Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[[ subsetting much slower than $ #780

Closed
ebein opened this issue Jun 3, 2020 · 9 comments
Closed

[[ subsetting much slower than $ #780

ebein opened this issue Jun 3, 2020 · 9 comments
Labels
bug an unexpected problem or unintended behavior performance 🏎️

Comments

@ebein
Copy link

ebein commented Jun 3, 2020

Starting with tibble 3.0.0, column subsetting using [[ is much slower than $. This causes slowdowns in functions that call [[ many times, for example data.matrix on a wide tibble.

df <- tibble::tibble(x = 1)

bench::mark(
  dollar = df$x,
  bracket = df[["x"]],
  iterations = 1000
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dollar        6.8us    8.1us   100956.    7.96KB      0  
#> 2 bracket     190.3us  211.2us     3998.  165.09KB     12.0

Created on 2020-06-03 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Jun 5, 2020

The call to vectbl_as_col_location2() is responsible for the speed difference, probably due to its use of tryCatch(), which is slow.

@hadley hadley added the bug an unexpected problem or unintended behavior label Jun 5, 2020
@md0u80c9
Copy link

Out of interest, do you know if $ was significantly quicker prior to tibble 3 - or was performance more equal?

@ebein
Copy link
Author

ebein commented Jun 10, 2020

From tibble 2.1.1 on a different machine. So it seems like $ was ~2x faster on 2.1.1 and is 25-30x faster on 3.0.1.

df <- tibble::tibble(x = 1)

bench::mark(
  dollar = df$x,
  bracket = df[["x"]],
  iterations = 1000
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dollar     682.65ns   1.36us   514262.    6.28KB        0
#> 2 bracket      1.02us   1.36us   472464.     5.3KB        0

Created on 2020-06-10 by the reprex package (v0.2.1)

@md0u80c9
Copy link

md0u80c9 commented Jun 10, 2020

Thanks @ebein (all my machines were on tibble 3 and the hassle of doing a full reinstall to check it meant I was cheeky and just asked the question!): very interesting that from a practical perspective it may be better to train my muscle memory to use $ where possible (obviously [[ has benefits where the column name isn't a constant!)

@krlmlr
Copy link
Member

krlmlr commented Jun 13, 2020

Once we remove the vectbl_as_col_location2() call and the associated overhead, run time drops to 9 µs. Still way too much, compared to 130 ns for base lists.

@krlmlr
Copy link
Member

krlmlr commented Jun 13, 2020

Pure S3 dispatch without doing actual work is already 1.3 µs. Oh well...

@krlmlr
Copy link
Member

krlmlr commented Jun 14, 2020

Now:

df <- tibble::tibble(x = 1)

bench::mark(
  dollar = df$x,
  bracket = df[["x"]],
  iterations = 1000
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dollar       4.29µs   4.67µs   199184.    17.6KB        0
#> 2 bracket      3.91µs   4.25µs   219202.    90.6KB        0

Created on 2020-06-14 by the reprex package (v0.3.0)

@krlmlr
Copy link
Member

krlmlr commented Jun 14, 2020

We can strive for even faster processing (closer to 2 µs), I suspect this needs a full rewrite in C. This should be fast enough for most use cases.

krlmlr added a commit that referenced this issue Feb 25, 2021
tibble 3.0.2

- `[[` works with classed indexes again, e.g. created with `glue::glue()` (#778).
- `add_column()` works without warning for 0-column data frames (#786).
- `tribble()` now better handles named inputs (#775) and objects of non-vtrs classes like `lubridate::Period` (#784) and `formattable::formattable` (#785).

- Subsetting and subassignment are faster (#780, #790, #794).
- `is.null()` is preferred over `is_null()` for speed.
- Implement continuous benchmarking (#793).

- `is_vector_s3()` is no longer reexported from pillar (#789).
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2021
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
bug an unexpected problem or unintended behavior performance 🏎️
Projects
None yet
Development

No branches or pull requests

4 participants