-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
group_by is sorting and does not maintain original order #3279
Comments
Below is a simple example of how much time can be spend in sorting. It can easily be up to 50% of the time spend in group_by. Perhaps even up to 80% in this example if you include time to check # check sorting times
l <- 2e3
int <- sample(1L:l,l*l,replace = TRUE)
double <- sample(runif(l),l*l,replace = TRUE)
x <- tibble(int,double)
# Amount of time it will take to sort
system.time(s <- arrange(x,double,int))
#> user system elapsed
#> 3.51 0.00 3.53
# Amount of time it will take if already sorted ~ (time to check)
system.time(s <- arrange(s,double,int))
#> user system elapsed
#> 1.54 0.01 1.56
# Amount of group_by time including the sorting
system.time(g1 <- group_by(x,double,int))
#> user system elapsed
#> 5.52 0.16 5.70
# Amount of group_by time when tibble is already pre-sorted
# This is about %50 of previous time, but it still includes time to check the sort
system.time(g2 <- group_by(s,double,int))
#> user system elapsed
#> 2.41 0.17 2.58 |
I think your analysis falsely assumes that the computation cost of sorting is independent of the cost of computing the indices. |
Thanks for the analysis. The difference in timing is interesting, but could be attributed to cache effects or branch prediction (https://stackoverflow.com/q/11227809/946850). We'd need to dig deeper to know for sure, preferably with a profiler. The factor is already sorted, so I'm not opposed to adding a |
The groups are sorted, but the data is left alone. library(dplyr, warn.conflicts = FALSE)
library(purrr)
set.seed(4)
char <- sample(LETTERS[1:20],40,replace = TRUE)
int <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)
x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))
sorted <- function(.data, ...){
group_by(.data, ...) %>%
group_data() %>%
pull(1) %>%
negate(is.unsorted)()
}
sorted(x, char)
#> [1] TRUE
sorted(x, int)
#> [1] TRUE
sorted(x, double)
#> [1] TRUE
sorted(x, fact)
#> [1] TRUE
all.equal(group_by(x, char)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, int)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, double)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, fact)[1,], x[1,] )
#> [1] TRUE So we pay for sorting, but on small data. I get this now with the devel version (esp after #341 and #3492). > # check sorting times
> l <- 2e3
> int <- sample(1L:l,l*l,replace = TRUE)
> double <- sample(runif(l),l*l,replace = TRUE)
> x <- tibble(int,double)
>
> # Amount of time it will take to sort
> system.time(s <- arrange(x,double,int))
user system elapsed
3.403 0.025 3.436
>
> # Amount of time it will take if already sorted ~ (time to check)
> system.time(s <- arrange(s,double,int))
user system elapsed
0.284 0.016 0.302
>
> # Amount of group_by time including the sorting
> system.time(g1 <- group_by(x,double,int))
user system elapsed
4.165 0.211 4.381
>
> # Amount of group_by time when tibble is already pre-sorted
> # This is about %50 of previous time, but it still includes time to check the sort
> system.time(g2 <- group_by(s,double,int))
user system elapsed
2.998 0.137 3.137 I don't think we would want a In group_by, this is where there is sorting: https://github.com/tidyverse/dplyr/blob/master/src/group_indices.cpp#L377 Without this sort, the groups would appear in unpredictable (governed by a hash map) order. This is where to experiment about the real price of sorting. I don't believe we need an option to not sort, so my involvement stops here ... Also, with the more care about empty groups, there is no real concept of data order. |
If I want to get the unsorted vector back with
|
It feels wrong to me though that if I use I wouldn't like the first option in fact, and it seems the second is problematic, but I thought it'd be worth mentioning. I often see on SO users having to rearrange their data after grouping operations, it can be handled by converting to factors before grouping but it adds verbosity and is not obvious to all. Here's the latest example : https://stackoverflow.com/questions/55256777
|
What I believe (which is probably irrelevant because we sort of have to obey the status quo here) is that group_by() should indeed reorder rows. Rows should be grouped by, that’s in the name. But this is not what happens now. Rows of the data are untouched and group_by() only adds metadata about the keys and indices of each group. Might be too late to changz that in dplyr. In other package i’ve experimented with (dance lately) making thz data contiguous within groups feels right and opens all sorts of possibilities implementation wise. But this probably won’t happen, would break way too many dependencies. |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
It seems that dplyr's group_by does sort, at least for character, integer and numeric. It does maintain order for factor. Tested with dplyr 0.7.4:
Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function
unique
or dplyr functiondistinct
, which does not sort either.Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation.
See for older discussion (but with incorrect finding/conclusion) #2159
The text was updated successfully, but these errors were encountered: