Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

group_by is sorting and does not maintain original order #3279

Closed
ghaarsma opened this issue Jan 2, 2018 · 8 comments
Closed

group_by is sorting and does not maintain original order #3279

ghaarsma opened this issue Jan 2, 2018 · 8 comments

Comments

@ghaarsma
Copy link

ghaarsma commented Jan 2, 2018

It seems that dplyr's group_by does sort, at least for character, integer and numeric. It does maintain order for factor. Tested with dplyr 0.7.4:

set.seed(4)
char   <- sample(LETTERS[1:20],40,replace = TRUE)
int    <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)

x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))

# All group_by results are sorted except the factor
group_by(x,char) %>% do(.[1,'char'])
group_by(x,int) %>% do(.[1,'int'])
group_by(x,double) %>% do(.[1,'double'])
group_by(x,fact) %>% do(.[1,'fact'])

# If group_by does not sort, the first indices should contain the first element (zero-based)
# This is only true for the factor
g <- group_by(x,char);attr(g,'indices')[[1]]
g <- group_by(x,int);attr(g,'indices')[[1]]
g <- group_by(x,double);attr(g,'indices')[[1]]
g <- group_by(x,fact);attr(g,'indices')[[1]]

Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function unique or dplyr function distinct, which does not sort either.

Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation.

See for older discussion (but with incorrect finding/conclusion) #2159

@ghaarsma
Copy link
Author

ghaarsma commented Jan 2, 2018

Below is a simple example of how much time can be spend in sorting. It can easily be up to 50% of the time spend in group_by. Perhaps even up to 80% in this example if you include time to check

# check sorting times
l <- 2e3
int    <- sample(1L:l,l*l,replace = TRUE)
double <- sample(runif(l),l*l,replace = TRUE)
x <- tibble(int,double)

# Amount of time it will take to sort
system.time(s <- arrange(x,double,int))
#>    user  system elapsed 
#>    3.51    0.00    3.53
# Amount of time it will take if already sorted  ~ (time to check)
system.time(s <- arrange(s,double,int))
#>    user  system elapsed 
#>    1.54    0.01    1.56

# Amount of group_by time including the sorting
system.time(g1 <- group_by(x,double,int))
#>    user  system elapsed 
#>    5.52    0.16    5.70
# Amount of group_by time when tibble is already pre-sorted
# This is about %50 of previous time, but it still includes time to check the sort
system.time(g2 <- group_by(s,double,int))
#>    user  system elapsed 
#>    2.41    0.17    2.58

@hadley
Copy link
Member

hadley commented Jan 2, 2018

I think your analysis falsely assumes that the computation cost of sorting is independent of the cost of computing the indices.

@krlmlr
Copy link
Member

krlmlr commented Jan 17, 2018

Thanks for the analysis. The difference in timing is interesting, but could be attributed to cache effects or branch prediction (https://stackoverflow.com/q/11227809/946850). We'd need to dig deeper to know for sure, preferably with a profiler.

The factor is already sorted, so group_by() doesn't change the order. This should change if fact = factor(char) is used.

I'm not opposed to adding a .sort argument to group_by() after we assess the impact. This would also help consistency with database backends: They currently won't sort automatically.

@romainfrancois
Copy link
Member

The groups are sorted, but the data is left alone.

library(dplyr, warn.conflicts = FALSE)
library(purrr)
set.seed(4)
char   <- sample(LETTERS[1:20],40,replace = TRUE)
int    <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)

x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))

sorted <- function(.data, ...){
  group_by(.data, ...) %>% 
    group_data() %>% 
    pull(1) %>% 
    negate(is.unsorted)()
}
sorted(x, char)
#> [1] TRUE
sorted(x, int)
#> [1] TRUE
sorted(x, double)
#> [1] TRUE
sorted(x, fact)
#> [1] TRUE

all.equal(group_by(x, char)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, int)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, double)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, fact)[1,], x[1,] )
#> [1] TRUE

So we pay for sorting, but on small data. I get this now with the devel version (esp after #341 and #3492).

> # check sorting times
> l <- 2e3
> int    <- sample(1L:l,l*l,replace = TRUE)
> double <- sample(runif(l),l*l,replace = TRUE)
> x <- tibble(int,double)
> 
> # Amount of time it will take to sort
> system.time(s <- arrange(x,double,int))
   user  system elapsed 
  3.403   0.025   3.436 
> 
> # Amount of time it will take if already sorted  ~ (time to check)
> system.time(s <- arrange(s,double,int))
   user  system elapsed 
  0.284   0.016   0.302 
> 
> # Amount of group_by time including the sorting
> system.time(g1 <- group_by(x,double,int))
   user  system elapsed 
  4.165   0.211   4.381 
> 
> # Amount of group_by time when tibble is already pre-sorted
> # This is about %50 of previous time, but it still includes time to check the sort
> system.time(g2 <- group_by(s,double,int))
   user  system elapsed 
  2.998   0.137   3.137 

I don't think we would want a .sort argument in group_by.

In group_by, this is where there is sorting: https://github.com/tidyverse/dplyr/blob/master/src/group_indices.cpp#L377

Without this sort, the groups would appear in unpredictable (governed by a hash map) order. This is where to experiment about the real price of sorting. I don't believe we need an option to not sort, so my involvement stops here ...

Also, with the more care about empty groups, there is no real concept of data order.

@jjesusfilho
Copy link

If I want to get the unsorted vector back with tidyr::uncount, it will not be possible:

set.seed(347)

s<-sample(c("a","b","c"),20,replace = TRUE)

aggregated<-tibble::tibble(original=s) %>% 
    dplyr::count(original,sort=FALSE)

disaggregated<-tidyr::uncount(aggregated,n)

@moodymudskipper
Copy link

moodymudskipper commented Mar 20, 2019

It feels wrong to me though that if I use mutate on grouped data the order will not be affected, but summarize and group_map will sort by groups. I think a consistent behavior would either sort the rows at the group_by step, or leave the rows in their initial order.

I wouldn't like the first option in fact, and it seems the second is problematic, but I thought it'd be worth mentioning.

I often see on SO users having to rearrange their data after grouping operations, it can be handled by converting to factors before grouping but it adds verbosity and is not obvious to all.

Here's the latest example : https://stackoverflow.com/questions/55256777

df %>% group_by(id) %>% group_map(~arrange(., var1)) %>% ungroup() could have been a solution (though maybe not the best) if row order had been preserved.

@romainfrancois
Copy link
Member

What I believe (which is probably irrelevant because we sort of have to obey the status quo here) is that group_by() should indeed reorder rows.

Rows should be grouped by, that’s in the name. But this is not what happens now. Rows of the data are untouched and group_by() only adds metadata about the keys and indices of each group.

Might be too late to changz that in dplyr. In other package i’ve experimented with (dance lately) making thz data contiguous within groups feels right and opens all sorts of possibilities implementation wise.

But this probably won’t happen, would break way too many dependencies.

@lock
Copy link

lock bot commented Sep 16, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 16, 2019
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

6 participants