group_by is sorting and does not maintain original order #3279

ghaarsma · 2018-01-02T11:35:16Z

It seems that dplyr's group_by does sort, at least for character, integer and numeric. It does maintain order for factor. Tested with dplyr 0.7.4:

set.seed(4)
char   <- sample(LETTERS[1:20],40,replace = TRUE)
int    <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)

x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))

# All group_by results are sorted except the factor
group_by(x,char) %>% do(.[1,'char'])
group_by(x,int) %>% do(.[1,'int'])
group_by(x,double) %>% do(.[1,'double'])
group_by(x,fact) %>% do(.[1,'fact'])

# If group_by does not sort, the first indices should contain the first element (zero-based)
# This is only true for the factor
g <- group_by(x,char);attr(g,'indices')[[1]]
g <- group_by(x,int);attr(g,'indices')[[1]]
g <- group_by(x,double);attr(g,'indices')[[1]]
g <- group_by(x,fact);attr(g,'indices')[[1]]

Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function unique or dplyr function distinct, which does not sort either.

Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation.

See for older discussion (but with incorrect finding/conclusion) #2159

The text was updated successfully, but these errors were encountered:

ghaarsma · 2018-01-02T11:57:29Z

Below is a simple example of how much time can be spend in sorting. It can easily be up to 50% of the time spend in group_by. Perhaps even up to 80% in this example if you include time to check

# check sorting times
l <- 2e3
int    <- sample(1L:l,l*l,replace = TRUE)
double <- sample(runif(l),l*l,replace = TRUE)
x <- tibble(int,double)

# Amount of time it will take to sort
system.time(s <- arrange(x,double,int))
#>    user  system elapsed 
#>    3.51    0.00    3.53
# Amount of time it will take if already sorted  ~ (time to check)
system.time(s <- arrange(s,double,int))
#>    user  system elapsed 
#>    1.54    0.01    1.56

# Amount of group_by time including the sorting
system.time(g1 <- group_by(x,double,int))
#>    user  system elapsed 
#>    5.52    0.16    5.70
# Amount of group_by time when tibble is already pre-sorted
# This is about %50 of previous time, but it still includes time to check the sort
system.time(g2 <- group_by(s,double,int))
#>    user  system elapsed 
#>    2.41    0.17    2.58

hadley · 2018-01-02T14:23:19Z

I think your analysis falsely assumes that the computation cost of sorting is independent of the cost of computing the indices.

krlmlr · 2018-01-17T08:04:58Z

Thanks for the analysis. The difference in timing is interesting, but could be attributed to cache effects or branch prediction (https://stackoverflow.com/q/11227809/946850). We'd need to dig deeper to know for sure, preferably with a profiler.

The factor is already sorted, so group_by() doesn't change the order. This should change if fact = factor(char) is used.

I'm not opposed to adding a .sort argument to group_by() after we assess the impact. This would also help consistency with database backends: They currently won't sort automatically.

romainfrancois · 2018-05-30T10:23:46Z

The groups are sorted, but the data is left alone.

library(dplyr, warn.conflicts = FALSE)
library(purrr)
set.seed(4)
char   <- sample(LETTERS[1:20],40,replace = TRUE)
int    <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)

x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))

sorted <- function(.data, ...){
  group_by(.data, ...) %>% 
    group_data() %>% 
    pull(1) %>% 
    negate(is.unsorted)()
}
sorted(x, char)
#> [1] TRUE
sorted(x, int)
#> [1] TRUE
sorted(x, double)
#> [1] TRUE
sorted(x, fact)
#> [1] TRUE

all.equal(group_by(x, char)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, int)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, double)[1,], x[1,] )
#> [1] TRUE
all.equal(group_by(x, fact)[1,], x[1,] )
#> [1] TRUE

So we pay for sorting, but on small data. I get this now with the devel version (esp after #341 and #3492).

> # check sorting times
> l <- 2e3
> int    <- sample(1L:l,l*l,replace = TRUE)
> double <- sample(runif(l),l*l,replace = TRUE)
> x <- tibble(int,double)
> 
> # Amount of time it will take to sort
> system.time(s <- arrange(x,double,int))
   user  system elapsed 
  3.403   0.025   3.436 
> 
> # Amount of time it will take if already sorted  ~ (time to check)
> system.time(s <- arrange(s,double,int))
   user  system elapsed 
  0.284   0.016   0.302 
> 
> # Amount of group_by time including the sorting
> system.time(g1 <- group_by(x,double,int))
   user  system elapsed 
  4.165   0.211   4.381 
> 
> # Amount of group_by time when tibble is already pre-sorted
> # This is about %50 of previous time, but it still includes time to check the sort
> system.time(g2 <- group_by(s,double,int))
   user  system elapsed 
  2.998   0.137   3.137

I don't think we would want a .sort argument in group_by.

In group_by, this is where there is sorting: https://github.com/tidyverse/dplyr/blob/master/src/group_indices.cpp#L377

Without this sort, the groups would appear in unpredictable (governed by a hash map) order. This is where to experiment about the real price of sorting. I don't believe we need an option to not sort, so my involvement stops here ...

Also, with the more care about empty groups, there is no real concept of data order.

jjesusfilho · 2018-10-31T09:41:26Z

If I want to get the unsorted vector back with tidyr::uncount, it will not be possible:

set.seed(347)

s<-sample(c("a","b","c"),20,replace = TRUE)

aggregated<-tibble::tibble(original=s) %>% 
    dplyr::count(original,sort=FALSE)

disaggregated<-tidyr::uncount(aggregated,n)

moodymudskipper · 2019-03-20T11:38:48Z

It feels wrong to me though that if I use mutate on grouped data the order will not be affected, but summarize and group_map will sort by groups. I think a consistent behavior would either sort the rows at the group_by step, or leave the rows in their initial order.

I wouldn't like the first option in fact, and it seems the second is problematic, but I thought it'd be worth mentioning.

I often see on SO users having to rearrange their data after grouping operations, it can be handled by converting to factors before grouping but it adds verbosity and is not obvious to all.

Here's the latest example : https://stackoverflow.com/questions/55256777

df %>% group_by(id) %>% group_map(~arrange(., var1)) %>% ungroup() could have been a solution (though maybe not the best) if row order had been preserved.

romainfrancois · 2019-03-20T13:34:50Z

What I believe (which is probably irrelevant because we sort of have to obey the status quo here) is that group_by() should indeed reorder rows.

Rows should be grouped by, that’s in the name. But this is not what happens now. Rows of the data are untouched and group_by() only adds metadata about the keys and indices of each group.

Might be too late to changz that in dplyr. In other package i’ve experimented with (dance lately) making thz data contiguous within groups feels right and opens all sorts of possibilities implementation wise.

But this probably won’t happen, would break way too many dependencies.

lock · 2019-09-16T13:57:50Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr added data frame performance 🚀 labels Jan 17, 2018

mem48 mentioned this issue May 24, 2018

sum_network_routes results differ between sf and sp data ropensci/stplanr#249

Closed

romainfrancois closed this as completed May 30, 2018

lock bot locked and limited conversation to collaborators Sep 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_by is sorting and does not maintain original order #3279

group_by is sorting and does not maintain original order #3279

ghaarsma commented Jan 2, 2018

ghaarsma commented Jan 2, 2018

hadley commented Jan 2, 2018

krlmlr commented Jan 17, 2018 •

edited

Loading

romainfrancois commented May 30, 2018

jjesusfilho commented Oct 31, 2018

moodymudskipper commented Mar 20, 2019 •

edited

Loading

romainfrancois commented Mar 20, 2019

lock bot commented Sep 16, 2019

group_by is sorting and does not maintain original order #3279

group_by is sorting and does not maintain original order #3279

Comments

ghaarsma commented Jan 2, 2018

ghaarsma commented Jan 2, 2018

hadley commented Jan 2, 2018

krlmlr commented Jan 17, 2018 • edited Loading

romainfrancois commented May 30, 2018

jjesusfilho commented Oct 31, 2018

moodymudskipper commented Mar 20, 2019 • edited Loading

romainfrancois commented Mar 20, 2019

lock bot commented Sep 16, 2019

krlmlr commented Jan 17, 2018 •

edited

Loading

moodymudskipper commented Mar 20, 2019 •

edited

Loading