Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug or inconsistent or illogical processing of "suffix" in consecutive joins. #6553

Closed
abalter opened this issue Nov 18, 2022 · 3 comments
Closed

Comments

@abalter
Copy link

abalter commented Nov 18, 2022

When joining tibbles with columns that have the same name but are not join columns, the suffix argument allows you to provide a scheme to rename the columns from the tibbles. However, when you perform consecutive joins there is an inconsistent behavior. The first join correctly applies the suffixes. The second does not. All subsequent ones appear to.

I was especially surprised that it wasn't just with piped joined. Saving the intermediate joins to variables gives the same effect.

My expected output from the reprex below is:

#> # A tibble: 3 × 5
#>       A B1    B2    B3     B4   
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     d     g     j    
#> 2     2 b     e     h     k    
#> 3     3 c     f     i     l

There may be a "reason" why dplyr behaves this way. But it is very confusing and, more importantly, would require an elaborate workaround to achieve the desired result.

library(tidyverse)

df1 = tibble(A=1:3, B=letters[1:3])
df2 = tibble(A=1:3, B=letters[4:6])
df3 = tibble(A=1:3, B=letters[7:9])
df4 = tibble(A=1:3, B=letters[10:12])

df1 %>%
inner_join(df2, by="A", suffix=c("1", "2")) %>%
inner_join(df3, by="A", suffix=c("", "3")) %>%
inner_join(df4, by="A", suffix=c("", "4"))
#> # A tibble: 3 × 5
#>       A B1    B2    B     B4   
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     d     g     j    
#> 2     2 b     e     h     k    
#> 3     3 c     f     i     l


j12 = inner_join(df1, df2, by="A", suffix=c("1", "2"))
j12
#> # A tibble: 3 × 3
#>       A B1    B2   
#>   <int> <chr> <chr>
#> 1     1 a     d    
#> 2     2 b     e    
#> 3     3 c     f
j123 = inner_join(j12, df3, by="A", suffix=c("", "3"))
j123
#> # A tibble: 3 × 4
#>       A B1    B2    B    
#>   <int> <chr> <chr> <chr>
#> 1     1 a     d     g    
#> 2     2 b     e     h    
#> 3     3 c     f     i
j1234 = inner_join(j123, df4, by="A", suffix=c("", "4"))
j1234
#> # A tibble: 3 × 5
#>       A B1    B2    B     B4   
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     d     g     j    
#> 2     2 b     e     h     k    
#> 3     3 c     f     i     l

Created on 2022-11-18 with reprex v2.0.2

@hadley
Copy link
Member

hadley commented Nov 19, 2022

I don’t see any inconsistency — there are no non-join columns with the same name when joining a table with columns A, B1, and B2 and a table with columns A and B.

@hadley hadley closed this as completed Nov 19, 2022
@eutwt
Copy link
Contributor

eutwt commented Nov 19, 2022

fwiw, this is how I'd get your expected result using the method described in #5700. You might also look at that issue for an explanation of how to do this in a more standardized / repeatable way with the {dm} package

library(dplyr, warn.conflicts = FALSE)

df1 <- tibble(A = 1:3, B = letters[1:3])
df2 <- tibble(A = 1:3, B = letters[4:6])
df3 <- tibble(A = 1:3, B = letters[7:9])
df4 <- tibble(A = 1:3, B = letters[10:12])

df1 %>%
  inner_join(df2, by = "A", suffix = c("", "2")) %>%
  inner_join(df3, by = "A", suffix = c("", "3")) %>%
  inner_join(df4, by = "A", suffix = c("1", "4"))
#> # A tibble: 3 × 5
#>       A B1    B2    B3    B4   
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     d     g     j    
#> 2     2 b     e     h     k    
#> 3     3 c     f     i     l

Created on 2022-11-19 with reprex v2.0.2

@abalter
Copy link
Author

abalter commented Nov 19, 2022

@hadley -- thank you for pointing out to focus on the matched columns. That did lead me to two solutions that feel consistent and intuitive to me.

df1 %>%
  inner_join(df2, by="A", suffix=c("1", "")) %>%
  inner_join(df3, by="A", suffix=c("2", "")) %>%
  inner_join(df4, by="A", suffix=c("3", "4"))
#> # A tibble: 3 x 5
#>       A B1    B2    B3    B4   
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     d     g     j    
#> 2     2 b     e     h     k    
#> 3     3 c     f     i     l

df1 %>%
  inner_join(df2, by="A", suffix=c("1", "")) %>%
  inner_join(df3, by="A", suffix=c("2", "")) %>%
  inner_join(df4, by="A", suffix=c("3", "")) %>%
  rename(B4=B)
#> # A tibble: 3 x 5
#>       A B1    B2    B3    B4   
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     d     g     j    
#> 2     2 b     e     h     k    
#> 3     3 c     f     i     l

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants