-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Multi-join suffixes #5700
Comments
I'm not sure this is in scope for dplyr, but it's come up a couple of times, so I think it's worth considering. Maybe we just need a function that takes a list of data frames, and adds the suffixes for you. Very very roughly: add_suffixes <- function(...) {
dfs <- list2(...)
no_name <- is.na(names2(dfs)
names(dfs)[no_name] <- letters[no_name]
names <- unlist(map(dfs, tbl_vars))
dups <- ...
# loop over each df, renaming any duplicates with the suffix
# return list of data frames.
} |
@eggrandio My solution when I have a similar issue is to add the first suffix ( > df <- full_join(x, y, by = "col1", suffix = c("","_y")) %>%
full_join(z, by = "col1", suffix = c("_x","_z"))
> df
col1 col2_x col3_x col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2 |
A question related to this issue is how to specify the |
There is good support for this in the dm package. Note how it also can print the library(conflicted)
library(dm)
a <- data.frame(col1 = c("a", "b", "c"))
x_tbl <- data.frame(col1 = c("a", "b", "c"), col2 = 1:3, col3 = 1:3)
y_tbl <- data.frame(col1 = c("b", "c", "d"), col2 = 4:6, col3 = 1:3)
z_tbl <- data.frame(col1 = c("c", "d", "a"), col2 = 7:9, col3 = 1:3)
dm <-
dm(a, x_tbl, y_tbl, z_tbl) %>%
dm_add_fk(a, col1, x_tbl, col1) %>%
dm_add_fk(a, col1, y_tbl, col1) %>%
dm_add_fk(a, col1, z_tbl, col1)
dm %>% dm_draw() dm %>%
dm_disambiguate_cols()
#> Renaming ambiguous columns: %>%
#> dm_rename(a, col1.a = col1) %>%
#> dm_rename(x_tbl, col1.x_tbl = col1) %>%
#> dm_rename(x_tbl, col2.x_tbl = col2) %>%
#> dm_rename(x_tbl, col3.x_tbl = col3) %>%
#> dm_rename(y_tbl, col1.y_tbl = col1) %>%
#> dm_rename(y_tbl, col2.y_tbl = col2) %>%
#> dm_rename(y_tbl, col3.y_tbl = col3) %>%
#> dm_rename(z_tbl, col1.z_tbl = col1) %>%
#> dm_rename(z_tbl, col2.z_tbl = col2) %>%
#> dm_rename(z_tbl, col3.z_tbl = col3)
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `a`, `x_tbl`, `y_tbl`, `z_tbl`
#> Columns: 10
#> Primary keys: 0
#> Foreign keys: 3
dm %>%
dm_flatten_to_tbl(a, .join = full_join)
#> Renaming ambiguous columns: %>%
#> dm_rename(a, col1.a = col1) %>%
#> dm_rename(x_tbl, col1.x_tbl = col1) %>%
#> dm_rename(x_tbl, col2.x_tbl = col2) %>%
#> dm_rename(x_tbl, col3.x_tbl = col3) %>%
#> dm_rename(y_tbl, col1.y_tbl = col1) %>%
#> dm_rename(y_tbl, col2.y_tbl = col2) %>%
#> dm_rename(y_tbl, col3.y_tbl = col3) %>%
#> dm_rename(z_tbl, col1.z_tbl = col1) %>%
#> dm_rename(z_tbl, col2.z_tbl = col2) %>%
#> dm_rename(z_tbl, col3.z_tbl = col3)
#> col1.a col2.x_tbl col3.x_tbl col2.y_tbl col3.y_tbl col2.z_tbl col3.z_tbl
#> 1 a 1 1 NA NA 9 3
#> 2 b 2 2 4 1 NA NA
#> 3 c 3 3 5 2 7 1
#> 4 d NA NA 6 3 8 2
dm %>%
dm_disambiguate_cols() %>%
dm_flatten_to_tbl(a, .join = full_join)
#> Renaming ambiguous columns: %>%
#> dm_rename(a, col1.a = col1) %>%
#> dm_rename(x_tbl, col1.x_tbl = col1) %>%
#> dm_rename(x_tbl, col2.x_tbl = col2) %>%
#> dm_rename(x_tbl, col3.x_tbl = col3) %>%
#> dm_rename(y_tbl, col1.y_tbl = col1) %>%
#> dm_rename(y_tbl, col2.y_tbl = col2) %>%
#> dm_rename(y_tbl, col3.y_tbl = col3) %>%
#> dm_rename(z_tbl, col1.z_tbl = col1) %>%
#> dm_rename(z_tbl, col2.z_tbl = col2) %>%
#> dm_rename(z_tbl, col3.z_tbl = col3)
#> col1.a col2.x_tbl col3.x_tbl col2.y_tbl col3.y_tbl col2.z_tbl col3.z_tbl
#> 1 a 1 1 NA NA 9 3
#> 2 b 2 2 4 1 NA NA
#> 3 c 3 3 5 2 7 1
#> 4 d NA NA 6 3 8 2 Created on 2022-08-19 by the reprex package (v2.0.1) |
Given that dm has such a nice solution, I think we don't need to implement something in dplyr. |
Hi,
I want to merge several data.frames with some common columns and append a suffix to the column names to keep track from where does the data for each column come from.
I can do it easily with the suffix term in the first full_join, but when I do the second join, no suffixes are added. I can rename the third data.frame so it has suffixes, but I wanted to know if there is another way of doing it using the suffix term.
Here is an example code:
I was expecting that col2 and col3 from data.frame z would have a "_z" suffix. I have tried using empty suffixes while merging two data.frames and it works.
I can work around by renaming the columns in z before doing the second full_join, but in my real data I have several common columns, and if I wanted to merge more data.frames it would complicate the code. This is my expected output.
I have seen other similar problems in which adding an extra column to keep track of the source data.frame is used, but I was wondering why does not the suffix term work with multiple joins.
PS: If I keep the first suffix empty, I can add suffixes in the second join, but that will leave the col2 and col3 form x without suffix.
I might be missing why this is not working the way I expected, but in principle it should be very simple to add a suffix to the "newly added" columns.
Thanks!
The text was updated successfully, but these errors were encountered: