-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Better way to handle name collisions in joins #4028
Comments
I would appreciate something like this. Unintentional column renamings are such a common error that I would probably just make a habit of putting in .resolve="stop", even when I don't anticipate it being necessary. Currently as a hack I set both suffixes to be equal, which only raises an error when they end up getting used. Arguably I am making my code unnecessarily wordy but I think it's always worth some extra text to make my expectations explicit ¯_(ツ)_/¯ |
I think the simplest interface might be to make default |
I'd also like to vote in favor of some improvement here. I just saw a bunch of unexpected behavior in an analysis when a table had gained a column(*) that was already present in a table it was being joined with. The fact that there is no warning or message stating that columns are being renamed makes finding these kinds of bugs very difficult. I think the default should either be no renaming at all, or, if you want to keep backwards compatibility, rename but issue at least a message, if not a warning. (*) To clarify: I was rerunning the analysis with new input data, and one of the input tables had unexpectedly gained a new column. |
Lately I have wished to have something like |
See also #5700 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Would love the option This should only have an effect if Maybe should inform also if you have df1 <- data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
value1 = c(1, 5, 7)
)
df2 <- data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
value2 = c(1, 6, 7)
)
df2 <- data.frame(
id = c(1, 2, 3),
name = c("name.1", "name.2", "name.3"),
value2 = c(1, 5, 7)
)
df1 |> left_join(df2, by = "id", suffix = NA)
#> Error in `left_join()`
#> `name` is found in `x`, and `y`
#> Mapping is compatible, you should use `join_by(id, name)`
df1 |> left_join(df3, by = "id", suffix = NA)
#> Error in `left_join()`
#> `name` is found in `x`, and `y` and is not the same
#> Either delete the `name` variable from `x` or `y`, or use suffix. my main reasoning behind specifying |
I'd suggest that there's room for enhancements to checking/transformation of column names and join keys that goes beyond the scope of the My feature wish list
Ideas for an API:Possible arguments which could be used to achieve the above:
The existing |
Currently, non-join columns available in both tables are given suffixes
.x
and.y
. Occasionaly one might want to raise an error or keep only the lhs columns in these situations. (This would also make it easier to adopt universal/unique renaming here.)Created on 2018-12-17 by the reprex package (v0.2.1.9000)
The text was updated successfully, but these errors were encountered: