-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add fillcombinations function #3012
Conversation
I think the logic is fine, though it's not easy to check without a docstring or tests. :-) I'm not sure about the name. It's true that |
@nalimilan I have added tests. The PR is good for review now. |
# levels drops missing, handle the case where missing values are present | ||
# All levels are retained, missing is added only if present | ||
if any(ismissing, df[!, col]) | ||
tempcol = vcat(levels(df[!, col]), missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has come up several times before: it would make sense to add an argument to levels
to preserve missing
so that we don't have to do this, which is suboptimal for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened JuliaData/DataAPI.jl#44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/data.jl
Outdated
@test isequal_coltyped(completecombinations(df1, [:a, :b], allcols=true, fill="X", allowduplicates=ad), | ||
DataFrame(a=[1, 2, missing, 1, 2, missing], | ||
b=[1, 1, 1, 2, 2, 2], | ||
c=[11, 12, "X", "X", "X", missing], | ||
d=[111, 112, "X", "X", "X", 113])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default dplyr also replaces missing values in the input with fill
(explicit=TRUE
). But I find this weird so I agree what you do is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I made this choice intentionally. The same behavior of fill
is in unstack
.
test/data.jl
Outdated
@test isequal_coltyped(completecombinations(df1, :c, allowduplicates=ad), | ||
DataFrame(c=categorical([12, 11, 10, missing]))) | ||
@test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad), | ||
DataFrame(c=categorical([12, 11, 10, missing, 12, 11, 10, missing]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test that levels are preserved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
duplicates are allowed. They are not repeated if `allcols` is `false` | ||
only unique combinations are produced then, but if `allcols` is `true` | ||
the duplicates are included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior differs from dplyr's complete
, which retains duplicates. Any particular reason to drop them? Or should we allow three options about how duplicates should be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a consequence of the current design. I.e. it is natural to drop them in the way it is implemented.
But we can discuss what to do.
Now I am thinking that maybe we do not need the allcols
kwarg and can simplify the API. Maybe we should have allcols=true
always and expect from the user to pass a data frame only with columns that are wanted in the output?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
If duplicates are allowed they are always kept. also since we always keep all cols the column order of the source data frame is retained (earlier I put the expanded columns first, but since now we always keep all columns I think it makes more sense to keep their original order). This should be good for a review. |
@nalimilan - no rush, but it would be good to finalize this PR before we forget what we wanted. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just minor comments.
Maybe we should check that the name sounds OK by asking people on Slack?
test/data.jl
Outdated
@test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad), | ||
DataFrame(a=[2; 1; fill(missing, 6)], | ||
b=[1, 1, 1, 1, 2, 2, 2, 2], | ||
c=categorical([12, 11, 10, missing, 12, 11, 10, missing]), | ||
d=[112; 111; fill(missing, 5); 113])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the test for levels
. Am I missing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the tests - as usual tricky cases are present.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Proposal for the name from Slack I like best is Other: |
@nalimilan - so what do you think about the best name for the function given the discussion? |
I also prefer |
OK - changed to |
Thank you! |
Update of #1864.