Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add fillcombinations function #3012

Merged
merged 15 commits into from
Mar 29, 2022
Merged

Add fillcombinations function #3012

merged 15 commits into from
Mar 29, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Feb 19, 2022

Update of #1864.

@bkamins bkamins added this to the 1.4 milestone Feb 19, 2022
@nalimilan
Copy link
Member

I think the logic is fine, though it's not easy to check without a docstring or tests. :-)

I'm not sure about the name. It's true that expand might be too general. OTOH that term doesn't seem to be used by any packages currently, and I'm not sure what else it could mean. expandlevels isn't totally accurate as the function doesn't just expand levels, it adds rows for missing combinations.

@bkamins bkamins changed the title Add expand function WIP: Add expand function Feb 20, 2022
@bkamins bkamins changed the title WIP: Add expand function Add expand function Feb 20, 2022
@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2022

@nalimilan I have added tests. The PR is good for review now.
The bad thing is that it is quite complex in corner cases and essentially every test I add is relevant and checks something important.

Comment on lines 1522 to 1525
# levels drops missing, handle the case where missing values are present
# All levels are retained, missing is added only if present
if any(ismissing, df[!, col])
tempcol = vcat(levels(df[!, col]), missing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has come up several times before: it would make sense to add an argument to levels to preserve missing so that we don't have to do this, which is suboptimal for performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened JuliaData/DataAPI.jl#44

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/data.jl Outdated
Comment on lines 501 to 505
@test isequal_coltyped(completecombinations(df1, [:a, :b], allcols=true, fill="X", allowduplicates=ad),
DataFrame(a=[1, 2, missing, 1, 2, missing],
b=[1, 1, 1, 2, 2, 2],
c=[11, 12, "X", "X", "X", missing],
d=[111, 112, "X", "X", "X", 113]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default dplyr also replaces missing values in the input with fill (explicit=TRUE). But I find this weird so I agree what you do is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I made this choice intentionally. The same behavior of fill is in unstack.

test/data.jl Outdated
@test isequal_coltyped(completecombinations(df1, :c, allowduplicates=ad),
DataFrame(c=categorical([12, 11, 10, missing])))
@test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad),
DataFrame(c=categorical([12, 11, 10, missing, 12, 11, 10, missing]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test that levels are preserved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 1470 to 1472
duplicates are allowed. They are not repeated if `allcols` is `false`
only unique combinations are produced then, but if `allcols` is `true`
the duplicates are included.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior differs from dplyr's complete, which retains duplicates. Any particular reason to drop them? Or should we allow three options about how duplicates should be handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a consequence of the current design. I.e. it is natural to drop them in the way it is implemented.
But we can discuss what to do.

Now I am thinking that maybe we do not need the allcols kwarg and can simplify the API. Maybe we should have allcols=true always and expect from the user to pass a data frame only with columns that are wanted in the output?

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Mar 11, 2022

allcols kwarg is now removed.

If duplicates are allowed they are always kept.

also since we always keep all cols the column order of the source data frame is retained (earlier I put the expanded columns first, but since now we always keep all columns I think it makes more sense to keep their original order).

This should be good for a review.

@bkamins
Copy link
Member Author

bkamins commented Mar 16, 2022

@nalimilan - no rush, but it would be good to finalize this PR before we forget what we wanted. Thank you!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just minor comments.

Maybe we should check that the name sounds OK by asking people on Slack?

test/data.jl Outdated
Comment on lines 506 to 510
@test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad),
DataFrame(a=[2; 1; fill(missing, 6)],
b=[1, 1, 1, 1, 2, 2, 2, 2],
c=categorical([12, 11, 10, missing, 12, 11, 10, missing]),
d=[112; 111; fill(missing, 5); 113]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the test for levels. Am I missing it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the tests - as usual tricky cases are present.

bkamins and others added 2 commits March 20, 2022 11:49
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Mar 20, 2022

Proposal for the name from Slack I like best is fillcombinations.

Other: add_combinations, fill_combinations, complete_combinations

@bkamins
Copy link
Member Author

bkamins commented Mar 28, 2022

@nalimilan - so what do you think about the best name for the function given the discussion?

@nalimilan
Copy link
Member

I also prefer fillcombinations, though completecombinations is OK too.

@bkamins bkamins changed the title Add expand function Add fillcombinations function Mar 28, 2022
@bkamins
Copy link
Member Author

bkamins commented Mar 28, 2022

OK - changed to fillcombinations.

@bkamins bkamins merged commit bb2629d into main Mar 29, 2022
@bkamins bkamins deleted the bk/expand branch March 29, 2022 09:41
@bkamins
Copy link
Member Author

bkamins commented Mar 29, 2022

Thank you!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants