Add fillcombinations function #3012

bkamins · 2022-02-19T15:48:17Z

Update of #1864.

nalimilan · 2022-02-19T17:02:51Z

I think the logic is fine, though it's not easy to check without a docstring or tests. :-)

I'm not sure about the name. It's true that expand might be too general. OTOH that term doesn't seem to be used by any packages currently, and I'm not sure what else it could mean. expandlevels isn't totally accurate as the function doesn't just expand levels, it adds rows for missing combinations.

bkamins · 2022-02-20T22:03:16Z

@nalimilan I have added tests. The PR is good for review now.
The bad thing is that it is quite complex in corner cases and essentially every test I add is relevant and checks something important.

src/abstractdataframe/abstractdataframe.jl

nalimilan · 2022-02-24T21:05:10Z

src/abstractdataframe/abstractdataframe.jl

+        # levels drops missing, handle the case where missing values are present
+        # All levels are retained, missing is added only if present
+        if any(ismissing, df[!, col])
+            tempcol = vcat(levels(df[!, col]), missing)


I think this has come up several times before: it would make sense to add an argument to levels to preserve missing so that we don't have to do this, which is suboptimal for performance.

I have opened JuliaData/DataAPI.jl#44

JuliaData/DataAPI.jl#46

src/abstractdataframe/abstractdataframe.jl

nalimilan · 2022-02-24T21:47:21Z

test/data.jl

+        @test isequal_coltyped(completecombinations(df1, [:a, :b], allcols=true, fill="X", allowduplicates=ad),
+                               DataFrame(a=[1, 2, missing, 1, 2, missing],
+                                         b=[1, 1, 1, 2, 2, 2],
+                                         c=[11, 12, "X", "X", "X", missing],
+                                         d=[111, 112, "X", "X", "X", 113]))


By default dplyr also replaces missing values in the input with fill (explicit=TRUE). But I find this weird so I agree what you do is better.

Yes - I made this choice intentionally. The same behavior of fill is in unstack.

nalimilan · 2022-02-24T21:59:31Z

test/data.jl

+        @test isequal_coltyped(completecombinations(df1, :c, allowduplicates=ad),
+                               DataFrame(c=categorical([12, 11, 10, missing])))
+        @test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad),
+                               DataFrame(c=categorical([12, 11, 10, missing, 12, 11, 10, missing]),


Test that levels are preserved?

nalimilan · 2022-02-24T22:01:31Z

src/abstractdataframe/abstractdataframe.jl

+    duplicates are allowed. They are not repeated if `allcols` is `false`
+    only unique combinations are produced then, but if `allcols` is `true`
+    the duplicates are included.


This behavior differs from dplyr's complete, which retains duplicates. Any particular reason to drop them? Or should we allow three options about how duplicates should be handled?

It is a consequence of the current design. I.e. it is natural to drop them in the way it is implemented.
But we can discuss what to do.

Now I am thinking that maybe we do not need the allcols kwarg and can simplify the API. Maybe we should have allcols=true always and expect from the user to pass a data frame only with columns that are wanted in the output?

src/abstractdataframe/abstractdataframe.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-03-11T19:16:46Z

allcols kwarg is now removed.

If duplicates are allowed they are always kept.

also since we always keep all cols the column order of the source data frame is retained (earlier I put the expanded columns first, but since now we always keep all columns I think it makes more sense to keep their original order).

This should be good for a review.

bkamins · 2022-03-16T20:50:30Z

@nalimilan - no rush, but it would be good to finalize this PR before we forget what we wanted. Thank you!

nalimilan

Looks good! Just minor comments.

Maybe we should check that the name sounds OK by asking people on Slack?

src/abstractdataframe/abstractdataframe.jl

nalimilan · 2022-03-19T14:44:06Z

test/data.jl

+        @test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad),
+                               DataFrame(a=[2; 1; fill(missing, 6)],
+                                         b=[1, 1, 1, 1, 2, 2, 2, 2],
+                                         c=categorical([12, 11, 10, missing, 12, 11, 10, missing]),
+                                         d=[112; 111; fill(missing, 5); 113]))


I don't see the test for levels. Am I missing it?

I have added the tests - as usual tricky cases are present.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-03-20T11:13:05Z

Proposal for the name from Slack I like best is fillcombinations.

Other: add_combinations, fill_combinations, complete_combinations

bkamins · 2022-03-28T06:37:06Z

@nalimilan - so what do you think about the best name for the function given the discussion?

nalimilan · 2022-03-28T12:31:38Z

I also prefer fillcombinations, though completecombinations is OK too.

bkamins · 2022-03-28T13:02:21Z

OK - changed to fillcombinations.

bkamins · 2022-03-29T09:41:59Z

Thank you!

bkamins mentioned this pull request Feb 19, 2022

[WIP] complete and expand df #1864

Closed

bkamins added the feature label Feb 19, 2022

bkamins added this to the 1.4 milestone Feb 19, 2022

add expand function

99323b5

bkamins force-pushed the bk/expand branch from 0e921a9 to 99323b5 Compare February 20, 2022 12:40

improved implementation

7034e62

bkamins changed the title ~~Add expand function~~ WIP: Add expand function Feb 20, 2022

small improvement

b6a05cd

bkamins mentioned this pull request Feb 20, 2022

Problem with unstack from DataFrames.jl on CategoricalVector JuliaData/CategoricalArrays.jl#380

Closed

add tests

57e0304

bkamins changed the title ~~WIP: Add expand function~~ Add expand function Feb 20, 2022

bkamins commented Feb 20, 2022

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins added 2 commits February 20, 2022 23:19

make sure input is consistent

7eafa46

improve test coverage

cbc936c

nalimilan reviewed Feb 24, 2022

View reviewed changes

Apply suggestions from code review

a004c62

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins mentioned this pull request Mar 11, 2022

add kwarg to levels to keep missing JuliaData/DataAPI.jl#44

Closed

bkamins added 2 commits March 11, 2022 12:18

fixes after code review

c999436

remove allcols

7083a8e

bkamins added 2 commits March 11, 2022 21:48

fix typo

b073389

Merge branch 'main' into bk/expand

1bf2288

nalimilan reviewed Mar 19, 2022

View reviewed changes

bkamins and others added 2 commits March 20, 2022 11:49

Apply suggestions from code review

2760262

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

apply suggestions from code review

77ae5cf

bkamins changed the title ~~Add expand function~~ Add fillcombinations function Mar 28, 2022

change to fillcombinations

e7f72b2

nalimilan approved these changes Mar 28, 2022

View reviewed changes

improve comments

4b601f5

bkamins merged commit bb2629d into main Mar 29, 2022

bkamins deleted the bk/expand branch March 29, 2022 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fillcombinations function #3012

Add fillcombinations function #3012

bkamins commented Feb 19, 2022 •

edited

Loading

nalimilan commented Feb 19, 2022

bkamins commented Feb 20, 2022

nalimilan Feb 24, 2022

bkamins Mar 11, 2022

nalimilan Mar 20, 2022

nalimilan Feb 24, 2022

bkamins Mar 11, 2022

nalimilan Feb 24, 2022

bkamins Mar 11, 2022

nalimilan Feb 24, 2022

bkamins Mar 11, 2022

bkamins commented Mar 11, 2022

bkamins commented Mar 16, 2022

nalimilan left a comment

nalimilan Mar 19, 2022

bkamins Mar 20, 2022

bkamins commented Mar 20, 2022

bkamins commented Mar 28, 2022

nalimilan commented Mar 28, 2022

bkamins commented Mar 28, 2022

bkamins commented Mar 29, 2022

Add fillcombinations function #3012

Add fillcombinations function #3012

Conversation

bkamins commented Feb 19, 2022 • edited Loading

nalimilan commented Feb 19, 2022

bkamins commented Feb 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Mar 11, 2022

bkamins commented Mar 16, 2022

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Mar 20, 2022

bkamins commented Mar 28, 2022

nalimilan commented Mar 28, 2022

bkamins commented Mar 28, 2022

bkamins commented Mar 29, 2022

bkamins commented Feb 19, 2022 •

edited

Loading