Add `skipmissing` argument to `levels` #46

nalimilan · 2022-03-20T10:45:12Z

Currently if one wants to get levels in their appropriate order, but add missing if present, the only solution is to do something like union(levels(x), unique(x)), which is inefficient.
Support skipmissing=false to allow doing this in a single pass over the data.

Use @inline to ensure that the return type can be inferred when the value of skipmissing is known statically. Also fix a type instability which existed for ranges.

Of course this new feature won't work for custom types which override this method (like CategoricalArray) until packages implement it. Unfortunately there's no way for packages which would like to rely on it (like DataFrames) to require an appropriate version.

There will also be a decision to make in CategoricalArrays as to whether missing should be returned only when present in the data (like the method defined here) or all the time as long as the eltype allows for it (like for other levels, which is more efficient).

Fixes #44.

Currently if one wants to get levels in their appropriate order, but add `missing` if present, the only solution is to do something like `union(levels(x), unique(x))`, which is inefficient. Support `skipmissing=false` to allow doing this in a single pass over the data. Use `@inline` to ensure that the return type can be inferred when the value of `skipmissing` is known statically. Also fix a type instability which existed for ranges.

codecov · 2022-03-20T10:46:31Z

Codecov Report

Merging #46 (8f36ce9) into main (1b6c589) will increase coverage by 2.52%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   92.59%   95.12%   +2.52%     
==========================================
  Files           1        1              
  Lines          27       41      +14     
==========================================
+ Hits           25       39      +14     
  Misses          2        2

Impacted Files	Coverage Δ
src/DataAPI.jl	`95.12% <100.00%> (+2.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b6c589...8f36ce9. Read the comment docs.

bkamins · 2022-03-20T11:19:51Z

CI on 1.0 fails.

bkamins

Implementation looks good. The only problem is the one that you have noted - we cannot rely on this behavior in library codes anyway.
Maybe we can have this for user's code, but then add something like levelsmissing that is internal and intended for library code to use (your current _levels_missing which is internal but not indented to be used).

nalimilan · 2022-03-20T11:33:52Z

Maybe we can have this for user's code, but then add something like levelsmissing that is internal and intended for library code to use (your current _levels_missing which is internal but not indented to be used).

So the advantage would be that we would be sure that it doesn't throw an error? If a type that uses a custom order of levels doesn't implement it, it would return them in their order of appearance though. Not sure whether it's better to throw an explicit error so that users can complain or give them suboptimal results...

bkamins · 2022-03-20T13:05:20Z

Having thought of it again, actually your solution is OK. We only would need in _levels_missing and _levels_skipmissing check if for the passed type the source package defines levels. If it does then levels would be appropriately called.

The point is that we would get to these calls only if package defines levels(x::SomeType) but does not define levels(x::SomeType; skipmissing::Bool=true). The reason is that if it would define new API then we would never reach the internal methods. On the other hand if it does not define the new signature, but defines the old one we know it is levels without missing.

nalimilan · 2022-03-20T13:58:13Z

Ah, interesting. You mean that _levels_missing would do something like this?

if which(DataAPI.levels, Tuple{typeof(x)}) === which(DataAPI.levels, Tuple{Vector})
    ... # Current code
elseif any(ismissing, col)
    [_levels_skipmissing(col); missing]
else
    _levels_skipmissing(col)
end

bkamins · 2022-03-20T14:20:25Z

In _levels_missing I mean the following:

if which(DataAPI.levels, (typeof(x),_) === which(DataAPI.levels, (Any,))
    ... # Current code
elseif any(ismissing, col)
    [levels(col); missing]
else
    levels(col)
end

The point is that if which(DataAPI.levels, (typeof(x),_) === which(DataAPI.levels, (Any,)) is false then "old style" levels is defined for typeof(x).

nalimilan · 2022-03-20T16:07:52Z

Ah right. Though doing this would mean that DataFrames would have to call _levels_missing rather than levels (otherwise the method defined by CategoricalArrays would be called, throwing an error since it doesn't accept skipmissing yet). I'd rather either 1) do the which check in DataFrames instead, or 2) bite the bullet and call the function levels_missing and document it.

bkamins · 2022-03-20T18:57:50Z

No it would not.

Note what would happen under my approach if you called levels(x, skipmissing=false) in DataFrames.jl and x were CategoricalArray and it would not define levels with skipmissing option. The process would be the following:

Default levels would be called from DataAPI.jl (as levels from CategoricalArrays.jl does not define kwargs - this is one of these cases when kwargs influence dispatch)
then this default method would call _levels_missing
inside _levels_missing we would see that levels is defined for CategoricalArray
so _levels_missing would execute [levels(x); missing] if indeed there would be missing in x

This would not be super fast, but this would be a default fallback and I think it would be good enough.

However, your option 2 is I think also OK and we could just add a second function not to complicate things if you prefer so.

Having said that the question is if CategoricalArray allows Missing should missing be included in levels (as it is a potential level) or not? I see in your code that you decided that not, but this is not consistent with a general notion of levels as a function listing potentially possible values.

nalimilan · 2022-03-20T21:34:48Z

OK, great, I hadn't realized that dispatch would choose the method which supports keyword arguments. I've pushed a commit to do that.

Unfortunately, @inferred seems to be confused by the recursive call to levels, as it thinks the result is type-unstable even though @code_warntype doesn't. I couldn't find a solution for now. Also the result is quite complex so I wonder whether it's worth it (DataAPI is supposed to be a lightweight package...). At any rate I'd have to add a test to ensure the backward compatibility works.

Having said that the question is if CategoricalArray allows Missing should missing be included in levels (as it is a potential level) or not? I see in your code that you decided that not, but this is not consistent with a general notion of levels as a function listing potentially possible values.

Yes that's the debate I mentioned above. For now I didn't decide anything, given that we only define the fallback method here, which is documented to be equivalent to unique but with possibly a different order. CategoricalArrays can do something slightly different if appropriate. Though it's difficult to decide, as missing isn't really a level. I'm worried that including missing even when it's not present in the data would be annoying and unexpected. For example, if you call completecases to skip missing values, the eltype won't change, yet adding missing would probably not be what you want.

bkamins · 2022-03-20T21:46:11Z

Yes that's the debate I mentioned above

I know. I just did not want to forget about it. Let us agree that missing is included only if it is present and explicitly document this.

Also the result is quite complex so I wonder whether it's worth it

I agree this is tricky (I just wanted to show it so that we consider such approach and decide if we want it). If you prefer - as I have written above - option 2, i.e. having two separate methods is I think also OK I think.

nalimilan · 2022-03-26T15:42:47Z

OK, I've found a solution to make the function inferrable using a different type for the default value.

bkamins · 2022-03-26T16:45:11Z

src/DataAPI.jl

+    elseif any(ismissing, x)
+        return [levels(x); missing]
+    else
+        return convert(AbstractArray{eltype(x)}, levels(x))


this line seems not to be covered by tests

bkamins

Looks good. Thank you. Only test coverage needs improvement.

quinnj

LGTM; if you get a chance, would you mind running Arrow.jl test suite to ensure everything works there?

nalimilan · 2022-04-19T07:02:13Z

Arrow tests pass.

The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46). When `skipmissing=true`, the method for `CategoricalArray` can be slightly more efficient than the fallback defined in DataAPI as it avoids calling `unique`.

nalimilan requested a review from bkamins March 20, 2022 10:45

nalimilan mentioned this pull request Mar 20, 2022

Add fillcombinations function JuliaData/DataFrames.jl#3012

Merged

bkamins reviewed Mar 20, 2022

View reviewed changes

Fix tests on Julia 1.0

7700f0f

Add backward compatibility strategy based on which

b4b9db5

nalimilan added 2 commits March 26, 2022 16:21

Fix inference

88ffd6d

Fix Julia 1.0

675536a

bkamins reviewed Mar 26, 2022

View reviewed changes

bkamins approved these changes Mar 26, 2022

View reviewed changes

Improve tests

8f36ce9

nalimilan requested a review from quinnj March 29, 2022 21:03

quinnj approved these changes Apr 9, 2022

View reviewed changes

nalimilan merged commit 3e905a6 into main Apr 19, 2022

nalimilan deleted the nl/skipmissing branch April 19, 2022 07:03

nalimilan mentioned this pull request Apr 19, 2022

Implement skipmissing argument to levels JuliaData/CategoricalArrays.jl#391

Merged

nalimilan mentioned this pull request Jun 26, 2022

Metadata on data frame and column level JuliaData/DataFrames.jl#3055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `skipmissing` argument to `levels` #46

Add `skipmissing` argument to `levels` #46

nalimilan commented Mar 20, 2022

codecov bot commented Mar 20, 2022 •

edited

Loading

bkamins commented Mar 20, 2022

bkamins left a comment

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022 •

edited

Loading

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022

nalimilan commented Mar 26, 2022

bkamins Mar 26, 2022

bkamins left a comment

quinnj left a comment

nalimilan commented Apr 19, 2022 •

edited

Loading

Add skipmissing argument to levels #46

Add skipmissing argument to levels #46

Conversation

nalimilan commented Mar 20, 2022

codecov bot commented Mar 20, 2022 • edited Loading

Codecov Report

bkamins commented Mar 20, 2022

bkamins left a comment

Choose a reason for hiding this comment

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022 • edited Loading

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022

nalimilan commented Mar 20, 2022

bkamins commented Mar 20, 2022

nalimilan commented Mar 26, 2022

bkamins Mar 26, 2022

Choose a reason for hiding this comment

bkamins left a comment

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

nalimilan commented Apr 19, 2022 • edited Loading

Add `skipmissing` argument to `levels` #46

Add `skipmissing` argument to `levels` #46

codecov bot commented Mar 20, 2022 •

edited

Loading

bkamins commented Mar 20, 2022 •

edited

Loading

nalimilan commented Apr 19, 2022 •

edited

Loading