Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add skipmissing argument to levels #46

Merged
merged 6 commits into from
Apr 19, 2022
Merged

Add skipmissing argument to levels #46

merged 6 commits into from
Apr 19, 2022

Conversation

nalimilan
Copy link
Member

Currently if one wants to get levels in their appropriate order, but add missing if present, the only solution is to do something like union(levels(x), unique(x)), which is inefficient.
Support skipmissing=false to allow doing this in a single pass over the data.

Use @inline to ensure that the return type can be inferred when the value of skipmissing is known statically. Also fix a type instability which existed for ranges.

Of course this new feature won't work for custom types which override this method (like CategoricalArray) until packages implement it. Unfortunately there's no way for packages which would like to rely on it (like DataFrames) to require an appropriate version.

There will also be a decision to make in CategoricalArrays as to whether missing should be returned only when present in the data (like the method defined here) or all the time as long as the eltype allows for it (like for other levels, which is more efficient).

Fixes #44.

Currently if one wants to get levels in their appropriate order,
but add `missing` if present, the only solution is to do something like
`union(levels(x), unique(x))`, which is inefficient.
Support `skipmissing=false` to allow doing this in a single pass over the data.

Use `@inline` to ensure that the return type can be inferred when the value
of `skipmissing` is known statically. Also fix a type instability which existed
for ranges.
@codecov
Copy link

codecov bot commented Mar 20, 2022

Codecov Report

Merging #46 (8f36ce9) into main (1b6c589) will increase coverage by 2.52%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   92.59%   95.12%   +2.52%     
==========================================
  Files           1        1              
  Lines          27       41      +14     
==========================================
+ Hits           25       39      +14     
  Misses          2        2              
Impacted Files Coverage Δ
src/DataAPI.jl 95.12% <100.00%> (+2.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b6c589...8f36ce9. Read the comment docs.

@bkamins
Copy link
Member

bkamins commented Mar 20, 2022

CI on 1.0 fails.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good. The only problem is the one that you have noted - we cannot rely on this behavior in library codes anyway.
Maybe we can have this for user's code, but then add something like levelsmissing that is internal and intended for library code to use (your current _levels_missing which is internal but not indented to be used).

@nalimilan
Copy link
Member Author

Maybe we can have this for user's code, but then add something like levelsmissing that is internal and intended for library code to use (your current _levels_missing which is internal but not indented to be used).

So the advantage would be that we would be sure that it doesn't throw an error? If a type that uses a custom order of levels doesn't implement it, it would return them in their order of appearance though. Not sure whether it's better to throw an explicit error so that users can complain or give them suboptimal results...

@bkamins
Copy link
Member

bkamins commented Mar 20, 2022

Having thought of it again, actually your solution is OK. We only would need in _levels_missing and _levels_skipmissing check if for the passed type the source package defines levels. If it does then levels would be appropriately called.

The point is that we would get to these calls only if package defines levels(x::SomeType) but does not define levels(x::SomeType; skipmissing::Bool=true). The reason is that if it would define new API then we would never reach the internal methods. On the other hand if it does not define the new signature, but defines the old one we know it is levels without missing.

@nalimilan
Copy link
Member Author

Ah, interesting. You mean that _levels_missing would do something like this?

if which(DataAPI.levels, Tuple{typeof(x)}) === which(DataAPI.levels, Tuple{Vector})
    ... # Current code
elseif any(ismissing, col)
    [_levels_skipmissing(col); missing]
else
    _levels_skipmissing(col)
end

@bkamins
Copy link
Member

bkamins commented Mar 20, 2022

In _levels_missing I mean the following:

if which(DataAPI.levels, (typeof(x),_) === which(DataAPI.levels, (Any,))
    ... # Current code
elseif any(ismissing, col)
    [levels(col); missing]
else
    levels(col)
end

The point is that if which(DataAPI.levels, (typeof(x),_) === which(DataAPI.levels, (Any,)) is false then "old style" levels is defined for typeof(x).

@nalimilan
Copy link
Member Author

Ah right. Though doing this would mean that DataFrames would have to call _levels_missing rather than levels (otherwise the method defined by CategoricalArrays would be called, throwing an error since it doesn't accept skipmissing yet). I'd rather either 1) do the which check in DataFrames instead, or 2) bite the bullet and call the function levels_missing and document it.

@bkamins
Copy link
Member

bkamins commented Mar 20, 2022

No it would not.

Note what would happen under my approach if you called levels(x, skipmissing=false) in DataFrames.jl and x were CategoricalArray and it would not define levels with skipmissing option. The process would be the following:

  1. Default levels would be called from DataAPI.jl (as levels from CategoricalArrays.jl does not define kwargs - this is one of these cases when kwargs influence dispatch)
  2. then this default method would call _levels_missing
  3. inside _levels_missing we would see that levels is defined for CategoricalArray
  4. so _levels_missing would execute [levels(x); missing] if indeed there would be missing in x

This would not be super fast, but this would be a default fallback and I think it would be good enough.

However, your option 2 is I think also OK and we could just add a second function not to complicate things if you prefer so.

Having said that the question is if CategoricalArray allows Missing should missing be included in levels (as it is a potential level) or not? I see in your code that you decided that not, but this is not consistent with a general notion of levels as a function listing potentially possible values.

@nalimilan
Copy link
Member Author

OK, great, I hadn't realized that dispatch would choose the method which supports keyword arguments. I've pushed a commit to do that.

Unfortunately, @inferred seems to be confused by the recursive call to levels, as it thinks the result is type-unstable even though @code_warntype doesn't. I couldn't find a solution for now. Also the result is quite complex so I wonder whether it's worth it (DataAPI is supposed to be a lightweight package...). At any rate I'd have to add a test to ensure the backward compatibility works.

Having said that the question is if CategoricalArray allows Missing should missing be included in levels (as it is a potential level) or not? I see in your code that you decided that not, but this is not consistent with a general notion of levels as a function listing potentially possible values.

Yes that's the debate I mentioned above. For now I didn't decide anything, given that we only define the fallback method here, which is documented to be equivalent to unique but with possibly a different order. CategoricalArrays can do something slightly different if appropriate. Though it's difficult to decide, as missing isn't really a level. I'm worried that including missing even when it's not present in the data would be annoying and unexpected. For example, if you call completecases to skip missing values, the eltype won't change, yet adding missing would probably not be what you want.

@bkamins
Copy link
Member

bkamins commented Mar 20, 2022

Yes that's the debate I mentioned above

I know. I just did not want to forget about it. Let us agree that missing is included only if it is present and explicitly document this.

Also the result is quite complex so I wonder whether it's worth it

I agree this is tricky (I just wanted to show it so that we consider such approach and decide if we want it). If you prefer - as I have written above - option 2, i.e. having two separate methods is I think also OK I think.

@nalimilan
Copy link
Member Author

OK, I've found a solution to make the function inferrable using a different type for the default value.

elseif any(ismissing, x)
return [levels(x); missing]
else
return convert(AbstractArray{eltype(x)}, levels(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line seems not to be covered by tests

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you. Only test coverage needs improvement.

@nalimilan nalimilan requested a review from quinnj March 29, 2022 21:03
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; if you get a chance, would you mind running Arrow.jl test suite to ensure everything works there?

@nalimilan
Copy link
Member Author

nalimilan commented Apr 19, 2022

Arrow tests pass.

@nalimilan nalimilan merged commit 3e905a6 into main Apr 19, 2022
@nalimilan nalimilan deleted the nl/skipmissing branch April 19, 2022 07:03
nalimilan added a commit to JuliaData/CategoricalArrays.jl that referenced this pull request Apr 19, 2022
The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46).
When `skipmissing=true`, the method for `CategoricalArray` can be
slightly more efficient than the fallback defined in DataAPI
as it avoids calling `unique`.
nalimilan added a commit to JuliaData/CategoricalArrays.jl that referenced this pull request Sep 25, 2022
The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46).
When `skipmissing=true`, the method for `CategoricalArray` can be
slightly more efficient than the fallback defined in DataAPI
as it avoids calling `unique`.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add kwarg to levels to keep missing
3 participants