-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add skipmissing
argument to levels
#46
Conversation
Currently if one wants to get levels in their appropriate order, but add `missing` if present, the only solution is to do something like `union(levels(x), unique(x))`, which is inefficient. Support `skipmissing=false` to allow doing this in a single pass over the data. Use `@inline` to ensure that the return type can be inferred when the value of `skipmissing` is known statically. Also fix a type instability which existed for ranges.
Codecov Report
@@ Coverage Diff @@
## main #46 +/- ##
==========================================
+ Coverage 92.59% 95.12% +2.52%
==========================================
Files 1 1
Lines 27 41 +14
==========================================
+ Hits 25 39 +14
Misses 2 2
Continue to review full report at Codecov.
|
CI on 1.0 fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good. The only problem is the one that you have noted - we cannot rely on this behavior in library codes anyway.
Maybe we can have this for user's code, but then add something like levelsmissing
that is internal and intended for library code to use (your current _levels_missing
which is internal but not indented to be used).
So the advantage would be that we would be sure that it doesn't throw an error? If a type that uses a custom order of levels doesn't implement it, it would return them in their order of appearance though. Not sure whether it's better to throw an explicit error so that users can complain or give them suboptimal results... |
Having thought of it again, actually your solution is OK. We only would need in The point is that we would get to these calls only if package defines |
Ah, interesting. You mean that
|
In
The point is that if |
Ah right. Though doing this would mean that DataFrames would have to call |
No it would not. Note what would happen under my approach if you called
This would not be super fast, but this would be a default fallback and I think it would be good enough. However, your option 2 is I think also OK and we could just add a second function not to complicate things if you prefer so. Having said that the question is if |
OK, great, I hadn't realized that dispatch would choose the method which supports keyword arguments. I've pushed a commit to do that. Unfortunately,
Yes that's the debate I mentioned above. For now I didn't decide anything, given that we only define the fallback method here, which is documented to be equivalent to |
I know. I just did not want to forget about it. Let us agree that
I agree this is tricky (I just wanted to show it so that we consider such approach and decide if we want it). If you prefer - as I have written above - option 2, i.e. having two separate methods is I think also OK I think. |
OK, I've found a solution to make the function inferrable using a different type for the default value. |
elseif any(ismissing, x) | ||
return [levels(x); missing] | ||
else | ||
return convert(AbstractArray{eltype(x)}, levels(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line seems not to be covered by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you. Only test coverage needs improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; if you get a chance, would you mind running Arrow.jl test suite to ensure everything works there?
Arrow tests pass. |
The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46). When `skipmissing=true`, the method for `CategoricalArray` can be slightly more efficient than the fallback defined in DataAPI as it avoids calling `unique`.
The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46). When `skipmissing=true`, the method for `CategoricalArray` can be slightly more efficient than the fallback defined in DataAPI as it avoids calling `unique`.
Currently if one wants to get levels in their appropriate order, but add
missing
if present, the only solution is to do something likeunion(levels(x), unique(x))
, which is inefficient.Support
skipmissing=false
to allow doing this in a single pass over the data.Use
@inline
to ensure that the return type can be inferred when the value ofskipmissing
is known statically. Also fix a type instability which existed for ranges.Of course this new feature won't work for custom types which override this method (like
CategoricalArray
) until packages implement it. Unfortunately there's no way for packages which would like to rely on it (like DataFrames) to require an appropriate version.There will also be a decision to make in CategoricalArrays as to whether
missing
should be returned only when present in the data (like the method defined here) or all the time as long as the eltype allows for it (like for other levels, which is more efficient).Fixes #44.