Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

cut: option to have missing as level #377

Closed
greimel opened this issue Dec 21, 2021 · 15 comments · Fixed by #378
Closed

cut: option to have missing as level #377

greimel opened this issue Dec 21, 2021 · 15 comments · Fixed by #378

Comments

@greimel
Copy link
Contributor

greimel commented Dec 21, 2021

Sometimes I would like to have "missing" as a level after cutting a variable. What would you think about a PR that adds a keyword missing_as_level (defaulting to false) that adds this feature?

@nalimilan
Copy link
Member

What do you mean exactly? cut already propagates missing values by default.

@bkamins
Copy link
Member

bkamins commented Dec 21, 2021

I understand that OP wants missing to be a level. Now it is treated as missing not as a level.

@greimel
Copy link
Contributor Author

greimel commented Dec 21, 2021

julia> categorical(["a", "b", missing]) |> levels
2-element Vector{String}:
 "a"
 "b"

instead, I'd like to have ["a", "b", "missing"]

@nalimilan
Copy link
Member

Ah, OK. Then I don't think it should be a specific option of cut. Instead it should be a general function similar to fct_explicit_na in forcats. More generally we should review the list of convenience functions provided by forcats and see which ones we want to implement and under what name.

@nalimilan
Copy link
Member

That said you can already do recode(cut(...), missing => "missing_level") so a new function wouldn't make this much shorter.

@greimel
Copy link
Contributor Author

greimel commented Dec 21, 2021

Ah, nice! I wasn't aware of that.

@greimel
Copy link
Contributor Author

greimel commented Dec 21, 2021

But since cut is supposed to be convenience function (cut(vec, 5) is not so much shorter than cut(vec, quintiles(vec, 0:1/5:1))), wouldn't it still add some value to have that keyword missing_as_level?

@nalimilan
Copy link
Member

As I said I could be convinced to add a function so that you can do e.g. missing_as_level(cut(...)) instead recode(cut(...), missing => "missing_level")). But I don't really want to add an argument to cut, as then the same could be done with other functions (e.g. recode, levels...). This is not an operation which is specific to cut IMO.

@greimel
Copy link
Contributor Author

greimel commented Dec 21, 2021

I disagree.

  1. This only applies to constructors of categorical arrays. So, categorical, categorical! and cut. It is a statement if missing should stay missing or become "missing".
  2. cut already has special handling of missing (extend=missing, do other functions have that?). So one could add extend = :missing_as_level which wouldn't make the API much more complicated.

@bkamins
Copy link
Member

bkamins commented Dec 21, 2021

recode(cut(...), missing => "missing_level")) is explicit.

If we wanted to follow that OP wants we cannot make missing be turned into "missing" (or any other default value) as "missing" as a string might already be a valid value in a categorical array.

What we would need to do is to make it an option for missing (not any conversion of it) to actually be a level of CategoricalArray. In order to handle this we would need to change the type internals as currently it is not possible.

@nalimilan
Copy link
Member

  • This only applies to constructors of categorical arrays. So, categorical, categorical! and cut. It is a statement if missing should stay missing or become "missing".

Why should this be specific to CategoricalArrays? Replacing missing with another value makes sense for any type of vector. Better have an operation that works for all arrays (just like levels works for non-categorical arrays).

  • cut already has special handling of missing (extend=missing, do other functions have that?). So one could add extend = :missing_as_level which wouldn't make the API much more complicated.

extend=missing is needed as otherwise one couldn't handle values outside of the provided breaks. The only way to do that would be to extract the names of the first and last levels and set them to missing, which admittedly is significantly more complex than what we discuss here.

@greimel
Copy link
Contributor Author

greimel commented Dec 21, 2021

Alright, I am closing this.

Thank you for the feedback and the discussion!

@greimel greimel closed this as completed Dec 21, 2021
@bkamins
Copy link
Member

bkamins commented Dec 21, 2021

@nalimilan, I reopen this as I would like to ask you to give a clear statement as a lead maintainer what is the recommended way for users to turn missing into a level in CategoricalArray before closing? (maybe it would be even good to add to the documentation).

Can you confirm that it is recode? It can be problematic in the case of e.g. CategoricalArray of Int.

Thank you!

@bkamins bkamins reopened this Dec 21, 2021
@nalimilan
Copy link
Member

Both recode or replace give the same result. But indeed the new level has to be convertible to the eltype of the array, so if it's e.g. Int you cannot use "missing" as a level.

As I said I could make sense to add a convenience function for this. Anyway yes we can mention this in the manual.

@nalimilan
Copy link
Member

See #378.

@bkamins bkamins linked a pull request Dec 22, 2021 that will close this issue
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants