Skip to content

ENH: Add option to make final interval closed for right-open intervals in pd.cut #42212

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
jotasi opened this issue Jun 24, 2021 · 4 comments
Closed
Labels
Closing Candidate May be closeable, needs more eyeballs cut cut, qcut Enhancement

Comments

@jotasi
Copy link
Contributor

jotasi commented Jun 24, 2021

Is your feature request related to a problem?

I would like to use pd.cut to sort values into bins that are half-open with the lower boundary being the closed end of the interval (i.e. [0, 5), so setting right=False) but still be able to include the upper bound of the last interval (i.e. have the last interval be closed, something like include_highest=True analogous to include_lowest=True for right=True). I encountered this also for infinite boundaries, where adding a small number is not an option (although as a workaround, one can fillna the result as the only remaining nas are those of the infinite right boundary).

I.e. while I can make the first interval closed for pd.cut:

In [1]: import pandas as pd

In [2]: pd.cut(pd.Series([0, 1, 2, 3]), bins=[0, 1, 2, 3], include_lowest=True, retbins=False)
Out[2]:
0    (-0.001, 1.0]
1    (-0.001, 1.0]
2       (1.0, 2.0]
3       (2.0, 3.0]
dtype: category
Categories (3, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0]]

I can't do the same for right=False where include_lowest=True seems functionless:

In [3]: pd.cut(pd.Series([0, 1, 2, 3]), bins=[0, 1, 2, 3], right=False, include_lowest=True, retbins=False)
Out[3]:
0    [0.0, 1.0)
1    [1.0, 2.0)
2    [2.0, 3.0)
3           NaN
dtype: category
Categories (3, interval[int64, left]): [[0, 1) < [1, 2) < [2, 3)]

(I would want the last value 3 to be in the final bin [2, 3).)

Describe the solution you'd like

Either, include_lowest could be changed to final_interval_closed or similar to work as include_lowest for right=True and include_highest for right=False (which would break API, see below). This would make the function work somewhat symmetrically for right=True and right=False. Alternatively, such a parameter could be added additionally, which would make include_lowest more or less obsolete though, as far as I can see. Or to make the API more symmetric one could add another parameter include_highest, which does nothing for right=True but makes the last interval closed on both ends for right=False.

API breaking implications

Changing the parameter include_lowest to final_interval_closed or similar would break the API. The alternative solutions (adding either final_interval_closed or include_highest) would add an additional parameter to the function pd.cut (and if the former would be added, potentially include_lowest could be deprecated down the line).

Describe alternatives you've considered

See three alternatives under Describe the solution you'd like

@jotasi jotasi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2021
@simonjayhawkins
Copy link
Member

Thanks @jotasi for the report. This looks similar to the discussion in #23164?

@simonjayhawkins simonjayhawkins added Closing Candidate May be closeable, needs more eyeballs cut cut, qcut and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 25, 2021
@jotasi
Copy link
Contributor Author

jotasi commented Jun 26, 2021

I think it is somewhat related but the discussion seems to be about an issue for the case of right=True (default). I'm arguing for a similar option to include_lowest for right=False, which (AFAICS) doesn't exist. So basically, for left-open intervals, you can (with the caveats discussed in #23164) make the outer-most open end closed(-ish) by specifying include_lowest=True and I propose to extend this to also allow the same for right-open intervals.

Nonetheless, depending on the solution to #23164 (changing the docs vs. extending IntervalIndex to actually support a single closed interval), it might be a good idea to fix both together.

@attack68
Copy link
Contributor

linking #40245 for relevance,

@mroeschke
Copy link
Member

I think we can loop in the right=True|False for include_lowest in the same issue #23164 so closing if favor of continuing discussion there

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Closing Candidate May be closeable, needs more eyeballs cut cut, qcut Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants