Skip to content

DEPR: groupby.grouper #56521

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 5 commits into from
Dec 18, 2023
Merged

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Dec 16, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Ref: #55738 (comment)

This reverts the deprecation of grouper.reconstructed_codes and grouper.group_keys_seq from #56149 in favor of deprecating all of grouper.

@rhshadrach rhshadrach added Groupby Deprecate Functionality to remove in pandas labels Dec 16, 2023
@rhshadrach rhshadrach added this to the 2.2 milestone Dec 16, 2023
@rhshadrach rhshadrach marked this pull request as draft December 16, 2023 22:54
@rhshadrach rhshadrach marked this pull request as ready for review December 17, 2023 16:25
@@ -472,7 +472,7 @@ Other Deprecations
- Deprecated strings ``H``, ``S``, ``U``, and ``N`` denoting units in :func:`to_timedelta` (:issue:`52536`)
- Deprecated strings ``H``, ``T``, ``S``, ``L``, ``U``, and ``N`` denoting units in :class:`Timedelta` (:issue:`52536`)
- Deprecated strings ``T``, ``S``, ``L``, ``U``, and ``N`` denoting frequencies in :class:`Minute`, :class:`Second`, :class:`Milli`, :class:`Micro`, :class:`Nano` (:issue:`52536`)
- Deprecated the :class:`.BaseGrouper` attributes ``group_keys_seq`` and ``reconstructed_codes``; these will be removed in a future version of pandas (:issue:`56148`)
- Deprecated the :class:`.DataFrameGroupBy.grouper` and :class:`SeriesGroupBy.grouper`; these will be removed in a future version of pandas (:issue:`56521`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either :attr: or the word "attribute" somewhere in here?

gen = gb.grouper.get_iterator(obj, axis=gb.axis)
msg = "DataFrameGroupBy.grouper is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
gen = gb.grouper.get_iterator(obj, axis=gb.axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just change to _grouper?

assert s.groupby(s).grouper.names == ["name"]
msg = "SeriesGroupBy.grouper is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
assert s.groupby(s).grouper.names == ["name"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much all of these can just change grouper->_grouper i think, as long as there are a couple that test the deprecation

)
with tm.assert_produces_warning(FutureWarning, match=msg):
tm.assert_numpy_array_equal(
gr.grouper.group_info[1], np.array([], dtype=np.dtype(np.intp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if for some reason this doesnt change to _grouper, i have a medium-strong preference for doing as little inside the context as possible, so e.g. gpr = gr.grouper would go inside the context and the assert_numpy... would go in a separate line

@rhshadrach
Copy link
Member Author

Thanks @jbrockmendel - changes made.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when ready @jbrockmendel

@jbrockmendel jbrockmendel merged commit ac170fd into pandas-dev:main Dec 18, 2023
@jbrockmendel
Copy link
Member

thanks @rhshadrach

@rhshadrach rhshadrach deleted the depr_gb_grouper branch December 18, 2023 23:20
cbpygit pushed a commit to cbpygit/pandas that referenced this pull request Jan 2, 2024
* DEPR: groupby.grouper

* DEPR: groupby.grouper

* fix whatsnew, tests

* Restore test
@jorisvandenbossche
Copy link
Member

@rhshadrach one usage of this that I noticed in downstream (seaborn, mwaskom/seaborn#3615, https://github.com/mwaskom/seaborn/blob/a3cb0f1b33551eb25fed1907d0abdd0237fac215/seaborn/categorical.py#L640) is to access the result_index.

Do we have a good alternative for that?
One could do something like pd.Index(grouped.indices.keys()), but that's less robust (converts from python scalars)

I haven't followed the PR where you are refactoring result_index, so don't know if that will change anything (or only how it is calculated), but would we want to expose something similar on the groupby object itself as alternative?

@rhshadrach
Copy link
Member Author

Using result_index is not currently reliable:

df = pd.DataFrame(
    {
        "a": pd.Categorical([1], categories=[1, 2]),
        "b": pd.Categorical([1], categories=[1, 2]),
        "c": 5,
    }
)
gb = df.groupby(["a", "b"])
print(gb.grouper.result_index)
# MultiIndex([(1, 1)], names=['a', 'b'])
print(gb.size())
# a  b
# 1  1    1
#    2    0
# 2  1    0
#    2    0
# dtype: int64

though #55738 will fix this in all cases I know of. An alternative is to perform a reduction and take the index from the result. I recommend size for this as it doesn't require as much computation as other methods.

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.randint(0, 100, size), 'c': np.random.random(size)})
%timeit df.groupby(['a', 'b']).size().index
# 3.68 ms ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby(['a', 'b'])._grouper.result_index
# 3.36 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We do not need to deprecate grouper for #55738, only certain grouper attributes (see #56149)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants