DEPR: DataFrameGroupBy.apply operating on the group keys #52477

rhshadrach · 2023-04-06T03:06:38Z

closes API: way to exclude the grouped column with apply #7155 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…_apply_on_groupings � Conflicts: � doc/source/whatsnew/v2.1.0.rst

topper-123 · 2023-04-08T11:40:27Z

I will look in depth in a few hours, I got some obligations. Just initially:

Is it correct that this will help simplify the code paths for other groupby methods or is this a change that only will affect groupby apply itself? If yes, could you explain how the other groupby methods use/are used by apply?

rhshadrach · 2023-04-08T12:18:50Z

Is it correct that this will help simplify the code paths for other groupby methods or is this a change that only will affect groupby apply itself?

This change is only for apply itself. Ignoring filtrations (which by de#clude the grouping columns when applicable), almost all of groupby does not operate on the grouping columns. apply is the major exception, and this deprecation is to correct that.

If yes, could you explain how the other groupby methods use/are used by apply?

Almost no groupby methods use apply internally. Those that do:

first, last
value_counts
is_monotonic_increasing, is_monotonic_decreasing
dtype
diff (for axis=1 only)

Of the above, all except for diff are only used in the Series case where this deprecation does not apply. For diff, axis=1 has been deprecated.

topper-123 · 2023-04-08T15:50:35Z

So this can't affect SeriesGroupby at all and no DataFrameGroupby method uses apply. So that means basically no worries wrt. effect on other method, which is great.

I've looked at the code in Groupby.apply and you did in the if clause: self._selected_obj.shape != self._obj_with_exclusions.shape, which surprised me. So I tried:

>>> df = pd.DataFrame({"A": ["a", "a", None, None, "b", "b", "c"], "B": [1, 1, 2, 2, 3, 3, 1]})
>>> g = df.groupby('A')[df.columns]
>>> g.sum()  # sums over grouping column
    A  B
A
a  aa  2
b  bb  6
c   c  1

I never knew this was possible (works also with apply), but I think this is a good thing: by default let's exclude grouping columns from apply/agg etc., but this let's users get them back in in the rare case they need that (there's always an edge case...).

Do you know if this is documented anywhere? IMO we should officially support that pattern with docs and tests. Probably also show this as an example in the deprecation text for this change?

topper-123

A few comments/questions.

Can you especially look into if we can avoid giving the deprecation warning in groupby.resample?

topper-123 · 2023-04-08T15:54:10Z

pandas/core/groupby/groupby.py

+                        f"columns. This behavior is deprecated, and in a future "
+                        f"version of pandas the grouping columns will be excluded "
+                        f"from the operation. Subset the data to exclude the "
+                        f"groupings and silence this warning."


"Subset the data to exclude the groupings and silence this warning." -> "Subset the data to silence this warning."?

What's the benefit of removing the phrase "to exclude the groupings"?

My idea was that by keeping the grouping columns in the subsetting, the users are guaranteed to get the same rseult as before, but without the warning:

>>> df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 3, 4]}) >>> g = df.groupby('A') >>> g.apply(lambda x: x.sum()) A B A 1 2 3 2 4 7 >>> g[df.columns].apply(lambda x: x.sum()) A B A 1 2 3 2 4 7

Or they may actually not want to include, then remove it from the subsetting,

>>> g[df.columns.drop("A")].apply(lambda x: x.sum()) B A 1 3 2 7

the idea is just that they may/may not want to remove the groupings in the subset and "exclude the groupings" may not be what they want in all cases.

I see - this makes sense, but I find it confusing to call this "subsetting". I'll see what I can do for the warning message here.

Updated. What do you think?

topper-123 · 2023-04-08T15:55:16Z

doc/source/whatsnew/v2.1.0.rst

@@ -149,6 +149,7 @@ Other API changes

 Deprecations
 ~~~~~~~~~~~~
+- Deprecated :meth:`.DataFrameGroupBy.apply` operating on the grouping column(s) (:issue:`7155`)


A small explanation about what to do to avoid the warning?

Thanks - done.

topper-123 · 2023-04-08T15:56:06Z

pandas/core/resample.py

+            "DataFrame.resample operated on the grouping columns. "
+            "This behavior is deprecated, and in a future version of "
+            "pandas the grouping columns will be excluded from the operation. "
+            "Subset the data to exclude the groupings and silence this warning."


"Subset the data to exclude the groupings and silence this warning." -> Subset the data to silence this warning."

pandas/core/resample.py

pandas/tests/resample/test_datetime_index.py

rhshadrach · 2023-04-09T14:12:26Z

Do you know if this is documented anywhere? IMO we should officially support that pattern with docs and tests.

Added a line in the user guide; I know we test this because that's how I discovered it but I'll have to dig up where that is.

Probably also show this as an example in the deprecation text for this change?

In the whatsnew, we could add a notable deprecations section to do this, but I'm not sure it's worth it. I'm rather against having an example in the FutureWarning that is raised.

doc/source/user_guide/groupby.rst

doc/source/whatsnew/v2.1.0.rst

topper-123

Nice, LGTM..

mroeschke

Looks good. Some of the doctest look to need updating though

…_apply_on_groupings

mroeschke · 2023-04-12T15:57:27Z

Thanks @rhshadrach

phofl · 2023-04-24T08:17:27Z

This is super noisy when you are fine with grouping columns not excluded in 3.0. Is there a way to make this less inconvenient apart from adjusting every groupby apply call?

rhshadrach · 2023-04-25T02:04:15Z

The warning should be triggered only when both of the following apply:

The UDF with the grouping columns does not raise
Raising would mean the fallback passes different groups (namely, excluding the grouping columns)

The only thing I could see doing in addition is seeing (perhaps on just the first group) if the result differs whether the grouping columns are included or not. What do you think about this option?

Another option would be changing without deprecation. I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition. Perhaps this could go behind that option as well.

phofl · 2023-04-25T08:07:18Z

If I understand you here, the default case would raise a warning:

df = pd.DataFrame({"a": [1, 2, 3], "b": 1.5, "c": True, "d": "d"})


def test(x):
    return x


df.groupby("d").apply(test)

which it does right now. This seems pretty noisy, since there is no way to get rid of the warning as a user apart from changing your code everywhere?

I think an option is a good idea, if you want to add something like this anyway for other reasons.

rhshadrach · 2023-04-25T21:23:07Z

This seems pretty noisy

I would typically describe "noisy" as warning when behavior won't change, so just want to be clear that this will.

# Current
     a    b     c  d
d                   
d 0  1  1.5  True  d
  1  2  1.5  True  d
  2  3  1.5  True  d

# Future
     a    b     c
d                   
d 0  1  1.5  True
  1  2  1.5  True
  2  3  1.5  True

To suppress the warning and adopt future behavior, one would do df.groupby("d")[["a", "b", "c"]].apply(test); to keep current behavior it's df.groupby("d")[["a", "b", "c", "d"]].apply(test). Neither of these warn.

But agreed it's noisy in the sense of "someone using apply a lot of may see a lot of warnings" - and experience a lot of behavior changes!

I'll put up a PR to revert.

…das-dev#52477)" This reverts commit 9b20759.

phofl · 2023-04-25T21:30:11Z

Yes you are correct, but I guess that most users don't care about grouping columns in apply? Values are constant anyway, so why would you need the information? That's what I mean with noisy. I am assuming that a vast majority of users just doesn't care about the grouping columns

rhshadrach · 2023-04-25T21:34:19Z

Values are constant anyway, so why would you need the information? That's what I mean with noisy. I am assuming that a vast majority of users just doesn't care about the grouping columns

It sounds like you're saying you'd be okay changing this behavior without deprecation?

The users may not want them, but they have them now. So if their workflow drops the grouping columns from the result, that workflow will break if we were to change this behavior without deprecation.

phofl · 2023-04-25T21:36:37Z

I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition

I'd prefer this solution if you end up creating an option anyway.

But I think I'd also be ok with doing this without a deprecation. Personally, I'd be annoyed that I'd have to change all my apply calls just to silence the warning even though I don't care about the grouping columns

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (#52477)" This reverts commit 9b20759.

…as-dev#52921) Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)" This reverts commit 9b20759.

rhshadrach · 2023-06-15T20:56:50Z

I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition

I'd prefer this solution if you end up creating an option anyway.

We've moved away from introducing this option. I'm now looking for ways to introduce this deprecation again.

But I think I'd also be ok with doing this without a deprecation. Personally, I'd be annoyed that I'd have to change all my apply calls just to silence the warning even though I don't care about the grouping columns

I imagine this is a popular method and we're changing the output in what I fear could be a common case. Doing this without deprecation seems too likely to break a lot of code. I don't fully understand your comment about "changing apply calls just to silence the warning". Wouldn't your code have a good chance of breaking if you didn't change it ahead of time when we change the behavior?

phofl · 2023-06-16T09:26:00Z

I imagine this is a popular method and we're changing the output in what I fear could be a common case. Doing this without deprecation seems too likely to break a lot of code. I don't fully understand your comment about "changing apply calls just to silence the warning". Wouldn't your code have a good chance of breaking if you didn't change it ahead of time when we change the behavior?

Not really if you don't care about the group keys being included in your group or not. This was often the case for me when I used apply.

But I agree, not deprecating this in some way is probably a bad idea.

rhshadrach · 2023-06-19T20:54:23Z

I believe the checks on the warning that were included here completely predict when the output would change. In other words, we have no way of making this less noisy by merely emitting warnings in less cases.

What do you think about an option to adopt the future behavior? Something like mode.groupings_in_apply. Defaults to True (current behavior). When set to False, it would adopt the future behavior. Then we deprecate the option in 3.x. Is that palatable?

If that's a good way forward, I think we still make the changes to the tests as they are in this PR.

topper-123 · 2023-06-19T22:26:42Z

Could an idea be to silence the deprecation warning until after 2.1 or maybe even 2.2?

IMO we need a deprecation here, but it may not be necessary to have it effective in 2.1, especially if silencing/unsilencing is low effort?

rhshadrach · 2023-06-19T22:40:48Z

I'm fine with waiting to deprecate; but I think to @phofl's point, silencing is not low effort (especially when you have many columns you want to include)

And just to make sure, with the option I'm proposing, anyone who has it set to True (the default) will see the deprecation warning.

topper-123 · 2023-06-19T23:18:59Z

Sorry, I meant if we in the pandas/core/groupby/groupby.py surround the FutureWarning with warnings.catch_warnings(): and warnings.filter_warning code, then remove that code when we want to re-activate the warnings e.g. in v2.2.

rhshadrach · 2023-06-21T02:19:12Z

@topper-123: I'd personally prefer just to wait until 2.1 is released. When you wrote:

especially if silencing/unsilencing is low effort?

I thought you meant think from a user perspective. From an implementation perspective, almost all the work is with the tests. But I'm okay with having to redo a little work if the tests have changes from the implementation here by the time 2.1 is release.

topper-123 · 2023-06-25T06:17:43Z

Ok, sounds good to me.

…as-dev#52921) Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)" This reverts commit 9b20759.

DEPR: DataFrameGroupBy.apply operating on the group keys

90c0ecf

rhshadrach added Groupby Deprecate Functionality to remove in pandas Apply Apply, Aggregate, Transform, Map labels Apr 6, 2023

rhshadrach added 2 commits April 8, 2023 06:54

Merge branch 'main' of https://github.com/pandas-dev/pandas into depr…

8d93efc

…_apply_on_groupings � Conflicts: � doc/source/whatsnew/v2.1.0.rst

Reorder whatsnew

0337e61

rhshadrach requested review from mroeschke and topper-123 April 8, 2023 11:04

topper-123 reviewed Apr 8, 2023

View reviewed changes

topper-123 added this to the 2.1 milestone Apr 8, 2023

rhshadrach added 4 commits April 9, 2023 09:33

Remove warnings from pivot, minor refinements

dd5f0f8

Handle warning in docs

c6fa2dc

Improve warning message

fe66a65

Add note to user guide

d3b2a65

topper-123 reviewed Apr 9, 2023

View reviewed changes

doc/source/user_guide/groupby.rst Show resolved Hide resolved

doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved

Improve whatsnew

7475648

topper-123 approved these changes Apr 11, 2023

View reviewed changes

mroeschke requested changes Apr 11, 2023

View reviewed changes

rhshadrach added 2 commits April 11, 2023 20:43

Adjust docstrings

574993d

Merge branch 'main' of https://github.com/pandas-dev/pandas into depr…

1ddd766

…_apply_on_groupings

mroeschke approved these changes Apr 12, 2023

View reviewed changes

mroeschke merged commit 9b20759 into pandas-dev:main Apr 12, 2023

rhshadrach deleted the depr_apply_on_groupings branch April 12, 2023 17:33

topper-123 mentioned this pull request Apr 12, 2023

BUG: core.groupby.GroupBy.apply unexpected behavior with TypeError raised in UDF #46324

Closed

3 tasks

rhshadrach mentioned this pull request Apr 22, 2023

DEPR: List of deprecations to be removed in 3.0 #50578

Open

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Apr 25, 2023

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pan…

700c9b8

…das-dev#52477)" This reverts commit 9b20759.

rhshadrach mentioned this pull request Apr 25, 2023

DEPR: Revert DataFrameGroupBy.apply operating on the group keys #52921

Merged

5 tasks

rhshadrach restored the depr_apply_on_groupings branch April 25, 2023 21:30

phofl pushed a commit that referenced this pull request Apr 28, 2023

DEPR: Revert DataFrameGroupBy.apply operating on the group keys (#52921)

9f5b44c

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (#52477)" This reverts commit 9b20759.

rhshadrach mentioned this pull request Jun 15, 2023

API: way to exclude the grouped column with apply #7155

Closed

rhshadrach mentioned this pull request Sep 2, 2023

DEPR: DataFrameGroupBy.apply operating on the group keys #54950

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: DataFrameGroupBy.apply operating on the group keys #52477

DEPR: DataFrameGroupBy.apply operating on the group keys #52477

rhshadrach commented Apr 6, 2023 •

edited

Loading

topper-123 commented Apr 8, 2023

rhshadrach commented Apr 8, 2023 •

edited

Loading

topper-123 commented Apr 8, 2023 •

edited

Loading

topper-123 left a comment

topper-123 Apr 8, 2023

rhshadrach Apr 9, 2023

topper-123 Apr 9, 2023

rhshadrach Apr 9, 2023

rhshadrach Apr 9, 2023

topper-123 Apr 8, 2023

rhshadrach Apr 9, 2023

topper-123 Apr 8, 2023

rhshadrach commented Apr 9, 2023

topper-123 left a comment

mroeschke left a comment

mroeschke commented Apr 12, 2023

phofl commented Apr 24, 2023

rhshadrach commented Apr 25, 2023 •

edited

Loading

phofl commented Apr 25, 2023

rhshadrach commented Apr 25, 2023

phofl commented Apr 25, 2023

rhshadrach commented Apr 25, 2023

phofl commented Apr 25, 2023

rhshadrach commented Jun 15, 2023

phofl commented Jun 16, 2023

rhshadrach commented Jun 19, 2023 •

edited

Loading

topper-123 commented Jun 19, 2023

rhshadrach commented Jun 19, 2023 •

edited

Loading

topper-123 commented Jun 19, 2023

rhshadrach commented Jun 21, 2023

topper-123 commented Jun 25, 2023 •

edited

Loading

DEPR: DataFrameGroupBy.apply operating on the group keys #52477

DEPR: DataFrameGroupBy.apply operating on the group keys #52477

Conversation

rhshadrach commented Apr 6, 2023 • edited Loading

topper-123 commented Apr 8, 2023

rhshadrach commented Apr 8, 2023 • edited Loading

topper-123 commented Apr 8, 2023 • edited Loading

topper-123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Apr 9, 2023

topper-123 left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Apr 12, 2023

phofl commented Apr 24, 2023

rhshadrach commented Apr 25, 2023 • edited Loading

phofl commented Apr 25, 2023

rhshadrach commented Apr 25, 2023

phofl commented Apr 25, 2023

rhshadrach commented Apr 25, 2023

phofl commented Apr 25, 2023

rhshadrach commented Jun 15, 2023

phofl commented Jun 16, 2023

rhshadrach commented Jun 19, 2023 • edited Loading

topper-123 commented Jun 19, 2023

rhshadrach commented Jun 19, 2023 • edited Loading

topper-123 commented Jun 19, 2023

rhshadrach commented Jun 21, 2023

topper-123 commented Jun 25, 2023 • edited Loading

rhshadrach commented Apr 6, 2023 •

edited

Loading

rhshadrach commented Apr 8, 2023 •

edited

Loading

topper-123 commented Apr 8, 2023 •

edited

Loading

rhshadrach commented Apr 25, 2023 •

edited

Loading

rhshadrach commented Jun 19, 2023 •

edited

Loading

rhshadrach commented Jun 19, 2023 •

edited

Loading

topper-123 commented Jun 25, 2023 •

edited

Loading