-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
DEPR: DataFrameGroupBy.apply operating on the group keys #52477
DEPR: DataFrameGroupBy.apply operating on the group keys #52477
Conversation
…_apply_on_groupings � Conflicts: � doc/source/whatsnew/v2.1.0.rst
I will look in depth in a few hours, I got some obligations. Just initially: Is it correct that this will help simplify the code paths for other groupby methods or is this a change that only will affect groupby |
This change is only for apply itself. Ignoring filtrations (which by de#clude the grouping columns when applicable), almost all of groupby does not operate on the grouping columns. apply is the major exception, and this deprecation is to correct that.
Almost no groupby methods use
Of the above, all except for diff are only used in the Series case where this deprecation does not apply. For diff, axis=1 has been deprecated. |
So this can't affect I've looked at the code in >>> df = pd.DataFrame({"A": ["a", "a", None, None, "b", "b", "c"], "B": [1, 1, 2, 2, 3, 3, 1]})
>>> g = df.groupby('A')[df.columns]
>>> g.sum() # sums over grouping column
A B
A
a aa 2
b bb 6
c c 1 I never knew this was possible (works also with Do you know if this is documented anywhere? IMO we should officially support that pattern with docs and tests. Probably also show this as an example in the deprecation text for this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions.
Can you especially look into if we can avoid giving the deprecation warning in groupby.resample
?
pandas/core/groupby/groupby.py
Outdated
f"columns. This behavior is deprecated, and in a future " | ||
f"version of pandas the grouping columns will be excluded " | ||
f"from the operation. Subset the data to exclude the " | ||
f"groupings and silence this warning." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Subset the data to exclude the groupings and silence this warning." -> "Subset the data to silence this warning."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of removing the phrase "to exclude the groupings"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that by keeping the grouping columns in the subsetting, the users are guaranteed to get the same rseult as before, but without the warning:
>>> df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 3, 4]})
>>> g = df.groupby('A')
>>> g.apply(lambda x: x.sum())
A B
A
1 2 3
2 4 7
>>> g[df.columns].apply(lambda x: x.sum())
A B
A
1 2 3
2 4 7
Or they may actually not want to include, then remove it from the subsetting,
>>> g[df.columns.drop("A")].apply(lambda x: x.sum())
B
A
1 3
2 7
the idea is just that they may/may not want to remove the groupings in the subset and "exclude the groupings" may not be what they want in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - this makes sense, but I find it confusing to call this "subsetting". I'll see what I can do for the warning message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. What do you think?
doc/source/whatsnew/v2.1.0.rst
Outdated
@@ -149,6 +149,7 @@ Other API changes | |||
|
|||
Deprecations | |||
~~~~~~~~~~~~ | |||
- Deprecated :meth:`.DataFrameGroupBy.apply` operating on the grouping column(s) (:issue:`7155`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small explanation about what to do to avoid the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - done.
pandas/core/resample.py
Outdated
"DataFrame.resample operated on the grouping columns. " | ||
"This behavior is deprecated, and in a future version of " | ||
"pandas the grouping columns will be excluded from the operation. " | ||
"Subset the data to exclude the groupings and silence this warning." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Subset the data to exclude the groupings and silence this warning." -> Subset the data to silence this warning."
Added a line in the user guide; I know we test this because that's how I discovered it but I'll have to dig up where that is.
In the whatsnew, we could add a notable deprecations section to do this, but I'm not sure it's worth it. I'm rather against having an example in the FutureWarning that is raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some of the doctest look to need updating though
Thanks @rhshadrach |
This is super noisy when you are fine with grouping columns not excluded in 3.0. Is there a way to make this less inconvenient apart from adjusting every groupby apply call? |
The warning should be triggered only when both of the following apply:
The only thing I could see doing in addition is seeing (perhaps on just the first group) if the result differs whether the grouping columns are included or not. What do you think about this option? Another option would be changing without deprecation. I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition. Perhaps this could go behind that option as well. |
If I understand you here, the default case would raise a warning:
which it does right now. This seems pretty noisy, since there is no way to get rid of the warning as a user apart from changing your code everywhere? I think an option is a good idea, if you want to add something like this anyway for other reasons. |
I would typically describe "noisy" as warning when behavior won't change, so just want to be clear that this will.
To suppress the warning and adopt future behavior, one would do But agreed it's noisy in the sense of "someone using apply a lot of may see a lot of warnings" - and experience a lot of behavior changes! I'll put up a PR to revert. |
…das-dev#52477)" This reverts commit 9b20759.
Yes you are correct, but I guess that most users don't care about grouping columns in apply? Values are constant anyway, so why would you need the information? That's what I mean with noisy. I am assuming that a vast majority of users just doesn't care about the grouping columns |
It sounds like you're saying you'd be okay changing this behavior without deprecation? The users may not want them, but they have them now. So if their workflow drops the grouping columns from the result, that workflow will break if we were to change this behavior without deprecation. |
I'd prefer this solution if you end up creating an option anyway. But I think I'd also be ok with doing this without a deprecation. Personally, I'd be annoyed that I'd have to change all my apply calls just to silence the warning even though I don't care about the grouping columns |
…as-dev#52921) Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)" This reverts commit 9b20759.
…as-dev#52921) Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)" This reverts commit 9b20759.
…as-dev#52921) Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)" This reverts commit 9b20759.
We've moved away from introducing this option. I'm now looking for ways to introduce this deprecation again.
I imagine this is a popular method and we're changing the output in what I fear could be a common case. Doing this without deprecation seems too likely to break a lot of code. I don't fully understand your comment about "changing apply calls just to silence the warning". Wouldn't your code have a good chance of breaking if you didn't change it ahead of time when we change the behavior? |
Not really if you don't care about the group keys being included in your group or not. This was often the case for me when I used apply. But I agree, not deprecating this in some way is probably a bad idea. |
I believe the checks on the warning that were included here completely predict when the output would change. In other words, we have no way of making this less noisy by merely emitting warnings in less cases. What do you think about an option to adopt the future behavior? Something like If that's a good way forward, I think we still make the changes to the tests as they are in this PR. |
Could an idea be to silence the deprecation warning until after 2.1 or maybe even 2.2? IMO we need a deprecation here, but it may not be necessary to have it effective in 2.1, especially if silencing/unsilencing is low effort? |
I'm fine with waiting to deprecate; but I think to @phofl's point, silencing is not low effort (especially when you have many columns you want to include) And just to make sure, with the option I'm proposing, anyone who has it set to True (the default) will see the deprecation warning. |
Sorry, I meant if we in the |
@topper-123: I'd personally prefer just to wait until 2.1 is released. When you wrote:
I thought you meant think from a user perspective. From an implementation perspective, almost all the work is with the tests. But I'm okay with having to redo a little work if the tests have changes from the implementation here by the time 2.1 is release. |
Ok, sounds good to me. |
…as-dev#52921) Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)" This reverts commit 9b20759.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.