Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

DEPR: DataFrameGroupBy.apply operating on the group keys #52477

Merged
merged 10 commits into from
Apr 12, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Apr 6, 2023

@rhshadrach rhshadrach added Groupby Deprecate Functionality to remove in pandas Apply Apply, Aggregate, Transform, Map labels Apr 6, 2023
…_apply_on_groupings

� Conflicts:
�	doc/source/whatsnew/v2.1.0.rst
@topper-123
Copy link
Contributor

I will look in depth in a few hours, I got some obligations. Just initially:

Is it correct that this will help simplify the code paths for other groupby methods or is this a change that only will affect groupby apply itself? If yes, could you explain how the other groupby methods use/are used by apply?

@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 8, 2023

Is it correct that this will help simplify the code paths for other groupby methods or is this a change that only will affect groupby apply itself?

This change is only for apply itself. Ignoring filtrations (which by de#clude the grouping columns when applicable), almost all of groupby does not operate on the grouping columns. apply is the major exception, and this deprecation is to correct that.

If yes, could you explain how the other groupby methods use/are used by apply?

Almost no groupby methods use apply internally. Those that do:

  • first, last
  • value_counts
  • is_monotonic_increasing, is_monotonic_decreasing
  • dtype
  • diff (for axis=1 only)

Of the above, all except for diff are only used in the Series case where this deprecation does not apply. For diff, axis=1 has been deprecated.

@topper-123
Copy link
Contributor

topper-123 commented Apr 8, 2023

So this can't affect SeriesGroupby at all and no DataFrameGroupby method uses apply. So that means basically no worries wrt. effect on other method, which is great.

I've looked at the code in Groupby.apply and you did in the if clause: self._selected_obj.shape != self._obj_with_exclusions.shape, which surprised me. So I tried:

>>> df = pd.DataFrame({"A": ["a", "a", None, None, "b", "b", "c"], "B": [1, 1, 2, 2, 3, 3, 1]})
>>> g = df.groupby('A')[df.columns]
>>> g.sum()  # sums over grouping column
    A  B
A
a  aa  2
b  bb  6
c   c  1

I never knew this was possible (works also with apply), but I think this is a good thing: by default let's exclude grouping columns from apply/agg etc., but this let's users get them back in in the rare case they need that (there's always an edge case...).

Do you know if this is documented anywhere? IMO we should officially support that pattern with docs and tests. Probably also show this as an example in the deprecation text for this change?

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions.

Can you especially look into if we can avoid giving the deprecation warning in groupby.resample?

f"columns. This behavior is deprecated, and in a future "
f"version of pandas the grouping columns will be excluded "
f"from the operation. Subset the data to exclude the "
f"groupings and silence this warning."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Subset the data to exclude the groupings and silence this warning." -> "Subset the data to silence this warning."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of removing the phrase "to exclude the groupings"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that by keeping the grouping columns in the subsetting, the users are guaranteed to get the same rseult as before, but without the warning:

>>> df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1,  2, 3, 4]})
>>> g = df.groupby('A')
>>> g.apply(lambda x: x.sum())
   A  B
A
1  2  3
2  4  7
>>> g[df.columns].apply(lambda x: x.sum())
   A  B
A
1  2  3
2  4  7

Or they may actually not want to include, then remove it from the subsetting,

>>> g[df.columns.drop("A")].apply(lambda x: x.sum())
   B
A
1  3
2  7

the idea is just that they may/may not want to remove the groupings in the subset and "exclude the groupings" may not be what they want in all cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - this makes sense, but I find it confusing to call this "subsetting". I'll see what I can do for the warning message here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. What do you think?

@@ -149,6 +149,7 @@ Other API changes

Deprecations
~~~~~~~~~~~~
- Deprecated :meth:`.DataFrameGroupBy.apply` operating on the grouping column(s) (:issue:`7155`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small explanation about what to do to avoid the warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done.

"DataFrame.resample operated on the grouping columns. "
"This behavior is deprecated, and in a future version of "
"pandas the grouping columns will be excluded from the operation. "
"Subset the data to exclude the groupings and silence this warning."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Subset the data to exclude the groupings and silence this warning." -> Subset the data to silence this warning."

@topper-123 topper-123 added this to the 2.1 milestone Apr 8, 2023
@rhshadrach
Copy link
Member Author

Do you know if this is documented anywhere? IMO we should officially support that pattern with docs and tests.

Added a line in the user guide; I know we test this because that's how I discovered it but I'll have to dig up where that is.

Probably also show this as an example in the deprecation text for this change?

In the whatsnew, we could add a notable deprecations section to do this, but I'm not sure it's worth it. I'm rather against having an example in the FutureWarning that is raised.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM..

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some of the doctest look to need updating though

@mroeschke mroeschke merged commit 9b20759 into pandas-dev:main Apr 12, 2023
@mroeschke
Copy link
Member

Thanks @rhshadrach

@phofl
Copy link
Member

phofl commented Apr 24, 2023

This is super noisy when you are fine with grouping columns not excluded in 3.0. Is there a way to make this less inconvenient apart from adjusting every groupby apply call?

@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 25, 2023

The warning should be triggered only when both of the following apply:

  • The UDF with the grouping columns does not raise
  • Raising would mean the fallback passes different groups (namely, excluding the grouping columns)

The only thing I could see doing in addition is seeing (perhaps on just the first group) if the result differs whether the grouping columns are included or not. What do you think about this option?

Another option would be changing without deprecation. I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition. Perhaps this could go behind that option as well.

@phofl
Copy link
Member

phofl commented Apr 25, 2023

If I understand you here, the default case would raise a warning:

df = pd.DataFrame({"a": [1, 2, 3], "b": 1.5, "c": True, "d": "d"})


def test(x):
    return x


df.groupby("d").apply(test)

which it does right now. This seems pretty noisy, since there is no way to get rid of the warning as a user apart from changing your code everywhere?

I think an option is a good idea, if you want to add something like this anyway for other reasons.

@rhshadrach
Copy link
Member Author

This seems pretty noisy

I would typically describe "noisy" as warning when behavior won't change, so just want to be clear that this will.

# Current
     a    b     c  d
d                   
d 0  1  1.5  True  d
  1  2  1.5  True  d
  2  3  1.5  True  d

# Future
     a    b     c
d                   
d 0  1  1.5  True
  1  2  1.5  True
  2  3  1.5  True

To suppress the warning and adopt future behavior, one would do df.groupby("d")[["a", "b", "c"]].apply(test); to keep current behavior it's df.groupby("d")[["a", "b", "c", "d"]].apply(test). Neither of these warn.

But agreed it's noisy in the sense of "someone using apply a lot of may see a lot of warnings" - and experience a lot of behavior changes!

I'll put up a PR to revert.

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Apr 25, 2023
@phofl
Copy link
Member

phofl commented Apr 25, 2023

Yes you are correct, but I guess that most users don't care about grouping columns in apply? Values are constant anyway, so why would you need the information? That's what I mean with noisy. I am assuming that a vast majority of users just doesn't care about the grouping columns

@rhshadrach rhshadrach restored the depr_apply_on_groupings branch April 25, 2023 21:30
@rhshadrach
Copy link
Member Author

Values are constant anyway, so why would you need the information? That's what I mean with noisy. I am assuming that a vast majority of users just doesn't care about the grouping columns

It sounds like you're saying you'd be okay changing this behavior without deprecation?

The users may not want them, but they have them now. So if their workflow drops the grouping columns from the result, that workflow will break if we were to change this behavior without deprecation.

@phofl
Copy link
Member

phofl commented Apr 25, 2023

I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition

I'd prefer this solution if you end up creating an option anyway.

But I think I'd also be ok with doing this without a deprecation. Personally, I'd be annoyed that I'd have to change all my apply calls just to silence the warning even though I don't care about the grouping columns

phofl pushed a commit that referenced this pull request Apr 28, 2023
Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (#52477)"

This reverts commit 9b20759.
NumanIjaz pushed a commit to NumanIjaz/pandas that referenced this pull request May 1, 2023
…as-dev#52921)

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)"

This reverts commit 9b20759.
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 7, 2023
…as-dev#52921)

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)"

This reverts commit 9b20759.
Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
…as-dev#52921)

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)"

This reverts commit 9b20759.
@rhshadrach
Copy link
Member Author

I'm also planning to put up a PDEP to propose a bunch of changes to agg/apply/transform that get put up behind an option for smoother transition

I'd prefer this solution if you end up creating an option anyway.

We've moved away from introducing this option. I'm now looking for ways to introduce this deprecation again.

But I think I'd also be ok with doing this without a deprecation. Personally, I'd be annoyed that I'd have to change all my apply calls just to silence the warning even though I don't care about the grouping columns

I imagine this is a popular method and we're changing the output in what I fear could be a common case. Doing this without deprecation seems too likely to break a lot of code. I don't fully understand your comment about "changing apply calls just to silence the warning". Wouldn't your code have a good chance of breaking if you didn't change it ahead of time when we change the behavior?

@phofl
Copy link
Member

phofl commented Jun 16, 2023

I imagine this is a popular method and we're changing the output in what I fear could be a common case. Doing this without deprecation seems too likely to break a lot of code. I don't fully understand your comment about "changing apply calls just to silence the warning". Wouldn't your code have a good chance of breaking if you didn't change it ahead of time when we change the behavior?

Not really if you don't care about the group keys being included in your group or not. This was often the case for me when I used apply.

But I agree, not deprecating this in some way is probably a bad idea.

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 19, 2023

I believe the checks on the warning that were included here completely predict when the output would change. In other words, we have no way of making this less noisy by merely emitting warnings in less cases.

What do you think about an option to adopt the future behavior? Something like mode.groupings_in_apply. Defaults to True (current behavior). When set to False, it would adopt the future behavior. Then we deprecate the option in 3.x. Is that palatable?

If that's a good way forward, I think we still make the changes to the tests as they are in this PR.

@topper-123
Copy link
Contributor

Could an idea be to silence the deprecation warning until after 2.1 or maybe even 2.2?

IMO we need a deprecation here, but it may not be necessary to have it effective in 2.1, especially if silencing/unsilencing is low effort?

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 19, 2023

I'm fine with waiting to deprecate; but I think to @phofl's point, silencing is not low effort (especially when you have many columns you want to include)

And just to make sure, with the option I'm proposing, anyone who has it set to True (the default) will see the deprecation warning.

@topper-123
Copy link
Contributor

Sorry, I meant if we in the pandas/core/groupby/groupby.py surround the FutureWarning with warnings.catch_warnings(): and warnings.filter_warning code, then remove that code when we want to re-activate the warnings e.g. in v2.2.

@rhshadrach
Copy link
Member Author

@topper-123: I'd personally prefer just to wait until 2.1 is released. When you wrote:

especially if silencing/unsilencing is low effort?

I thought you meant think from a user perspective. From an implementation perspective, almost all the work is with the tests. But I'm okay with having to redo a little work if the tests have changes from the implementation here by the time 2.1 is release.

@topper-123
Copy link
Contributor

topper-123 commented Jun 25, 2023

Ok, sounds good to me.

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…as-dev#52921)

Revert "DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#52477)"

This reverts commit 9b20759.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: way to exclude the grouped column with apply
4 participants