Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

DOC: Closed parameter not intuitively documented in DataFrame.rolling #60485

Closed
1 task done
caballerofelipe opened this issue Dec 3, 2024 · 7 comments · Fixed by #60832 or #60844
Closed
1 task done

DOC: Closed parameter not intuitively documented in DataFrame.rolling #60485

caballerofelipe opened this issue Dec 3, 2024 · 7 comments · Fixed by #60832 or #60844
Assignees
Labels
Docs Window rolling, ewma, expanding

Comments

@caballerofelipe
Copy link

caballerofelipe commented Dec 3, 2024

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.rolling.html

Documentation problem

I believe the parameter closed is not very intuitively documented.

(I'm using Pandas 2.2.2 on a macOS Sequoia)

Window size used for closed should be window+1

For this parameter to work, the actual window size should be thought of window+1. So for instance when window=3 this is how closed should be thought:

  • closed='right': from a window size of 4 (window=3+1), take the current element and 2 (4-2) elements just before the current one. Totals to 3 elements.
  • closed='left': from a window size of 4 (window=3+1), don't take the current element but take the 3 (4-1) elements just before the current one. Totals to 3 elements.
  • closed='both': from a window size of 4 (window=3+1), take the current element and 3 (4-1) elements just before the current one. Totals to 4 elements.

closed='neither'

Either 'neither' isn't working or what it does isn't straight forward to me. See examples below, both examples return NaN in every position.

Intuitively I would guess this parameter would diminish the window size by two from a window size of window+1. So if window=3 it would mean the actual calculation would be done in 3+1-2=2 window but as you see below I only get NaN.

Example 1: mean()

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8]})

# Rolling mean with 'left' closed
df['rolling_mean_left'] = df['A'].rolling(window=3, closed='left').mean()

# Rolling mean with 'both' closed
df['rolling_mean_both'] = df['A'].rolling(window=3, closed='both').mean()

# Rolling mean with 'right' closed
df['rolling_mean_right'] = df['A'].rolling(window=3, closed='right').mean()

# Rolling mean with neither closed
df['rolling_mean_neither'] = df['A'].rolling(window=3, closed='neither').mean()

df
Capture d’écran 2024-12-03 à 14 17 33

Example 2: sum()

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8]})

# Rolling sum with 'left' closed
df['rolling_sum_left'] = df['A'].rolling(window=3, closed='left').sum()

# Rolling sum with 'both' closed
df['rolling_sum_both'] = df['A'].rolling(window=3, closed='both').sum()

# Rolling sum with 'right' closed
df['rolling_sum_right'] = df['A'].rolling(window=3, closed='right').sum()

# Rolling sum with neither closed
df['rolling_sum_neither'] = df['A'].rolling(window=3, closed='neither').sum()

df
Capture d’écran 2024-12-03 à 14 18 32

Suggested fix for documentation

I would suggest stating that the window size taken into consideration for closed is actually the parameter window + 1, then what's stated in the docs would make sense. OR, actually use the actual window parameter which would make way much more sense to me. From the current docs:

  • If 'right', the first point in the window is excluded from calculations.
  • If 'left', the last point in the window is excluded from calculations.
  • If 'both', no point in the window is excluded from calculations.

Maybe even add an image example like the ones I posted above.

As for 'neither', I don't have suggestions as I don't fully understand it from my testing.

Finally, I don't like the name closed for the parameter, is doesn't mean much to me. I would maybe prefer something like ends or ends_used. I believe it would be more intuitive.

@caballerofelipe caballerofelipe added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 3, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Dec 3, 2024

Thanks for the report! Agreed there can be many improvements here.

So for instance when window=3 this is how closed should be thought:

I don't think this is the most natural way to think of the rolling operation. Instead, I recommend using interval notation. In your sum example, consider computing the values for the row indexed by 3 (so the 4th row, with the first being indexed by 0). The window size is 3 and center=False, so the window is ☐0,3☐ where the stands for one of the brackets [, (, ), ]. We then have 4 cases for closed:

  • Left: [0, 3)
  • Right: (0, 3]
  • Both: [0, 3]
  • Neither: (0, 3)

In each case, I think you'll see that pandas is including the points that fall in this interval. This includes the case of neither, where the points are 1 and 2 which is only 2 valid observations. This falls short (and will always fall short) of the default minimum, and therefore result in NaN.

Note this is where "closed" comes from - the mathematical terminology of having an interval be closed on the left or right.

Also note that for each of the intervals above, the size of the interval is indeed 3.

@rhshadrach rhshadrach added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 3, 2024
@caballerofelipe
Copy link
Author

Hi @rhshadrach, thanks for the mathematical rigorosity added and the explanation of the term 'closed' 😊.

However, I would disagree about the window parameter's meaning. I agree with your usage of "interval" but in the docs it states window is size:

Size of the moving window.

And if you think about size, in the window=3 example, well size is 4 and not 3. What changes are the parameters taken into consideration, 4, 3 or 2.

So maybe some clarification should be added to avoid this confusion.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 3, 2024

Agreed that the docs need updated. But the size (length) of the interval [0, 3] is 3. This is what is meant by size of the window.

@jackson-tsang578
Copy link

take

@SebastianOuslis
Copy link
Contributor

take

@rhshadrach
Copy link
Member

Reopening since #60832 did not address the reason for the confusion, namely that window=k with k an integer is not the number of data points in the window.

@SebastianOuslis
Copy link
Contributor

@rhshadrach readdressed the missing information about the sizing of the window, let me know if it needs more detail on how closed configurations impact the size of the window.

# for free to join this conversation on GitHub. Already have an account? # to comment