Skip to content

BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

viable-alternative
Copy link

@viable-alternative viable-alternative commented Apr 14, 2025

Summary:

  • Fixes a 3-year-old bug with incorrect min/max rolling calculation for custom window sizes. Adds an error check for invalid inputs.
  • Speed improvement of ~10% by not using additional queue for handling NaNs.
  • For complex cases, incurs additional multiplicative log(k) complexity (where k is max window size), but this cost is only incurred in cases that produced invalid result before. For example, for constant window size this cost is not incurred.

Changed behavior:

Has additional validity check, which will raise ValueError if the function detects a condition it cannot work with, namely improper ordering of start/end bounds. The existing method would happily consume such input and would produce a wrong result. There is a new unit test to check for the raised ValueError.

Note on invalid inputs:

It is possible to make the method work for an arbitrary stream of start/end window bounds, but it will require sorting. It is very unlikely that such work is worth the effort, and it is estimated to have extremely low need, if any. Let someone create an enhancement request first.
If sorting is to be implemented: it can be done with only incurring performance hit in the case of unsorted input: copy and sort the start/end arrays, producing a permutation, run the main method on the copy, and then extract the result back using the permutation. To detect if the start/end array pair is properly sorted will only take O(N). (Soring is N*log(N), does not have to be stable, but the input array is extremely likely to be “almost” sorted, and you have to pick your poison of a sorting method that works well with nearly sorted array, or use efficient soring methods, most of which do not offer additional speed on nearly sorted arrays.) Working such intermediate step (without copying and pasting) into 3 different implementations will require some less than straightforward work in the “apply” family of methods used by other rolling functions, and therefore will bear risk. If this is decided to be done, it is recommended to have an additional parameter to optionally skip the “sorted” check. (The user may already know that the arrays are properly sorted).

How to Debug numba

You can temporarily change 2 lines of code in order to Python-debug numba implementation with VS Code or another Python debugger:

  • Comment out the numba.jit decorator on the function(sliding_min_max() in min_max_.py).
  • Do the same with the column_looper() function defined inside the generate_apply_looper() function in executor.py.
  • Your breakpoint inside the function will now hit!

Misc Notes

The speed improvement of ~10% was confirmed in two ways:

  • As measured by pandas’ supplied asv benchmark suite (0.80-0.91 coefficient (depending on particular test) on my hardware).
  • With a custom load test over a 2MM-long rolling window on a 300MM-long data set. (See the supplied bench.py.txt.) A single run of the test takes approx. 6-8 seconds and consumes ~15GB of RAM on a 32-GB RAM PC.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Before fully reviewing, it would be great if some things conformed more closely to our codebase, namely

  1. Remove bug_hunters directory
  2. Remove pandas/core/window/_min_max.py if it's not used by the Cython or Numba implementation
  3. Instead of a separate test_minmax.py create tests that would close the original issue in tests/window/test_rolling.py and tests/window/test_numba.py

@viable-alternative
Copy link
Author

@mroeschke, all done!
Edited my original comment to reflect what is actually being done. Thx!
(CI is still in progress. I will fix ti up if anything fails.)

Dominators: list = [] # this is a stack
# END-OF numba-only init part

# Basic idea of the algorithm: https://stackoverflow.com/a/12239580
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove most of the comments in this file if it's mostly described by this stackoverflow link

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack overflow (SO) link does not describe the method; just provides the code for a special case (fixed window). We can confirm that by noting that one one of the SO comments states "please explain the algorithm more clearly ??" I don't mind taking out all comments. It will not be an easy task to understand what this method is doing and why it works (or maybe doesn't in some cases!) without them. Further, this is an extension to the SO method (like a 2nd dimension added to a 1-dimentional case) so the SO article cannot describe that extension.

Suggestion: keep all in-code comments.
I am new to GItHub, and I don't feel I know enough to click on "resolve this conversation". Feel free to "resolve" and tell me if I was supposed to "resolve" it instead. Thx!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I appreciate the thoroughness of the comments and willingness to explain things, I find it fairly distracting to have more comments than code.

If you could reduce this by giving variables clearer names and maybe have at most 3 lines of comments at the beginning of this function explaining what you're doing, that would be a good compromise.

Copy link
Member

@rhshadrach rhshadrach Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @mroeschke. This is an opinionated take, but excessive comments like this are a code smell. The best code is code that doesn't require any comments. If code requires an excessive amount of comments, it can be (not is) an indication the code can be improved. Of course, highly technical algorithms like this almost always necessitate some amount of comments.

It seems to me individual comments can be made more concise, which may help a bit. E.g. instead of

This is also likely 99+% of all use cases, thus we want to make sure we do not go into bisection as to incur neither the *log(k) penalty nor the function call penalty for this very common case.

could be

This is a common case, ensure we avoid bisection for performance.

In addition, it may be a good idea to have a paragraph of a concise high-level description of the algorithm prior to starting the main code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-code comments addressed as requested by @mroeschke.
As for the request to "giving variables clearer names": all new variables were given clear names in the original commit. There are two reasons for keeping existing variable names the same: the "Q" variable name has to stay to align with the linked SO article as the only surviving piece of documentation. The remaining variables kept as is, as to not complicate the review of what has changed. There is not much room for renaming them at any rate.

@@ -1946,3 +1947,172 @@ def test_rolling_timedelta_window_non_nanoseconds(unit, tz):
df.index = df.index.as_unit("ns")

tm.assert_frame_equal(ref_df, df)


class StandardWindowIndexer(BaseIndexer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this, can we have a test like

class MultiWindowIndexer(...)
    # from the original issue

def test_max_with_custom_indexer():
    df = pd.DataFrame({"window": [not_random_data], "data": [not_random_data]})
    result = df.rolling(MultiWindowIndexer(df.window)).max()
    expected = pd.DataFrame(...)
    tm.assert_frame_equal(result, expected)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, are you asking to remove pseudo-random data with seed and replace it with specific typed-in data, like a list of literals? What is a good way of storing 1,000 (or possibly more) numbers in the test? One would suggest practicality of storing of 1,000 or more numbers to be questionable. On the other hand, reducing the number to something manageable for storing (like 10 or 20) drastically reduces the power of the test. It could be that I am misreading the request? The current tests run quickly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not occur to me to bring this up in the first response. There is a reason these tests are such. Earlier versions of this fix worked perfectly fine with no NaN data, and by adding the "right" amount of NaNs (30%, 60%, etc) and choosing a certain proportion of the window length to the size of the data frame and NaN fraction, one or the other kind of bugs would come out. I can quickly write a short test that fails on main and passes with this fix. I would not recommend my customer (pandas project) to consider such unit test satisfactory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future maintainability of these test, we would appreciate deterministic tests. You could still use pytest.mark.parmeterize to specify pairs of input test data of 10 elements with varying proportion of NaNs and their expected outputs to gain satisfactory coverage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on the 10-element tests as we speak. I would like to highlight that the tests being offered are deterministic. Do not be misled by (pseudo!)random number generators (RNG). All RNG calls in these tests have a seed. The existing CI controls on the repo will not allow truly random tests at any rate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000-element randomized deterministic tests replaced with 10-element handwritten tests as requested. This closes all pending requests.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite an algorithm - this is very nice!

Q, Dominators, and TestData go against PEP8 naming guidelines. I'm surprised our linters don't flag these. In any case, can we adhere to PEP8 here - I'd suggest queue, dominators, and test_data, but would also be okay with q.

@@ -18,6 +21,19 @@
from pandas._typing import npt


@numba.njit(nogil=True, parallel=False)
def bisect_left(a: list[Any], x: Any, lo: int = 0, hi: int = -1) -> int:
if hi == -1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring here; I think just a single line is fine. Something like

"""Get index of smallest value greater than or equal to x between lo and hi."""


# Discard entries outside and left of current window
while Q and Q[0] <= start[i] - 1:
Q: list = [] # this is a queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on what this queue tracks. Something like

# Track indices of potential extrema. Corresponding values will always be monotonic (increasing/decreasing depending on min/max).

Also, I think type-hint can be list[int].

# Discard entries outside and left of current window
while Q and Q[0] <= start[i] - 1:
Q: list = [] # this is a queue
Dominators: list = [] # this is a stack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on what this stack tracks. Something like

# Track `start` values we need to consider when building `Q`.

Can also be list[int].

int64_t lo=0,
int64_t hi=-1
) nogil:
cdef int64_t mid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can you add a similar docstring.

@viable-alternative
Copy link
Author

Thank you for the thoughtful review. Give me some time (hopefully under a week) to read up on PEP8 and address other comments. Thanks for the helpful naming suggestions: only takes a minute, and saves me hours of time digging too deep into PEP8.

viable-alternative and others added 6 commits April 21, 2025 10:35
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Comment on lines +590 to +598
class TestMinMaxNumba:
parent = TestMinMax()

@pytest.mark.parametrize("is_max, has_nan, exp_list", TestMinMax.TestData)
def test_minmax(self, is_max, has_nan, exp_list):
TestMinMaxNumba.parent.test_minmax(is_max, has_nan, exp_list, "numba")

def test_wrong_order(self):
TestMinMaxNumba.parent.test_wrong_order("numba")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you essentially duplicate the test in pandas/tests/window/test_rolling.py? The pandas test suite avoids test dependencies like this.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for accommodating my requests. This is looking close to merging!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min
4 participants