BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

viable-alternative · 2025-04-14T12:15:42Z

closes BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min #46726
Tests added and passed
All code checks passed.
Added type annotations to new arguments/methods/functions.
Speed improved by ~10% as measured by ASV.
DID NOT add an entry in the latest doc/source/whatsnew/vX.X.X.rst as I do not know what version this will go out with. I am happy to add such section pretty quickly if you tell me the version number.

Summary:

Fixes a 3-year-old bug with incorrect min/max rolling calculation for custom window sizes. Adds an error check for invalid inputs.
Speed improvement of ~10% by not using additional queue for handling NaNs.
For complex cases, incurs additional multiplicative log(k) complexity (where k is max window size), but this cost is only incurred in cases that produced invalid result before. For example, for constant window size this cost is not incurred.

Changed behavior:

Has additional validity check, which will raise ValueError if the function detects a condition it cannot work with, namely improper ordering of start/end bounds. The existing method would happily consume such input and would produce a wrong result. There is a new unit test to check for the raised ValueError.

Note on invalid inputs:

It is possible to make the method work for an arbitrary stream of start/end window bounds, but it will require sorting. It is very unlikely that such work is worth the effort, and it is estimated to have extremely low need, if any. Let someone create an enhancement request first.
If sorting is to be implemented: it can be done with only incurring performance hit in the case of unsorted input: copy and sort the start/end arrays, producing a permutation, run the main method on the copy, and then extract the result back using the permutation. To detect if the start/end array pair is properly sorted will only take O(N). (Soring is N*log(N), does not have to be stable, but the input array is extremely likely to be “almost” sorted, and you have to pick your poison of a sorting method that works well with nearly sorted array, or use efficient soring methods, most of which do not offer additional speed on nearly sorted arrays.) Working such intermediate step (without copying and pasting) into 3 different implementations will require some less than straightforward work in the “apply” family of methods used by other rolling functions, and therefore will bear risk. If this is decided to be done, it is recommended to have an additional parameter to optionally skip the “sorted” check. (The user may already know that the arrays are properly sorted).

How to Debug numba

You can temporarily change 2 lines of code in order to Python-debug numba implementation with VS Code or another Python debugger:

Comment out the numba.jit decorator on the function(sliding_min_max() in min_max_.py).
Do the same with the column_looper() function defined inside the generate_apply_looper() function in executor.py.
Your breakpoint inside the function will now hit!

Misc Notes

The speed improvement of ~10% was confirmed in two ways:

As measured by pandas’ supplied asv benchmark suite (0.80-0.91 coefficient (depending on particular test) on my hardware).
With a custom load test over a 2MM-long rolling window on a 300MM-long data set. (See the supplied bench.py.txt.) A single run of the test takes approx. 6-8 seconds and consumes ~15GB of RAM on a 32-GB RAM PC.

mroeschke

Thanks for the PR.

Before fully reviewing, it would be great if some things conformed more closely to our codebase, namely

Remove bug_hunters directory
Remove pandas/core/window/_min_max.py if it's not used by the Cython or Numba implementation
Instead of a separate test_minmax.py create tests that would close the original issue in tests/window/test_rolling.py and tests/window/test_numba.py

viable-alternative · 2025-04-17T18:56:11Z

@mroeschke, all done!
Edited my original comment to reflect what is actually being done. Thx!
(CI is still in progress. I will fix ti up if anything fails.)

pandas/_libs/window/aggregations.pyx

pandas/core/_numba/kernels/min_max_.py

mroeschke · 2025-04-18T16:19:01Z

pandas/core/_numba/kernels/min_max_.py

+    Dominators: list = []  # this is a stack
+    # END-OF numba-only init part
+
+    # Basic idea of the algorithm: https://stackoverflow.com/a/12239580


I think we can remove most of the comments in this file if it's mostly described by this stackoverflow link

The stack overflow (SO) link does not describe the method; just provides the code for a special case (fixed window). We can confirm that by noting that one one of the SO comments states "please explain the algorithm more clearly ??" I don't mind taking out all comments. It will not be an easy task to understand what this method is doing and why it works (or maybe doesn't in some cases!) without them. Further, this is an extension to the SO method (like a 2nd dimension added to a 1-dimentional case) so the SO article cannot describe that extension.

Suggestion: keep all in-code comments.
I am new to GItHub, and I don't feel I know enough to click on "resolve this conversation". Feel free to "resolve" and tell me if I was supposed to "resolve" it instead. Thx!

While I appreciate the thoroughness of the comments and willingness to explain things, I find it fairly distracting to have more comments than code.

If you could reduce this by giving variables clearer names and maybe have at most 3 lines of comments at the beginning of this function explaining what you're doing, that would be a good compromise.

Agree with @mroeschke. This is an opinionated take, but excessive comments like this are a code smell. The best code is code that doesn't require any comments. If code requires an excessive amount of comments, it can be (not is) an indication the code can be improved. Of course, highly technical algorithms like this almost always necessitate some amount of comments.

It seems to me individual comments can be made more concise, which may help a bit. E.g. instead of

This is also likely 99+% of all use cases, thus we want to make sure we do not go into bisection as to incur neither the *log(k) penalty nor the function call penalty for this very common case.

could be

This is a common case, ensure we avoid bisection for performance.

In addition, it may be a good idea to have a paragraph of a concise high-level description of the algorithm prior to starting the main code.

In-code comments addressed as requested by @mroeschke.
As for the request to "giving variables clearer names": all new variables were given clear names in the original commit. There are two reasons for keeping existing variable names the same: the "Q" variable name has to stay to align with the linked SO article as the only surviving piece of documentation. The remaining variables kept as is, as to not complicate the review of what has changed. There is not much room for renaming them at any rate.

mroeschke · 2025-04-18T16:23:47Z

pandas/tests/window/test_rolling.py

@@ -1946,3 +1947,172 @@ def test_rolling_timedelta_window_non_nanoseconds(unit, tz):
    df.index = df.index.as_unit("ns")

    tm.assert_frame_equal(ref_df, df)
+
+
+class StandardWindowIndexer(BaseIndexer):


Instead of all this, can we have a test like

class MultiWindowIndexer(...) # from the original issue def test_max_with_custom_indexer(): df = pd.DataFrame({"window": [not_random_data], "data": [not_random_data]}) result = df.rolling(MultiWindowIndexer(df.window)).max() expected = pd.DataFrame(...) tm.assert_frame_equal(result, expected)

Just to clarify, are you asking to remove pseudo-random data with seed and replace it with specific typed-in data, like a list of literals? What is a good way of storing 1,000 (or possibly more) numbers in the test? One would suggest practicality of storing of 1,000 or more numbers to be questionable. On the other hand, reducing the number to something manageable for storing (like 10 or 20) drastically reduces the power of the test. It could be that I am misreading the request? The current tests run quickly.

Did not occur to me to bring this up in the first response. There is a reason these tests are such. Earlier versions of this fix worked perfectly fine with no NaN data, and by adding the "right" amount of NaNs (30%, 60%, etc) and choosing a certain proportion of the window length to the size of the data frame and NaN fraction, one or the other kind of bugs would come out. I can quickly write a short test that fails on main and passes with this fix. I would not recommend my customer (pandas project) to consider such unit test satisfactory.

For future maintainability of these test, we would appreciate deterministic tests. You could still use pytest.mark.parmeterize to specify pairs of input test data of 10 elements with varying proportion of NaNs and their expected outputs to gain satisfactory coverage.

Working on the 10-element tests as we speak. I would like to highlight that the tests being offered are deterministic. Do not be misled by (pseudo!)random number generators (RNG). All RNG calls in these tests have a seed. The existing CI controls on the repo will not allow truly random tests at any rate.

1000-element randomized deterministic tests replaced with 10-element handwritten tests as requested. This closes all pending requests.

rhshadrach

Quite an algorithm - this is very nice!

Q, Dominators, and TestData go against PEP8 naming guidelines. I'm surprised our linters don't flag these. In any case, can we adhere to PEP8 here - I'd suggest queue, dominators, and test_data, but would also be okay with q.

rhshadrach · 2025-04-20T14:07:57Z

pandas/core/_numba/kernels/min_max_.py

@@ -18,6 +21,19 @@
    from pandas._typing import npt


+@numba.njit(nogil=True, parallel=False)
+def bisect_left(a: list[Any], x: Any, lo: int = 0, hi: int = -1) -> int:
+    if hi == -1:


Can you add a docstring here; I think just a single line is fine. Something like

"""Get index of smallest value greater than or equal to x between lo and hi."""

rhshadrach · 2025-04-20T14:10:37Z

pandas/core/_numba/kernels/min_max_.py


-        # Discard entries outside and left of current window
-        while Q and Q[0] <= start[i] - 1:
+    Q: list = []  # this is a queue


Can you comment on what this queue tracks. Something like

# Track indices of potential extrema. Corresponding values will always be monotonic (increasing/decreasing depending on min/max).

Also, I think type-hint can be list[int].

rhshadrach · 2025-04-20T14:13:44Z

pandas/core/_numba/kernels/min_max_.py

-        # Discard entries outside and left of current window
-        while Q and Q[0] <= start[i] - 1:
+    Q: list = []  # this is a queue
+    Dominators: list = []  # this is a stack


Can you comment on what this stack tracks. Something like

# Track `start` values we need to consider when building `Q`.

Can also be list[int].

rhshadrach · 2025-04-20T14:23:55Z

pandas/_libs/window/aggregations.pyx

+    int64_t lo=0,
+    int64_t hi=-1
+) nogil:
+    cdef int64_t mid


Same here, can you add a similar docstring.

viable-alternative · 2025-04-20T15:58:40Z

Thank you for the thoughtful review. Give me some time (hopefully under a week) to read up on PEP8 and address other comments. Thanks for the helpful naming suggestions: only takes a minute, and saves me hours of time digging too deep into PEP8.

…max rolling calc.

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

@mroeschke

fuinction, as requested by @mroeschke

@mroeschke

…ts as requested by @mroeschke

mroeschke · 2025-04-21T16:07:46Z

pandas/tests/window/test_numba.py

+class TestMinMaxNumba:
+    parent = TestMinMax()
+
+    @pytest.mark.parametrize("is_max, has_nan, exp_list", TestMinMax.TestData)
+    def test_minmax(self, is_max, has_nan, exp_list):
+        TestMinMaxNumba.parent.test_minmax(is_max, has_nan, exp_list, "numba")
+
+    def test_wrong_order(self):
+        TestMinMaxNumba.parent.test_wrong_order("numba")


Can you essentially duplicate the test in pandas/tests/window/test_rolling.py? The pandas test suite avoids test dependencies like this.

mroeschke

Thanks for accommodating my requests. This is looking close to merging!

viable-alternative requested a review from WillAyd as a code owner April 14, 2025 12:15

viable-alternative force-pushed the rolling_mm_upstream branch 3 times, most recently from 7c4b8a9 to 1c2ce94 Compare April 14, 2025 19:35

simonjayhawkins mentioned this pull request Apr 17, 2025

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min #46726

Open

3 tasks

simonjayhawkins added Bug Window rolling, ewma, expanding labels Apr 17, 2025

mroeschke requested changes Apr 17, 2025

View reviewed changes

viable-alternative force-pushed the rolling_mm_upstream branch from b67a773 to 54b62c7 Compare April 17, 2025 18:22

viable-alternative requested a review from mroeschke April 17, 2025 18:56

mroeschke reviewed Apr 18, 2025

View reviewed changes

pandas/_libs/window/aggregations.pyx Outdated Show resolved Hide resolved

mroeschke reviewed Apr 18, 2025

View reviewed changes

pandas/_libs/window/aggregations.pyx Outdated Show resolved Hide resolved

mroeschke reviewed Apr 18, 2025

View reviewed changes

pandas/core/_numba/kernels/min_max_.py Outdated Show resolved Hide resolved

mroeschke reviewed Apr 18, 2025

View reviewed changes

mroeschke requested changes Apr 18, 2025

View reviewed changes

viable-alternative force-pushed the rolling_mm_upstream branch from 144fdc9 to 8b93478 Compare April 19, 2025 17:44

viable-alternative requested a review from mroeschke April 19, 2025 18:34

rhshadrach requested changes Apr 20, 2025

View reviewed changes

viable-alternative and others added 6 commits April 21, 2025 10:35

BUG: Fix pandas-dev#46726; wrong result with varying window size min/…

d77b53c

…max rolling calc.

Whitespace recommendations from code review.

5a8a480

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

Whitespace recommendations from code review.

15eabc2

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

removed comments in code

b272c34

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

In-code comments replaced with minimal comment at the top of the

3a70973

fuinction, as requested by @mroeschke

Randomized deterministic unit tests replaced with 10-row prebuilt tes…

922ea45

…ts as requested by @mroeschke

viable-alternative force-pushed the rolling_mm_upstream branch from 8b93478 to 922ea45 Compare April 21, 2025 14:35

mroeschke reviewed Apr 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

viable-alternative commented Apr 14, 2025 •

edited

Loading

mroeschke left a comment

viable-alternative commented Apr 17, 2025

mroeschke Apr 18, 2025

viable-alternative Apr 18, 2025

mroeschke Apr 18, 2025

rhshadrach Apr 19, 2025 •

edited

Loading

viable-alternative Apr 19, 2025

mroeschke Apr 18, 2025

viable-alternative Apr 18, 2025

viable-alternative Apr 18, 2025

mroeschke Apr 18, 2025

viable-alternative Apr 19, 2025

viable-alternative Apr 19, 2025

rhshadrach left a comment

rhshadrach Apr 20, 2025

rhshadrach Apr 20, 2025

rhshadrach Apr 20, 2025

rhshadrach Apr 20, 2025

viable-alternative commented Apr 20, 2025

mroeschke Apr 21, 2025

mroeschke left a comment

BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

Are you sure you want to change the base?

BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288

Conversation

viable-alternative commented Apr 14, 2025 • edited Loading

Summary:

Changed behavior:

Note on invalid inputs:

How to Debug numba

Misc Notes

mroeschke left a comment

Choose a reason for hiding this comment

viable-alternative commented Apr 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viable-alternative commented Apr 20, 2025

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

viable-alternative commented Apr 14, 2025 •

edited

Loading

rhshadrach Apr 19, 2025 •

edited

Loading