-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: main
Are you sure you want to change the base?
BUG: Fix #46726; wrong result with varying window size min/max rolling calc. #61288
Conversation
7c4b8a9
to
1c2ce94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Before fully reviewing, it would be great if some things conformed more closely to our codebase, namely
- Remove
bug_hunters
directory - Remove
pandas/core/window/_min_max.py
if it's not used by the Cython or Numba implementation - Instead of a separate
test_minmax.py
create tests that would close the original issue intests/window/test_rolling.py
andtests/window/test_numba.py
b67a773
to
54b62c7
Compare
@mroeschke, all done! |
Dominators: list = [] # this is a stack | ||
# END-OF numba-only init part | ||
|
||
# Basic idea of the algorithm: https://stackoverflow.com/a/12239580 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove most of the comments in this file if it's mostly described by this stackoverflow link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack overflow (SO) link does not describe the method; just provides the code for a special case (fixed window). We can confirm that by noting that one one of the SO comments states "please explain the algorithm more clearly ??" I don't mind taking out all comments. It will not be an easy task to understand what this method is doing and why it works (or maybe doesn't in some cases!) without them. Further, this is an extension to the SO method (like a 2nd dimension added to a 1-dimentional case) so the SO article cannot describe that extension.
Suggestion: keep all in-code comments.
I am new to GItHub, and I don't feel I know enough to click on "resolve this conversation". Feel free to "resolve" and tell me if I was supposed to "resolve" it instead. Thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I appreciate the thoroughness of the comments and willingness to explain things, I find it fairly distracting to have more comments than code.
If you could reduce this by giving variables clearer names and maybe have at most 3 lines of comments at the beginning of this function explaining what you're doing, that would be a good compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @mroeschke. This is an opinionated take, but excessive comments like this are a code smell. The best code is code that doesn't require any comments. If code requires an excessive amount of comments, it can be (not is) an indication the code can be improved. Of course, highly technical algorithms like this almost always necessitate some amount of comments.
It seems to me individual comments can be made more concise, which may help a bit. E.g. instead of
This is also likely 99+% of all use cases, thus we want to make sure we do not go into bisection as to incur neither the *log(k) penalty nor the function call penalty for this very common case.
could be
This is a common case, ensure we avoid bisection for performance.
In addition, it may be a good idea to have a paragraph of a concise high-level description of the algorithm prior to starting the main code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-code comments addressed as requested by @mroeschke.
As for the request to "giving variables clearer names": all new variables were given clear names in the original commit. There are two reasons for keeping existing variable names the same: the "Q" variable name has to stay to align with the linked SO article as the only surviving piece of documentation. The remaining variables kept as is, as to not complicate the review of what has changed. There is not much room for renaming them at any rate.
pandas/tests/window/test_rolling.py
Outdated
@@ -1946,3 +1947,172 @@ def test_rolling_timedelta_window_non_nanoseconds(unit, tz): | |||
df.index = df.index.as_unit("ns") | |||
|
|||
tm.assert_frame_equal(ref_df, df) | |||
|
|||
|
|||
class StandardWindowIndexer(BaseIndexer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all this, can we have a test like
class MultiWindowIndexer(...)
# from the original issue
def test_max_with_custom_indexer():
df = pd.DataFrame({"window": [not_random_data], "data": [not_random_data]})
result = df.rolling(MultiWindowIndexer(df.window)).max()
expected = pd.DataFrame(...)
tm.assert_frame_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, are you asking to remove pseudo-random data with seed and replace it with specific typed-in data, like a list of literals? What is a good way of storing 1,000 (or possibly more) numbers in the test? One would suggest practicality of storing of 1,000 or more numbers to be questionable. On the other hand, reducing the number to something manageable for storing (like 10 or 20) drastically reduces the power of the test. It could be that I am misreading the request? The current tests run quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not occur to me to bring this up in the first response. There is a reason these tests are such. Earlier versions of this fix worked perfectly fine with no NaN data, and by adding the "right" amount of NaNs (30%, 60%, etc) and choosing a certain proportion of the window length to the size of the data frame and NaN fraction, one or the other kind of bugs would come out. I can quickly write a short test that fails on main and passes with this fix. I would not recommend my customer (pandas project) to consider such unit test satisfactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future maintainability of these test, we would appreciate deterministic tests. You could still use pytest.mark.parmeterize
to specify pairs of input test data of 10 elements with varying proportion of NaNs and their expected outputs to gain satisfactory coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on the 10-element tests as we speak. I would like to highlight that the tests being offered are deterministic. Do not be misled by (pseudo!)random number generators (RNG). All RNG calls in these tests have a seed. The existing CI controls on the repo will not allow truly random tests at any rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000-element randomized deterministic tests replaced with 10-element handwritten tests as requested. This closes all pending requests.
144fdc9
to
8b93478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite an algorithm - this is very nice!
Q
, Dominators
, and TestData
go against PEP8 naming guidelines. I'm surprised our linters don't flag these. In any case, can we adhere to PEP8 here - I'd suggest queue
, dominators
, and test_data
, but would also be okay with q
.
@@ -18,6 +21,19 @@ | |||
from pandas._typing import npt | |||
|
|||
|
|||
@numba.njit(nogil=True, parallel=False) | |||
def bisect_left(a: list[Any], x: Any, lo: int = 0, hi: int = -1) -> int: | |||
if hi == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring here; I think just a single line is fine. Something like
"""Get index of smallest value greater than or equal to x between lo and hi."""
|
||
# Discard entries outside and left of current window | ||
while Q and Q[0] <= start[i] - 1: | ||
Q: list = [] # this is a queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on what this queue tracks. Something like
# Track indices of potential extrema. Corresponding values will always be monotonic (increasing/decreasing depending on min/max).
Also, I think type-hint can be list[int]
.
# Discard entries outside and left of current window | ||
while Q and Q[0] <= start[i] - 1: | ||
Q: list = [] # this is a queue | ||
Dominators: list = [] # this is a stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on what this stack tracks. Something like
# Track `start` values we need to consider when building `Q`.
Can also be list[int]
.
int64_t lo=0, | ||
int64_t hi=-1 | ||
) nogil: | ||
cdef int64_t mid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, can you add a similar docstring.
Thank you for the thoughtful review. Give me some time (hopefully under a week) to read up on PEP8 and address other comments. Thanks for the helpful naming suggestions: only takes a minute, and saves me hours of time digging too deep into PEP8. |
…max rolling calc.
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
fuinction, as requested by @mroeschke
…ts as requested by @mroeschke
8b93478
to
922ea45
Compare
class TestMinMaxNumba: | ||
parent = TestMinMax() | ||
|
||
@pytest.mark.parametrize("is_max, has_nan, exp_list", TestMinMax.TestData) | ||
def test_minmax(self, is_max, has_nan, exp_list): | ||
TestMinMaxNumba.parent.test_minmax(is_max, has_nan, exp_list, "numba") | ||
|
||
def test_wrong_order(self): | ||
TestMinMaxNumba.parent.test_wrong_order("numba") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you essentially duplicate the test in pandas/tests/window/test_rolling.py
? The pandas test suite avoids test dependencies like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for accommodating my requests. This is looking close to merging!
doc/source/whatsnew/vX.X.X.rst
as I do not know what version this will go out with. I am happy to add such section pretty quickly if you tell me the version number.Summary:
Changed behavior:
Has additional validity check, which will raise ValueError if the function detects a condition it cannot work with, namely improper ordering of start/end bounds. The existing method would happily consume such input and would produce a wrong result. There is a new unit test to check for the raised ValueError.
Note on invalid inputs:
It is possible to make the method work for an arbitrary stream of start/end window bounds, but it will require sorting. It is very unlikely that such work is worth the effort, and it is estimated to have extremely low need, if any. Let someone create an enhancement request first.
If sorting is to be implemented: it can be done with only incurring performance hit in the case of unsorted input: copy and sort the start/end arrays, producing a permutation, run the main method on the copy, and then extract the result back using the permutation. To detect if the start/end array pair is properly sorted will only take O(N). (Soring is N*log(N), does not have to be stable, but the input array is extremely likely to be “almost” sorted, and you have to pick your poison of a sorting method that works well with nearly sorted array, or use efficient soring methods, most of which do not offer additional speed on nearly sorted arrays.) Working such intermediate step (without copying and pasting) into 3 different implementations will require some less than straightforward work in the “apply” family of methods used by other rolling functions, and therefore will bear risk. If this is decided to be done, it is recommended to have an additional parameter to optionally skip the “sorted” check. (The user may already know that the arrays are properly sorted).
How to Debug numba
You can temporarily change 2 lines of code in order to Python-debug numba implementation with VS Code or another Python debugger:
numba.jit
decorator on the function(sliding_min_max()
inmin_max_.py
).column_looper()
function defined inside thegenerate_apply_looper()
function in executor.py.Misc Notes
The speed improvement of ~10% was confirmed in two ways: