Skip to content

DES: How much perf penalty will we accept to get rid of libreduction? #40263

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
jbrockmendel opened this issue Mar 6, 2021 · 4 comments
Closed
Labels
Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

xref some discussion #40171 (comment)

libreduction and the associated callers are a disproportionate maintenance headache [citation needed]. It would be nice to be able to rip it out and just have one path for those methods, but that would entail a non-trivial performance hit. Recently, though, we've managed to optimize the pure-python path a bit, and im optimistic we can shave off some more of the difference.

The question: how much of a perf penalty are we willing to accept in order to remove libreduction?

Copying from #40171 (comment)

I'll throw out a number: if we could get worst-case down to within about 3x and most-cases within 1.5x, I'd be open to removing the cython paths. They are a disproportionate producer of headaches. (From #36459 (possibly out of date) "entirely disabling the cython path leads to 4 currently-xfailed tests passing")

Besides which, if/when cython3 becomes available, we may have to get rid of these anyway.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 6, 2021
@jorisvandenbossche
Copy link
Member

The referenced benchmark is:

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)
df.groupby("key").apply(lambda x: 1)

Running this with current master, I get:

In [2]: %timeit df.groupby("key").apply(lambda x: 1)
5.45 ms ± 92.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When disabling the usage of libreduction fast_apply using this patch:

--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -390,6 +390,7 @@ class BaseGrouper:
             # for now -> relies on BlockManager internals
             pass
         elif (
+            False and
             com.get_callable_name(f) not in base.plotting_methods
             and isinstance(splitter, FrameSplitter)
             and axis == 0

I get the following timing:

In [4]: %timeit df.groupby("key").apply(lambda x: 1)
16.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

So a 3-4x slowdown.

However, the applied function is not doing anything useful (just returning a constant), so basically this benchmark is only measuring the overhead. And whether a 3-4x slowdown in the overhead is significant in a real use case, depends on how much time this overhead itself takes.

So using a slightly more complex example, calculating the mean of one of the columns (which is still a relatively simple/fast function, I think). With master, this gives

In [4]: %timeit df.groupby("key").apply(lambda x: x['value1'].mean())
147 ms ± 5.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

with libreduction disabled, I get:

In [6]: %timeit df.groupby("key").apply(lambda x: x['value1'].mean())
182 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So still slower, but no longer a 3-4x slowdown.
And this slowdown is for me in the range of what's acceptable to get rid of libreduction, I think.

(it's probably useful to see if those numbers are similar on different machines)

@lithomas1 lithomas1 added Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action Internals Related to non-user accessible pandas implementation and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2021
@mroeschke
Copy link
Member

@jbrockmendel did #42992 close this?

@jbrockmendel
Copy link
Member Author

Only half of it

@jbrockmendel
Copy link
Member Author

Closed by #43189

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants