[C++][Compute] Hash aggregation is slowish #45741

pitrou · 2025-03-11T10:14:00Z

Describe the enhancement requested

Running some simple benchmarks from Python, I was a bit surprised by the performance of group-by aggregations:

10000 groups:

>>> n = 10000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> %timeit a.group_by('group', use_threads=False).aggregate([('value', 'sum')])
496 μs ± 439 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit a.group_by('group', use_threads=False).aggregate([(('key', 'value'), 'pivot_wider', pc.PivotWiderOptions(key_names=('h', 'w')))])
708 μs ± 1.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

100000 groups:

>>> n = 100000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> %timeit a.group_by('group', use_threads=False).aggregate([('value', 'sum')])
5.93 ms ± 11.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit a.group_by('group', use_threads=False).aggregate([(('key', 'value'), 'pivot_wider', pc.PivotWiderOptions(key_names=('h', 'w')))])
8.23 ms ± 28.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I was initially expecting pivot_wider to be much slower than sum, both because it does a secondary grouping using a naive std::unordered_map, and because it does a row-to-column transposition of grouped values. But pivot_wider only appears to be 50% slower than a simple sum.

In absolute numbers, it seems group-by summing hovers at around 30-40M rows/second. Given that we're supposed to use a high-performance hash table ("swiss table" with AVX2 optimizations) and the group ids above are trivially distributed integers, this doesn't seem like a very high number.

What should be the expectations here? @zanmato1984

Component(s)

C++

The text was updated successfully, but these errors were encountered:

pitrou · 2025-03-11T10:20:24Z

For the record, Pandas is slower, but not astonishingly so either:

10000 groups

>>> n = 10000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> df = a.to_pandas()
>>> %timeit df.groupby('group').sum('value')
906 μs ± 406 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

100000 groups

>>> n = 100000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> df = a.to_pandas()
>>> %timeit df.groupby('group').sum('value')
6.76 ms ± 10.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

zanmato1984 · 2025-03-11T10:36:17Z

I can't say for the numbers for now. Do you have any flame graphs for sum and pivot_wider? I can take a look if you do before I benchmark them myself (can't promise a time though).

pitrou · 2025-03-11T10:38:30Z

No, I don't have any flamegraphs. It's not a pressing issue either, and I don't have a need for faster hashing actually :). I was just surprised and thought I'd share the results in case other people care.

zanmato1984 · 2025-03-11T10:42:17Z

No problem. This is indeed something interesting and to pay attention. And thank you for sharing! I think I can take a look when my time fits.

pitrou added the Type: enhancement label Mar 11, 2025

github-actions bot added the Component: C++ label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Compute] Hash aggregation is slowish #45741

[C++][Compute] Hash aggregation is slowish #45741

pitrou commented Mar 11, 2025 •

edited

Loading

pitrou commented Mar 11, 2025

zanmato1984 commented Mar 11, 2025

pitrou commented Mar 11, 2025

zanmato1984 commented Mar 11, 2025

[C++][Compute] Hash aggregation is slowish #45741

[C++][Compute] Hash aggregation is slowish #45741

Comments

pitrou commented Mar 11, 2025 • edited Loading

Describe the enhancement requested

Component(s)

pitrou commented Mar 11, 2025

zanmato1984 commented Mar 11, 2025

pitrou commented Mar 11, 2025

zanmato1984 commented Mar 11, 2025

pitrou commented Mar 11, 2025 •

edited

Loading