Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[C++][Compute] Hash aggregation is slowish #45741

Open
pitrou opened this issue Mar 11, 2025 · 4 comments
Open

[C++][Compute] Hash aggregation is slowish #45741

pitrou opened this issue Mar 11, 2025 · 4 comments

Comments

@pitrou
Copy link
Member

pitrou commented Mar 11, 2025

Describe the enhancement requested

Running some simple benchmarks from Python, I was a bit surprised by the performance of group-by aggregations:

  • 10000 groups:
>>> n = 10000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> %timeit a.group_by('group', use_threads=False).aggregate([('value', 'sum')])
496 μs ± 439 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit a.group_by('group', use_threads=False).aggregate([(('key', 'value'), 'pivot_wider', pc.PivotWiderOptions(key_names=('h', 'w')))])
708 μs ± 1.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  • 100000 groups:
>>> n = 100000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> %timeit a.group_by('group', use_threads=False).aggregate([('value', 'sum')])
5.93 ms ± 11.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit a.group_by('group', use_threads=False).aggregate([(('key', 'value'), 'pivot_wider', pc.PivotWiderOptions(key_names=('h', 'w')))])
8.23 ms ± 28.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I was initially expecting pivot_wider to be much slower than sum, both because it does a secondary grouping using a naive std::unordered_map, and because it does a row-to-column transposition of grouped values. But pivot_wider only appears to be 50% slower than a simple sum.

In absolute numbers, it seems group-by summing hovers at around 30-40M rows/second. Given that we're supposed to use a high-performance hash table ("swiss table" with AVX2 optimizations) and the group ids above are trivially distributed integers, this doesn't seem like a very high number.

What should be the expectations here? @zanmato1984

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented Mar 11, 2025

For the record, Pandas is slower, but not astonishingly so either:

  • 10000 groups
>>> n = 10000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> df = a.to_pandas()
>>> %timeit df.groupby('group').sum('value')
906 μs ± 406 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  • 100000 groups
>>> n = 100000
>>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 'value': range(n*2)})
>>> df = a.to_pandas()
>>> %timeit df.groupby('group').sum('value')
6.76 ms ± 10.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@zanmato1984
Copy link
Contributor

I can't say for the numbers for now. Do you have any flame graphs for sum and pivot_wider? I can take a look if you do before I benchmark them myself (can't promise a time though).

@pitrou
Copy link
Member Author

pitrou commented Mar 11, 2025

No, I don't have any flamegraphs. It's not a pressing issue either, and I don't have a need for faster hashing actually :). I was just surprised and thought I'd share the results in case other people care.

@zanmato1984
Copy link
Contributor

No problem. This is indeed something interesting and to pay attention. And thank you for sharing! I think I can take a look when my time fits.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants