-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Epic] High cardinality aggregation performance wishlist #11679
Labels
enhancement
New feature or request
Comments
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Is your feature request related to a problem or challenge?
DataFusion uses a two phase approach to aggregation (see
Accumulator::state
) for details:For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨🍳
However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on #7957 for why the intermediate cardinality increases
This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )
Together with the StringView work in #10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries
Also I hear that @avantgardnerio may be interested in helping here
Describe the solution you'd like
Here is my wishlist:
CoalesceBatchesExec
to improve performance #7957 (I have a prototype and some ideas)Describe alternatives you've considered
Do nothing and let DuckDB pass us by ;)
Additional context
Other potential things to do:
The text was updated successfully, but these errors were encountered: