-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
@2010YOUY01 says in #13090 (comment)
DF doesn't have a buffer pool in the traditional sense, and the way arrow-rs allocates memory directly from the system allocator makes it quite hard to implement. However, I think the fact that we have arrow-rs and the arrow IPC offers lots of opportunity.
I think starting with stability and then optimizing is a great idea 💯 Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins Another potential option would be to work on running clickbench with a very small memory (100MB)? Or maybe we could figure out another large dataset 🤔 |
Maybe @comphead 's work to get SMJ working in #13111 will help this (e.g. we could always use SMJ for the large TPCH queries 🤔 ) |
This is a good idea, we should get clickbench work under memory constraints before TPCH |
Here is a PR to optimize the spill format: |
This is a collection of items to improve external (spilling) aggregation
Background
in the Solid State Age (DuckDB external aggregation paper))
DataFusion has supported memory limited / spilling hash aggregation since @kazuyukitanimura added it last year in #7400.
We can likely improve this feature and @2010YOUY01 is considering working on it
Tasks the solution you'd like
The text was updated successfully, but these errors were encountered: