[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

alamb · 2024-10-26T10:52:29Z

This is a collection of items to improve external (spilling) aggregation

Background

Abstract—Analytical database systems offer high-performance in-memory aggregation. If there are many unique groups, temporary query intermediates may not fit RAM, requiring the use of external storage. However, switching from an in-memory to an external algorithm can degrade performance sharply

Robust External Hash Aggregation
in the Solid State Age (DuckDB external aggregation paper))

DataFusion has supported memory limited / spilling hash aggregation since @kazuyukitanimura added it last year in #7400.

We can likely improve this feature and @2010YOUY01 is considering working on it

Tasks the solution you'd like

alamb · 2024-10-26T10:58:18Z

@2010YOUY01 says in #13090 (comment)

Really nice paper, we can implement the same benchmark and compare in the future 😄 They implemented a unified buffer pool for both table data cache and operator (like aggregation) intermediate results, to easily support spilling in various operators. I think they didn't mention any optimization specific to the spilling part of aggregation, and just use simple LRU policy in the buffer pool. Maybe there are some spilling and merging specific optimizations we can explore (all of memory-limited aggregate/SortMergeJoin/Sort can benefit from)

DF doesn't have a buffer pool in the traditional sense, and the way arrow-rs allocates memory directly from the system allocator makes it quite hard to implement. However, I think the fact that we have arrow-rs and the arrow IPC offers lots of opportunity.

Also, are you interested in improving DataFusion's external aggregation capabilities? I think it is a non trivial gap at the moment and would be great to improve (and I would be interested in helping do so).
if you are, I can start organizing the work into some tickets to see if we can get some others to check it out too

Yes, I'm start to look at related components now. Perhaps we can start with making memory-limited SQL queries more stable (e.g. more tests, make sure TPCH-SF1000 is able to run on laptop correctly), and later optimize.

I think starting with stability and then optimizing is a great idea 💯

Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins

Memory Limited Joins (Externalized / Spill) #1599

Another potential option would be to work on running clickbench with a very small memory (100MB)?

Or maybe we could figure out another large dataset 🤔

alamb · 2024-10-26T11:43:27Z

Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins

Maybe @comphead 's work to get SMJ working in #13111 will help this (e.g. we could always use SMJ for the large TPCH queries 🤔 )

2010YOUY01 · 2024-10-27T06:46:55Z

Another potential option would be to work on running clickbench with a very small memory (100MB)?

This is a good idea, we should get clickbench work under memory constraints before TPCH

alamb · 2025-01-10T13:59:25Z

Here is a PR to optimize the spill format:

Optimized spill file format #14078

alamb added the enhancement New feature or request label Oct 26, 2024

alamb mentioned this issue Oct 26, 2024

Oct 21, 2024: This week in DataFusion #13035

Closed

4 tasks

alamb mentioned this issue Oct 29, 2024

Oct 28, 2024: This week in DataFusion #13167

Closed

3 tasks

2010YOUY01 mentioned this issue Nov 15, 2024

Improve test coverage for spilling (memory-limited) sort/aggregation/sort-merge-join #13431

Open

alamb changed the title ~~[EPIC] Improved Externalized / Spilling / Out of core Hash Aggregation~~ [EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation Jan 10, 2025

alamb mentioned this issue Jan 10, 2025

[Epic] A collection of items related to processing larger than memory datasets (via spilling, externalized algorithm, etc) #14077

Closed

6 tasks

Rachelint mentioned this issue Feb 5, 2025

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

Open

ding-young mentioned this issue May 16, 2025

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

alamb commented Oct 26, 2024 •

edited

Loading

alamb commented Oct 26, 2024 •

edited

Loading

alamb commented Oct 26, 2024

2010YOUY01 commented Oct 27, 2024

alamb commented Jan 10, 2025

[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

Comments

alamb commented Oct 26, 2024 • edited Loading

Background

Tasks the solution you'd like

alamb commented Oct 26, 2024 • edited Loading

alamb commented Oct 26, 2024

2010YOUY01 commented Oct 27, 2024

alamb commented Jan 10, 2025

alamb commented Oct 26, 2024 •

edited

Loading

alamb commented Oct 26, 2024 •

edited

Loading