[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

ding-young · 2025-05-16T14:37:55Z

Is your feature request related to a problem or challenge?

To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.

This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

enable TrackConsumersPool by default in datafusion-cli
improve err msg to guide how to adjust parameters
Enable sort query fuzzing with limited memory #15517
migrate tests to insta

Sort

Aggregate

Integrate ExternalSorter

Join

Memory limited nest loop join #15760

2. Optimize Spill File Format

TBD

Describe alternatives you've considered

While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.

Additional context

Related work:

The text was updated successfully, but these errors were encountered:

2010YOUY01 · 2025-05-17T08:52:25Z

Welcome aboard! We're excited to collaborate with you for this GSoC project 😄

Regarding the plan, I can see the following sub-tasks:

Stabilize external sort and aggregate.
Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.
Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
(And also improve UX/performance along the way)

I plan to open separate issues for each sub-task to better describe the problems and outline the approaches.

Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb

ding-young added the enhancement New feature or request label May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

ding-young commented May 16, 2025 •

edited

Loading

2010YOUY01 commented May 17, 2025 •

edited

Loading

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

Comments

ding-young commented May 16, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

Sort

Aggregate

Join

2. Optimize Spill File Format

Describe alternatives you've considered

Additional context

2010YOUY01 commented May 17, 2025 • edited Loading

ding-young commented May 16, 2025 •

edited

Loading

2010YOUY01 commented May 17, 2025 •

edited

Loading