Skip to content

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
9 tasks
ding-young opened this issue May 16, 2025 · 1 comment
Open
9 tasks

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

ding-young opened this issue May 16, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@ding-young
Copy link
Contributor

ding-young commented May 16, 2025

Is your feature request related to a problem or challenge?

To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.

This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

Sort

Aggregate

  • Integrate ExternalSorter

Join

2. Optimize Spill File Format

TBD

Describe alternatives you've considered

While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.

Additional context

Related work:

@ding-young ding-young added the enhancement New feature or request label May 16, 2025
@2010YOUY01
Copy link
Contributor

2010YOUY01 commented May 17, 2025

Welcome aboard! We're excited to collaborate with you for this GSoC project 😄

Regarding the plan, I can see the following sub-tasks:

  1. Stabilize external sort and aggregate.
  2. Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.
  3. Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
    (And also improve UX/performance along the way)

I plan to open separate issues for each sub-task to better describe the problems and outline the approaches.

Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants