You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.
This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.
Describe the solution you'd like
1. Stabilize Larger-Than-Memory Queries
User Experience & Testing
enable TrackConsumersPool by default in datafusion-cli
Welcome aboard! We're excited to collaborate with you for this GSoC project 😄
Regarding the plan, I can see the following sub-tasks:
Stabilize external sort and aggregate.
Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.
Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
(And also improve UX/performance along the way)
I plan to open separate issues for each sub-task to better describe the problems and outline the approaches.
Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb
Is your feature request related to a problem or challenge?
To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.
This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.
Describe the solution you'd like
1. Stabilize Larger-Than-Memory Queries
User Experience & Testing
TrackConsumersPool
by default indatafusion-cli
insta
Sort
SortExec
#16042Aggregate
Join
2. Optimize Spill File Format
TBD
Describe alternatives you've considered
While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.
Additional context
Related work:
The text was updated successfully, but these errors were encountered: