Skip to content

[Epic] A collection of items related to processing larger than memory datasets (via spilling, externalized algorithm, etc) #14077

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
6 tasks
alamb opened this issue Jan 10, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Is your feature request related to a problem or challenge?

This epic attempts to organize attempts to improve DataFusion's ability to process datasets that are larger than fit in configured memory budget

Some of DataFusion's "pipeline blocking" operations (SortExec and HashGroupBy) already do work with datasets that are larger than fit in memory, but the performance and usability could be improved

Note: Joins are another operation that can run out of memory and will error (rather than falling back to some other strategy like Sort-Merge-Join for example). If people are interested in making this better, I think we could organize another project

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

Since there was recent excitement /activity about better sorting behavior, I file an EPIC for just that:

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

I broke this ticket into two follow on ones, one focused on hashing and one focused on sorting:

So let's close this one

@alamb alamb closed this as not planned Won't fix, can't repro, duplicate, stale Mar 17, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant