Skip to content

Evaluate use of selection vectors in scan-filter-join operations #745

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
Tracked by #717
andygrove opened this issue Jul 31, 2024 · 2 comments
Open
Tracked by #717

Evaluate use of selection vectors in scan-filter-join operations #745

andygrove opened this issue Jul 31, 2024 · 2 comments
Labels
enhancement New feature or request performance

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

It is very common to have scan -> filter as inputs to a join. The copying of data in the filter can be expensive when the batch contains strings and complex types, and the result of the filter is discarded after the join.

I believe that it would be more efficient to have the join use a selection vector to read inputs from the scanned batch rather than perform a filter.

This issue is for tracking the work to create a small prototype to demonstrate. If succesful, then we can discuss making changes in upstream DataFusion to add support for a new ColumnarValue::ArrayWithSelectionVector and then add a specialization in SortMergeJoin to take advantage of this.

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Jul 31, 2024
@andygrove andygrove added this to the 0.2.0 milestone Jul 31, 2024
@viirya
Copy link
Member

viirya commented Jul 31, 2024

Related issue at arrow-rs: apache/arrow-rs#3620

@andygrove
Copy link
Member Author

This paper may have useful information:

"Filter Representation in Vectorized Query Execution"
https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf

@andygrove andygrove removed this from the 0.2.0 milestone Aug 16, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

2 participants