-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Potentially improve join performance by implementing a version of the take kernel that accepts an iterator of indices #13620
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
@alamb I think you might have some useful guidance here. Do you think it makes sense to have something like |
I wonder if some approach could be taken
This avoids allocating new memory but still is compatible with current |
I agree with @Dandandan -- and specifically it isn't clear to me that an iterator based approach will be faster than using the If the issue is that the indices themselves take up too much space, then perhaps we can do some more effort to incrementally generate them and reuse the arrays, as suggested by @Dandandan Here is an example in grouping where we reuse indexes: datafusion/datafusion/physical-plan/src/aggregates/row_hash.rs Lines 402 to 404 in 8773846
|
I think the idea is that there are some cases where indices have a pattern and can be "procedurally generated" (in CV-speak). Is that right, @alihan-synnada? |
Ah -- makes sense -- check out how filter does it in arrow-rs: Maybe we can expose that API somehow 🤔 |
Related to apache/datafusion-comet#901 (comment) |
Agree with take may be the bottleneck, I try Even try best to reuse the buffer, |
PoC Link It requires a patched version of I'm not very confident in the way I set up the benchmark but I think the results are promising. Note that the selectivity only goes up to 30 because it's really slow after that point. Selectivity is used in the filter as "sum of all columns % 100 = selectivity". Batch size is in log2 on the chart (i.e. batch size 13 means 2^13 (8192)). So it isn't really useful for the default batch size of 8192 but anything between 2^4 (16) and 2^10 (1024) might benefit from it. |
AFAIK the algorithm of NLJ is as follows:
Optionally some extra logic per join type is applied (updating visited rows). I think there is a couple of things that might be optimized in nested loop join:
|
That makes sense, alternatively there is a good set of NLJ optimizations in this video https://www.youtube.com/watch?v=RcEW0P8iVTc I went quickly through our NLJ implementation and it looks like a page(batch) based NLJ, and there is some optimizations that can potentially be applied reg to video. |
Yeah, nice! Thanks for bringing this to our attention |
Is your feature request related to a problem or challenge?
Collecting the filtered indices to a PrimitiveArray takes a lot of memory and time. Using an Iterator based de#stead would save a lot of intermediate memory and potentially speed up the join operation due to fewer cache misses and less copying.
Describe the solution you'd like
Create a version of arrow's compute::take kernel that accepts an iterator of indices. Benchmark and figure out where it's worth using Iterators over PrimitiveArrays.
Describe alternatives you've considered
No response
Additional context
I have tried to create a POC but I seem to get different results whenever I benchmark it and I couldn't figure out what's wrong. I also had lots of trouble with taking the ownership of the values array of PrimitiveArrays which are guaranteed to not have nulls (namely the indices cache and the mask produced by the filter).
Furthermore, it's possible to use the mask's
.values().set_indices()
iterator to generate indices to be used in the join, because the left-right index pairs are a function of the index of the mask in the form of(index % left_batch.num_rows(), index / left_batch.num_rows())
and it vectorizes nicely.I plan to share the entirety of my POC (here's a potential implementation of take_with_iter) and benchmarks once I can tame them and generate the same results deterministically.
The text was updated successfully, but these errors were encountered: