-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Range/inequality joins are slow #8393
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I just noticed that what I really want is to actually do a RIGHT join. That is, if there is no matching # for a timestamp, it should give null. Changing the query to that, Datafusion is much faster. I believe it's because with a RIGHT join, # becomes the outer table (single partition), while timestamps becomes the inner table (unspecified partitioning), which allows for greater parallelism (see https://github.com/apache/arrow-datafusion/blob/e19c669855baa8b78ff86755803944d2ddf65536/datafusion/physical-plan/src/joins/nested_loop_join.rs#L72-L77C4) But I think the issue should still be open - the LEFT join is still slower |
I think |
I stared trying to collect a list of various join improvments on #8398 |
I am interested in this ticket. Since it is a pretty major project, I will write a proposal first. |
Thank you @my-vegetable-has-exploded -- that is a great idea cc @korowa / @viirya / @metesynnada who have been involved in Join implementations recently and who may be interested as well |
Disregarding IEJoin -- So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs. UPD: in addition, to make join reordering useful, it's also required to modify NLJoin, since currently it chooses build-side based on logical join type. |
I think it is a good idea to improve performance in this scenario. Your pr is also good for me. But I think it is also ok to keep old parallelism strategy. In my opinion, the old paralleism strategy should works, but the check in I think it may another way to write a new enforce_distribution strategy for |
I don't think it's proper way to go -- it'll give some benefits in terms of runtime, but it will be suboptimal in terms of memory utilization, and cputime (as we'll need to perform BuildSideRows * NumberOfPartitions filter evaluations instead of BuildSideRows * 1, where 1 is probe side input batches) |
I don't think this issue should be closed. #9676 seems to take care of ordering but I think it doesn't improve range/inequality joins much? |
My intention was to fix NLJoin parallelism issue due to fixed build-side choice (since right join instead of left had acceptable performance, as it was claimed above), and in the same time we also have #318 for specialized operator implementation, so, I supposed #9676 to be enough. Don't mind to keep it open, though. |
Could anyone do me a favour here? |
Describe the bug
Joins where the
ON
filter are not equality, but rather inequalities like<
, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a
NestedLoopJoinExec
, while DuckDB uses aIEJoin
.Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue #318).
To Reproduce
Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:
$ duckdb < repro-dataset.sql
We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as
repro-range-query.sql
.DuckDB performance:
Datafusion performance:
$ time datafusion-cli -f repro-range-query.sql ... real 0m8.269s user 0m6.358s sys 0m1.907s
Expected behavior
It would be nice if the above query (or something equivalent) would be faster in Datafusion.
If someone knows of a better way to express the query, then that could also be a workaround for me.
Additional context
Machine tested on:
CPU:Ryzen 3900x
OS: Ubuntu 22.04
Versions used:
The text was updated successfully, but these errors were encountered: