Skip to content

DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
viirya opened this issue May 20, 2024 · 1 comment · May be fixed by #10584
Open

DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583

viirya opened this issue May 20, 2024 · 1 comment · May be fixed by #10584
Assignees
Labels
bug Something isn't working

Comments

@viirya
Copy link
Member

viirya commented May 20, 2024

Describe the bug

During working on apache/datafusion-comet#437, a few Spark join tests are failed when delegating to DataFusion HashJoin.

It is because that DataFusion HashJoin LeftAnti Join returns incorrect results when it is a null aware anti join.

To Reproduce

Added a test to join.slt:

statement ok
CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES
(1, 1),
(2, 2),
(3, 3),
(4, null),
(null, 0);

query II
SELECT * FROM test_table t1 WHERE (c1 NOT IN (SELECT c2 FROM test_table)) = true
----
4 NULL
NULL 0

Expected behavior

Above query should return empty relation.

Additional context

No response

@viirya viirya added the bug Something isn't working label May 20, 2024
@viirya viirya linked a pull request May 20, 2024 that will close this issue
@viirya viirya self-assigned this May 20, 2024
@viirya viirya changed the title HashJoin LeftAnti Join handles nulls incorrectly DataFusion HashJoin LeftAnti doesn't support Null aware anti join May 21, 2024
@viirya viirya changed the title DataFusion HashJoin LeftAnti doesn't support Null aware anti join DataFusion HashJoin LeftAnti doesn't support null aware anti join May 21, 2024
@korowa
Copy link
Contributor

korowa commented Aug 10, 2024

I'll just add one more example here in order to show that the problem is related not only to non-correlated subqueries

CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES
(1, 1),
(2, 2),
(3, 3),
(4, null),
(4, 7),
(null, 0);

SELECT * FROM test_table t1 WHERE c1 NOT IN (SELECT c2 FROM test_table t2 where t1.c1 = t2.c1);
+----+----+
| c1 | c2 |
+----+----+
| 4  |    |
| 4  | 7  |
|    | 0  |
+----+----+

while PG returns

CREATE TABLE
INSERT 0 6
 c1 | c2 
----+----
    |  0
(1 row)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants