Expression simplifier does not simplify `A = B AND B = A` #8724

Jefffrey · 2024-01-03T00:11:55Z

Is your feature request related to a problem or challenge?

Given this case:

DataFusion CLI v34.0.0
❯ CREATE TABLE kumachan (wakana string) AS VALUES ('ookuma');
0 rows in set. Query took 0.004 seconds.

❯ explain select * from kumachan where wakana = 'ookuma' and 'ookuma' = wakana;
+---------------+-------------------------------------------------------------------------------+
| plan_type     | plan                                                                          |
+---------------+-------------------------------------------------------------------------------+
| logical_plan  | Filter: kumachan.wakana = Utf8("ookuma") AND Utf8("ookuma") = kumachan.wakana |
|               |   TableScan: kumachan projection=[wakana]                                     |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                   |
|               |   FilterExec: wakana@0 = ookuma AND ookuma = wakana@0                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                             |
|               |                                                                               |
+---------------+-------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

❯

I would expect the final plan to simplify the FilterExec to only be wakana@0 = ookuma or ookuma = wakana@0, as the AND of these conditions is redundant.

Describe the solution you'd like

Possible solutions:

introduce new optimizer rule to reorder the left and right fields of all BinaryExpr with op of Operator::Eq to a consistent order, which would be placed high in the rules list (during analysis?) to ensure downstream rules (like simplify_expressions) can properly determine this equality and simplify the condition
modify BinaryExpr to have a custom PartialEq/Eq implementation which disregards order of its left/right fields when checking equality (only for Operator::eq)

https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/expr/src/expr.rs#L208-L217

Option 1 seems the cleaner solution, since don't have to muck with manual implementation of PartialEq/Eq

Describe alternatives you've considered

No response

Additional context

Found this while checking if #4732 can be closed

The text was updated successfully, but these errors were encountered:

alamb · 2024-01-03T11:42:15Z

introduce new optimizer rule to reorder the left and right fields of all BinaryExpr with op of Operator::Eq to a consistent order, which would be placed high in the rules list (during analysis?) to ensure downstream rules (like simplify_expressions) can properly determine this equality and simplify the condition

I think the idea of a canonicalize phase sounds very good. Some potential ideas:

Reorder all BinaryExpr that are <col> <op> <literal> or <literal> <op> <col> to <col> <op> <literal>
Combine all AND chains of <col> <op> <literal> into <col> IN (<literal>) (even if they are eventually rewritten back to AND chains)

Using rules 1 and 2 would probably handle the case in this ticket. It would also help with things like col > 5 AND 5 < col

Maybe we could add this simplification directly as as a rule to ExprSimplifier

alamb · 2024-01-03T11:42:46Z

Also, I recently implemented more general logic to extract "literal guarantees" for PhysicalExprs (aka if columns must be one or not one of a set of constants for the expression to evaluate to try) which could potentially serve as some inspiration https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/physical-expr/src/utils/guarantee.rs#L119-L223

alamb · 2024-01-03T11:43:07Z

@Jefffrey if you agree with adding this directly as a rule to the ExprSimplifier, I think it would potentially make a good first project for someone

Jefffrey · 2024-01-03T20:38:06Z

Just a note, that this doesn't apply only for <col> <op> <lietral>, but also <col1> <op> <col2>:

DataFusion CLI v34.0.0
❯ CREATE TABLE kumachan1 (wakana string) AS VALUES ('ookuma');
0 rows in set. Query took 0.004 seconds.

❯ CREATE TABLE kumachan2 (wakana string) AS VALUES ('ookuma');
0 rows in set. Query took 0.003 seconds.

❯ select * from kumachan1 k1 join kumachan2 k2 on k1.wakana = k2.wakana and k2.wakana = k1.wakana;
+--------+--------+
| wakana | wakana |
+--------+--------+
| ookuma | ookuma |
+--------+--------+
1 row in set. Query took 0.007 seconds.

❯ explain select * from kumachan1 k1 join kumachan2 k2 on k1.wakana = k2.wakana and k2.wakana = k1.wakana;
+---------------+----------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                               |
+---------------+----------------------------------------------------------------------------------------------------+
| logical_plan  | Inner Join: k1.wakana = k2.wakana, k1.wakana = k2.wakana                                           |
|               |   SubqueryAlias: k1                                                                                |
|               |     TableScan: kumachan1 projection=[wakana]                                                       |
|               |   SubqueryAlias: k2                                                                                |
|               |     TableScan: kumachan2 projection=[wakana]                                                       |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                        |
|               |   HashJoinExec: mode=Partitioned, join_type=Inner, on=[(wakana@0, wakana@0), (wakana@0, wakana@0)] |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                    |
|               |       RepartitionExec: partitioning=Hash([wakana@0, wakana@0], 12), input_partitions=1             |
|               |         MemoryExec: partitions=1, partition_sizes=[1]                                              |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                    |
|               |       RepartitionExec: partitioning=Hash([wakana@0, wakana@0], 12), input_partitions=1             |
|               |         MemoryExec: partitions=1, partition_sizes=[1]                                              |
|               |                                                                                                    |
+---------------+----------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.005 seconds

So ideally before the extract_equijoin_predicate optimizer rule runs, the join condition filter should be simplified first to reduce that duplication. I just chose using a literal in the original issue as it had a smaller MRE.

@Jefffrey if you agree with adding this directly as a rule to the ExprSimplifier, I think it would potentially make a good first project for someone

Yes this sounds good

alamb · 2024-01-04T22:49:29Z

I am marking this as a good first issue but it is really a medium sized project

However, I think it is well specified and the existing code is straightforward to extend

The goal is to add this simplification directly to ExprSimplifier

Canonicalize

First canonicalize any BinaryExprs so:

<literal> <op> <col> is rewritten to <col> <op> <literal> (remember to switch the operator)
<col1> <op> <col2> is rewritten so that the name of col1 sorts higher than col2 (b > a would be canonicalized to a < b);

Remove reundancy

For any chain of <expr1> AND <expr2> AND <expr3> remove any identical exprs
For any chain of <expr1> OR <expr2> OR <expr3> remove any identical exprs

So for example I would expect the following to be simplified:

A=1 AND 1 = A AND A = 3 --> A = 1 AND A = 3

(A=1 AND (B> 3 OR 3 < B)) --> (A = 1 AND B > 3)

alamb · 2024-01-04T22:49:53Z

I took a shot at writing up a description @Jefffrey -- let me know if that makes sense

Jefffrey · 2024-01-04T22:57:13Z

The canonicalize part sounds good.

The remove redundancy, I think is already taken care of by expr_simplifier?

e.g.

https://github.com/apache/arrow-datafusion/blob/e5036d0e760b637724e8ac59c32924f126311d39/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L704-L714

alamb · 2024-01-05T15:23:40Z

The remove redundancy, I think is already taken care of by expr_simplifier?

Nice! I didn't know / remember that.

yyy1000 · 2024-01-07T20:42:06Z

After reading this issue, I'd like a try. :)

yyy1000 · 2024-01-07T22:25:28Z

I submitted #8780 that may help this.
Not so familiar with Rust so the PR may need later change. :)

berkaysynnada · 2024-01-08T06:55:21Z

@alamb Does it make sense to use IntervalArithmetics in this kind of optimizations? It cannot handle strings yet but I don't think it will take much effort.

alamb · 2024-01-08T17:47:11Z

@alamb Does it make sense to use IntervalArithmetics in this kind of optimizations? It cannot handle strings yet but I don't think it will take much effort.

That is a good question @berkaysynnada and I am not sure what the best answer is.

I think interval analysis has the most promise in filter analysis (aka proving that boolean expressions can not be true). IN that context, I suspect this particular simplification is already covered (as in the interval analysis could alreaady handle it).

However, I need to think more about how interval analysis could be used to simplify expressions 🤔

One challenge may be that DataFusion has a split between Logical Expr and physical PhysicalExpr -- and I believe Interval arithmetic is focused mostly on the physical exprs and this simplification happens on the LogicalExpressions

Jefffrey added the enhancement New feature or request label Jan 3, 2024

Jefffrey mentioned this issue Jan 3, 2024

[EPIC] More SQL optimizer optimizations for Datafusion #5923

Open

7 tasks

alamb added the good first issue Good for newcomers label Jan 4, 2024

yyy1000 mentioned this issue Jan 7, 2024

Enhance simplifier by adding Canonicalize #8780

Merged

alamb closed this as completed in #8780 Jan 24, 2024

alamb mentioned this issue Jan 24, 2024

Improve Canonicalize API #8983

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expression simplifier does not simplify `A = B AND B = A` #8724

Expression simplifier does not simplify `A = B AND B = A` #8724

Jefffrey commented Jan 3, 2024 •

edited

Loading

alamb commented Jan 3, 2024

alamb commented Jan 3, 2024

alamb commented Jan 3, 2024

Jefffrey commented Jan 3, 2024

alamb commented Jan 4, 2024

alamb commented Jan 4, 2024

Jefffrey commented Jan 4, 2024

alamb commented Jan 5, 2024

yyy1000 commented Jan 7, 2024

yyy1000 commented Jan 7, 2024

berkaysynnada commented Jan 8, 2024

alamb commented Jan 8, 2024

Expression simplifier does not simplify A = B AND B = A #8724

Expression simplifier does not simplify A = B AND B = A #8724

Comments

Jefffrey commented Jan 3, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jan 3, 2024

alamb commented Jan 3, 2024

alamb commented Jan 3, 2024

Jefffrey commented Jan 3, 2024

alamb commented Jan 4, 2024

Canonicalize

Remove reundancy

alamb commented Jan 4, 2024

Jefffrey commented Jan 4, 2024

alamb commented Jan 5, 2024

yyy1000 commented Jan 7, 2024

yyy1000 commented Jan 7, 2024

berkaysynnada commented Jan 8, 2024

alamb commented Jan 8, 2024

Expression simplifier does not simplify `A = B AND B = A` #8724

Expression simplifier does not simplify `A = B AND B = A` #8724

Jefffrey commented Jan 3, 2024 •

edited

Loading