Skip to content

Expression simplifier does not simplify A = B AND B = A #8724

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #5923
Jefffrey opened this issue Jan 3, 2024 · 12 comments · Fixed by #8780
Closed
Tracked by #5923

Expression simplifier does not simplify A = B AND B = A #8724

Jefffrey opened this issue Jan 3, 2024 · 12 comments · Fixed by #8780
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 3, 2024

Is your feature request related to a problem or challenge?

Given this case:

DataFusion CLI v34.0.0
❯ CREATE TABLE kumachan (wakana string) AS VALUES ('ookuma');
0 rows in set. Query took 0.004 seconds.

❯ explain select * from kumachan where wakana = 'ookuma' and 'ookuma' = wakana;
+---------------+-------------------------------------------------------------------------------+
| plan_type     | plan                                                                          |
+---------------+-------------------------------------------------------------------------------+
| logical_plan  | Filter: kumachan.wakana = Utf8("ookuma") AND Utf8("ookuma") = kumachan.wakana |
|               |   TableScan: kumachan projection=[wakana]                                     |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                   |
|               |   FilterExec: wakana@0 = ookuma AND ookuma = wakana@0                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                             |
|               |                                                                               |
+---------------+-------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

❯

I would expect the final plan to simplify the FilterExec to only be wakana@0 = ookuma or ookuma = wakana@0, as the AND of these conditions is redundant.

Describe the solution you'd like

Possible solutions:

  1. introduce new optimizer rule to reorder the left and right fields of all BinaryExpr with op of Operator::Eq to a consistent order, which would be placed high in the rules list (during analysis?) to ensure downstream rules (like simplify_expressions) can properly determine this equality and simplify the condition
  2. modify BinaryExpr to have a custom PartialEq/Eq implementation which disregards order of its left/right fields when checking equality (only for Operator::eq)

https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/expr/src/expr.rs#L208-L217

Option 1 seems the cleaner solution, since don't have to muck with manual implementation of PartialEq/Eq

Describe alternatives you've considered

No response

Additional context

Found this while checking if #4732 can be closed

@alamb
Copy link
Contributor

alamb commented Jan 3, 2024

introduce new optimizer rule to reorder the left and right fields of all BinaryExpr with op of Operator::Eq to a consistent order, which would be placed high in the rules list (during analysis?) to ensure downstream rules (like simplify_expressions) can properly determine this equality and simplify the condition

I think the idea of a canonicalize phase sounds very good. Some potential ideas:

  1. Reorder all BinaryExpr that are <col> <op> <literal> or <literal> <op> <col> to <col> <op> <literal>
  2. Combine all AND chains of <col> <op> <literal> into <col> IN (<literal>) (even if they are eventually rewritten back to AND chains)

Using rules 1 and 2 would probably handle the case in this ticket. It would also help with things like col > 5 AND 5 < col

Maybe we could add this simplification directly as as a rule to ExprSimplifier

@alamb
Copy link
Contributor

alamb commented Jan 3, 2024

Also, I recently implemented more general logic to extract "literal guarantees" for PhysicalExprs (aka if columns must be one or not one of a set of constants for the expression to evaluate to try) which could potentially serve as some inspiration https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/physical-expr/src/utils/guarantee.rs#L119-L223

@alamb
Copy link
Contributor

alamb commented Jan 3, 2024

@Jefffrey if you agree with adding this directly as a rule to the ExprSimplifier, I think it would potentially make a good first project for someone

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

Just a note, that this doesn't apply only for <col> <op> <lietral>, but also <col1> <op> <col2>:

DataFusion CLI v34.0.0
❯ CREATE TABLE kumachan1 (wakana string) AS VALUES ('ookuma');
0 rows in set. Query took 0.004 seconds.

❯ CREATE TABLE kumachan2 (wakana string) AS VALUES ('ookuma');
0 rows in set. Query took 0.003 seconds.

❯ select * from kumachan1 k1 join kumachan2 k2 on k1.wakana = k2.wakana and k2.wakana = k1.wakana;
+--------+--------+
| wakana | wakana |
+--------+--------+
| ookuma | ookuma |
+--------+--------+
1 row in set. Query took 0.007 seconds.

❯ explain select * from kumachan1 k1 join kumachan2 k2 on k1.wakana = k2.wakana and k2.wakana = k1.wakana;
+---------------+----------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                               |
+---------------+----------------------------------------------------------------------------------------------------+
| logical_plan  | Inner Join: k1.wakana = k2.wakana, k1.wakana = k2.wakana                                           |
|               |   SubqueryAlias: k1                                                                                |
|               |     TableScan: kumachan1 projection=[wakana]                                                       |
|               |   SubqueryAlias: k2                                                                                |
|               |     TableScan: kumachan2 projection=[wakana]                                                       |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                        |
|               |   HashJoinExec: mode=Partitioned, join_type=Inner, on=[(wakana@0, wakana@0), (wakana@0, wakana@0)] |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                    |
|               |       RepartitionExec: partitioning=Hash([wakana@0, wakana@0], 12), input_partitions=1             |
|               |         MemoryExec: partitions=1, partition_sizes=[1]                                              |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                    |
|               |       RepartitionExec: partitioning=Hash([wakana@0, wakana@0], 12), input_partitions=1             |
|               |         MemoryExec: partitions=1, partition_sizes=[1]                                              |
|               |                                                                                                    |
+---------------+----------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.005 seconds

So ideally before the extract_equijoin_predicate optimizer rule runs, the join condition filter should be simplified first to reduce that duplication. I just chose using a literal in the original issue as it had a smaller MRE.

@Jefffrey if you agree with adding this directly as a rule to the ExprSimplifier, I think it would potentially make a good first project for someone

Yes this sounds good

@alamb alamb added the good first issue Good for newcomers label Jan 4, 2024
@alamb
Copy link
Contributor

alamb commented Jan 4, 2024

I am marking this as a good first issue but it is really a medium sized project

However, I think it is well specified and the existing code is straightforward to extend

The goal is to add this simplification directly to ExprSimplifier

Canonicalize

First canonicalize any BinaryExprs so:

  1. <literal> <op> <col> is rewritten to <col> <op> <literal> (remember to switch the operator)
  2. <col1> <op> <col2> is rewritten so that the name of col1 sorts higher than col2 (b > a would be canonicalized to a < b);

Remove reundancy

  1. For any chain of <expr1> AND <expr2> AND <expr3> remove any identical exprs
  2. For any chain of <expr1> OR <expr2> OR <expr3> remove any identical exprs

So for example I would expect the following to be simplified:

A=1 AND 1 = A AND A = 3 --> A = 1 AND A = 3
(A=1 AND (B> 3 OR 3 < B)) --> (A = 1 AND B > 3)

@alamb
Copy link
Contributor

alamb commented Jan 4, 2024

I took a shot at writing up a description @Jefffrey -- let me know if that makes sense

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 4, 2024

The canonicalize part sounds good.

The remove redundancy, I think is already taken care of by expr_simplifier?

e.g.

https://github.com/apache/arrow-datafusion/blob/e5036d0e760b637724e8ac59c32924f126311d39/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L704-L714

@alamb
Copy link
Contributor

alamb commented Jan 5, 2024

The remove redundancy, I think is already taken care of by expr_simplifier?

Nice! I didn't know / remember that.

@yyy1000
Copy link
Contributor

yyy1000 commented Jan 7, 2024

After reading this issue, I'd like a try. :)

@yyy1000
Copy link
Contributor

yyy1000 commented Jan 7, 2024

I submitted #8780 that may help this.
Not so familiar with Rust so the PR may need later change. :)

@berkaysynnada
Copy link
Contributor

@alamb Does it make sense to use IntervalArithmetics in this kind of optimizations? It cannot handle strings yet but I don't think it will take much effort.

@alamb
Copy link
Contributor

alamb commented Jan 8, 2024

@alamb Does it make sense to use IntervalArithmetics in this kind of optimizations? It cannot handle strings yet but I don't think it will take much effort.

That is a good question @berkaysynnada and I am not sure what the best answer is.

I think interval analysis has the most promise in filter analysis (aka proving that boolean expressions can not be true). IN that context, I suspect this particular simplification is already covered (as in the interval analysis could alreaady handle it).

However, I need to think more about how interval analysis could be used to simplify expressions 🤔

One challenge may be that DataFusion has a split between Logical Expr and physical PhysicalExpr -- and I believe Interval arithmetic is focused mostly on the physical exprs and this simplification happens on the LogicalExpressions

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants