Skip to content

Decorrelate scalar subqueries with more complex filter expressions #14554

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
Tracked by #5483 ...
duongcongtoai opened this issue Feb 8, 2025 · 15 comments
Open
Tracked by #5483 ...
Labels
enhancement New feature or request

Comments

@duongcongtoai
Copy link
Contributor

Is your feature request related to a problem or challenge?

Datafusion already support decorrelating simple scalar subqueries in this PR: #6457

This follow the first approach in TUM paper (simple unnesting), and allow decorrelating this simple query

explain select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id) < t1.t1_int

However, if we add an or condition this subquery

explain select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id or t1.t1_name=t2.t2_name) < t1.t1_int

Datafusion cannot decorrelate it

+--------------+----------------------------------------------------------------------------------------+
| plan_type    | plan                                                                                   |
+--------------+----------------------------------------------------------------------------------------+
| logical_plan | Projection: t1.t1_int                                                                  |
|              |   Filter: (<subquery>) < CAST(t1.t1_int AS Int64)                                      |
|              |     Subquery:                                                                          |
|              |       Projection: count(*)                                                             |
|              |         Aggregate: groupBy=[[]], aggr=[[count(Int64(1)) AS count(*)]]                  |
|              |           Filter: outer_ref(t1.t1_id) = t2.t2_id OR outer_ref(t1.t1_name) = t2.t2_name |
|              |             TableScan: t2                                                              |
|              |     TableScan: t1 projection=[t1_id, t1_name, t1_int]                                  |
+--------------+----------------------------------------------------------------------------------------+

Describe the solution you'd like

Support decorrelating this query following the second method mentioned in the paper

Describe alternatives you've considered

No response

Additional context

General framework for decorrelation maybe discussed here #5492

But the steps needed to make this work is followed

Allow decorrelation for this type of filter exprs in this code:

self.can_pull_over_aggregation = self.can_pull_over_aggregation

Add more logic to handle complex query decorrelation:

  • Build domain/magic relation
  • Rewrite the subquery to join inner table (table of the subquery) with domain/magic relation using its complex filter expression (i.e t2.t2_id = domain.t1_id OR t2.t2_name = domain.t1_name)
  • Rewrite aggregation to group by the additional columns mentioned in the domain/magic relation
  • Join the outer relation with the newly built aggregation

For example the above mentioned query may be rewritten like

explain select t1.t1_int from t1,
(
    select count(*) as count_all, domain.t1_id as t1_id, domain.t1_name as t1_name from (
        select distinct t1_id, t1_name from t1
    ) as domain join t2 where t2.t2_id = domain.t1_id or t2.t2_name=domain.t1_name 
    group by domain.t1_id, domain.t1_name
) as pulled_up
where t1.t1_id=pulled_up.t1_id and pulled_up.count_all < t1.t1_int

Logical plan may look like

| logical_plan  | Projection: t1.t1_int                                                                                                                           |
|               |   Inner Join: t1.t1_id = pulled_up.t1_id Filter: pulled_up.count_all < CAST(t1.t1_int AS Int64)                                                 |
|               |     TableScan: t1 projection=[t1_id, t1_int]                                                                                                    |
|               |     SubqueryAlias: pulled_up                                                                                                                    |
|               |       Projection: count(*) AS count_all, domain.t1_id                                                                                           |
|               |         Aggregate: groupBy=[[domain.t1_id, domain.t1_name]], aggr=[[count(Int64(1)) AS count(*)]]                                               |
|               |           Projection: domain.t1_id, domain.t1_name                                                                                              |
|               |             Inner Join:  Filter: t2.t2_id = domain.t1_id OR t2.t2_name = domain.t1_name                                                         |
|               |               SubqueryAlias: domain                                                                                                             |
|               |                 Aggregate: groupBy=[[t1.t1_id, t1.t1_name]], aggr=[[]]                                                                          |
|               |                   TableScan: t1 projection=[t1_id, t1_name]                                                                                     |
|               |               TableScan: t2 projection=[t2_id, t2_name] 
@duongcongtoai duongcongtoai added the enhancement New feature or request label Feb 8, 2025
@duongcongtoai
Copy link
Contributor Author

take

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

This follow the first approach in TUM paper (simple unnesting), and allow decorrelating this simple query

What do you think about implementing the more general approach to subquery unnesting described in that paper?

I think @xudong963 mentioned he had done something similar before

@duongcongtoai
Copy link
Contributor Author

duongcongtoai commented Feb 21, 2025

From what is see in current code, this struct PullUpCorrelatedExpr is applied for scalar subquery as well as predicate subquery.

For that paper implementation, i'll try my best to find time and figure out what usecases Datafusion cannot yet support. Will need to do it in steps/PRs

@alamb
Copy link
Contributor

alamb commented Feb 22, 2025

From what is see in current code, this struct PullUpCorrelatedExpr is applied for scalar subquery as well as predicate subquery.

For that paper implementation, i'll try my best to find time and figure out what usecases Datafusion cannot yet support. Will need to do it in steps/PRs

FWIW I think @xudong963 said he has experience implementing such code so perhaps he will be able to help / assist with the implementation and review

@xudong963
Copy link
Member

FWIW I think @xudong963 said he has experience implementing such code so perhaps he will be able to help / assist with the implementation and review

Yes, please ping me @duongcongtoai in your PR

@duongcongtoai
Copy link
Contributor Author

From this PR, there are several types of query mentioned that need support

  1. In Subquery contains limit/order by
select students where student_id in (
	select e.student_id from exams order by score limit 10
)
  1. Scalar subquery contains limit/order by
select * from student s where s.last_semester_avg_score > (
	select avg(score) from (
		select score from exam e where e.student_id=s.student_id 
		order by timestamp limit 3
	)
)
  1. There is union in subquery (the initial proposal of this issue)
 select * from student s where (select avg(score) from exam e where e.student_id = s.student_id or e.student_name=s.student_name) > 0.5
  1. Correlated expressions are in join condition
select * from students s join exam e on s.last_semester_avg_score > (
select avg(score) from exam e2 where e2.class_id=e.class_id
)
  1. Correlated expressions are in aggregation expressions
SELECT * from students s where 5 <
(
    SELECT max(student.last_semester_avg_score+b.score) as max_adjusted_score
    FROM bonus b
);
  1. Correlated expressions are in window expressions
    This i cannot find any example query

I'll start thinking about implementing unnesting for all these usecases

@ctsk
Copy link
Contributor

ctsk commented Mar 31, 2025

Hey @duongcongtoai,

I want to draw your attention on a follow-up paper on "Unnesting Arbitrary Queries": https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neumann-btw2025.pdf

This paper improves on the original approach by better dealing with multiple nesting levels. It also describes the process in an algorithmic way that might be closer to the implementation

@duongcongtoai
Copy link
Contributor Author

thank you, i'll take a look at the PR

@duongcongtoai
Copy link
Contributor Author

duongcongtoai commented Apr 12, 2025

I think we can break down this story into multiple step:

  1. unify the optimizor for correlated query, regardless the query type (exists query, scalar query etc)
  2. support flexible decorrelation scheme (simple vs general approach), we can achieve this by following the algorithm mentioned in the 2nd paper. There is a prerequisite to introduce an index algebra during the rewrite. This index requires a pre-traversing over the whole query to detect all non-trivial subqueries, and answer the question whether simple unnesting is sufficient, or should the framework continue with the general approach
  3. Implement general purpose + recursive aware subquery decorrelation for the most major operators (projection, filter, group by) using the top-down algorithm mentioned in the 2nd paper
  4. Gradually support more complex expression (group by, order, limit, window function)

@alamb
Copy link
Contributor

alamb commented Apr 17, 2025

I really like the idea of the incremental approach -- I think it is practically speaking the only one we are likely to be able to pull off. Thank you @duongcongtoai

There are a bunch of related tickets listed on this epic:

What do you think about creating a new ticket with the steps you outline above @duongcongtoai ? I am pretty sure others are interested in this feature as well and may be able to help

@duongcongtoai
Copy link
Contributor Author

I think we can reuse this ticket right?: #5492

@xudong963
Copy link
Member

Also, there is a newer paper for the topic: https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neumann-btw2025.pdf

@duongcongtoai
Copy link
Contributor Author

Yes, that paper basically gave pretty neat skeleton for a decorrelation framework

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@alamb
Copy link
Contributor

alamb commented May 18, 2025

I recommend we continue the discussion on

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants