Decorrelate scalar subqueries with more complex filter expressions #14554

duongcongtoai · 2025-02-08T04:12:11Z

Is your feature request related to a problem or challenge?

Datafusion already support decorrelating simple scalar subqueries in this PR: #6457

This follow the first approach in TUM paper (simple unnesting), and allow decorrelating this simple query

explain select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id) < t1.t1_int

However, if we add an or condition this subquery

explain select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id or t1.t1_name=t2.t2_name) < t1.t1_int

Datafusion cannot decorrelate it

+--------------+----------------------------------------------------------------------------------------+
| plan_type    | plan                                                                                   |
+--------------+----------------------------------------------------------------------------------------+
| logical_plan | Projection: t1.t1_int                                                                  |
|              |   Filter: (<subquery>) < CAST(t1.t1_int AS Int64)                                      |
|              |     Subquery:                                                                          |
|              |       Projection: count(*)                                                             |
|              |         Aggregate: groupBy=[[]], aggr=[[count(Int64(1)) AS count(*)]]                  |
|              |           Filter: outer_ref(t1.t1_id) = t2.t2_id OR outer_ref(t1.t1_name) = t2.t2_name |
|              |             TableScan: t2                                                              |
|              |     TableScan: t1 projection=[t1_id, t1_name, t1_int]                                  |
+--------------+----------------------------------------------------------------------------------------+

Describe the solution you'd like

Support decorrelating this query following the second method mentioned in the paper

Describe alternatives you've considered

No response

Additional context

General framework for decorrelation maybe discussed here #5492

But the steps needed to make this work is followed

Allow decorrelation for this type of filter exprs in this code:

datafusion/datafusion/optimizer/src/decorrelate.rs

Line 162 in 813220d

self.can_pull_over_aggregation = self.can_pull_over_aggregation

Add more logic to handle complex query decorrelation:

Build domain/magic relation
Rewrite the subquery to join inner table (table of the subquery) with domain/magic relation using its complex filter expression (i.e t2.t2_id = domain.t1_id OR t2.t2_name = domain.t1_name)
Rewrite aggregation to group by the additional columns mentioned in the domain/magic relation
Join the outer relation with the newly built aggregation

For example the above mentioned query may be rewritten like

explain select t1.t1_int from t1,
(
    select count(*) as count_all, domain.t1_id as t1_id, domain.t1_name as t1_name from (
        select distinct t1_id, t1_name from t1
    ) as domain join t2 where t2.t2_id = domain.t1_id or t2.t2_name=domain.t1_name 
    group by domain.t1_id, domain.t1_name
) as pulled_up
where t1.t1_id=pulled_up.t1_id and pulled_up.count_all < t1.t1_int

Logical plan may look like

| logical_plan  | Projection: t1.t1_int                                                                                                                           |
|               |   Inner Join: t1.t1_id = pulled_up.t1_id Filter: pulled_up.count_all < CAST(t1.t1_int AS Int64)                                                 |
|               |     TableScan: t1 projection=[t1_id, t1_int]                                                                                                    |
|               |     SubqueryAlias: pulled_up                                                                                                                    |
|               |       Projection: count(*) AS count_all, domain.t1_id                                                                                           |
|               |         Aggregate: groupBy=[[domain.t1_id, domain.t1_name]], aggr=[[count(Int64(1)) AS count(*)]]                                               |
|               |           Projection: domain.t1_id, domain.t1_name                                                                                              |
|               |             Inner Join:  Filter: t2.t2_id = domain.t1_id OR t2.t2_name = domain.t1_name                                                         |
|               |               SubqueryAlias: domain                                                                                                             |
|               |                 Aggregate: groupBy=[[t1.t1_id, t1.t1_name]], aggr=[[]]                                                                          |
|               |                   TableScan: t1 projection=[t1_id, t1_name]                                                                                     |
|               |               TableScan: t2 projection=[t2_id, t2_name]

The text was updated successfully, but these errors were encountered:

duongcongtoai · 2025-02-09T01:29:39Z

take

alamb · 2025-02-15T10:58:22Z

This follow the first approach in TUM paper (simple unnesting), and allow decorrelating this simple query

What do you think about implementing the more general approach to subquery unnesting described in that paper?

I think @xudong963 mentioned he had done something similar before

duongcongtoai · 2025-02-21T18:34:35Z

From what is see in current code, this struct PullUpCorrelatedExpr is applied for scalar subquery as well as predicate subquery.

For that paper implementation, i'll try my best to find time and figure out what usecases Datafusion cannot yet support. Will need to do it in steps/PRs

alamb · 2025-02-22T13:03:57Z

From what is see in current code, this struct PullUpCorrelatedExpr is applied for scalar subquery as well as predicate subquery.

For that paper implementation, i'll try my best to find time and figure out what usecases Datafusion cannot yet support. Will need to do it in steps/PRs

FWIW I think @xudong963 said he has experience implementing such code so perhaps he will be able to help / assist with the implementation and review

xudong963 · 2025-02-26T04:28:20Z

FWIW I think @xudong963 said he has experience implementing such code so perhaps he will be able to help / assist with the implementation and review

Yes, please ping me @duongcongtoai in your PR

duongcongtoai · 2025-03-17T19:52:43Z

From this PR, there are several types of query mentioned that need support

In Subquery contains limit/order by

select students where student_id in (
	select e.student_id from exams order by score limit 10
)

Scalar subquery contains limit/order by

select * from student s where s.last_semester_avg_score > (
	select avg(score) from (
		select score from exam e where e.student_id=s.student_id 
		order by timestamp limit 3
	)
)

There is union in subquery (the initial proposal of this issue)

 select * from student s where (select avg(score) from exam e where e.student_id = s.student_id or e.student_name=s.student_name) > 0.5

Correlated expressions are in join condition

select * from students s join exam e on s.last_semester_avg_score > (
select avg(score) from exam e2 where e2.class_id=e.class_id
)

Correlated expressions are in aggregation expressions

SELECT * from students s where 5 <
(
    SELECT max(student.last_semester_avg_score+b.score) as max_adjusted_score
    FROM bonus b
);

Correlated expressions are in window expressions
This i cannot find any example query

I'll start thinking about implementing unnesting for all these usecases

ctsk · 2025-03-31T09:41:38Z

Hey @duongcongtoai,

I want to draw your attention on a follow-up paper on "Unnesting Arbitrary Queries": https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neumann-btw2025.pdf

This paper improves on the original approach by better dealing with multiple nesting levels. It also describes the process in an algorithmic way that might be closer to the implementation

duongcongtoai · 2025-04-03T17:19:56Z

thank you, i'll take a look at the PR

duongcongtoai · 2025-04-12T18:36:35Z

I think we can break down this story into multiple step:

unify the optimizor for correlated query, regardless the query type (exists query, scalar query etc)
support flexible decorrelation scheme (simple vs general approach), we can achieve this by following the algorithm mentioned in the 2nd paper. There is a prerequisite to introduce an index algebra during the rewrite. This index requires a pre-traversing over the whole query to detect all non-trivial subqueries, and answer the question whether simple unnesting is sufficient, or should the framework continue with the general approach
Implement general purpose + recursive aware subquery decorrelation for the most major operators (projection, filter, group by) using the top-down algorithm mentioned in the 2nd paper
Gradually support more complex expression (group by, order, limit, window function)

alamb · 2025-04-17T19:24:44Z

I really like the idea of the incremental approach -- I think it is practically speaking the only one we are likely to be able to pull off. Thank you @duongcongtoai

There are a bunch of related tickets listed on this epic:

[EPIC] More Subquery support #5483

What do you think about creating a new ticket with the steps you outline above @duongcongtoai ? I am pretty sure others are interested in this feature as well and may be able to help

duongcongtoai · 2025-04-27T05:59:39Z

I think we can reuse this ticket right?: #5492

xudong963 · 2025-04-27T09:49:04Z

Also, there is a newer paper for the topic: https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neumann-btw2025.pdf

duongcongtoai · 2025-04-28T18:28:11Z

Yes, that paper basically gave pretty neat skeleton for a decorrelation framework

alamb · 2025-04-28T21:54:32Z

I am trying to organize a join task force for planning joins / subqueries: [DISCUSSION] JOIN "task force" / project team #15885

alamb · 2025-05-18T13:01:55Z

I recommend we continue the discussion on

General framework to decorrelate the subqueries #5492

duongcongtoai added the enhancement New feature or request label Feb 8, 2025

This was referenced Apr 17, 2025

[EPIC] More Subquery support #5483

Open

Nested correlated subquery error with a depth exceeding 1 #15558

Open

alamb mentioned this issue Apr 28, 2025

[DISCUSSION] JOIN "task force" / project team #15885

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decorrelate scalar subqueries with more complex filter expressions #14554

Decorrelate scalar subqueries with more complex filter expressions #14554

duongcongtoai commented Feb 8, 2025

duongcongtoai commented Feb 9, 2025

alamb commented Feb 15, 2025

duongcongtoai commented Feb 21, 2025 •

edited

Loading

alamb commented Feb 22, 2025

xudong963 commented Feb 26, 2025

duongcongtoai commented Mar 17, 2025

ctsk commented Mar 31, 2025

duongcongtoai commented Apr 3, 2025

duongcongtoai commented Apr 12, 2025 •

edited

Loading

alamb commented Apr 17, 2025

duongcongtoai commented Apr 27, 2025

xudong963 commented Apr 27, 2025

duongcongtoai commented Apr 28, 2025

alamb commented Apr 28, 2025

alamb commented May 18, 2025

Decorrelate scalar subqueries with more complex filter expressions #14554

Decorrelate scalar subqueries with more complex filter expressions #14554

Comments

duongcongtoai commented Feb 8, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

duongcongtoai commented Feb 9, 2025

alamb commented Feb 15, 2025

duongcongtoai commented Feb 21, 2025 • edited Loading

alamb commented Feb 22, 2025

xudong963 commented Feb 26, 2025

duongcongtoai commented Mar 17, 2025

ctsk commented Mar 31, 2025

duongcongtoai commented Apr 3, 2025

duongcongtoai commented Apr 12, 2025 • edited Loading

alamb commented Apr 17, 2025

duongcongtoai commented Apr 27, 2025

xudong963 commented Apr 27, 2025

duongcongtoai commented Apr 28, 2025

alamb commented Apr 28, 2025

alamb commented May 18, 2025

duongcongtoai commented Feb 21, 2025 •

edited

Loading

duongcongtoai commented Apr 12, 2025 •

edited

Loading