Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[SPARK-16958] [SQL] Reuse subqueries within the same query #14548

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 8, 2016

What changes were proposed in this pull request?

There could be multiple subqueries that generate same results, we could re-use the result instead of running it multiple times.

This PR also cleanup up how we run subqueries.

For SQL query

select id,(select avg(id) from t) from t where id > (select avg(id) from t)

The explain is

== Physical Plan ==
*Project [id#15L, Subquery subquery29 AS scalarsubquery()#35]
:  +- Subquery subquery29
:     +- *HashAggregate(keys=[], functions=[avg(id#15L)])
:        +- Exchange SinglePartition
:           +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
:              +- *Range (0, 1000, splits=4)
+- *Filter (cast(id#15L as double) > Subquery subquery29)
   :  +- Subquery subquery29
   :     +- *HashAggregate(keys=[], functions=[avg(id#15L)])
   :        +- Exchange SinglePartition
   :           +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
   :              +- *Range (0, 1000, splits=4)
   +- *Range (0, 1000, splits=4)

The visualized plan:

reuse-subquery

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63389 has finished for PR 14548 at commit 1348ba7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ExecSubqueryExpression extends SubqueryExpression
    • case class InSubquery(
    • case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan]

@@ -502,15 +508,64 @@ case class OutputFakerExec(output: Seq[Attribute], child: SparkPlan) extends Spa

/**
* Physical plan for a subquery.
*
* This is used to generate tree string for SparkScalarSubquery.
*/
case class SubqueryExec(name: String, child: SparkPlan) extends UnaryExecNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large part of this class is shared with BroadcastExchangeExec. Should we try to factor out common functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to have some duplicated code here, over abstracted code is actually harder to read.

@hvanhovell
Copy link
Contributor

@davies this looks pretty good. I am very excited about the SparkPlan clean-up!

@davies
Copy link
Contributor Author

davies commented Aug 10, 2016

@hvanhovell Had posted an picture, check it out.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63560 has finished for PR 14548 at commit 8444447.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63563 has finished for PR 14548 at commit dd1581b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Cool picture!

@hvanhovell
Copy link
Contributor

LGTM

@davies
Copy link
Contributor Author

davies commented Aug 11, 2016

Merging it into master, thanks!

@asfgit asfgit closed this in 0f72e4f Aug 11, 2016
asfgit pushed a commit that referenced this pull request Dec 5, 2018
## What changes were proposed in this pull request?

this code come from PR: #11190,
but this code has never been used, only since  PR: #14548,
Let's continue fix it. thanks.

## How was this patch tested?

N / A

Closes #23227 from heary-cao/unuseSparkPlan.

Authored-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@JkSelf
Copy link
Contributor

JkSelf commented Jan 16, 2019

@davies @hvanhovell @gatorsmile
Here the subquery reuse may be does not work. In my test, I found the visualized plan do show the subquery is executed once as following.
image

But in deed, the stage of same subquery execute maybe not once as following:
image
Maybe I miss some knowledge, can you help verify this? Thanks for your help!

@hvanhovell
Copy link
Contributor

@JkSelf can you file a JIRA ticket?

@JkSelf
Copy link
Contributor

JkSelf commented Jan 17, 2019

@hvanhovell , Thanks for your help and I have filed Jira 26639.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

this code come from PR: apache#11190,
but this code has never been used, only since  PR: apache#14548,
Let's continue fix it. thanks.

## How was this patch tested?

N / A

Closes apache#23227 from heary-cao/unuseSparkPlan.

Authored-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants