Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG][AUDIT] SPARK-45652 - SPJ: Handle empty input partitions after dynamic filtering #9743

Closed
tgravescs opened this issue Nov 16, 2023 · 1 comment · Fixed by #9962
Closed
Labels
audit_3.4.2 Audit related tasks for 3.4.2 audit_3.5.1 Audit related tasks for 3.5.1 audit_4.0.0 Audit related tasks for 4.0.0 bug Something isn't working

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
https://issues.apache.org/jira/browse/SPARK-45652 was part of feature for storage partitioned join. It changes BatchScanExec which we have the equivalent code in GpuBatchScanExec so we should pull this change in.

apache/spark@75aed566018

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify audit_3.4.2 Audit related tasks for 3.4.2 audit_3.5.1 Audit related tasks for 3.5.1 audit_4.0.0 Audit related tasks for 4.0.0 labels Nov 16, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
@razajafri
Copy link
Collaborator

I have tried reproducing this but I am unable to see it in local mode. Here are the steps I took

$> spark-homes/spark-3.4.1-bin-hadoop3/bin/spark-shell --conf spark.sql.autoBroadcastJoinThreshold=“-1” --conf spark.sql.adaptive.enabled=“false” --conf spark.sql.optimizer.dynamicPartitionPruning.enabled=“true” --conf spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=“false” --conf spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=“10”

scala> import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.SQLConf
scala> spark.sql("create table test_purchases (item_id long, price float, time timestamp) using parquet partitioned by (item_id)")
scala> spark.sql("create table test_table (id long, name string, price float, arrive_time timestamp) using parquet partitioned by (id)")
scala> spark.sql("insert into test_purchases values((19.5, cast('2020-02-01' as timestamp), 3))")
scala> spark.sql("insert into test_purchases values((11.0, cast('2020-01-01' as timestamp), 2))")
scala> spark.sql("insert into test_purchases values((45.0, cast('2020-01-15' as timestamp), 1))")
scala> spark.sql("insert into test_purchases values((44.0, cast('2020-01-15' as timestamp), 1))")
scala> spark.sql("insert into test_purchases values((42.0, cast('2020-01-01' as timestamp), 1))")
scala> spark.sql("insert into test_table values(('cc', 15.5, cast('2020-02-01' as timestamp), 3))")
scala> spark.sql("insert into test_table values(('bb', 10.5, cast('2020-01-01' as timestamp), 2))")
scala> spark.sql("insert into test_table values(('bb', 10.0, cast('2020-01-01' as timestamp), 2))")
scala> spark.sql("insert into test_table values(('aa', 41.0, cast('2020-01-15' as timestamp), 1))")
scala> spark.sql("insert into test_table values(('aa', 40.0, cast('2020-01-01' as timestamp), 1))")
scala> Seq(true, false).foreach { pushDownValues =>
     | Seq(true, false).foreach { partiallyClustered => {
     | spark.conf.set(SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key, partiallyClustered)
     | spark.conf.set(SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key, pushDownValues)
     | spark.sql("select /*+ REPARTITION(3)*/ p.price from test_table i, test_purchases p WHERE i.id = p.item_id AND i.price > 50.0").show
     | }
     | }
     | }
+-----+
|price|
+-----+
+-----+

+-----+
|price|
+-----+
+-----+

+-----+
|price|
+-----+
+-----+

+-----+
|price|
+-----+
+-----+

After discussing this with @tgravescs the next step he suggested was to build Spark 3.4.1 locally and run the unit tests

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
audit_3.4.2 Audit related tasks for 3.4.2 audit_3.5.1 Audit related tasks for 3.5.1 audit_4.0.0 Audit related tasks for 4.0.0 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants