Skip data in merge based on MATCHED conditions #1851

johanl-db · 2023-06-21T07:25:58Z

Related to:
[Feature Request] Merge performance improvements
[BUG] Merge copy rows even when match clause is false. Potential for huge perf impact.

Description

This change adds data skipping in merge when there are only MATCHED clause instead of only relying on the ON condition. The match conditions are used to skip files when scanning the table and to prune the list of modified files in findTouchedFiles.

For example:

MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED AND target.value = 1 THEN UPDATE SET *
WHEN MATCHED AND target.value = 2 THEN DELETE

A predicate target.value = 1 OR target.value = 2 is used to skip files to scan based on file statistics and to remove files that effectively have no rows updated or deleted after the join in findTouchedFiles

How was this patch tested?

Tests covering data skipping for different scenarios are added to MergeIntoSuiteBase

spark/src/main/scala/org/apache/spark/sql/delta/DeltaTable.scala

tdas

This is a great optimization! Thank you for doing this.

felipepessoto · 2023-06-22T01:55:37Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

+    // When they are only MATCHED clauses, we prune after the join the files that have no rows that
+    // satisfy any of the clause conditions.
+    val matchedPredicate =
+      if (isMatchedOnly) {


A common scenario is:

MERGE INTO .... t USING .... s ON s.id = t.id WHEN MATCHED AND NOT (s.name <=> t.name) THEN UPDATE SET t.name = s.name WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id, s.name)

In that case the optimization won't work because it has when not matched?

We can't skip data when reading the target in this case: let's assume there's a row {"id": 10, "name": "Felipe"} in both the source and target tables, i.e. it's a match but it won't be updated by the first MATCHED clause.

If we apply NOT (s.name <=> t.name) as a matched predicate here to prune files, then the file that contains that row in the target table may get pruned from the list of modified files. It won't be included in the join in writeAllChanges and the corresponding row in the source won't have a match anymore and will be wrongly inserted.

Thanks for clarifying it. Do you know if in this case, we are re-writing the parquet file (with the same data, given the matched predicate is false).

I'm asking because user can have large tables and constantly update it using merge, and even if source and target are identical, we would rewrite all the data files. This is what I meant in #1812

Yes, we would rewrite all files where at least one row matches the ON condition. I imagine that would be more complex to implement than the data skipping I'm adding here.

johanl-db · 2023-06-27T07:55:47Z

Closed by b104309.

#1852 was stacked on top of this PR and got merged first, picking the changes from this PR in the process in the same commit b104309.

Data skipping and column pruning in merge

e27029d

This was referenced Jun 21, 2023

Use Merge Insert-only path with multiple NOT MATCHED clauses and merges with only inserted rows. #1852

Closed

[Feature Request] Merge performance improvements #1827

Closed

Improve generating and writing out changes in Merge #1854

Closed

tdas reviewed Jun 21, 2023

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaTable.scala Show resolved Hide resolved

tdas approved these changes Jun 21, 2023

View reviewed changes

felipepessoto reviewed Jun 22, 2023

View reviewed changes

johanl-db force-pushed the merge-data-skipping branch from e453507 to 6298b6f Compare June 22, 2023 11:13

Reword read char padding comments

bedb36d

johanl-db force-pushed the merge-data-skipping branch from 6298b6f to bedb36d Compare June 22, 2023 11:13

johanl-db closed this Jun 27, 2023

johanl-db mentioned this pull request Apr 19, 2024

[Feature Request] [MERGE] Avoid copy rows when match clause is false. Potential for huge perf impact. #1812

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip data in merge based on MATCHED conditions #1851

Skip data in merge based on MATCHED conditions #1851

johanl-db commented Jun 21, 2023 •

edited

Loading

tdas left a comment

felipepessoto Jun 22, 2023

johanl-db Jun 22, 2023

felipepessoto Jun 22, 2023

johanl-db Jun 23, 2023

johanl-db commented Jun 27, 2023

Skip data in merge based on MATCHED conditions #1851

Skip data in merge based on MATCHED conditions #1851

Conversation

johanl-db commented Jun 21, 2023 • edited Loading

Description

How was this patch tested?

tdas left a comment

Choose a reason for hiding this comment

felipepessoto Jun 22, 2023

Choose a reason for hiding this comment

johanl-db Jun 22, 2023

Choose a reason for hiding this comment

felipepessoto Jun 22, 2023

Choose a reason for hiding this comment

johanl-db Jun 23, 2023

Choose a reason for hiding this comment

johanl-db commented Jun 27, 2023

johanl-db commented Jun 21, 2023 •

edited

Loading