Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Skip data in merge based on MATCHED conditions #1851

Closed
wants to merge 2 commits into from

Conversation

johanl-db
Copy link
Collaborator

@johanl-db johanl-db commented Jun 21, 2023

Related to:
[Feature Request] Merge performance improvements
[BUG] Merge copy rows even when match clause is false. Potential for huge perf impact.

Description

This change adds data skipping in merge when there are only MATCHED clause instead of only relying on the ON condition. The match conditions are used to skip files when scanning the table and to prune the list of modified files in findTouchedFiles.

For example:

MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED AND target.value = 1 THEN UPDATE SET *
WHEN MATCHED AND target.value = 2 THEN DELETE

A predicate target.value = 1 OR target.value = 2 is used to skip files to scan based on file statistics and to remove files that effectively have no rows updated or deleted after the join in findTouchedFiles

How was this patch tested?

Tests covering data skipping for different scenarios are added to MergeIntoSuiteBase

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great optimization! Thank you for doing this.

// When they are only MATCHED clauses, we prune after the join the files that have no rows that
// satisfy any of the clause conditions.
val matchedPredicate =
if (isMatchedOnly) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common scenario is:

MERGE INTO .... t
USING .... s
   ON s.id = t.id
   WHEN MATCHED AND NOT (s.name <=> t.name) THEN UPDATE SET t.name = s.name
   WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id, s.name)

In that case the optimization won't work because it has when not matched?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't skip data when reading the target in this case: let's assume there's a row {"id": 10, "name": "Felipe"} in both the source and target tables, i.e. it's a match but it won't be updated by the first MATCHED clause.

If we apply NOT (s.name <=> t.name) as a matched predicate here to prune files, then the file that contains that row in the target table may get pruned from the list of modified files. It won't be included in the join in writeAllChanges and the corresponding row in the source won't have a match anymore and will be wrongly inserted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying it. Do you know if in this case, we are re-writing the parquet file (with the same data, given the matched predicate is false).

I'm asking because user can have large tables and constantly update it using merge, and even if source and target are identical, we would rewrite all the data files. This is what I meant in #1812

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would rewrite all files where at least one row matches the ON condition. I imagine that would be more complex to implement than the data skipping I'm adding here.

@johanl-db johanl-db force-pushed the merge-data-skipping branch from e453507 to 6298b6f Compare June 22, 2023 11:13
@johanl-db johanl-db force-pushed the merge-data-skipping branch from 6298b6f to bedb36d Compare June 22, 2023 11:13
@johanl-db
Copy link
Collaborator Author

Closed by b104309.

#1852 was stacked on top of this PR and got merged first, picking the changes from this PR in the process in the same commit b104309.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants