-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Skip data in merge based on MATCHED conditions #1851
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great optimization! Thank you for doing this.
// When they are only MATCHED clauses, we prune after the join the files that have no rows that | ||
// satisfy any of the clause conditions. | ||
val matchedPredicate = | ||
if (isMatchedOnly) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common scenario is:
MERGE INTO .... t
USING .... s
ON s.id = t.id
WHEN MATCHED AND NOT (s.name <=> t.name) THEN UPDATE SET t.name = s.name
WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id, s.name)
In that case the optimization won't work because it has when not matched
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't skip data when reading the target in this case: let's assume there's a row {"id": 10, "name": "Felipe"}
in both the source and target tables, i.e. it's a match but it won't be updated by the first MATCHED
clause.
If we apply NOT (s.name <=> t.name)
as a matched predicate here to prune files, then the file that contains that row in the target table may get pruned from the list of modified files. It won't be included in the join in writeAllChanges
and the corresponding row in the source won't have a match anymore and will be wrongly inserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying it. Do you know if in this case, we are re-writing the parquet file (with the same data, given the matched predicate is false).
I'm asking because user can have large tables and constantly update it using merge, and even if source and target are identical, we would rewrite all the data files. This is what I meant in #1812
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we would rewrite all files where at least one row matches the ON condition. I imagine that would be more complex to implement than the data skipping I'm adding here.
e453507
to
6298b6f
Compare
6298b6f
to
bedb36d
Compare
Related to:
[Feature Request] Merge performance improvements
[BUG] Merge copy rows even when match clause is false. Potential for huge perf impact.
Description
This change adds data skipping in merge when there are only
MATCHED
clause instead of only relying on theON
condition. The match conditions are used to skip files when scanning the table and to prune the list of modified files infindTouchedFiles
.For example:
A predicate
target.value = 1 OR target.value = 2
is used to skip files to scan based on file statistics and to remove files that effectively have no rows updated or deleted after the join infindTouchedFiles
How was this patch tested?
Tests covering data skipping for different scenarios are added to
MergeIntoSuiteBase