-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Improve DV path canonicalization #1829
Conversation
@larsk-db Could you take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just a couple of nits
core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala
Outdated
Show resolved
Hide resolved
## Description This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive `Path.toUri.toString` calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up. Future improvement is possible for handling huge logs: build `canonicalizedPathMap` in a distributed way. Related PR target the 2.4 branch: #1829. Existing tests. Closes #1836 Signed-off-by: Paddy Xu <xupaddy@gmail.com> GitOrigin-RevId: c4810852f9136c36ec21f3519620ca26ed12bb04
// Build two maps, using Path or String as keys. The one with String keys is used in UDF. | ||
val canonicalizedPathMap = buildCanonicalizedPathMap(txn.deltaLog, candidateFiles) | ||
val canonicalizedPathStringMap = | ||
canonicalizedPathMap.map { case (k, v) => k.toString -> v } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this second pass, for a single-callsite helper method? Can it just return a string-string map directly, since that's what we ultimately broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! I addressed your comment in a later PR #1770.
Closed in favor of #1770. |
Description
This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive
Path.toUri.toString
calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up.Future improvement is possible for handling huge logs: build
canonicalizedPathMap
in a distributed way.How was this patch tested?
Existing tests.