Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve DV path canonicalization #1829

Closed
wants to merge 2 commits into from

Conversation

xupefei
Copy link
Contributor

@xupefei xupefei commented Jun 9, 2023

Description

This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up.

Future improvement is possible for handling huge logs: build canonicalizedPathMap in a distributed way.

How was this patch tested?

Existing tests.

@xupefei
Copy link
Contributor Author

xupefei commented Jun 13, 2023

@larsk-db Could you take a look at this PR?

Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a couple of nits

vkorukanti pushed a commit that referenced this pull request Jun 20, 2023
## Description

This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive `Path.toUri.toString` calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up.

Future improvement is possible for handling huge logs: build `canonicalizedPathMap` in a distributed way.

Related PR target the 2.4 branch: #1829.

Existing tests.

Closes #1836

Signed-off-by: Paddy Xu <xupaddy@gmail.com>
GitOrigin-RevId: c4810852f9136c36ec21f3519620ca26ed12bb04
// Build two maps, using Path or String as keys. The one with String keys is used in UDF.
val canonicalizedPathMap = buildCanonicalizedPathMap(txn.deltaLog, candidateFiles)
val canonicalizedPathStringMap =
canonicalizedPathMap.map { case (k, v) => k.toString -> v }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this second pass, for a single-callsite helper method? Can it just return a string-string map directly, since that's what we ultimately broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! I addressed your comment in a later PR #1770.

@xupefei
Copy link
Contributor Author

xupefei commented Jun 30, 2023

Closed in favor of #1770.

@xupefei xupefei closed this Jun 30, 2023
@xupefei xupefei deleted the branch-2.4 branch June 30, 2023 08:34
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants