Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Spark] Fix the inconsistencies in min/max Delta Log stats for special characters #3430

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

sumeet-db
Copy link
Collaborator

@sumeet-db sumeet-db commented Jul 26, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

When truncating maxValue strings longer than 32 characters for statistics, it's crucial to ensure the final truncated string is lexicographically greater than or equal to the input string in UTF-8 encoded bytes.

Previously, we used the Unicode replacement character as the tieBreaker, comparing it directly against one byte of the next character at a time. This approach was insufficient because the tieBreaker could incorrectly win against the non-first bytes of other characters (e.g., � < 🌼 but � > the second byte of 🌼). We now compare one UTF-8 character (i.e. upto 2 Scala UTF-16 characters depending on surrogates) at a time to address this issue.

We also start using U+10FFFD i.e. character with highest Unicode code point as the tie-breaker now.

How was this patch tested?

UTs

Does this PR introduce any user-facing changes?

No

Copy link
Contributor

@cstavr cstavr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vkorukanti vkorukanti merged commit 890889a into delta-io:master Aug 1, 2024
9 of 10 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants