-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Spark] Fix the inconsistencies in min/max Delta Log stats for specia…
…l characters (#3430) ## Description When truncating maxValue strings longer than 32 characters for statistics, it's crucial to ensure the final truncated string is lexicographically greater than or equal to the input string in UTF-8 encoded bytes. Previously, we used the Unicode replacement character as the tieBreaker, comparing it directly against one byte of the next character at a time. This approach was insufficient because the tieBreaker could incorrectly win against the non-first bytes of other characters (e.g., � < 🌼 but � > the second byte of 🌼). We now compare one UTF-8 character (i.e. upto 2 Scala UTF-16 characters depending on surrogates) at a time to address this issue. We also start using U+10FFFD i.e. character with highest Unicode code point as the tie-breaker now. ## How was this patch tested? UTs
- Loading branch information
Showing
3 changed files
with
122 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters