[Spark] Fix the inconsistencies in min/max Delta Log stats for special characters #3430

sumeet-db · 2024-07-26T18:09:37Z

Which Delta project/connector is this regarding?

Description

When truncating maxValue strings longer than 32 characters for statistics, it's crucial to ensure the final truncated string is lexicographically greater than or equal to the input string in UTF-8 encoded bytes.

Previously, we used the Unicode replacement character as the tieBreaker, comparing it directly against one byte of the next character at a time. This approach was insufficient because the tieBreaker could incorrectly win against the non-first bytes of other characters (e.g., � < 🌼 but � > the second byte of 🌼). We now compare one UTF-8 character (i.e. upto 2 Scala UTF-16 characters depending on surrogates) at a time to address this issue.

We also start using U+10FFFD i.e. character with highest Unicode code point as the tie-breaker now.

How was this patch tested?

UTs

Does this PR introduce any user-facing changes?

No

cstavr

LGTM!

sumeet-db force-pushed the utf-2 branch from 0384ed8 to aeaecf8 Compare July 29, 2024 18:06

cstavr approved these changes Jul 29, 2024

View reviewed changes

sumeet-db force-pushed the utf-2 branch from aeaecf8 to 37415e4 Compare July 30, 2024 08:39

Parallel calls

039b488

sumeet-db force-pushed the utf-2 branch from 37415e4 to 039b488 Compare July 30, 2024 22:12

vkorukanti merged commit 890889a into delta-io:master Aug 1, 2024
9 of 10 checks passed

jlowe mentioned this pull request Aug 15, 2024

Fix Delta Lake truncation of min/max string values [databricks] NVIDIA/spark-rapids#11335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Fix the inconsistencies in min/max Delta Log stats for special characters #3430

[Spark] Fix the inconsistencies in min/max Delta Log stats for special characters #3430

sumeet-db commented Jul 26, 2024 •

edited

Loading

cstavr left a comment

[Spark] Fix the inconsistencies in min/max Delta Log stats for special characters #3430

[Spark] Fix the inconsistencies in min/max Delta Log stats for special characters #3430

Conversation

sumeet-db commented Jul 26, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

cstavr left a comment

Choose a reason for hiding this comment

sumeet-db commented Jul 26, 2024 •

edited

Loading