Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[refurb] Count codepoints not bytes for slice-to-remove-prefix-or-suffix (FURB188) #13631

Merged
merged 9 commits into from
Oct 7, 2024

Conversation

dylwil3
Copy link
Collaborator

@dylwil3 dylwil3 commented Oct 4, 2024

This PR fixes the calculation of string length for the purposes of verifying when to suggest removeprefix/removesuffix (FURB188). Before, we used text_len which was counting bytes rather than codepoints (chars) and therefore disagreed with Python's len for non-ASCII text.

Closes #13620

Copy link
Contributor

github-actions bot commented Oct 4, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@dscorbett
Copy link

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

@dylwil3
Copy link
Collaborator Author

dylwil3 commented Oct 4, 2024

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

TIL @dscorbett - neat! Added a test for this, and it appears to be handled correctly (I think this happens in the guts of the parser, so by the time I'm looking at string_val here, the subtleties have been smoothed out).

@dscorbett
Copy link

I think the reason it works is that Ruff’s representation of a Python string as a Rust string replaces surrogates with replacement characters. That is fine for counting the code points but could be a problem for other rules.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. I only have two nit comments.

Comment on lines 338 to 339
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest converting to a u64 considering that you have to use usize::try_from anyways (for 32 bit platforms)

Suggested change
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
.and_then(ast::Int::as_u64)
.and_then(|x| usize::try_from(x).ok())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could consider adding a as_usize method to ast::Int

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with the latter

// Only support prefix removal for size at most `u32::MAX`
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
.is_some_and(|x| x == string_val.to_str().chars().count()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.is_some_and(|x| x == string_val.to_str().chars().count()),
.is_some_and(|x| x == string_val.chars().count()),

@@ -370,7 +372,8 @@ fn affix_matches_slice_bound(data: &RemoveAffixData, semantic: &SemanticModel) -
value
.as_int()
.and_then(ast::Int::as_u32)
.is_some_and(|x| x == string_val.to_str().text_len().to_u32())
.and_then(|x| usize::try_from(x).ok())
.is_some_and(|x| x == string_val.to_str().chars().count())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.is_some_and(|x| x == string_val.to_str().chars().count())
.is_some_and(|x| x == string_val.chars().count())

@MichaReiser MichaReiser added fixes Related to suggested fixes for violations preview Related to preview mode features rule Implementing or modifying a lint rule and removed preview Related to preview mode features fixes Related to suggested fixes for violations labels Oct 7, 2024
@MichaReiser MichaReiser merged commit 14ee5db into astral-sh:main Oct 7, 2024
20 checks passed
@dylwil3 dylwil3 deleted the affix-unicode branch October 7, 2024 15:36
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
preview Related to preview mode features rule Implementing or modifying a lint rule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FURB188 calculates lengths wrong for non-ASCII affixes
3 participants