[refurb] Count codepoints not bytes for `slice-to-remove-prefix-or-suffix (FURB188)` #13631

dylwil3 · 2024-10-04T16:24:43Z

This PR fixes the calculation of string length for the purposes of verifying when to suggest removeprefix/removesuffix (FURB188). Before, we used text_len which was counting bytes rather than codepoints (chars) and therefore disagreed with Python's len for non-ASCII text.

Closes #13620

github-actions · 2024-10-04T16:38:29Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

dscorbett · 2024-10-04T17:56:06Z

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

dylwil3 · 2024-10-04T18:26:53Z

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

TIL @dscorbett - neat! Added a test for this, and it appears to be handled correctly (I think this happens in the guts of the parser, so by the time I'm looking at string_val here, the subtleties have been smoothed out).

dscorbett · 2024-10-05T02:06:24Z

I think the reason it works is that Ruff’s representation of a Python string as a Rust string replaces surrogates with replacement characters. That is fine for counting the code points but could be a problem for other rules.

MichaReiser

Nice, thanks. I only have two nit comments.

MichaReiser · 2024-10-07T08:28:57Z

crates/ruff_linter/src/rules/refurb/rules/slice_to_remove_prefix_or_suffix.rs

+            .and_then(ast::Int::as_u32)
+            .and_then(|x| usize::try_from(x).ok())


I suggest converting to a u64 considering that you have to use usize::try_from anyways (for 32 bit platforms)

Suggested change

.and_then(ast::Int::as_u32)

.and_then(|x| usize::try_from(x).ok())

.and_then(ast::Int::as_u64)

.and_then(|x| usize::try_from(x).ok())

Or you could consider adding a as_usize method to ast::Int

went with the latter

MichaReiser · 2024-10-07T08:30:29Z

crates/ruff_linter/src/rules/refurb/rules/slice_to_remove_prefix_or_suffix.rs

+            // Only support prefix removal for size at most `u32::MAX`
+            .and_then(ast::Int::as_u32)
+            .and_then(|x| usize::try_from(x).ok())
+            .is_some_and(|x| x == string_val.to_str().chars().count()),


Suggested change

.is_some_and(|x| x == string_val.to_str().chars().count()),

.is_some_and(|x| x == string_val.chars().count()),

MichaReiser · 2024-10-07T08:30:57Z

crates/ruff_linter/src/rules/refurb/rules/slice_to_remove_prefix_or_suffix.rs

@@ -370,7 +372,8 @@ fn affix_matches_slice_bound(data: &RemoveAffixData, semantic: &SemanticModel) -
                value
                    .as_int()
                    .and_then(ast::Int::as_u32)
-                    .is_some_and(|x| x == string_val.to_str().text_len().to_u32())
+                    .and_then(|x| usize::try_from(x).ok())
+                    .is_some_and(|x| x == string_val.to_str().chars().count())


Suggested change

.is_some_and(|x| x == string_val.to_str().chars().count())

.is_some_and(|x| x == string_val.chars().count())

dylwil3 added 3 commits October 4, 2024 10:52

update test fixture

d98ed4e

count chars not bytes

8f41c68

update snapshot

98dd23e

zanieb requested review from MichaReiser and dhruvmanila October 4, 2024 17:27

dylwil3 added 3 commits October 4, 2024 13:20

update test fixture

53969f9

fix indentation

ff622e2

update snapshot

0500602

MichaReiser approved these changes Oct 7, 2024

View reviewed changes

MichaReiser added fixes Related to suggested fixes for violations preview Related to preview mode features rule Implementing or modifying a lint rule and removed preview Related to preview mode features fixes Related to suggested fixes for violations labels Oct 7, 2024

dylwil3 added 3 commits October 7, 2024 08:30

add as_usize method to Int

70fc5ed

use as_usize

f90da17

unnecessary to_str

e19c17a

MichaReiser approved these changes Oct 7, 2024

View reviewed changes

MichaReiser merged commit 14ee5db into astral-sh:main Oct 7, 2024
20 checks passed

dylwil3 deleted the affix-unicode branch October 7, 2024 15:36

BrewTestBot mentioned this pull request Oct 17, 2024

ruff 0.7.0 Homebrew/homebrew-core#194818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refurb] Count codepoints not bytes for `slice-to-remove-prefix-or-suffix (FURB188)` #13631

[refurb] Count codepoints not bytes for `slice-to-remove-prefix-or-suffix (FURB188)` #13631

dylwil3 commented Oct 4, 2024

github-actions bot commented Oct 4, 2024 •

edited

Loading

dscorbett commented Oct 4, 2024

dylwil3 commented Oct 4, 2024

dscorbett commented Oct 5, 2024

MichaReiser left a comment

MichaReiser Oct 7, 2024

MichaReiser Oct 7, 2024

dylwil3 Oct 7, 2024

MichaReiser Oct 7, 2024

MichaReiser Oct 7, 2024

		.and_then(ast::Int::as_u32)
		.and_then(\|x\| usize::try_from(x).ok())

	.is_some_and(\|x\| x == string_val.to_str().chars().count()),
	.is_some_and(\|x\| x == string_val.chars().count()),

[refurb] Count codepoints not bytes for slice-to-remove-prefix-or-suffix (FURB188) #13631

[refurb] Count codepoints not bytes for slice-to-remove-prefix-or-suffix (FURB188) #13631

Conversation

dylwil3 commented Oct 4, 2024

github-actions bot commented Oct 4, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

dscorbett commented Oct 4, 2024

dylwil3 commented Oct 4, 2024

dscorbett commented Oct 5, 2024

MichaReiser left a comment

Choose a reason for hiding this comment

MichaReiser Oct 7, 2024

Choose a reason for hiding this comment

MichaReiser Oct 7, 2024

Choose a reason for hiding this comment

dylwil3 Oct 7, 2024

Choose a reason for hiding this comment

MichaReiser Oct 7, 2024

Choose a reason for hiding this comment

MichaReiser Oct 7, 2024

Choose a reason for hiding this comment

[refurb] Count codepoints not bytes for `slice-to-remove-prefix-or-suffix (FURB188)` #13631

[refurb] Count codepoints not bytes for `slice-to-remove-prefix-or-suffix (FURB188)` #13631

github-actions bot commented Oct 4, 2024 •

edited

Loading

`ruff-ecosystem` results