-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Make ASCII case conversions more than 4× faster #59283
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, modulo the tidy warnings.
I like that the explanation is so much longer than the code :)
Looks good to me. r=me as soon as CI passes. |
Might this be slower for platforms without SIMD which can't take advantage of auto-vectorization or does that not matter? |
It's probably still faster than the status quo on those platforms because it does the computation without branches. If one cared deeply about those platforms, then the pseudo-SIMD approach could be resurrected. However, I think this is a pretty good compromise. |
I guess it depends on whether LLVM can auto-vectorize based on "classic" I also just realize that when doing one byte at a time, instead of convoluted add-then-mask to emulate comparison, we can use actual comparison to obtain a byte &= !(0x20 * (b'a' <= byte && byte <= b'z') as u8) This even turns out to be slightly faster! I’ll update the PR. |
If instead of Benchmark results in GIF for "visual diff": Benchmark results in textBefore: test ascii::long::is_ascii ... bench: 187 ns/iter (+/- 0) = 37379 MB/s
test ascii::long::is_ascii_alphabetic ... bench: 94 ns/iter (+/- 0) = 74361 MB/s
test ascii::long::is_ascii_alphanumeric ... bench: 125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_control ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit ... bench: 125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_graphic ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation ... bench: 124 ns/iter (+/- 1) = 56370 MB/s
test ascii::long::is_ascii_uppercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::medium::is_ascii ... bench: 28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic ... bench: 24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_alphanumeric ... bench: 24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_control ... bench: 23 ns/iter (+/- 1) = 1391 MB/s
test ascii::medium::is_ascii_digit ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_graphic ... bench: 24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_hexdigit ... bench: 23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_lowercase ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase ... bench: 22 ns/iter (+/- 2) = 1454 MB/s
test ascii::medium::is_ascii_whitespace ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::short::is_ascii ... bench: 23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_alphabetic ... bench: 24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_alphanumeric ... bench: 24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_control ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_digit ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_graphic ... bench: 25 ns/iter (+/- 0) = 280 MB/s
test ascii::short::is_ascii_hexdigit ... bench: 24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_lowercase ... bench: 23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase ... bench: 24 ns/iter (+/- 1) = 291 MB/s
test ascii::short::is_ascii_whitespace ... bench: 22 ns/iter (+/- 0) = 318 MB/s After: test ascii::long::is_ascii ... bench: 186 ns/iter (+/- 0) = 37580 MB/s
test ascii::long::is_ascii_alphabetic ... bench: 96 ns/iter (+/- 0) = 72812 MB/s
test ascii::long::is_ascii_alphanumeric ... bench: 119 ns/iter (+/- 0) = 58739 MB/s
test ascii::long::is_ascii_control ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_graphic ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_uppercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace ... bench: 134 ns/iter (+/- 0) = 52164 MB/s
test ascii::medium::is_ascii ... bench: 28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic ... bench: 23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_alphanumeric ... bench: 23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_control ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_digit ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_graphic ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_hexdigit ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_lowercase ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase ... bench: 21 ns/iter (+/- 0) = 1523 MB/s
test ascii::medium::is_ascii_whitespace ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::short::is_ascii ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphabetic ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphanumeric ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_control ... bench: 20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_digit ... bench: 20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_graphic ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_hexdigit ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_lowercase ... bench: 20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase ... bench: 21 ns/iter (+/- 0) = 333 MB/s
test ascii::short::is_ascii_whitespace ... bench: 20 ns/iter (+/- 0) = 350 MB/s |
Benchmark results from the original PR description, in case they end up being relevant: 6830 bytes string:
alloc_only ... bench: 109 ns/iter (+/- 0) = 62660 MB/s
black_box_read_each_byte ... bench: 1,708 ns/iter (+/- 5) = 3998 MB/s
lookup ... bench: 1,725 ns/iter (+/- 2) = 3959 MB/s
branch_and_subtract ... bench: 413 ns/iter (+/- 1) = 16537 MB/s
branch_and_mask ... bench: 411 ns/iter (+/- 2) = 16618 MB/s
branchless ... bench: 377 ns/iter (+/- 2) = 18116 MB/s
libcore ... bench: 378 ns/iter (+/- 2) = 18068 MB/s
fake_simd_u32 ... bench: 373 ns/iter (+/- 1) = 18310 MB/s
fake_simd_u64 ... bench: 374 ns/iter (+/- 0) = 18262 MB/s
32 bytes string:
alloc_only ... bench: 13 ns/iter (+/- 0) = 2461 MB/s
black_box_read_each_byte ... bench: 28 ns/iter (+/- 0) = 1142 MB/s
lookup ... bench: 25 ns/iter (+/- 0) = 1280 MB/s
branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s
branchless ... bench: 15 ns/iter (+/- 0) = 2133 MB/s
libcore ... bench: 16 ns/iter (+/- 0) = 2000 MB/s
fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s
7 bytes string:
alloc_only ... bench: 13 ns/iter (+/- 0) = 538 MB/s
black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s
lookup ... bench: 17 ns/iter (+/- 0) = 411 MB/s
branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask ... bench: 17 ns/iter (+/- 0) = 411 MB/s
branchless ... bench: 21 ns/iter (+/- 0) = 333 MB/s
libcore ... bench: 21 ns/iter (+/- 0) = 333 MB/s
fake_simd_u32 ... bench: 20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u64 ... bench: 23 ns/iter (+/- 0) = 304 MB/s |
@@ -3794,7 +3794,8 @@ impl u8 { | |||
#[stable(feature = "ascii_methods_on_intrinsics", since = "1.23.0")] | |||
#[inline] | |||
pub fn to_ascii_uppercase(&self) -> u8 { | |||
ASCII_UPPERCASE_MAP[*self as usize] | |||
// Unset the fith bit if this is a lowercase letter | |||
*self & !((self.is_ascii_lowercase() as u8) << 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*self & !((self.is_ascii_lowercase() as u8) << 5) | |
*self - ((self.is_ascii_lowercase() as u8) << 5) |
Using subtract is slightly faster for me:
test long::case12_mask_shifted_bool_match_range ... bench: 776 ns/iter (+/- 26) = 9007 MB/s
test long::case13_sub_shifted_bool_match_range ... bench: 734 ns/iter (+/- 49) = 9523 MB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also an improvement for me, but smaller:
test ascii::long::case12_mask_shifted_bool_match_range ... bench: 352 ns/iter (+/- 0) = 19857 MB/s
test ascii::long::case13_subtract_shifted_bool_match_range ... bench: 350 ns/iter (+/- 1) = 19971 MB/s
test ascii::medium::case12_mask_shifted_bool_match_range ... bench: 15 ns/iter (+/- 0) = 2133 MB/s
test ascii::medium::case13_subtract_shifted_bool_match_range ... bench: 15 ns/iter (+/- 0) = 2133 MB/s
test ascii::short::case12_mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s
test ascii::short::case13_subtract_shifted_bool_match_range ... bench: 18 ns/iter (+/- 0) = 388 MB/s
A quick benchmark using
Which shows that this can be slower than the lookup for a target without SIMD. |
What commit were these i586 results on? Because the |
I was just a recent nightly so that's why |
@joshtriplett I pushed several changes since your review, could you have another look? |
@bors r+ |
📌 Commit 7fad370 has been approved by |
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
Rollup of 18 pull requests Successful merges: - #57293 (Make some lints incremental) - #57565 (syntax: Remove warning for unnecessary path disambiguators) - #58253 (librustc_driver => 2018) - #58837 (librustc_interface => 2018) - #59268 (Add suggestion to use `&*var` when `&str: From<String>` is expected) - #59283 (Make ASCII case conversions more than 4× faster) - #59284 (adjust MaybeUninit API to discussions) - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations) - #59390 (Make `ptr::eq` documentation mention fat-pointer behavior) - #59393 (Refactor tuple comparison tests) - #59420 ([CI] record docker image info for reuse) - #59421 (Reject integer suffix when tuple indexing) - #59430 (Renames `EvalContext` to `InterpretCx`) - #59439 (Generalize diagnostic for `x = y` where `bool` is the expected type) - #59449 (fix: Make incremental artifact deletion more robust) - #59451 (Add `Default` to `std::alloc::System`) - #59459 (Add some tests) - #59460 (Include id in Thread's Debug implementation) Failed merges: r? @ghost
Version 1.35.0 (2019-05-23) ========================== Language -------- - [`FnOnce`, `FnMut`, and the `Fn` traits are now implemented for `Box<FnOnce>`, `Box<FnMut>`, and `Box<Fn>` respectively.][59500] - [You can now coerce closures into unsafe function pointers.][59580] e.g. ```rust unsafe fn call_unsafe(func: unsafe fn()) { func() } pub fn main() { unsafe { call_unsafe(|| {}); } } ``` Compiler -------- - [Added the `armv6-unknown-freebsd-gnueabihf` and `armv7-unknown-freebsd-gnueabihf` targets.][58080] - [Added the `wasm32-unknown-wasi` target.][59464] Libraries --------- - [`Thread` will now show its ID in `Debug` output.][59460] - [`StdinLock`, `StdoutLock`, and `StderrLock` now implement `AsRawFd`.][59512] - [`alloc::System` now implements `Default`.][59451] - [Expanded `Debug` output (`{:#?}`) for structs now has a trailing comma on the last field.][59076] - [`char::{ToLowercase, ToUppercase}` now implement `ExactSizeIterator`.][58778] - [All `NonZero` numeric types now implement `FromStr`.][58717] - [Removed the `Read` trait bounds on the `BufReader::{get_ref, get_mut, into_inner}` methods.][58423] - [You can now call the `dbg!` macro without any parameters to print the file and line where it is called.][57847] - [In place ASCII case conversions are now up to 4× faster.][59283] e.g. `str::make_ascii_lowercase` - [`hash_map::{OccupiedEntry, VacantEntry}` now implement `Sync` and `Send`.][58369] Stabilized APIs --------------- - [`f32::copysign`] - [`f64::copysign`] - [`RefCell::replace_with`] - [`RefCell::map_split`] - [`ptr::hash`] - [`Range::contains`] - [`RangeFrom::contains`] - [`RangeTo::contains`] - [`RangeInclusive::contains`] - [`RangeToInclusive::contains`] - [`Option::copied`] Cargo ----- - [You can now set `cargo:rustc-cdylib-link-arg` at build time to pass custom linker arguments when building a `cdylib`.][cargo/6298] Its usage is highly platform specific. Misc ---- - [The Rust toolchain is now available natively for musl based distros.][58575] [59460]: rust-lang/rust#59460 [59464]: rust-lang/rust#59464 [59500]: rust-lang/rust#59500 [59512]: rust-lang/rust#59512 [59580]: rust-lang/rust#59580 [59283]: rust-lang/rust#59283 [59451]: rust-lang/rust#59451 [59076]: rust-lang/rust#59076 [58778]: rust-lang/rust#58778 [58717]: rust-lang/rust#58717 [58369]: rust-lang/rust#58369 [58423]: rust-lang/rust#58423 [58080]: rust-lang/rust#58080 [57847]: rust-lang/rust#57847 [58575]: rust-lang/rust#58575 [cargo/6298]: rust-lang/cargo#6298 [`f32::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f32.html#method.copysign [`f64::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f64.html#method.copysign [`RefCell::replace_with`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.replace_with [`RefCell::map_split`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.map_split [`ptr::hash`]: https://doc.rust-lang.org/stable/std/ptr/fn.hash.html [`Range::contains`]: https://doc.rust-lang.org/std/ops/struct.Range.html#method.contains [`RangeFrom::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeFrom.html#method.contains [`RangeTo::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeTo.html#method.contains [`RangeInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeInclusive.html#method.contains [`RangeToInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeToInclusive.html#method.contains [`Option::copied`]: https://doc.rust-lang.org/std/option/enum.Option.html#method.copied
Reformatted output of
./x.py bench src/libcore --test-args ascii
below. Thelibcore
benchmark calls[u8]::make_ascii_lowercase
.lookup
has code (effectively) identical to that before this PR, andbranchless
mask_shifted_bool_match_range
after this PR.See code comments inu8::to_ascii_uppercase
insrc/libcore/num/mod.rs
for an explanation of the branchless algorithm.Update: the algorithm was simplified while keeping the performance. See
branchless
v.s.mask_shifted_bool_match_range
benchmarks.Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on
u32
to convert four bytes at a time. Thefake_simd_u32
benchmarks implements this withlet (before, aligned, after) = bytes.align_to_mut::<u32>()
. Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).This could be fixed (to optimize
[u8]::make_ascii_lowercase
and[u8]::make_ascii_uppercase
insrc/libcore/slice/mod.rs
) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.Benchmark results on Linux x64 with Intel i7-7700K: (updated from #59283 (comment))