speed up `String::push` and `String::insert` #124810

lincot · 2024-05-06T16:39:28Z

Addresses the concerns described in #116235.

The performance gain comes mainly from avoiding temporary buffers.

Complex pattern matching in encode_utf8 (introduced in #67569) has been simplified to a comparison and an exhaustive match in the encode_utf8_raw_unchecked helper function. It takes a slice of MaybeUninit<u8> because otherwise we'd have to construct a normal slice to uninitialized data, which is not desirable, I guess.

Several functions still have that unneeded zeroing, but a single instruction is not that important, I guess.

@rustbot label T-libs C-optimization A-str

rustbot · 2024-05-06T16:39:35Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @scottmcm (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

@rustbot author: the review is finished, PR author should check the comments and take action accordingly
@rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

library/core/src/char/methods.rs

library/alloc/src/string.rs

library/core/src/char/methods.rs

scottmcm

I had a variety of thoughts; let me know what you think.

Also, is there anything here for which it would make sense to have a codegen test to confirm what's happening? Or some other test to help confirm it's better?

lincot · 2024-05-13T20:52:24Z

A codegen check for the absence of memcpy would be nice, since the original String::push has one.

rustbot · 2024-07-10T18:13:27Z

There are merge commits (commits with multiple parents) in your changes. We have a no merge policy so these commits will need to be removed for this pull request to be merged.

You can start a rebase with the following commands:

$ # rebase
$ git pull --rebase https://github.com/rust-lang/rust.git master
$ git push --force-with-lease

The following commits are merge commits:

9511918

lincot · 2024-10-06T17:06:50Z

The proposed implementation uses get_unchecked_mut, which cannot be used in char::encode_utf8, which is now const. Also, get_unchecked_mut actually has a cost when running with debug assertions. So I am reverting to using pointers. The codegen of String::push seems to be unchanged: godbolt.

tgross35

Sorry this has been sitting for so long, I have one last question then I think we can merge this. Mind posting the results of library/alloctests/benches/string.rs if you have run those?

tgross35 · 2025-04-04T03:39:44Z

library/core/src/char/methods.rs

+#[doc(hidden)]
+#[inline]
+#[cfg_attr(bootstrap, rustc_allow_const_fn_unstable(const_mut_refs))]
+pub const unsafe fn encode_utf8_raw_unchecked(code: u32, dst: *mut u8) {


It's been long enough that I'm forgetting context here, but why was this changed away from MaybeUninit? Specifically thinking of a signature like

pub const unsafe fn encode_utf8_raw_unchecked( code: u32, dst: &mut [MaybeUninit<u8>] ) -> &mut [u8] { // Write the characters then call MaybeUninit::assume_init_ref }

Then lengths get checked and push becomes slightly simpler with core::char::encode_utf8_raw_unchecked(ch as u32, self.vec.spare_capacity_mut()) (maybe needs an assert_unchecked(self.buf.capacity() - self.len > len) if LLVM doesn't pick up on that).

lincot · 2025-04-05T22:58:47Z

The benchmarks for String::push in library/alloctests/benches/string.rs seem to lack black_box.

Original benchmark results

before the patch:

string::bench_push_char_one_byte       10464.93ns/iter +/- 104.38
string::bench_push_char_two_bytes       8793.83ns/iter  +/- 33.15
string::bench_insert_char_long            40.98ns/iter   +/- 0.82
string::bench_insert_char_short           40.19ns/iter   +/- 0.35

after the patch:

string::bench_push_char_one_byte       10247.51ns/iter +/- 18.39
string::bench_push_char_two_bytes       9575.89ns/iter +/- 40.07
string::bench_insert_char_long            34.36ns/iter  +/- 2.17
string::bench_insert_char_short           34.17ns/iter  +/- 0.57

If we make the design of the String::push benchmarks similar to the String::insert benchmarks, the results begin to convey the difference, especially when pushing a two-byte character.

The improved benchmark code

#[bench]
fn bench_push_char_two_bytes(b: &mut Bencher) {
    b.bytes = REPETITIONS * 2;
    b.iter(|| {
        let mut r = String::new();
        for _ in 0..REPETITIONS {
            black_box(&mut r).push(black_box('â'));
        }
        r
    });
}

before the patch:

string::bench_push_char_one_byte       12670.38ns/iter +/- 151.43
string::bench_push_char_two_bytes      49854.63ns/iter +/- 367.59
string::bench_insert_char_long            40.36ns/iter   +/- 0.57
string::bench_insert_char_short           39.40ns/iter   +/- 0.63

after the patch:

string::bench_push_char_one_byte       12872.57ns/iter +/- 122.83
string::bench_push_char_two_bytes      14988.39ns/iter +/- 120.85
string::bench_insert_char_long            35.46ns/iter   +/- 0.50
string::bench_insert_char_short           35.04ns/iter   +/- 0.48

I can make an additional commit that improves the benchmark.

lincot · 2025-04-06T19:17:52Z

In the message above you can observe a regression in string::bench_push_char_one_byte. I decided to run the benchmarks 10 more times per implementation:

without the patch: 12745.60, 12654.10, 12654.79, 12652.53, 12669.58, 12751.50, 12656.31, 12654.54, 12652.73, 12650.33 (average 12674.2 ns)
with the patch: 12655.30, 12657.40, 12829.48, 12651.96, 12653.84, 12653.74, 12655.01, 12660.19, 12664.05, 12664.41 (average 12674.5 ns)

Both give the same average so we can say that the performance doesn't change for a single-byte push.

I also tried to add a likely for the single-byte length in the len_utf8 function, the average time did not change.

For a two-byte push, though, the average time goes from 49850ns to 14785ns, giving a 3.37x speedup.

tgross35 · 2025-04-08T03:45:15Z

The bench_push_char_one_byte change is within noise tolerance (results overlap when taking the +/- into account), but the others show pretty clear wins. Mind addressing #124810 (comment)? I might be missing something but if that change is possible while keeping the same performance wins, it seems slightly more accurate.

lincot · 2025-04-08T10:35:53Z

@tgross35 It seems that replies do not appear in the conversation section. In a reply I suggested a possible solution using MaybeUninit.

tgross35

Sorry about that, the GH UI got me. I missed the const reasoning but that makes sense, so the change LGTM.

Mind squashing the first two commits since the codegen change happens with the implementation change? r=me with that

Improve performance of `String` methods by avoiding unnecessary memcpy for the character bytes, with added codegen check to ensure compliance.

lincot · 2025-04-09T11:56:21Z

@bors r=tgross35

bors · 2025-04-09T11:56:24Z

@lincot: 🔑 Insufficient privileges: Not in reviewers

tgross35 · 2025-04-09T17:08:15Z

@bors r+

bors · 2025-04-09T17:08:18Z

📌 Commit ff248de has been approved by tgross35

It is now in the queue for this repository.

bors · 2025-04-09T17:58:29Z

⌛ Testing commit ff248de with merge 934880f...

bors · 2025-04-09T21:34:13Z

☀️ Test successful - checks-actions
Approved by: tgross35
Pushing 934880f to master...

github-actions · 2025-04-09T21:36:18Z

What is this?

This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.

Comparing 48f89e7 (parent) -> 934880f (this PR)

Test differences

Show 170 test diffs

Stage 1

[codegen] tests/codegen/string-push.rs: [missing] -> pass (J1)

Stage 2

[codegen] tests/codegen/string-push.rs: [missing] -> pass (J0)

Additionally, 168 doctest diffs were found. These are ignored, as they are noisy.

Job group index

Job duration changes

dist-x86_64-apple: 10157.9s -> 12286.7s (21.0%)
x86_64-apple-1: 10027.8s -> 8027.1s (-20.0%)
aarch64-apple: 4027.3s -> 4687.9s (16.4%)
x86_64-apple-2: 4827.3s -> 4064.1s (-15.8%)
dist-apple-various: 6854.3s -> 7883.0s (15.0%)
x86_64-rust-for-linux: 2672.0s -> 2845.4s (6.5%)
dist-loongarch64-musl: 5823.9s -> 5447.7s (-6.5%)
dist-x86_64-linux-alt: 7456.8s -> 7057.7s (-5.4%)
mingw-check: 1311.7s -> 1243.4s (-5.2%)
dist-x86_64-linux: 5529.6s -> 5253.4s (-5.0%)

How to interpret the job duration changes?

Job durations can vary a lot, based on the actual runner instance
that executed the job, system noise, invalidated caches, etc. The table above is provided
mostly for t-infra members, for simpler debugging of potential CI slow-downs.

rust-timer · 2025-04-09T23:43:19Z

Finished benchmarking commit (934880f): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	0.8%	[0.8%, 0.8%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.3%	[-1.2%, -0.1%]	35
Improvements ✅ (secondary)	-0.3%	[-0.8%, -0.2%]	46
All ❌✅ (primary)	-0.3%	[-1.2%, 0.8%]	36

Max RSS (memory usage)

Results (primary -3.0%, secondary 3.6%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.9%	[1.3%, 2.4%]	2
Regressions ❌ (secondary)	3.6%	[1.6%, 4.7%]	5
Improvements ✅ (primary)	-4.9%	[-8.5%, -2.5%]	5
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-3.0%	[-8.5%, 2.4%]	7

Cycles

Results (primary 1.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.1%	[1.1%, 1.1%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.1%	[1.1%, 1.1%]	1

Binary size

Results (primary -0.1%, secondary -0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.3%]	19
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.3%	[-1.1%, -0.0%]	23
Improvements ✅ (secondary)	-0.1%	[-0.2%, -0.0%]	38
All ❌✅ (primary)	-0.1%	[-1.1%, 0.3%]	42

Bootstrap: 780.203s -> 779.708s (-0.06%)
Artifact size: 366.14 MiB -> 366.24 MiB (0.03%)

…ng-insert, r=tgross35 speed up `String::push` and `String::insert` Addresses the concerns described in rust-lang#116235. The performance gain comes mainly from avoiding temporary buffers. Complex pattern matching in `encode_utf8` (introduced in rust-lang#67569) has been simplified to a comparison and an exhaustive `match` in the `encode_utf8_raw_unchecked` helper function. It takes a slice of `MaybeUninit<u8>` because otherwise we'd have to construct a normal slice to uninitialized data, which is not desirable, I guess. Several functions still have that [unneeded zeroing](https://rust.godbolt.org/z/5oKfMPo7j), but a single instruction is not that important, I guess. `@rustbot` label T-libs C-optimization A-str

rustbot assigned scottmcm May 6, 2024

scottmcm reviewed May 13, 2024

View reviewed changes

library/core/src/char/methods.rs Outdated Show resolved Hide resolved

scottmcm reviewed May 13, 2024

View reviewed changes

library/alloc/src/string.rs Show resolved Hide resolved

scottmcm reviewed May 13, 2024

View reviewed changes

library/core/src/char/methods.rs Outdated Show resolved Hide resolved

scottmcm requested changes May 13, 2024

View reviewed changes

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 13, 2024

cuviper mentioned this pull request May 14, 2024

Remove the branches from len_utf8 #125129

Closed

This comment has been minimized.

# to view

This comment was marked as outdated.

# to view

This comment has been minimized.

# to view

This comment was marked as outdated.

# to view

rustbot added the has-merge-commits PR has merge commits, merge with caution. label Jul 10, 2024

lincot force-pushed the speed-up-string-push-and-string-insert branch from 9511918 to 89fa55e Compare July 10, 2024 19:08

rustbot removed the has-merge-commits PR has merge commits, merge with caution. label Jul 10, 2024

This comment was marked as outdated.

# to view

lincot force-pushed the speed-up-string-push-and-string-insert branch from 89fa55e to 2cb20b3 Compare August 6, 2024 19:00

This comment was marked as outdated.

# to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 21, 2025

tgross35 reviewed Apr 4, 2025

View reviewed changes

tgross35 self-assigned this Apr 4, 2025

lincot force-pushed the speed-up-string-push-and-string-insert branch from 9cf92e5 to b6cf666 Compare April 8, 2025 14:26

tgross35 approved these changes Apr 8, 2025

View reviewed changes

lincot added 2 commits April 9, 2025 13:06

Speed up String::push and String::insert

09d5bcf

Improve performance of `String` methods by avoiding unnecessary memcpy for the character bytes, with added codegen check to ensure compliance.

Add missing black_box in String benchmarks

ff248de

lincot force-pushed the speed-up-string-push-and-string-insert branch from b6cf666 to ff248de Compare April 9, 2025 10:06

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 9, 2025

bors added the merged-by-bors This PR was explicitly merged by bors. label Apr 9, 2025

bors merged commit 934880f into rust-lang:master Apr 9, 2025
7 checks passed

rustbot added this to the 1.88.0 milestone Apr 9, 2025

lincot mentioned this pull request Apr 9, 2025

String::push is slow #116235

Closed

rustbot removed the perf-regression Performance regression. label Apr 9, 2025

MoSal mentioned this pull request Apr 13, 2025

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

Open

speed up String::push and String::insert #124810

speed up String::push and String::insert #124810

Uh oh!

Conversation

lincot commented May 6, 2024

Uh oh!

rustbot commented May 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scottmcm left a comment

Choose a reason for hiding this comment

Uh oh!

lincot commented May 13, 2024

Uh oh!

This comment has been minimized.

This comment was marked as outdated.

This comment has been minimized.

This comment was marked as outdated.

rustbot commented Jul 10, 2024

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

lincot commented Oct 6, 2024

Uh oh!

tgross35 left a comment

Choose a reason for hiding this comment

Uh oh!

tgross35 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

lincot commented Apr 5, 2025

Uh oh!

lincot commented Apr 6, 2025

Uh oh!

tgross35 commented Apr 8, 2025

Uh oh!

lincot commented Apr 8, 2025

Uh oh!

tgross35 left a comment

Choose a reason for hiding this comment

Uh oh!

lincot commented Apr 9, 2025

Uh oh!

bors commented Apr 9, 2025

Uh oh!

tgross35 commented Apr 9, 2025

Uh oh!

bors commented Apr 9, 2025

Uh oh!

bors commented Apr 9, 2025

Uh oh!

bors commented Apr 9, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2025

Test differences

Stage 1

Stage 2

Job duration changes

Uh oh!

rust-timer commented Apr 9, 2025

Overall result: ✅ improvements - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Uh oh!

Uh oh!

speed up `String::push` and `String::insert` #124810

speed up `String::push` and `String::insert` #124810