Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Don't implement mem::replace with mem::swap. #83022

Merged
merged 1 commit into from
Mar 13, 2021

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Mar 11, 2021

swap is a complicated operation, so this changes the implementation of replace to use read and write instead.

See #83019.

I wrote there:

Implementing the simpler operation (replace) with the much more complicated operation (swap) doesn't make a whole lot of sense. replace is just read+write, and the primitive for moving out of a &mut. swap is for doing that to two &mut at the same time, which is both more niche and more complicated (as shown by swap_nonoverlapping_bytes).

This could be especially interesting for Option<VeryLargeStruct>::take(), since swapping such a large structure with swap_nonoverlapping_bytes is going to be much less efficient than ptr::write()'ing a None.

But also for small values where swap just reads/writes using temporary variable, this makes a replace or take operation simpler:
image

@rust-highfive
Copy link
Contributor

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 11, 2021
@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

1 similar comment
@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 11, 2021
@bors
Copy link
Collaborator

bors commented Mar 11, 2021

⌛ Trying commit bf27819 with merge 9acd471bcf643c584381c0ff22e4d38a7617f313...

@rust-timer

This comment has been minimized.

@eddyb
Copy link
Member

eddyb commented Mar 11, 2021

FWIW I just tested this PR against Rust-GPU and it appears to handle the for i in 0..n motivation, from #83019's description.

@bors
Copy link
Collaborator

bors commented Mar 11, 2021

☀️ Try build successful - checks-actions
Build commit: 9acd471bcf643c584381c0ff22e4d38a7617f313 (9acd471bcf643c584381c0ff22e4d38a7617f313)

@rust-timer
Copy link
Collaborator

Queued 9acd471bcf643c584381c0ff22e4d38a7617f313 with parent 5c6d3bf, future comparison URL.

@nagisa
Copy link
Member

nagisa commented Mar 11, 2021

since swapping such a large structure with swap_nonoverlapping_bytes is going to be much less efficient than ptr::write()'ing a None.

I don't believe that's true – assigning a None is a disciminant write, but ptr::write will need to copy over all the size_of::<Option<VeryLargeStruct>>() bytes of the type regardless of whether the value is a None or not, won't it?

@nagisa
Copy link
Member

nagisa commented Mar 11, 2021

x86_64, aarch64 assembly-level comparison.

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

@nagisa Nope, that just writes the tag of the Option, and nothing else. (Just like write()'ing a MaybeUninit::uninit() writes nothing at all.)

But looks like llvm is also smart enough to do that with the more compicated swap code: https://godbolt.org/z/TM1bcv

@camelid camelid added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 11, 2021
@khyperia
Copy link
Contributor

khyperia commented Mar 11, 2021

I don't believe that's true – assigning a None is a disciminant write, but ptr::write will need to copy over all the size_of::<Option<VeryLargeStruct>>() bytes of the type regardless of whether the value is a None or not, won't it?

x86_64, aarch64 assembly-level comparison.

Looking at the x86_64 assembly, changing it to be this shows off how the ptr::write version optimizes the None case, where the swap version doesn't.

pub fn replace_large(dest: &mut Option<[u8; 127]>) -> Option<[u8; 127]> {
    replace(dest, None)
}

But looks like llvm is also smart enough to do that with the more compicated swap code

Not smart enough in all cases, I guess? No idea how this case is different than yours, though.

edit: discovered this weird thing, I mistakenly thought it was relevant but I missed something ugh. big shrug~

godbolt link

pub fn speedy_fast(x: &mut Option<[u64; 10]>) -> Option<[u64; 10]> {
    std_replace(x, None)
}
// copied from std::mem::replace, just to be clear what's happening (same behavior with std::mem::replace)
fn std_replace(dest: &mut Option<[u64; 10]>, mut src: Option<[u64; 10]>) -> Option<[u64; 10]> {
    std::mem::swap(dest, &mut src);
    src
}

// manually inlined, should be the same? but isn't
pub fn big_slow_write(x: &mut Option<[u64; 10]>) -> Option<[u64; 10]> {
    let mut blah = None;
    std::mem::swap(x, &mut blah);
    blah
}

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (9acd471bcf643c584381c0ff22e4d38a7617f313): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 11, 2021
@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

🎉

Quite a significant improvement! Everything is green, except for one +174% that pushed one average into the red. I suppose that's just a random outlier:

image

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

r? @eddyb

@rust-highfive rust-highfive assigned eddyb and unassigned kennytm Mar 11, 2021
@klensy
Copy link
Contributor

klensy commented Mar 11, 2021

Can it rerun perf without try?

@rust-timer queue include=46449

@rust-timer
Copy link
Collaborator

Insufficient permissions to issue commands to rust-timer.

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

@rust-timer queue include=46449

I sure have permissions, lets see what happens.

EDIT: Looks like no, you can't rerun perf withouta another try build :(

@bors
Copy link
Collaborator

bors commented Mar 12, 2021

☀️ Try build successful - checks-actions
Build commit: 6537b60dde68a3362e54e06e38e45445cd1e920d (6537b60dde68a3362e54e06e38e45445cd1e920d)

@rust-timer
Copy link
Collaborator

Queued 6537b60dde68a3362e54e06e38e45445cd1e920d with parent 0cc64a3, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (6537b60dde68a3362e54e06e38e45445cd1e920d): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 12, 2021
@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

Well, we're good to go, I guess!

@bors r+

@bors
Copy link
Collaborator

bors commented Mar 12, 2021

📌 Commit bf27819 has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 12, 2021
@SimonSapin
Copy link
Contributor

For what it’s worth I wasn’t saying in #83019 that we shouldn’t change anything, only that the current situation was not entirely accidental.

These macro-level results look great. Since this effectively undoes #40454 for replace it might be interesting to also look at micro-level effects, especially for large values. Unfortunately I don’t find code to reproduce the benchmarks in #40454.

@bors
Copy link
Collaborator

bors commented Mar 12, 2021

⌛ Testing commit bf27819 with merge 46a934a...

@bors
Copy link
Collaborator

bors commented Mar 13, 2021

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 46a934a to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 13, 2021
@bors bors merged commit 46a934a into rust-lang:master Mar 13, 2021
@rustbot rustbot added this to the 1.52.0 milestone Mar 13, 2021
@m-ou-se m-ou-se deleted the mem-replace-no-swap branch March 16, 2021 19:02
@jordens
Copy link

jordens commented Apr 21, 2021

Thanks for this fix! Example where this (I think) was painful: taking and replacing packet size data in Options:
https://godbolt.org/z/5dPh8eoWa

@m-ou-se
Copy link
Member Author

m-ou-se commented Apr 21, 2021

Thanks for this fix! Example where this (I think) was painful: taking and replacing packet size data in Options:
https://godbolt.org/z/5dPh8eoWa

Nice, that's a huge difference! Even with --opt-level set to z or s, it makes a big difference.

@scottmcm scottmcm mentioned this pull request Jun 24, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 6, 2023
…iler-errors

Use `load`+`store` instead of `memcpy` for small integer arrays

I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway.  After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments.

LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s.  In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712).  So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations.

rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now.

The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them.  I don't want to try to bite off too much in this PR, though.  (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.)

Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv>

Before:
```llvm
define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 {
  %a.i = alloca [3 x i16], align 2
  call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false)
  tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false)
  call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i)
  ret void
}
```
Note it going to stack:
```nasm
swap_rgb48_old:                         # `@swap_rgb48_old`
        movzx   eax, word ptr [rdi + 4]
        mov     word ptr [rsp - 4], ax
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rsp - 8], eax
        movzx   eax, word ptr [rsi + 4]
        mov     word ptr [rdi + 4], ax
        mov     eax, dword ptr [rsi]
        mov     dword ptr [rdi], eax
        movzx   eax, word ptr [rsp - 4]
        mov     word ptr [rsi + 4], ax
        mov     eax, dword ptr [rsp - 8]
        mov     dword ptr [rsi], eax
        ret
```

Now:
```llvm
define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 {
start:
  %0 = load <3 x i16>, ptr %x, align 2
  %1 = load <3 x i16>, ptr %y, align 2
  store <3 x i16> %1, ptr %x, align 2
  store <3 x i16> %0, ptr %y, align 2
  ret void
}
```
still lowers to `dword`+`word` operations, but has no stack traffic:
```nasm
swap_rgb48:                             # `@swap_rgb48`
        mov     eax, dword ptr [rdi]
        movzx   ecx, word ptr [rdi + 4]
        movzx   edx, word ptr [rsi + 4]
        mov     r8d, dword ptr [rsi]
        mov     dword ptr [rdi], r8d
        mov     word ptr [rdi + 4], dx
        mov     word ptr [rsi + 4], cx
        mov     dword ptr [rsi], eax
        ret
```

And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR
```llvm
define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}
```
but that lowers to reasonable `dword`+`qword` instructions still
```nasm
replace_short_array:                    # `@replace_short_array`
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     edi, dword ptr [rsi + 8]
        mov     dword ptr [rax + 8], edi
        mov     qword ptr [rax], rcx
        mov     rcx, qword ptr [rdx]
        mov     edx, dword ptr [rdx + 8]
        mov     dword ptr [rsi + 8], edx
        mov     qword ptr [rsi], rcx
        ret
```
@joboet joboet mentioned this pull request Mar 16, 2024
fmease added a commit to fmease/rust that referenced this pull request Mar 16, 2024
Optimize `ptr::replace`

rust-lang#83022 optimized `mem::replace` to reduce the number of `memcpy`s. `ptr::replace`, which is [documented to behave just like `mem::replace`](https://doc.rust-lang.org/nightly/std/ptr/fn.replace.html), was not optimized however, leading to [worse code](https://godbolt.org/z/T3hdEEdfe) and missed optimizations. This PR simply forwards `ptr::replace` to `mem::replace` to take advantage of the better implementation.
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 17, 2024
Rollup merge of rust-lang#122601 - joboet:ptr_replace, r=workingjubilee

Optimize `ptr::replace`

rust-lang#83022 optimized `mem::replace` to reduce the number of `memcpy`s. `ptr::replace`, which is [documented to behave just like `mem::replace`](https://doc.rust-lang.org/nightly/std/ptr/fn.replace.html), was not optimized however, leading to [worse code](https://godbolt.org/z/T3hdEEdfe) and missed optimizations. This PR simply forwards `ptr::replace` to `mem::replace` to take advantage of the better implementation.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.