[RFC] AtomicPerByte (aka "atomic memcpy") #3301

m-ou-se · 2022-08-14T17:39:35Z

bjorn3 · 2022-08-14T17:57:35Z

ibraheemdev · 2022-08-14T19:50:27Z

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

5225225 · 2022-08-14T19:56:24Z

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

Lokathor · 2022-08-14T20:10:23Z

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

T-Dark0 · 2022-08-14T20:14:29Z

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

programmerjake · 2022-08-15T01:01:21Z

note that LLVM already implements this operation:
llvm.memcpy.element.unordered.atomic Intrinsic
with an additional fence operation for acquire/release.

comex · 2022-08-15T02:26:31Z

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

text/3301-atomic-memcpy.md

taiki-e · 2022-08-15T05:30:32Z

text/3301-atomic-memcpy.md

+- In order for this to be efficient, we need an additional intrinsic hooking into
+  special support in LLVM. (Which LLVM needs to have anyway for C++.)


How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.

Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.

We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.

Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.

text/3301-atomic-memcpy.md

m-ou-se · 2022-08-15T08:12:50Z

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

text/3301-atomic-memcpy.md

ojeda · 2022-08-15T10:48:45Z

text/3301-atomic-memcpy.md

+  but it's easy to accidentally cause undefined behavior by using `load`
+  to make an extra copy of data that shouldn't be copied.
+
+- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?


Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

AtomicPerByteMaybeUninit or AtomicPerByteManuallyDrop to also resolve the other concern around dropping? Those are terrible names though...

ojeda · 2022-08-15T10:56:59Z

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

thomcc · 2022-08-15T17:11:50Z

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

thomcc · 2022-08-15T19:45:35Z

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

bjorn3 · 2022-08-15T20:12:07Z

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

thomcc · 2022-08-15T20:26:01Z

I'm not sure we'd want unordered, as mentioned above...

thomcc · 2022-08-16T02:25:12Z

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

programmerjake · 2022-08-16T03:16:12Z

something that could work but not be technically correct is:
compiler acquire fence
unordered atomic memcpy
compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

thomcc · 2022-08-16T03:19:51Z

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

tschuett · 2022-08-18T20:08:11Z

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable.
The LLVM guys started something similar in 2016:
https://reviews.llvm.org/D27133

text/3301-atomic-memcpy.md

RalfJung · 2022-08-20T16:24:57Z

text/3301-atomic-memcpy.md

+        loop {
+            let s1 = self.seq.load(Acquire);
+            let data = read_data(&self.data, Acquire);
+            let s2 = self.seq.load(Relaxed);


There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load here cannot return an outdated value that would prevent us from noticing a seqnum bump.

The reason this is the case is that if there is a concurrent write, and if any
part of data reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add from write, and thus we will definitely see a version conflict. OTOH if the s1 reads-from some second fetch_add in write, then that forms a release-acquire pair, and we will definitely see the full data.

So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)

(This is related to this comment by @cbeuw.)

Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data() part, but the usual (incorrect) read_data implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data() operation.

Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)) under the C11 model.

Yeah, the famous seqlock paper discusses "read dont-modify write" operations.

RalfJung · 2022-08-20T16:31:12Z

text/3301-atomic-memcpy.md

+while the second one is basically a memory fence followed by series of `AtomicU8::store`s.
+Except the implementation can be much more efficient.
+The implementation is allowed to load/store the bytes in any order,
+and doesn't have to operate on individual bytes.


The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.

For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.

I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.

In the C++ paper they are basically as:

for (size_t i = 0; i < count; ++i) { reinterpret_cast<char*>(dest)[i] = atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed); } atomic_thread_fence(order);

and

atomic_thread_fence(order); for (size_t i = 0; i < count; ++i) { atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store( reinterpret_cast<char*>(source)[i], memory_order::relaxed); }

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.

In the C++ paper they are basically as:

Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).

Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.

RalfJung · 2022-08-20T16:39:57Z

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound.

VorpalBlade · 2025-05-07T15:50:54Z

The only thing left I'm still struggling with is the signature of the store method(s):
      pub fn store(&self, value: MaybeUninit<T>, ordering: Ordering);
// or
      pub fn store(&self, value: &MaybeUninit<T>, ordering: Ordering);
// or
      pub fn store(&self, value: T, ordering: Ordering);
// or
      pub fn store(&self, value: T, ordering: Ordering) where T: Copy;
// or
      pub fn store(&self, value: &T, ordering: Ordering);
// or
      pub fn store(&self, value: &T, ordering: Ordering) where T: Copy;
Or a combination of these (store and store_from).

Taking by value fits the most basic use case, but consuming the value can be annoying if you need to attempt a store multiple times. However, taking by reference can get weird for non-Copy/needs-drop types. Wrapping it in a MaybeUninit makes the Drop situation clearer, but taking that by reference can be annoying if you have a &T and need a &MaybeUninit<T> here. :/

This RFC has now been stuck for nearly a year on this. And this is only going to be unstable. Could we not select something, get into nightly and see how it works? Isn't that was nightly is for: figuring out and improving. It isn't like this API is going to be insta-stable.

I don't think this should be a blocker at this point.

RalfJung · 2025-05-07T16:36:46Z

I don't think the API is the only open issue. AFAIK LLVM does not even support these operations yet.

Also, I don't know if their interactions with the memory model have been fully worked out -- typically, mixed-size accessed are UB, so what if I do two racing "atomic memcpy" of different size on the same memory? Or what if such an "atomic memcpy" races with an overlapping regular atomic access?

arielb1 · 2025-05-07T21:25:28Z

AFAIK LLVM does not even support these operations yet

Isn't a freeze of [any ordering] per-byte LLVM load a valid compilation of this? Since all loads:

return the right value when there is only a single relevant write
return something not less defined than an undef when there is more than one relevant write, which a freeze will turn to an undefined value.

of course, it might not be the "most general" compilation, but I don't see why that matters for correctness.

Also, the funny thing is that if there was an unstable way of having this behavior, then an asm block would magically be able to "mimic" it in the operational semantics and work, even if there is no way for a programmer to type it out. You could even say that "AtomicPerByte loads are valid Rust operational semantics with no syntax", and then asm blocks will be guaranteed to work (as far as I know, this requires no change to the current compiler, since all optimizations work well in the existence of AtomicPerByte loads within memory-reading "asm volatile" blocks [again this is different from not being able to optimize an AtomicPerByte load in the same way you optimize a UB-on-racy-read load, that's a different question, the compiler can't optimize a memory-reading asm volatile block as if it was a UB-on-racy-read load]).

197g · 2025-05-07T15:10:47Z

text/3301-atomic-memcpy.md

+  However, the use cases for this are very limited, it would require a new trait to mark the types for which this is valid,
+  and it makes the API a lot more complicated or verbose to use.
+
+  Also, such a API for safely handling torn values can be built on top of the proposed API,
+  so we can leave that to a (niche) ecosystem crate.


I do not think this case is as obscure as it is made out to be. I have been wrapping homogeneous-size torn memory operations in the context of images (i.e. a library having the luxury of defining one normative underlying atomic unit, and then building on that for Pod types). The argument sounds okay with SeqLock as the most immediate use case but there are others for a MaybeTorn itself. Also, I don't think the arguments against a trait and against MaybeTorn<_> should be conflated. To expand on these points:

There are several ways in which write/read races may be unproblematic for an algorithm but without having a reliable proof of data-race freedom, to the compiler or in Rust's machine model, that would be necessary when implementing said algorithm without atomics. Some that come to mind, though it'd be nice to have them more concretely:

In graphics applications one may choose not to care about a tear if it is, as determined by out-of-bands means, somehow minor. The contents being copied mostly are simple components such as floats, integers, blocks of bytes that might not even have padding.

In parallelized numerical algorithms we may split some matrix into smaller subregions to be computed. Here the tear-freedom could be guaranteed by scheduling operations in the correct sequence while the underlying data is also simple numerical types. Importantly, with the use of AtomicPerByte tears could become simple bugs in the scheduler, not soundness issues with the algorithms. These will would definitely prefer simple access to the bytes, assuming they are correct.

In networking packets may arrive with checksums and we could choose to filter them later in a processing pipeline based on that in-band signal rather than on a separately tracked SeqLock, or generally speaking this algorithm would move around MaybeTorn value itself before turning it into a T by validating that type's invariants. If these invariants aren't representational but only safety invariants then I would prefer a safe way of doing so, i.e. exposed like a standard constructor, as in these situations no part of the memory model is violated in the first place.

Then note that for the ecosystem, e.g. bytemuck or zerocopy, to provide convenient methods for access to Pod data, they benefit greatly from a vocabulary type to talk about that representation. This can not be MaybeUninit<T>. (It is of course entirely sound to have a utility method MaybeTorn<T>::into_uninit(Self) -> MaybeUninit<T>). The representational invariants are quite distinct. For instance in this speculated-to-be-highly-relevant API they are not interchangeable:

// crate bytemuck; unsafe impl<T: Pod> Pod for MaybeTorn<T> {}

Under the alternative, these crates must provide special methods for the AtomicPerByte receiver type, and these will duplicate all the different ways of loading and take their own Ordering argument. This duplication will be multiplied by the number of different store/load pairs such as for slice, in the light of an unsafe pointer-self method with safety requirements, or interactions with unaligned representations. This is all mental overhead of extension traits and these wrappers needing to justify internally behaving exactly like the standard version, casting their source to MaybeUninit<T> is rather obscure in itself. This is also not really ergonomic.

@m-ou-se Could you expand on the interfaces you have tried, that did not work out? I was thinking along the lines of using a sealed trait. (Borrow could be made to work, technically I think, but it adds confusion).

trait BytesOf<T> { /// Just for demonstration, we probably want another method operating on pointers here.. fn borrow(&self) -> &MaybeTorn<T>; } impl<T> AtomicPerByte<T> { pub fn store(&self, _l: impl BytesOf<T>, _: Ordering); } impl<T> BytesOf<T> for T { fn borrow(&self) -> &MaybeTorn<T> { // SAFETY: repr(transparent), I think unsafe { transmute(self) } } } impl<T> BytesOf<T> for MaybeTorn<T> { … } /// If you want to ignore `Copy`, and just get bytes, use the wrapping into /// `MaybeTorn<T>` which does not have the bound. impl<T: Copy> BytesOf<T> for &'_ T { … } impl<T> BytesOf<T> for &'_ MaybeTorn<T> { … } // Even after it was manually dropped, bytes are still valid. impl<T> BytesOf<T> for ManuallyDrop<T> { … }

This can be called with all of T, &T, MaybeTorn<T> and ManuallyDrop<T> . It doesn't immediately address if these are confusing. Instead, one could gate the impl on T for T: Copy since it can always be made explicit ManuallyDrop when that is intended.

In the context I am using these in, interactions with slices of data are a big reason for using these APIs in that context. This makes sense from the motivation: if this should build an analogue for memcpy, then the array operations should be considered.

From that experience, I expect another interface extensions to be interesting:

copy(this: &AtomicPerByte<T>, into: &AtomicPerByte<T>, _: Ordering)

the slice equivalent of this.

possibly: store_cell(&self, into: &Cell<T>, _: Ordering); note that &Cell<MaybeTorn<T>> is the equivalent output type for loads and slices. This may be punted to the ecosystem albeit with the question of how. Casting to &mut MaybeUninit<T> is at least incorrect for the load variants.

As the lifted analogue to the pure memory operation, this is more performant than the manual loop version around an intermediate buffer (On a tangential note, I've missed that on Cell<impl Copy> as well but the optimizer is mostly more reliable here).

RalfJung · 2025-05-08T06:08:43Z

Isn't a freeze of [any ordering] per-byte LLVM load a valid compilation of this? Since all loads:

No, these operations have an Ordering so they need to have the right acquire/release/SC semantics (and even relaxed accesses need to properly synchronize when there's a fence). This excludes using regular loads. (I assume you meant using regular loads as otherwise the mention of freeze makes no sense to me.) As for using atomic loads, we'd still run into the mixed-size-access issues.

Also, we need stores as well, not just loads.

arielb1 · 2025-05-08T19:42:41Z

SeqLock vs. "loads from buggy threads"

It certainly seems that a bunch of people (including me) read the description, and think that this is the API is intended to be used for the case of loading from a potentially-buggy thread (let's leave aside malicious threads that interact with the Rust abstract machine on the Assembly level but talk only about threads that are written in Rust but might have a "safe" race condition bug).

This is not the API for that, and it probably needs to be written in the RFC. This is true in 2 directions:

This API is "too weak", since it is not designed to guarantee correct behavior wrt. buggy threads - AFAICT it is only intended to have well-defined behavior if all racing writes are done via AtomicPerByte writes.
This API is also "too strong". For SeqLocks, you want the loads to have at least a Relaxed ordering so they can be "upgraded" via an Acquire barrier since otherwise you run into the "no such thing as a release load" problem. In the "potentially buggy" case, observing a non-deterministic result tells you something about the input (that it provoked a bug) rather than about the thread ordering.

More about "loads from buggy threads" (should probably be in a different RFC)

For the "communicating with an potentially-buggy thread" case, regular LLVM loads ought to work fine according to what I believe is the LLVM semantics - you don't need any happens-before operations. Either the thread is buggy and you don't care which results you get, or the thread is well-functioning, and you will not have any races (tho in that case, inline asm can work fine, since the "simulator" can decide "buginess" based on the environment and choose whether to not load anything return nondeterministic values for a buggy partner thread or do a full Rust load for a non-buggy partner thread).

Back to SeqLocks

For SeqLocks, having writes and reads have the semantics of AtomicU8 writes and reads with an unspecified/nondeterministic ordering is definitely strong enough.

But that is something you can already write in today's Rust, and AFAICT on the architectures people care about is not more defined than an inline asm memcpy + the ordering-correct memory barrier (which generates the assembly that has the performance characteristics people want), which means that inline asm barrier-including AtomicPerByte ought to be fine for seqlocks.

An "unspecified sequence of AtomicU8 operations" might be too strong than we want to specify - for example:

It might be smarter to say that the separate reads/writes are unsequenced ("ad hoc concurrent") rather than having unspecified ordering. LLVM does not directly support ad hoc concurrency, but most memory models do and I won't be surprised if because of that LLVM optimizations support ad hoc concurrency as well.
It implies some specific semantics for using AtomicPerByte along with AtomicU8 operations, which AFAICT do the specified thing when using memcpy on the architectures people care about, but we might want to prohibit.

An aside: compiler barriers

Doing a "poison + freeze" LLVM memcpy + a compiler barrier + a memory barrier "appears to work" for seqlocks. I don't see a strong enough reason to model it as that as opposed to a bunch of AtomicU8 loads.

It might be interesting from a theoretical point of view to have a justification for "why does this work", but that requires quite a big model of the memory model.

arielb1 · 2025-05-15T15:14:45Z

As far as I can tell, there is consensus that you can model the AtomicPerByte memcpy needed for seqlocks or matrix operations by doing a bunch of relaxed u8 loads/stores in some nondeterministic/"unspecified" order along with a memory barrier (a release barrier before stores, or an acquire barrier after loads), and implement that by doing an inline assembly memcpy + a memory barrier.

The remaining operational semantics question is how to have a load that works in the presence of concurrent atomic operations of unknown size. For which, AFAICT there are 2 options:

Have the operational semantics contain a regular or special kind of u8 load that gives the "correct" (non-undef/nondeterministic/poison) value when combined with atomic operations.
Have an LLVM-style load that returns an undef when racy, since seqlocks don't care about racing atomics.

And as far as I know both would work, and we don't have an obvious candidate.

We might end up keeping operational semantics for Rust that do not have LLVM-style undef-on-race loads, even if they might end up having LLVM-style freeze operations.

The other question is whether we want to make a store that doesn't cause concurrent rmw operations to be UB?

VorpalBlade · 2025-05-16T20:17:41Z

As far as I can tell, there is consensus that you can model the AtomicPerByte memcpy needed for seqlocks or matrix operations by doing a bunch of relaxed u8 loads/stores in some nondeterministic/"unspecified" order along with a memory barrier (a release barrier before stores, or an acquire barrier after loads), and implement that by doing an inline assembly memcpy + a memory barrier.

A important aspect is that the compiler should be allowed to optimise to use larger accesses if possible (just like a normal memcpy doesn't copy one byte at a time).

RalfJung · 2025-05-18T14:24:08Z

I think the API is entirely suited for "communication with untrusted threads", though I do not know of a formal model of that. It is true that Relaxed may be too strong for that, but OTOH Relaxed in practice already compiles to plain accesses without a fence, so another, even weaker access mode would not allow better codegen -- and it seems rather unlikely that it would allow better optimizations.

That said, we'll need more than just this RFC for that -- we also need proper mixed-size accesses. Talking with some C++ people, it seems like the C++ version of this API cannot be used if there are concurrent atomic accesses -- all accesses to the memory in question must use an atomic bytewise memcpy, possibly they must even all use the exact same start address & size. That's enough for SeqLocks, and in practice is likely enough for communication with untrusted threads (i.e., it is the least wrong thing people can do in that situation), but a proper solution for the latter needs to deal with situations where the untrusted threads perform accesses that don't match the sizes used on the trusted side.

arielb1 · 2025-05-18T18:49:50Z

I agree that AFAICT:

No optimizations will be hurt by declaring that whatever of implementation of AtomicPerByte reads/writes we have is "safe for untrusted reads/writes".
While using an AtomicPerByte op for communicating with an untrusted thread theoretically has a stronger ordering than desired which might prevent optimizations, in practice it will generate the optimal code.

I do think that due to the mixed-size access problem it might be wise to leave this RFC to say "this API works for SeqLocks. It will probably work for working with untrusted threads but no memory model or CPU says anything about the behavior of racing mixed atomics".

Tho how much of a risk there is of a buggy thread within the same process writing an undef into a byte which later poisons the program? Since compilers normally don't do cross-thread LTO, this won't do anything today, but it might be better to have a solution that holds in that case. Maybe just tell people to use a freeze after the atomic-per-byte read (assuming we have freeze which I believe we ought to)?

programmerjake · 2025-05-18T19:20:24Z

but no memory model or CPU says anything about the behavior of racing mixed atomics

That's not true, e.g. AFAICT RISC-V specifies the behavior of racing mixed atomics. The problem is that some architectures (~~x86~~, WASM, etc.) don't specify that so compilers don't specify that either.

DemiMarie · 2025-05-19T18:34:23Z

I’m pretty sure that every CPU agrees that racing atomic operations is not undefined behavior at the hardware level. On implementations using LL/SC for atomics, it is possible to trigger a livelock if CAS is implemented as a loop, but none of the operaitons being discussed here use CAS loops.

programmerjake · 2025-05-19T20:45:38Z

turns out mixed <= 64-bit atomics does work on x86 if you use a new enough cpu: rust-lang/unsafe-code-guidelines#345 (comment)

arielb1 · 2025-05-21T10:43:28Z

The debate about freeze makes me more want to have separate "untrusting" and "trusting" versions of the AtomicPerByte memcpy, where the "untrusted" version is guaranteed to read only the memory that's in the address it's reading from - this doesn't have any meaning in the abstract machine - in the AM, it's nondeterminism - but in the concrete machine, it will be guaranteed that the "nondeterministic" value is equal to a value that physically existed in that address.

I don't think it makes sense for general-use primitives to have semantics that involve the concrete machine, which is why I think it makes sense for that to be a separate API.

RalfJung · 2025-05-21T12:04:03Z

I don't understand what you mean by this. This operation is fully defined by saying that it behaves like a series of u8 relaxed (atomic) loads/stores. There's no new source of non-determinism here that doesn't already exist with atomics today.

arielb1 · 2025-05-21T13:10:45Z

This operation is fully defined by saying that it behaves like a series of u8 relaxed (atomic) loads/stores.

If there are undefs in the memory being loaded from, then an AtomicU8 load would be UB, right? You could do a load of an AtomicMaybeUninitU8 (is that a thing? we probably want that to also allow copying provenance), but I think you want a "concrete machine" guarantee that if there is undef, the uninitialized memory is copied from the original memory location rather than being from e.g. the stack of the thread calling atomicperbyte memcpy (this does not do anything at the abstract machine level, but does affect the concrete machine).

Of course, you can argue that an untrusted program is not able to put undef in memory it controls, but a buggy Rust program can certainly write undef (e.g. by writing a struct with padding), and you want a version of AtomicPerByte memcpy that is robust against that.

RalfJung · 2025-05-21T13:14:35Z

Ah, I guess it's more of a series of MaybeUninit<u8> relaxed (atomic) loads/stores -- which is not currently a thing, but would become a thing with this RFC. Then you can even read uninitialized or partially initialized memory; it is up to you how to safely deal with that though. I don't think there should be any implicit freeze in this API.

Untrusted code is linked in at the assembly level, so it cannot write uninit into Rust memory anyway.

I think you want a "concrete machine" guarantee that if there is undef, the uninitialized memory is copied from the original memory location rather than being from e.g. the stack of the thread calling atomicperbyte memcpy (this does not do anything at the abstract machine level, but does affect the concrete machine).

I don't think such a guarantee makes sense. You are basically suggesting uninit should have provenance so we can define "where it comes from" and have that affect its value when frozen -- that'd be a complete nightmare for optimizations.

arielb1 · 2025-05-21T13:39:01Z

I don't think such a guarantee makes sense. You are basically suggesting uninit should have provenance so we can define "where it comes from" and have that affect its value when frozen -- that'd be a complete nightmare for optimizations.

Assuming we have an AM way of doing an MaybeUninit atomic per byte memcpy. if you do an asm memcpy, that from an abstract machine is equivalent to an MaybeUninit atomic per byte memcpy followed by a freeze, you get this guarantee. It would be nice to get that guarantee without inline asm.

RalfJung · 2025-05-21T13:46:15Z

I don't think it is a useful enough to guarantee to justify the enormous amounts of work it'd take to make this reasonably precise. The compiler is and should be allowed to entirely omit operations that boil down to "store uninit in this memory here", and such operations can easily cause your property to be violated.

If you want assembly-level guarantees, write assembly code.

arielb1 · 2025-05-21T13:48:50Z

In that case I think it would be right to recommend that people use an asm memcpy crate instead of calling AtomicPerByte reads (but the existence of AtomicPerByte operations is still helpful since it makes this well-defined behavior in the non-undef case).

Add atomic memcpy RFC.

d5393a7

m-ou-se added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 14, 2022

Add number in atomic memcpy rfc.

e864d8d

m-ou-se mentioned this pull request Aug 14, 2022

What about: seqlocks, load-release/store-acquire? rust-lang/unsafe-code-guidelines#323

Open

taiki-e reviewed Aug 15, 2022

View reviewed changes

Fix typo.

d12abe9

ojeda reviewed Aug 15, 2022

View reviewed changes

m-ou-se added 2 commits August 15, 2022 13:18

Fix types of C++ API.

eb68c3a

Better wording.

e802133

cbeuw reviewed Aug 19, 2022

View reviewed changes

text/3301-atomic-memcpy.md Show resolved Hide resolved

RalfJung reviewed Aug 20, 2022

View reviewed changes

This comment was marked as off-topic.

# to view

197g reviewed May 8, 2025

View reviewed changes

RalfJung mentioned this pull request May 19, 2025

deliberate UB: add crossbeam-deque rust-lang/unsafe-code-guidelines#568

Merged

RalfJung mentioned this pull request May 20, 2025

## Pre-Pre-RFC: core::arch::{load, store} and stricter volatile semantics rust-lang/unsafe-code-guidelines#321

Open

		- In order for this to be efficient, we need an additional intrinsic hooking into
		special support in LLVM. (Which LLVM needs to have anyway for C++.)

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

Are you sure you want to change the base?

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

Conversation

m-ou-se commented Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorn3 commented Aug 14, 2022

Uh oh!

ibraheemdev commented Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

5225225 commented Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lokathor commented Aug 14, 2022

Uh oh!

T-Dark0 commented Aug 14, 2022

Uh oh!

programmerjake commented Aug 15, 2022

Uh oh!

comex commented Aug 15, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

m-ou-se commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ojeda commented Aug 15, 2022

Uh oh!

thomcc commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomcc commented Aug 15, 2022

Uh oh!

bjorn3 commented Aug 15, 2022

Uh oh!

thomcc commented Aug 15, 2022

Uh oh!

thomcc commented Aug 16, 2022

Uh oh!

programmerjake commented Aug 16, 2022

Uh oh!

thomcc commented Aug 16, 2022

Uh oh!

tschuett commented Aug 18, 2022

Uh oh!

Uh oh!

RalfJung Aug 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibraheemdev Aug 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung commented Aug 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

m-ou-se commented Aug 14, 2022 •

edited

Loading

ibraheemdev commented Aug 14, 2022 •

edited

Loading

5225225 commented Aug 14, 2022 •

edited

Loading

m-ou-se commented Aug 15, 2022 •

edited

Loading

thomcc commented Aug 15, 2022 •

edited

Loading

RalfJung Aug 20, 2022 •

edited

Loading

ibraheemdev Aug 23, 2022 •

edited

Loading

RalfJung commented Aug 20, 2022 •

edited

Loading

RalfJung commented May 7, 2025 •

edited

Loading

arielb1 commented May 7, 2025 •

edited

Loading

RalfJung commented May 8, 2025 •

edited

Loading

arielb1 commented May 8, 2025 •

edited

Loading

arielb1 commented May 15, 2025 •

edited

Loading

RalfJung commented May 18, 2025 •

edited

Loading

arielb1 commented May 18, 2025 •

edited

Loading

programmerjake commented May 18, 2025 •

edited

Loading

arielb1 commented May 21, 2025 •

edited

Loading