Skip to content

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Aug 14, 2022

@m-ou-se m-ou-se added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 14, 2022
@bjorn3
Copy link
Member

bjorn3 commented Aug 14, 2022

cc @ojeda

@ibraheemdev
Copy link
Member

ibraheemdev commented Aug 14, 2022

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

@5225225
Copy link

5225225 commented Aug 14, 2022

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

@Lokathor
Copy link
Contributor

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

@T-Dark0
Copy link

T-Dark0 commented Aug 14, 2022

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

@programmerjake
Copy link
Member

note that LLVM already implements this operation:
llvm.memcpy.element.unordered.atomic Intrinsic
with an additional fence operation for acquire/release.

@comex
Copy link

comex commented Aug 15, 2022

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

Comment on lines +180 to +181
- In order for this to be efficient, we need an additional intrinsic hooking into
special support in LLVM. (Which LLVM needs to have anyway for C++.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

  • We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
  • Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.
  • We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.
  • Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 15, 2022

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

but it's easy to accidentally cause undefined behavior by using `load`
to make an extra copy of data that shouldn't be copied.

- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicPerByteMaybeUninit or AtomicPerByteManuallyDrop to also resolve the other concern around dropping? Those are terrible names though...

@ojeda
Copy link

ojeda commented Aug 15, 2022

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2022

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

I'm not sure we'd want unordered, as mentioned above...

@thomcc
Copy link
Member

thomcc commented Aug 16, 2022

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

@programmerjake
Copy link
Member

something that could work but not be technically correct is:
compiler acquire fence
unordered atomic memcpy
compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

@thomcc
Copy link
Member

thomcc commented Aug 16, 2022

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

@tschuett
Copy link

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable.
The LLVM guys started something similar in 2016:
https://reviews.llvm.org/D27133

loop {
let s1 = self.seq.load(Acquire);
let data = read_data(&self.data, Acquire);
let s2 = self.seq.load(Relaxed);
Copy link
Member

@RalfJung RalfJung Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load here cannot return an outdated value that would prevent us from noticing a seqnum bump.

The reason this is the case is that if there is a concurrent write, and if any
part of data reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add from write, and thus we will definitely see a version conflict. OTOH if the s1 reads-from some second fetch_add in write, then that forms a release-acquire pair, and we will definitely see the full data.

So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)

(This is related to this comment by @cbeuw.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data() part, but the usual (incorrect) read_data implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data() operation.

Copy link
Member

@ibraheemdev ibraheemdev Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)) under the C11 model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the famous seqlock paper discusses "read dont-modify write" operations.

while the second one is basically a memory fence followed by series of `AtomicU8::store`s.
Except the implementation can be much more efficient.
The implementation is allowed to load/store the bytes in any order,
and doesn't have to operate on individual bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.

I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.

In the C++ paper they are basically as:

for (size_t i = 0; i < count; ++i) {
  reinterpret_cast<char*>(dest)[i] =
      atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed);
}
atomic_thread_fence(order);

and

atomic_thread_fence(order);
for (size_t i = 0; i < count; ++i) {
  atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store(
      reinterpret_cast<char*>(source)[i], memory_order::relaxed);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.

In the C++ paper they are basically as:

Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).

Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.

@RalfJung
Copy link
Member

RalfJung commented Aug 20, 2022

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound.

@joshlf
Copy link
Contributor

joshlf commented Sep 30, 2024

edit: there is a difference if you need your IPC communication method to be a security boundary - in that case you also need to consider that the other process may do anything to that memory at any time. If all you're doing is reading using AtomicPerByte then that should never cause UB in your process though.

Yeah, this is the use case - treating shared-memory IPC as a security boundary. IIUC that's what @RalfJung was responding to by saying that it's not well-defined since UB is a whole-program property. (I'm sure Ralf will push back and clarify that "not well-defined" is not an accurate characterization of what he said, but I've learned not to try to capture the subtleties 😛 )

@Diggsey
Copy link
Contributor

Diggsey commented Sep 30, 2024

When reasoning about security, I would argue there are other ways to prove that the boundary is solid:

If you can prove that for any malicious program M, there is a non-malicious program N that does the exact same shared memory operations at the hardware level, and that your program has no UB when combined with N, then you don't even need to consider M because they are literally the same program. For x86 this seems plausible since relaxed atomics generally compile down to no additional synchronization.

ie. the step from abstract machine to actual hardware is not injective, and if two different programs in the abstract machine translate to the same actual program, then it's clearly indistinguishable which program was used because they are identical.

@DemiMarie
Copy link

@Diggsey: Rust cannot be fully specified in terms of an abstract machine. Rust is a systems programming language, and that means that it must also make guarantees about how the abstract machine is implemented in terms of the concrete machine that the hardware and OS actually implement.

@RalfJung
Copy link
Member

RalfJung commented Oct 1, 2024

Can we please move shared-memory IPC with atomics off of this RFC thread? As I said, this isn't specific to atomic-per-byte memcpy at all, so this is really the wrong place for that discussion.

@VorpalBlade
Copy link

What is the current state / blocker for progress on this RFC? Today I ran across another case where I needed this. It doesn't seem like much happened since end of summer.

@m-ou-se
Copy link
Member Author

m-ou-se commented Dec 16, 2024

@VorpalBlade This is still stuck on the details of the API. See #3301 (comment)

@VorpalBlade
Copy link

Why not have multiple store methods (perhaps not all 6, but enough to cover the use cases)? They could dispatch to the same underlying intrinsic internally.

It isn't like rust doesn't already do this in the standard library: foo, foo_mut, unchecked_foo etc. Though perhaps coming up with suitable names will be just as difficult.

@m-ou-se
Copy link
Member Author

m-ou-se commented Dec 16, 2024

Because that would just result in confusion and unexpected behaviour. E.g. it's unclear what reasonable behaviour would be for types that need to be dropped.

@programmerjake
Copy link
Member

what if the only option was:

pub fn store(&self, value: &MaybeUninit<T>, ordering: Ordering);

and to make storing a copy more ergonomic, MaybeUninit gains:

impl<T> MaybeUninit<T> { // maybe have ?Sized bound? icr if that works with unions
    pub const fn from_ref(v: &T) -> &Self {
        // Safety: &Self can't be written to, so this works
        unsafe { &*(v as *const T as *const Self) }
    }
}

that way if you want to store a copy of some type, you just use: a.store(MaybeUninit::from_ref(&my_value), Ordering::Relaxed)
and my_value will still be dropped later. or you can just use a reference if that's all you have access to.

and if you want my_value to not be dropped, just write:
a.store(&MaybeUninit::new(my_value), Ordering::Relaxed)

do remember that atomic memcpy is not terribly common so being a bit more verbose is fine.

@DemiMarie
Copy link

What about only providing the intrinsic as an unsafe raw pointer operation, and letting users write their own higher-level wrappers?

@arielb1
Copy link
Contributor

arielb1 commented Dec 17, 2024

What about only providing the intrinsic as an unsafe raw pointer operation, and letting users write their own higher-level wrappers?

The intrinsic seems more fundamental to me than the API around it.

@ais523
Copy link

ais523 commented Apr 19, 2025

I'd like to suggest an alternative approach to solving the same problem (which I was thinking of suggesting before I saw this thread): an unsafe intrinsic (which I think of as read_racy) that behaves as follows:

  • The intrinsic takes a raw pointer *const T as an argument. This could be of any Rust type (it doesn't have to be an atomic and doesn't have to be made of UnsafeCells).
  • The return value is a MaybeUninit<T>, specified to be chosen as follows:
    • if the memory referenced by the pointer has been/is being/will be written to in a way that could cause a read through the pointer to form a data race with the write, it returns a MaybeUninit<T> holding an uninitialised value;
    • if the memory referenced by the pointer is currently mutably borrowed (even by another thread), it returns a MaybeUninit<T> holding an uninitialised value (and does not cause "access to a mutably borrowed value" undefined behaviour – conceptually the read does not occur in this case, although of course in practice the CPU would likely read potentially garbage data from the memory in question);
    • in other cases, it returns a MaybeUninit<T> holding the bit pattern of the memory referenced by the pointer.

By combining this with an acquire-ordered load before doing a read_racy of the memory and a release-ordered load after doing the read_racy, it becomes possible to implement sequence locks and similar code (i.e. first you attempt a read, then you discover whether it worked or not) – of course, this would need release-ordered loads to be added to the language, although as mentioned above they can be simulated by adding 0. (The read_racy itself would not be atomic – you synchronize it by using the ability of an acquire…release sequence to synchronize reads that happen between the acquire and release.) The big advantage of this approach is that you don't need to have any special handling of the pointed-to T; a sequence lock can safely coexist with arbitrary safe code that operates on the same memory, as long as it ensures that no such code was running before attempting to assume_init() the bytes.

This should also be very easy to implement – it's basically just an assembly-level load instruction that's "opaque" to the compiler, preventing it performing optimisations related to knowledge of what address is being loaded. (I think it can be implemented as a load instruction written with inline assembly, that the compiler has to assume could place arbitrary bits into the returned value because it can't see that it's a load instruction.) If there is no race, then the load instruction will load the pointer value. If there is a race, then the load instruction might or might not return useful data, but it will load some sequence of bits, which is valid to store in a MaybeUnint as long as you don't actually try to do anything with the data. Thus, it complies with the specification I wrote above.

This approach seems to be more powerful than requiring T to be of a particular type (you can use it to, e.g., write a memory allocator that uses sequence locks to protect the memory being allocated), and simpler than the existing listed alternatives (because it doesn't, e.g., need an UnsafeCell).

@DemiMarie
Copy link

This is sufficient for synchronization, but not for functions like copy_from_user that access data another (potentially malicious) process might be concurrently mutating.

@RalfJung
Copy link
Member

@ais523 Allowing racy reads on non-atomic accesses without UB has some very non-trivial consequences and would be a huge departure from some of the fundamental principles that the C++ memory model (which we inherit) is based on. This paper explores this a bit by having two languages where the first has full UB on read-write races but the second makes them return "poison" similar to what you suggested. We should not do this unless either C++ also does it, or we are ready to make our memory model independent from that of C++ (with all the consequences that entails, e.g. making it impossible to use atomic operation on memory shared with C++ code).

This should also be very easy to implement

You could hardly be further from the truth here. ;) Remember that "implementing" any change to the concurrency memory model requires making sure that the model even still makes any sense and supports all the desired optimizations, which typically requires months of work by an expert (and there's very few experts that are able to do that kind of work; I am not one of them).

Suggesting to "just" change something fundamental about the concurrency memory model is like suggesting to "just" change some detail about a rocket engine. These are non-trivial pieces of engineering and you can't "just" change anything about them without great care.

@ais523
Copy link

ais523 commented Apr 20, 2025

@RalfJung: I agree that changing the memory model is a bad idea. My suggestion is designed to avoid needing to change the memory model, via confining the racy reads to a particular intrinsic/function that the compiler can't optimise around (and thus can't exploit the fact that the read would be undefined behaviour if done normally) and whose observable behaviour always matches something that could be done in the existing memory model.

I agree that "very easy to implement" is quite different from "very easy to prove correct"! Nonetheless, I don't think this is too hard to prove correct on the basis of "the executable output by the compiler must match the behaviour of the source program". The idea is that, from the compiler's point of view, read_racy is an opaque/FFI function that takes in a pointer, and does one of two things (based on a condition that the compiler doesn't know):

  • either it reads the pointer, and returns the value stored there;
  • or it ignores the pointer and returns an arbitrary value.

The compiler cannot take advantage of the "maybe the pointer isn't read" case to, e.g., move reads and writes around in a way that would stop the read working, because it doesn't know whether or not the opaque function reads the pointer, and has to assume (in any case where it can't prove a race exists) that there might be no race and the funciton might be reading the pointer.

The compiler also cannot take advantage of the "maybe the pointer is read" case to assume no race and optimise on that basis, again because it doesn't know whether or not the opaque function reads the pointer; if the function chose to ignore the pointer on that call, there would be no race, and thus there would be no optimisation-enabling UB for it to exploit.

Another way to think about it is to imagine that we have a magic function that lets us know whether or not a read could race (or access mutably borrowed memory), and read_racy gets implemented as follows:

unsafe fn read_racy<T>(ptr: *const T) -> MaybeUninit<T> {
    if (the_read_will_race(ptr)) {
        MaybeUninit::uninit()
    } else {
        unsafe { core::ptr::read(ptr as *const MaybeUninit<T>) }
    }
}

Assuming the existence of the_read_will_race, the function works correctly entirely within the existing memory model – it has no data races, because it only ever reads data in a situation where no race exists.

Although the function in question can't be implemented in Rust, due to there being no working the_read_will_race function, it can be implemented in any language with a load instruction that returns an arbitrary or uninitialised value when a data race happens – the load instruction happens to implement both branches of the if at the same time, meaning that the the_read_will_race call can be optimised out (and in turn meaning that hte function doesn't need to be defined). Most notably, LLVM defines its load instruction to read undef in the case of a data race, rather than causing undefined behaviour, so the function in question can be implemented using raw LLVM IR.

As for the paper you linked, it's basically discussing "what would happen if the memory model allowed any read to race with writes, producing an undefined value rather than undefined behaviour?" and its conclusion was "you would miss optimisations". By confining reads that can race with writes to a particular function/intrinsic, you avoid the missed optmisations in the code in general. The compiler will optimise less around a read_racy call, but that's the entire reason why it exists – to stop compiler optimisations that are correct in general but wrong when a racy read occurs..

@DemiMarie
Copy link

@RalfJung What if one was okay with these operations being opaque to the optimizer? That would allow them to desugar to asm!, which Rust already supports.

@RalfJung
Copy link
Member

RalfJung commented Apr 21, 2025 via email

@ais523
Copy link

ais523 commented Apr 21, 2025 via email

@RalfJung
Copy link
Member

RalfJung commented Apr 22, 2025

@DemiMarie it is definitely not legal to do this with inline asm; the requirements for inline asm blocks are not met. And in fact, there are optimizations which are incompatible with the existence of an operation to do non-atomic reads where races are not full UB, see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

@ais523

I think it's reasonable to question whether this method of implementing things is legitimate

This is not a reliable way to build a compiler -- so no, this is not legitimate. You must present a consistent formal semantics and show that it has all the desired properties, and the compiler must implement those semantics. Hand-waving something involving "this is opaque and hence" does not suffice (unless you can produce a proper proof of correctness of your reasoning principle, of course). This is the reasoning we are applying to inline asm blocks, and to ensure soundness they are subject to a tight restriction, which means they are not suited to add the operation you are proposing.

Also I think this is getting off-topic for this RFC. GIven how terrible Github is at threading, we should keep discussion here focused on the proposed new primitive, and explore possible alternatives elsewhere (a separate issue, a thread on IRLO, a topic on Zulip).

@programmerjake
Copy link
Member

see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

note the paper says they decided that step wasn't allowed in LLVM:

As a result of reporting this bug, the LLVM developers decided to restrict the second transformation rather than the first one, which means that the intended LLVM memory model is subtly different from the C11 model.

@RalfJung
Copy link
Member

RalfJung commented Apr 22, 2025

Yes, LLVM does not use the C++ memory model, they have their own. In the LLVM memory model, data races are not UB, they behave like read_racy. However, Rust uses the C++ memory model, not the LLVM memory model -- and the LLVM model has been explored very little, so I'd caution against adopting it for Rust (aside from the fact that, as noted above, diverging from the C++ model would be an interop hazard).

@arielb1
Copy link
Contributor

arielb1 commented Apr 22, 2025

You can definitely do an "inline assembly memcpy" from a raw pointer to e.g. a stack variable.

I am quite sure that:

  1. It is defined to at least put some non-deterministic but stable bytes in that address.
  2. If the memory is not being concurrently modified, it will put the right bytes in that address.

the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

AFAICT, LLVM at least pretends it lowers a C read into an "UB on race" read, but in addition to that, it supports a "poison on race" read [and you can turn "poison on race" to "non-deterministic but stable bytes on race" via freeze], where LLVM is allowed to convert a program with a conditional "UB on race" read to a program with an unconditional "poison on race" read and a conditional use. I don't see a contradiction in that.

I personally believe that a good internal IR needs to have all of "UB on race", "poison on race", and "nondet but stable on race" reads, but I don't see why a semantics for Rust (as opposed for an internal IR) needs to have "poison on race".

This RFC argues that the surface language needs to have "nondet but stable on race" as well. Tho not "poison on race", I am still not sure if there is a need for poison in the surface semantics.

@RalfJung
Copy link
Member

RalfJung commented Apr 22, 2025 via email

@arielb1
Copy link
Contributor

arielb1 commented Apr 22, 2025 via email

@RalfJung
Copy link
Member

RalfJung commented Apr 23, 2025

~Why not? If the access is not racy, then this is well defined code that returns the right value. If the access is racy, then it’s the same as the assembly pulling the numbers from the environment, which is also well defined.

There is no AM operation that can check if an access is racy, so this is not a valid line of reasoning. You cannot use inline assembly to extend the AM with new operations, as correctness proofs involving the AM can and do rely on knowing the full set of operations that can be performed by arbitrary AM code.

Correctness of optimizations assumes that all code accessing AM-visible state is Rust code, therefore inline asm can only perform actions on the AM state that can also be performed by Rust code. See this link I already posted above for more context.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.