-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[RFC] AtomicPerByte (aka "atomic memcpy") #3301
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
cc @ojeda |
This could mention the |
With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as This would also require removing the safe That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn. |
The "uninit" part of not without the fabled and legendary Freeze Intrinsic anyway. |
On the other hand, |
note that LLVM already implements this operation: |
The trouble with that intrinsic is that |
- In order for this to be efficient, we need an additional intrinsic hooking into | ||
special support in LLVM. (Which LLVM needs to have anyway for C++.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to implement this until LLVM implements this?
I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.
(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:
- We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
- Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like
ptr::{read_volatile, write_volatile}
combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler. - We use the existing
llvm.memcpy.element.unordered.atomic
, after figuring out the consequences of theunordered
property. - Until LLVM supports appears, we implement it in the library using a loop of
AtomicUsize::load()
/store()
s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.
I'm not fully sure yet which of these are feasible.
I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's (It seems |
but it's easy to accidentally cause undefined behavior by using `load` | ||
to make an extra copy of data that shouldn't be copied. | ||
|
||
- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these options and considering what the C++ paper chose, AtomicPerByte
sounds OK and has the advantage of having Atomic
as a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AtomicPerByteMaybeUninit
or AtomicPerByteManuallyDrop
to also resolve the other concern around dropping? Those are terrible names though...
Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense) In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed. Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence). That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists. (FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf) |
CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past). I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently. |
We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes. |
I'm not sure we'd want unordered, as mentioned above... |
To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;
fn thread1() {
ATOM.store(1, O);
ATOM.store(2, O);
}
fn thread2() {
let a = ATOM.load(O);
let b = ATOM.load(O);
assert!(a <= b);
}
In other words, for unordered, it would be legal for 2 to be stored before 1, or for |
something that could work but not be technically correct is: those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do. |
Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt). Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that. |
I like |
loop { | ||
let s1 = self.seq.load(Acquire); | ||
let data = read_data(&self.data, Acquire); | ||
let s2 = self.seq.load(Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load
here cannot return an outdated value that would prevent us from noticing a seqnum bump.
The reason this is the case is that if there is a concurrent write
, and if any
part of data
reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add
from write
, and thus we will definitely see a version conflict. OTOH if the s1
reads-from some second fetch_add
in write
, then that forms a release-acquire pair, and we will definitely see the full data.
So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)
(This is related to this comment by @cbeuw.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data()
part, but the usual (incorrect) read_data
implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data()
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)
) under the C11 model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the famous seqlock paper discusses "read dont-modify write" operations.
while the second one is basically a memory fence followed by series of `AtomicU8::store`s. | ||
Except the implementation can be much more efficient. | ||
The implementation is allowed to load/store the bytes in any order, | ||
and doesn't have to operate on individual bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.
I was originally thinking this would be specified as a series of AtomicU8
load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.
I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.
A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?
I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.
In the C++ paper they are basically as:
for (size_t i = 0; i < count; ++i) { reinterpret_cast<char*>(dest)[i] = atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed); } atomic_thread_fence(order);
and
atomic_thread_fence(order); for (size_t i = 0; i < count; ++i) { atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store( reinterpret_cast<char*>(source)[i], memory_order::relaxed); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?
Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.
In the C++ paper they are basically as:
Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).
Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.
Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.
Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound. |
Yeah, this is the use case - treating shared-memory IPC as a security boundary. IIUC that's what @RalfJung was responding to by saying that it's not well-defined since UB is a whole-program property. (I'm sure Ralf will push back and clarify that "not well-defined" is not an accurate characterization of what he said, but I've learned not to try to capture the subtleties 😛 ) |
When reasoning about security, I would argue there are other ways to prove that the boundary is solid: If you can prove that for any malicious program M, there is a non-malicious program N that does the exact same shared memory operations at the hardware level, and that your program has no UB when combined with N, then you don't even need to consider M because they are literally the same program. For x86 this seems plausible since relaxed atomics generally compile down to no additional synchronization. ie. the step from abstract machine to actual hardware is not injective, and if two different programs in the abstract machine translate to the same actual program, then it's clearly indistinguishable which program was used because they are identical. |
@Diggsey: Rust cannot be fully specified in terms of an abstract machine. Rust is a systems programming language, and that means that it must also make guarantees about how the abstract machine is implemented in terms of the concrete machine that the hardware and OS actually implement. |
Can we please move shared-memory IPC with atomics off of this RFC thread? As I said, this isn't specific to atomic-per-byte memcpy at all, so this is really the wrong place for that discussion. |
What is the current state / blocker for progress on this RFC? Today I ran across another case where I needed this. It doesn't seem like much happened since end of summer. |
@VorpalBlade This is still stuck on the details of the API. See #3301 (comment) |
Why not have multiple store methods (perhaps not all 6, but enough to cover the use cases)? They could dispatch to the same underlying intrinsic internally. It isn't like rust doesn't already do this in the standard library: foo, foo_mut, unchecked_foo etc. Though perhaps coming up with suitable names will be just as difficult. |
Because that would just result in confusion and unexpected behaviour. E.g. it's unclear what reasonable behaviour would be for types that need to be dropped. |
what if the only option was: pub fn store(&self, value: &MaybeUninit<T>, ordering: Ordering); and to make storing a copy more ergonomic, impl<T> MaybeUninit<T> { // maybe have ?Sized bound? icr if that works with unions
pub const fn from_ref(v: &T) -> &Self {
// Safety: &Self can't be written to, so this works
unsafe { &*(v as *const T as *const Self) }
}
} that way if you want to store a copy of some type, you just use: and if you want do remember that atomic |
What about only providing the intrinsic as an |
The intrinsic seems more fundamental to me than the API around it. |
I'd like to suggest an alternative approach to solving the same problem (which I was thinking of suggesting before I saw this thread): an unsafe intrinsic (which I think of as
By combining this with an acquire-ordered load before doing a This should also be very easy to implement – it's basically just an assembly-level load instruction that's "opaque" to the compiler, preventing it performing optimisations related to knowledge of what address is being loaded. (I think it can be implemented as a load instruction written with inline assembly, that the compiler has to assume could place arbitrary bits into the returned value because it can't see that it's a load instruction.) If there is no race, then the load instruction will load the pointer value. If there is a race, then the load instruction might or might not return useful data, but it will load some sequence of bits, which is valid to store in a This approach seems to be more powerful than requiring |
This is sufficient for synchronization, but not for functions like |
@ais523 Allowing racy reads on non-atomic accesses without UB has some very non-trivial consequences and would be a huge departure from some of the fundamental principles that the C++ memory model (which we inherit) is based on. This paper explores this a bit by having two languages where the first has full UB on read-write races but the second makes them return "poison" similar to what you suggested. We should not do this unless either C++ also does it, or we are ready to make our memory model independent from that of C++ (with all the consequences that entails, e.g. making it impossible to use atomic operation on memory shared with C++ code).
You could hardly be further from the truth here. ;) Remember that "implementing" any change to the concurrency memory model requires making sure that the model even still makes any sense and supports all the desired optimizations, which typically requires months of work by an expert (and there's very few experts that are able to do that kind of work; I am not one of them). Suggesting to "just" change something fundamental about the concurrency memory model is like suggesting to "just" change some detail about a rocket engine. These are non-trivial pieces of engineering and you can't "just" change anything about them without great care. |
@RalfJung: I agree that changing the memory model is a bad idea. My suggestion is designed to avoid needing to change the memory model, via confining the racy reads to a particular intrinsic/function that the compiler can't optimise around (and thus can't exploit the fact that the read would be undefined behaviour if done normally) and whose observable behaviour always matches something that could be done in the existing memory model. I agree that "very easy to implement" is quite different from "very easy to prove correct"! Nonetheless, I don't think this is too hard to prove correct on the basis of "the executable output by the compiler must match the behaviour of the source program". The idea is that, from the compiler's point of view,
The compiler cannot take advantage of the "maybe the pointer isn't read" case to, e.g., move reads and writes around in a way that would stop the read working, because it doesn't know whether or not the opaque function reads the pointer, and has to assume (in any case where it can't prove a race exists) that there might be no race and the funciton might be reading the pointer. The compiler also cannot take advantage of the "maybe the pointer is read" case to assume no race and optimise on that basis, again because it doesn't know whether or not the opaque function reads the pointer; if the function chose to ignore the pointer on that call, there would be no race, and thus there would be no optimisation-enabling UB for it to exploit. Another way to think about it is to imagine that we have a magic function that lets us know whether or not a read could race (or access mutably borrowed memory), and unsafe fn read_racy<T>(ptr: *const T) -> MaybeUninit<T> {
if (the_read_will_race(ptr)) {
MaybeUninit::uninit()
} else {
unsafe { core::ptr::read(ptr as *const MaybeUninit<T>) }
}
} Assuming the existence of Although the function in question can't be implemented in Rust, due to there being no working As for the paper you linked, it's basically discussing "what would happen if the memory model allowed any read to race with writes, producing an undefined value rather than undefined behaviour?" and its conclusion was "you would miss optimisations". By confining reads that can race with writes to a particular function/intrinsic, you avoid the missed optmisations in the code in general. The compiler will optimise less around a |
@RalfJung What if one was okay with these operations being opaque to the optimizer? That would allow them to desugar to |
@ais523 you are proposing to add an operation that does not currently exist in the memory model and that cannot be implemented with the operations that do exist. That is changing (more specifically, extending) the memory model.
|
So the situation here is basically that we have two implementations of `read_racy`:
a) an implementation that is entirely legal in the memory model and that has the desired semantics, but which is a hypothetical implementation that would be near-impossible to write in practice due to requiring a race-predicting function
b) an implementation that does the same thing and can be implemented in practice, but contains operations outside the memory model
and the trick is that the function is opaque to the compiler so that it can't tell which we're using. The compiler has to take into account the possibility that the function might be implemented as a), so it can't make any optimisations that would cause function a) to break or otherwise exploit the potential for races in any way. That means that if we actually implement the function as b) (in a way that is opaque to the compiler), everything will still work correctly because the compiler can't do anything with it that it couldn't do with function a).
I think it's reasonable to question whether this method of implementing things is legitimate, i.e. "the compiler can't distinguish between this function and an implementation of the same function that doesn't break the memory model, and thus must always compile the function correctly even if its internal implementation breaks the memory model, because there's no way it could vary its behaviour between the two cases". To me, this is one of those things that obviously works in practice, but adding the principle itself may be something of a large step to take.
|
@DemiMarie it is definitely not legal to do this with inline asm; the requirements for inline asm blocks are not met. And in fact, there are optimizations which are incompatible with the existence of an operation to do non-atomic reads where races are not full UB, see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.
This is not a reliable way to build a compiler -- so no, this is not legitimate. You must present a consistent formal semantics and show that it has all the desired properties, and the compiler must implement those semantics. Hand-waving something involving "this is opaque and hence" does not suffice (unless you can produce a proper proof of correctness of your reasoning principle, of course). This is the reasoning we are applying to inline asm blocks, and to ensure soundness they are subject to a tight restriction, which means they are not suited to add the operation you are proposing. Also I think this is getting off-topic for this RFC. GIven how terrible Github is at threading, we should keep discussion here focused on the proposed new primitive, and explore possible alternatives elsewhere (a separate issue, a thread on IRLO, a topic on Zulip). |
note the paper says they decided that step wasn't allowed in LLVM:
|
Yes, LLVM does not use the C++ memory model, they have their own. In the LLVM memory model, data races are not UB, they behave like |
You can definitely do an "inline assembly memcpy" from a raw pointer to e.g. a stack variable. I am quite sure that:
AFAICT, LLVM at least pretends it lowers a C read into an "UB on race" read, but in addition to that, it supports a "poison on race" read [and you can turn "poison on race" to "non-deterministic but stable bytes on race" via freeze], where LLVM is allowed to convert a program with a conditional "UB on race" read to a program with an unconditional "poison on race" read and a conditional use. I don't see a contradiction in that. I personally believe that a good internal IR needs to have all of "UB on race", "poison on race", and "nondet but stable on race" reads, but I don't see why a semantics for Rust (as opposed for an internal IR) needs to have "poison on race". This RFC argues that the surface language needs to have "nondet but stable on race" as well. Tho not "poison on race", I am still not sure if there is a need for poison in the surface semantics. |
If that memory is also accessed outside inline asm blocks, then no you cannot do that.
|
~~Why not? If the access is not racy, then this is well defined code that returns the right value. If the access is racy, then it’s the same as the assembly pulling the numbers from the environment, which is also well defined.~~ And now I realize angelic nondeterminism has weird interactions with malloc, which shouldn't matter here, but are hard to prove don't matter.
Tho there is a fairly strong reason that inline assembly "works" for seqlocks - you could have an "justification" of "suspend all other threads; if the seqlock is even, take a snapshot. If the seqlock is odd, return nondeterministic bytes; unsuspend all other threads", and if you consider "suspend all other threads" to be a valid operation (AFAICT it is), it will do the same thing as an AtomicPerByte memcpy [of course, this requires the inline asm to be able to access the seqlock, but this normally holds].
I do think we'll need to have "load a value from memory. If it's racy, return something nondeterministic" loads. Does the mere existence of these loads actually break any optimizations? And if they exist, inline asm has a "joker" behavior of picking the right way of doing it.
Figure 3 in the paper only shows that you have to pick whether any single load is a "nondeterministic on race" or "UB on race", not that you can't have both. There are a bunch of optimizations you can do with "UB on race" loads but can't do with "nondeterministic on race" loads like duplicating/deduplicating them, but AFAICT you can't do them with run-once asm blocks either, and the mere existence of nondeterministic on race" loads does not make them less valid.
|
There is no AM operation that can check if an access is racy, so this is not a valid line of reasoning. You cannot use inline assembly to extend the AM with new operations, as correctness proofs involving the AM can and do rely on knowing the full set of operations that can be performed by arbitrary AM code. Correctness of optimizations assumes that all code accessing AM-visible state is Rust code, therefore inline asm can only perform actions on the AM state that can also be performed by Rust code. See this link I already posted above for more context. |
Rendered