Remove zerocopy from rand #1579

dhardy · 2025-02-06T12:20:22Z

Added a CHANGELOG.md entry

Summary

Replace zerocopy dependency with unsafe code (up from 12 to 17 instances).

Add benchmarks for some SIMD / wide types.

Remove two #[inline(never)] attributes which were apparently motivated by benchmark results, but caused more harm than help with the new benches.

Motivation

Make the dependency on zerocopy optional #1574: zerocopy is "a big crate with a huge amount of unsafe code"
I've also seen some chatter about compile time increase in rand v0.9 due to now depending on two versions of zerocopy

I'm not a big fan of this, but together with #1575 it removes the dependency on zerocopy v0.8, so is probably an improvement.

Project Safe Transmute

If this project lands safe transmute support into the standard library, we would of course want to use that.

Details

Replacing zerocopy::transmute! with core::mem::transmute is easy and results in identical code generation (tested with StdRng and SmallRng); this reverts a change in #1349.

Replacing the fill impls is more complex but I believe acceptable; this reverts a change in #1502.

In both cases, this would have resulted in a usage of unsafe in a macro where safety depends on a type passed by the macro caller. In the first case I decided to inline the three macro usages while in the second I prefixed the macro name with unsafe_.

Benchmark results

$ cargo bench --bench simd --features simd_support -- --baseline master 
   Compiling rand v0.9.0 (/home/dhardy/projects/rand/rand)
   Compiling rand_distr v0.5.0 (/home/dhardy/projects/rand/rand/rand_distr)
   Compiling benches v0.1.0 (/home/dhardy/projects/rand/rand/benches)
    Finished `bench` profile [optimized] target(s) in 1.38s
     Running benches/simd.rs (target/release/deps/simd-2905efe84e67fa8e)
random_simd/u128        time:   [1.8751 ns 1.8831 ns 1.8948 ns]
                        change: [-0.1321% +0.6261% +1.4131%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe
random_simd/m128i       time:   [1.9753 ns 1.9790 ns 1.9833 ns]
                        change: [+5.4631% +5.6551% +5.8561%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe
random_simd/m256i       time:   [3.7588 ns 3.7755 ns 3.7931 ns]
                        change: [-0.0698% +0.3828% +0.7685%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  4 (4.00%) high mild
  13 (13.00%) high severe
random_simd/m512i       time:   [6.8739 ns 6.8901 ns 6.9097 ns]
                        change: [+0.1511% +0.3741% +0.6309%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
random_simd/u64x2       time:   [1.9767 ns 1.9817 ns 1.9875 ns]
                        change: [-72.129% -72.012% -71.890%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
random_simd/u32x4       time:   [3.9506 ns 3.9572 ns 3.9651 ns]
                        change: [-50.352% -50.035% -49.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  9 (9.00%) high severe
random_simd/u32x8       time:   [3.7498 ns 3.7598 ns 3.7717 ns]
                        change: [-0.0915% +0.3002% +0.8262%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe
random_simd/u16x8       time:   [3.7647 ns 3.7792 ns 3.7953 ns]
                        change: [-0.0710% +0.6785% +1.3454%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
random_simd/u8x16       time:   [3.7806 ns 3.7950 ns 3.8118 ns]
                        change: [+1.1070% +1.5527% +2.1092%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

Unfinished business?

The Simd and m128i etc. type generation should be equivalent, but they're not in terms of code; the Simd impls currently use fill to avoid more unsafe code here.

Notice from the above that u32x4, u16x8 and u8x16 are the same size as u128 and m128i but cost about twice as much to generate here. This indicates the fill code may be sub-optimal.

Additionally, the m128i impl performed even worse when transmuting a u128 value (~4.3ns or +%130) which, as far as I can tell, is purely because the u128 value is returned via rax, rdx while the __m128i value is returned via rdx, r10 (with rax equal to the struct address). I don't understand this.

Results show that some Simd types are 2-4 times as expensive as expected

Results in few minor regressions and two large improvements in benchmarks: -72% time for u64x2, -50% for u32x4.

Code gen is identical and benchmarks unaffected.

…_parts_mut Mostly code gen appears equivalent, though it affects inlining of u32x4 gen with SmallRng. Benchmarks are not significantly affected.

mitsuhiko · 2025-02-06T22:40:07Z

Replacing zerocopy::transmute! with core::mem::transmute is easy and results in identical code generation (tested with StdRng and SmallRng); this reverts a change in #1349.

For those cases where you just call zerocopy::transmute! you could still use zerocopy in CI. You could declare an optional dependency to zerocopy and have a macro that switches between the zerocopy transmute for CI and tests and the stdlib one. That way you do get the verification in CI that zerocopy enables.

I have been proposing this for ahash: tkaitchuck/aHash#253

I'm not sure if this is a great idea, but it's I think a compromise that has some value.

joshlf · 2025-02-12T00:54:32Z

If this project lands safe transmute support into the standard library, we would of course want to use that.

I should clarify that Project Safe Transmute will likely never replace zerocopy/bytemuck, but just replace their derives (zerocopy-derive and bytemuck-derive). Some very limited functionality may exist directly in the standard library, but we think of Safe Transmute as mostly being a building block that makes it easier to write sound unsafe code, not a building block that permits you to avoid writing unsafe code entirely. I suspect this doesn't change the calculus here, but I figured it was worth mentioning.

josephlr · 2025-02-28T08:25:12Z

src/distr/integer.rs

+        // x86 is little endian so no need for conversion
+
+        // SAFETY: both source and result types are valid for all values.
+        unsafe { core::mem::transmute(buf) }


I was wondering if here (and elsewhere) we should be using the various unaligned-load intrinsics:

_mm_loadu_si128

_mm256_loadu_si256

_mm512_loadu_si512

instead of a raw transmute. However, they seem to produce identical assembly (AVX2 256-bit Godbolt example), so I guess it's just a matter of taste.

josephlr · 2025-02-28T08:28:38Z

src/rng.rs

+// This macro is unsafe to call: target types must support transmute from
+// random bits (i.e. all bit representations are valid).
+macro_rules! unsafe_impl_fill {


Would it be possible for the various integer types to implement a "Safe to mem::transmute from raw bytes" sealed marker trait, and then implement this via a generic impl instead of a macro?

joshlf · 2025-02-28T08:35:14Z

src/distr/integer.rs

+        rng.fill_bytes(&mut buf);
+        // x86 is little endian so no need for conversion
+
+        // SAFETY: both source and result types are valid for all values.


Two things to note:

This should say that any bit-valid [u8; N] is also a bit-valid __m128i ("all values" technically doesn't preclude uninitialized bytes, which are unsound to transmute to a __m128i); it should also say that they're the same size (while mem::transmute wouldn't permit this to compile, transmuting a value of a smaller type to one of a larger type would implicitly leave the trailing bytes uninitialized)

Being pedantic, this should provide a citation for the claim that any byte values are valid for __m128i; feel free to rip off our safety comment for this: https://github.com/google/zerocopy/blob/dccbbcd8714eebeeac794ca5dd1ba1cbecc09ea2/src/impls.rs#L815-L879

If we want to be super pedantic and use MiniRust terminology, "value" refers to high-level concepts like mathematical integers, not to raw byte sequences stored in memory. Being "valid" is a property of byte sequences; invalid byte sequences do not even correspond to a value, and therefore there are no "invalid values". The key for transmute safety is that all representations (i.e., byte sequences representing a value) of the source are also representations of the target.

That's how I would phrase it, anyway. We haven't officially adopted this terminology I think, and we probably talk about "invalid values" in many places.

joshlf · 2025-02-28T08:36:55Z

src/rng.rs

+// This macro is unsafe to call: target types must support transmute from
+// random bits (i.e. all bit representations are valid).


To be precise: In order to call this for a type, T, it must be the case that mem::transmute::<[u8; size_of::<T>()], T>(...) is sound for all values. Note that this precludes uninitialized bytes.

Also, to be pedantic, this should be a doc comment with a # Safety section.

joshlf · 2025-02-28T08:38:43Z

src/rng.rs

+                        slice::from_raw_parts_mut(self.as_mut_ptr()
+                            as *mut u8,
+                            mem::size_of_val(self)


I'll let @RalfJung chime in to confirm, but I believe this is unsound. In particular, you aren't allowed to interleave uses of raw pointers and references to the same object. You can fix this by moving the mem::size_of_val(self) above the call to self.as_mut_ptr().

Yeah, size_of_val(self) does a reborrow so doing that between the creation of the self.as_mut_ptr() pointer and its use is questionable. SB accepts this due to a terrible hack that I am trying to phase out as it causes other problems. TB actually properly accepts this.

joshlf · 2025-02-28T08:42:10Z

src/rng.rs

            fn fill<R: Rng + ?Sized>(&mut self, rng: &mut R) {
                if self.len() > 0 {
-                    rng.fill_bytes(self.as_mut_bytes());
+                    rng.fill_bytes(unsafe {


Safety comment? You'll need to justify that you've upheld this safety contract. You may find the following to be a helpful rephrasing of similar requirements: https://doc.rust-lang.org/1.82.0/std/ptr/index.html#pointer-to-reference-conversion

I'd recommend doing #![deny(clippy::undocumented_unsafe_blocks)] for the whole crate.

joshlf · 2025-02-28T08:43:30Z

src/rng.rs

+unsafe_impl_fill!(u16, u32, u64, u128,);
+unsafe_impl_fill!(i8, i16, i32, i64, i128,);


Safety comments?

You may want to copy from what we've done in zerocopy (specifically for FromBytes).

dhardy added 6 commits February 6, 2025 10:17

Add simd benchmark

8fab522

Results show that some Simd types are 2-4 times as expensive as expected

Remove #[inline(never)] statements on Fill::fill

4ccd0c0

Results in few minor regressions and two large improvements in benchmarks: -72% time for u64x2, -50% for u32x4.

Replace zerocopy::transmute! with unsafe transmute

8ec4cf4

Code gen is identical and benchmarks unaffected.

Replace zerocopy::IntoBytes::as_mut_bytes with unsafe slice::from_raw…

0d27d3f

…_parts_mut Mostly code gen appears equivalent, though it affects inlining of u32x4 gen with SmallRng. Benchmarks are not significantly affected.

Remove zerocopy dependency

80b8d95

CHANGELOG

b81c644

dhardy requested a review from josephlr February 6, 2025 12:20

dhardy mentioned this pull request Feb 27, 2025

Make the dependency on zerocopy optional #1574

Open

josephlr reviewed Feb 28, 2025

View reviewed changes

joshlf suggested changes Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove zerocopy from rand #1579

Remove zerocopy from rand #1579

dhardy commented Feb 6, 2025

mitsuhiko commented Feb 6, 2025

joshlf commented Feb 12, 2025

josephlr Feb 28, 2025

josephlr Feb 28, 2025

joshlf Feb 28, 2025

RalfJung Feb 28, 2025 •

edited

Loading

joshlf Feb 28, 2025

joshlf Feb 28, 2025

RalfJung Feb 28, 2025

joshlf Feb 28, 2025

joshlf Feb 28, 2025

		// This macro is unsafe to call: target types must support transmute from
		// random bits (i.e. all bit representations are valid).

		unsafe_impl_fill!(u16, u32, u64, u128,);
		unsafe_impl_fill!(i8, i16, i32, i64, i128,);

Remove zerocopy from rand #1579

Are you sure you want to change the base?

Remove zerocopy from rand #1579

Conversation

dhardy commented Feb 6, 2025

Summary

Motivation

Project Safe Transmute

Details

Benchmark results

Unfinished business?

mitsuhiko commented Feb 6, 2025

joshlf commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung Feb 28, 2025 •

edited

Loading