Skip to content

Critical issues before stabilization #364

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
calebzulawski opened this issue Sep 10, 2023 · 56 comments
Open

Critical issues before stabilization #364

calebzulawski opened this issue Sep 10, 2023 · 56 comments

Comments

@calebzulawski
Copy link
Member

calebzulawski commented Sep 10, 2023

I wanted to put together a more technical list of issues to be solved before we can stabilize (see rust-lang/rust#86656).

There are many more important issues like performance on some targets, or important missing functions, but IMO those don't prevent stabilization. In this list I've trimmed down the issues to major API-related issues that can't be changed, or are difficult to change, after stabilization.

Issues to be solved

  • Restricting the number of lanes. In my opinion, the LaneCount<N>: SupportedLaneCount bound makes the API exceptionally cumbersome. I've also had some trouble writing generics around it, particularly when making functions that change the number of lanes (the trait solver sometimes simply fails). Adding the bound often feels like boilerplate, and I found myself looking for ways to launder the bound, like adding unnecessary const parameters. Making this a post-monomorphization error (I found Design meeting: Non-local errors (aka post-monomorphization errors) lang-team#195 helpful) might be the way to go, or perhaps there's a way to make a const fn panic. Cons: the trait bound is very explicit and hiding the error states could possibly do more harm than good when identifying the source of a build failure.
  • Mask element type. I'm not confident that the mask type for Simd<f32, N> should be Simd<i32, N> (see Change Mask element to match Simd element #322 for discussion). I think it would be much more straightforward if the types simply matched. Cons: this would require an extra cast when using i32 masks for f32 vectors or similar, and makes implementing From<Mask> for Mask impossible (since pointer masks must be generic Mask<*const T, N>, maybe all pointers could use the same mask element?).
  • Swizzle functions. They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation.

Non-issues, but things that should be done

  • Are we happy with the traits the way they are? e.g. SimdPartialEq, SimdInt.
  • The API should be partitioned into multiple unstable features that can be stabilized independently.

Updates (after filing this issue)

Lane count

  • I tried condensing LaneCount an SimdElement into a single bound Simd<T, N>: Supported. This doesn't work well for a variety of reasons. One example: scatter/gather use Simd<T, N>, Simd<*const T, N>, and Simd<usize, N>. Each of these would need their own bound, rather than using one LaneCount bound since they all share N.
  • I recommend either keeping the bounds as they are now (LaneCount and SimdElement) or turning LaneCount into a non-local error.

Swizzles

I tried the following swizzle code. It requires the incomplete adt_const_params feature. Even with this feature enabled, it's impossible to implement functions like reverse, because const generics can't access the generic const parameters.

    pub fn swizzle<const M: usize, const INDEX: &'static [usize]>(self) -> Simd<T, M>
    where
        LaneCount<M>: SupportedLaneCount,
    {
        // SAFETY: `self` is a vector and the swizzle index is a const array of u32
        unsafe {
            intrinsics::simd_shuffle(
                self,
                self,
                const {
                    assert!(M == INDEX.len(), "`M` must equal the length of `INDEX`");
                    let mut r = [0; M];
                    let mut i = 0;
                    while i < M {
                        assert!(
                            INDEX[i] < N,
                            "indices must be less than the input vector length"
                        );
                        r[i] = INDEX[i] as u32;
                        i += 1;
                    }
                    r
                },
            )
        }
    }
@BurntSushi
Copy link
Member

w00t! Very excited to see this. :D :D :D

Are we happy with the traits the way they are? e.g. SimdPartialEq, SimdInt.

Is there high level docs anywhere describing how the API is laid out? Like, a guided tour of everything? I realize that's a tall order to ask for before the API is set in stone because it takes work to stabilize. But I'm asking because as of right now, I find the API to be very overwhelming. There are a lot of traits and upon seeing them, I immediately wonder how they all fit together. And in particular, why they exist. That is, if the API were less trait heavy, what would we lose?

Looking at the names of the traits, I can make some guesses as to what some of them are for. And it looks plausible that they could all mostly be explained (at a high level) without too much exposition.

I have opinions on the names of the traits too, but I think that fun can be saved for later. :P

I find the number of concrete types to be large but not overwhelming. At a quick glance, I can immediately understand what they mean and what connects them. While I do have some light SIMD background, I'm confident that the large number of concrete types could be explained by a ~paragraph in the crate docs.

The API should be partitioned into multiple unstable features that can be stabilized independently.

Do you have a 50,000-foot view of what the partitioning might look like? I don't mean to put you on the spot and ask for a detailed proposal, but just a general feeling of what you think might make sense.

Swizzle functions. They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation.

I agree that missing swizzles is a rather large limitation. For me personally, unless there is a language feature that you're pretty confident is right at the horizon, or if there's a small language feature that you can get the lang team on board with quickly, I would go to war with the army you have when everything else is ready to stabilize. But this is just a general sense of things personally, and I'm sure there is a deeper analysis that could be done here. e.g., "If Rust had language feature Foo, then the swizzle API could be written like this which would make things simpler and/or unlock new use cases."

@programmerjake
Copy link
Member

e.g., "If Rust had language feature Foo, then the swizzle API could be written like this which would make things simpler and/or unlock new use cases."

that would be: if LLVM had dynamic swizzles (using the shuffle LLVM IR instruction), then we could just have the swizzle pattern be a function argument and rely on inlining and const-propagation, like we do with atomic orderings (technically we could do that to some extent in rustc, but it really should be in LLVM).

@calebzulawski
Copy link
Member Author

Is there high level docs anywhere describing how the API is laid out?

Nope, that's definitely something we should have. Personally, I think we could use more submodules to make the division a little clearer. Each of these could be documented independently, with a bit of guidance in the main std::simd module:

  • Simd, Mask, and the various aliases in std::simd
  • num containing SimdInt, SimdFloat, etc
  • ptr containing SimdConstPtr and SimdMutPtr
  • cmp containing SimdPartialEq etc
  • swizzle containing Swizzle, Which, etc

That is, if the API were less trait heavy, what would we lose?

The original implementation had no traits. This results in two related issues. To implement recip, you need impl Simd<f32, N> and impl Simd<f64, N> and end up with two identical (but separate in docs) recip functions. For integers this is even more verbose and distracting. The second issue is that you now have overlapping functions, e.g. recip (for floats), count_ones (for ints) and wrapping_add (for pointers) all implemented on the Simd type. With traits, we were able to put generic vector functions on Simd and leave anything element-type-specific to a trait.

The API should be partitioned into multiple unstable features that can be stabilized independently.

Do you have a 50,000-foot view of what the partitioning might look like?

Offhand, I would say something like:

  • portable_simd_types: just Simd, Mask, and a handful of generic functions (these types implement Add, Sub, etc)
  • portable_simd_swizzle if we want to split off swizzling for later
  • portable_simd_scatter_gather: scatter/gather is another fairly complicated API we might want to hold back
  • portable_simd_traits: for all of the traits above, once we're comfortable with how they're divided up etc

I agree that missing swizzles is a rather large limitation. For me personally, unless there is a language feature that you're pretty confident is right at the horizon, or if there's a small language feature that you can get the lang team on board with quickly, I would go to war with the army you have when everything else is ready to stabilize.

One option might be to implement a slightly less powerful set of swizzle2<I0, I1>, swizzle4<I0, I1, I2, I3>, etc functions that are not quite as powerful as we'd like (impossible to make generic over N) and wait on stabilizing the completely generic interface for when we can do something like fn swizzle<const N: usize, const I: [usize; N]>() or perhaps something even better.

@RazrFalcon
Copy link

I hope it will be released soon. I've tried using it in one of my crates (linebender/tiny-skia#88) and it mostly works. There are some glitches which I haven't debugged yet, but I don't think they are critical.
Performance is all over the place on AArch64 (haven't tested x86 yet). Some stuff become faster, some slower. Maybe a porting bug, maybe imperfect codegen. But I fine with small performance regressions as long as I can ditch manual, unsafe SIMD code.

I personally do no use swizzle, therefore those functions are not important to me.

@thomcc
Copy link
Member

thomcc commented Sep 18, 2023

That is, if the API were less trait heavy, what would we lose?

Realistically we'd probably gain things. As it is, there are a number of APIs that have been hard to add due to the need to make them fit into the trait-heavy design.

That said, my stance here (that we should not have gone with the traits as it's lead to a cumbersome and inflexible design) is not a popular one, so YMMV.

@calebzulawski
Copy link
Member Author

As far as restricting the number of lanes goes, I tried combining the bound into something like:

pub struct Simd<T, const N: usize>(...) where Self: Supported;

In theory, this seemed like a good compromise between the current verbose bounds and using a non-local error. In practice, it wasn't really possible to implement this. It turns out we use the separation between the element bound (SimdElement) and the lane count bound (SupportedLaneCount) in a fair number of places. For example, functions that use multiple vector types, such as scatter/gather, require the bound for each type, which actually ends up more verbose. (Currently, they all share a lane count bound). Even worse, functions that only use another vector type internally (like Simd<usize, ...> in functions that take pointer vectors) need the extra bounds and leak implementation details.

After this experiment, I think the best option is to add two new compiler attributes like #[portable_simd_max_elements(N)] and #[portable_simd_require_power_of_2_elements], which error when unsupported lane counts are used.

@programmerjake
Copy link
Member

could we just use:

impl<T: ..., const N: usize> Simd<T, N> {
    const VALID: () = {
        let valid = N > 0 && N < 64;
        if !cfg!(all_lane_counts_iirc) {
            valid &= N.is_power_of_two();
        }
        assert!(valid, "invalid lane count");
    };
    pub fn each_method(self, ...) -> ... {
        Self::VALID; // use so it triggers a post-mono error
        ...
    }
}

actually, no, we can't, since just copying a Simd type requires a supported lane count.

@calebzulawski
Copy link
Member Author

Also, it would be easy to accidentally forget

@scottmcm
Copy link
Member

since just copying a Simd type requires a supported lane count.

Any chance we'd be willing to say that we're ok allowing the type to exist for non-PoT lane counts, but then just not offering anything other than Copy and the array conversions with them? Then it's fine that we don't have a good way to do + or shuffle for Simd<f32, 21>, but we don't force it to be bound everywhere just to exist...

(But I guess that exposes layout questions that we don't want to decide just yet, even if we left them explicitly unspecified.)

it would be easy to accidentally forget

We could maybe have a -Z flag to ICE on non-power-of-two vectors in the codegen, or something, to help catch it.

@programmerjake
Copy link
Member

since just copying a Simd type requires a supported lane count.

Any chance we'd be willing to say that we're ok allowing the type to exist for non-PoT lane counts, but then just not offering anything other than Copy and the array conversions with them?

we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big. or Simd<T, 0> which is a pain therefore not allowed (though I'm ok with adding it for the consistency benefits).

@calebzulawski
Copy link
Member Author

Also, removing the bounds from Simd but not impl Simd doesn't help us much, I think. In my experience, the bounds are no problem when writing code with "concrete" SIMD types, but painful when trying to do anything generic. If functions still require them, those generics will still be difficult.

we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big. or Simd<T, 0> which is a pain therefore not allowed (though I'm ok with adding it for the consistency benefits).

There is precedent with [T; HUGE] erroring, so I think we may be able to do the same if we can remove the non-PoT bound as well.

@programmerjake
Copy link
Member

we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big. or Simd<T, 0> which is a pain therefore not allowed (though I'm ok with adding it for the consistency benefits).

the other issue I forgot is that imo non-pot lane counts have the wrong layout and that should change: #319

@scottmcm
Copy link
Member

scottmcm commented Sep 24, 2023

the main issue imo is that someone could try to use Simd<T, 1234567> which should error for being too big

This one I feel strongly isn't a problem, since [T; isize::MAX as usize] has exactly the same issue, and I don't think that being a non-local error has really ever been a problem. (I elaborate more on that train of thought in rust-lang/rust#104087 (comment) )

So long as using Simd like this gives the same kind of error as we get for arrays in https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a8752e42c5601f574d948ff3677a7e0e, that seems entirely acceptable to me. (Saying that with my lang hat on, but not speaking for the team, since I don't think we've discussed Simd in particular.)

or Simd<T, 0> which is a pain therefore not allowed

I would love to have even a hacky solution to this. as_chunks is blocked on exactly the same thing (rust-lang/rust#74985).

lcnr had some thoughts on doing something like this in https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/How.20essential.20is.20the.20compile-time.20check.20for.20empty.20arrays.3F/near/248387965. Maybe someone could pick that up to deal with the zero problem for a while? If it was easier, Simd<T, 0> feels like another of those places where we might do "the type is well-formed but doesn't implement most of the traits".

but painful when trying to do anything generic

I suppose that's another possibility here -- explicitly decide to have v1 of stabilizing this not allow writing generic code over them, but still let people use specific versions when implementing stuff.

That seems like it'd still be pretty useful, since using f32x4 in the implementations of stuff is still much easier even if I can't be generic over stuff yet.

@programmerjake
Copy link
Member

Also, removing the bounds from Simd but not impl Simd doesn't help us much, I think.

My proposal of using Simd::CHECK would only have type bounds, all lane count bounds would be post-mono errors only.

@sammysheep
Copy link
Contributor

sammysheep commented Sep 25, 2023

I love the current state of this project and am grateful for the hard work being done and the general friendliness/helpfulness of the team.

However, I will be a bit of a contrarian and say that swizzles are important to me, particularly dynamic ones. I've also used the APIs with the const swizzles at length, and it's workable: you can use const fn to generate special index patterns provided you are willing to write some boilerplate to make it ergonomic.

I'm not saying swizzles should stop stabilization per se, I'm a very happy nightly user.

I'll also be contrary and say that it's okay to abstract more functionality for very common use cases. I think some of this is already done, but I wouldn't be sad, for example, to see a function to map bytes using dynamic swizzles. The newbie user (like me) may not realize the hardware limitations (the usual size of look-up tables, for example) of certain operations and so creating a well-designed function for a common use case is helpful to people like me.

@workingjubilee
Copy link
Member

workingjubilee commented Oct 2, 2023

I agree that a satisfying story for swizzles is important.

Regarding traits, I initially proposed it as a way to reduce the "multiplication of entities" problem so that it's easier to follow things. I have begun to think a different approach may be appropriate (increased experience in API design changes your opinions about API design, who would've thought?), but I believe we should probably fork further discussion about that into its own issues/threads.

@jdahlstrom
Copy link

jdahlstrom commented Oct 11, 2023

My 2c is that if Rust's primitive types cannot currently be abstracted over with traits (other than Add etc), it's not necessary to be able to do it for SIMD types either at first. Indeed it would be strange (and slightly amusing) if a workaround for the lack of numeric traits was to wrap your numbers in x1 SIMD and use SIMD traits instead…

@calebzulawski
Copy link
Member Author

My 2c is that if Rust's primitive types cannot currently be abstracted over with traits (other than Add etc), it's not necessary to be able to do it for SIMD types either at first. Indeed it would be strange (and slightly amusing) if a workaround for the lack of numeric traits was to wrap your numbers in x1 SIMD and use SIMD traits instead…

Personally, I don't see a way around using traits. This was one of the first issues we hashed out, I'm sure there is more discussion available in zulip. SIMD types are not just numbers and not just arrays, they act like both simultaneously. Like arrays, SIMD vectors have container-like operations that are only concerned with their memory layout and are indifferent to their element type. This includes splatting, indexing, swizzles, scatter, gather, conversions to and from arrays and slices.

To undo this means that you will have 14 vector types that each implement these functions separately, and there will be no way in the future to ever treat a vector like an element-agnostic container. A comparison would be like if [i32; 4] and [f64; 4] were totally independent types that couldn't both be used as [T; 4].

Using traits allowed us to treat vectors as containers (which they are) while also giving them element-specific behavior. The element-specific functions can't be implemented directly Simd for each element type (they end up having name collisions and the docs are impossible to read), so any alternative to traits would need to account for this.

@RalfJung
Copy link
Member

RalfJung commented Dec 11, 2023

Can we add "documentation and review of the intrinsics' safety requirements" to the list? Currently every time I want to implement one of these in Miri I have to dig through PRs and then usually still go ask people. Intrinsics are language extensions so they should be reviewed by t-opsem before stabilization (and ideally, a draft documentation is created when the intrinsic is implemented, so that one has some starting point for "intended behavior" vs "actual behavior" and checking that they are the same).

I also think the intrinsics' declarations should be moved into libcore so that they can be equipped with doc comments that live in the same repository as the codegen and Miri implementations that turn those doc comments into reality.

@calebzulawski
Copy link
Member Author

Yes, I intend on doing just that.

@calebzulawski
Copy link
Member Author

I opened #381 to track that

@anderspapitto
Copy link

I strongly support checking lane count post-monomorphization. It's quite a hassle to carry around bounds proving that not only N but also N * 2, N / 4, etc are all valid lane counts (and I hit some ICEs along the way rust-lang/rust#126443)

@programmerjake
Copy link
Member

It's quite a hassle to carry around bounds proving that not only N but also N * 2, N / 4, etc are all valid lane counts

if you want to use a function fn f<const N: usize>(a: Simd<u8, N>) -> Simd<u8, { 2 * N }> you'll need some bounds regardless of if we handle element count limiting post-monomorphization or not, since those bounds are needed for any const-generic type, e.g. for arrays:
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=6320749ac371c3aef86ad1c1316d4cdd

@programmerjake
Copy link
Member

programmerjake commented Jul 3, 2024

related conversation about min_generic_const_args maybe gaining support for N.div_ceil(8) which would allow us to calculate bit vector sizes from lane counts: https://rust-lang.zulipchat.com/#narrow/stream/260443-project-const-generics/topic/Layout.20in.20const.20parameters/near/448743588

@kytans
Copy link

kytans commented Jul 15, 2024

Why not support any number of lanes, by using arrays of native SIMD vectors large as needed, and masking for operations that need it?

More in detail:

  • Simd<T, N> is valid for any T and N and is equivalent to [T; N] except that it has a platform-specific alignment (that is of course >= the alignment of T). EDIT: maybe it's better to implement as an array of the native SIMD type, although they are semantically the same thing
  • Simd<T, N> where T is a primitive type gets add/sub/mul/div/etc. operations that use SIMD intrinsics where available, using the largest vector size on the platform <= N (which will result in the compiler usually allocating it in SIMD registers)
  • [T; N] where T is a primitive type also gets add/sub/mul/div/etc. operations that use unaligned SIMD operations where available; ideally, the compiler or LLVM should use aligned SIMD operations instead if it can prove alignment or if it is a local variable whose alignment can be increased
  • Operations between Simd<T, N> and [T; N] are also supported
  • Ideally all operations that any SIMD instruction set has are supported and are provided portably; if any operation is not natively supported, it will use the best sequence of SIMD operations or will be performed on scalars one by one
  • A bunch of #[cfg] APIs are provided to query the potential arch SIMD capabilities, and an API is provided to query the actual capabilities at run-time

@calebzulawski
Copy link
Member Author

The simple answer is that the size limitations are not related to native SIMD register sizes, but codegen backend encoding limitations. We further restrict to even smaller sizes that we are confident operate correctly and optimize well. Our design allows for increasing the maximum in the future.

@kytans
Copy link

kytans commented Jul 15, 2024

More precisely, doing something vaguely like this:

struct Simd<T: SimdArray<N>, N>([<T as SimdArray<N>>::Type; <T as SimdArray<N>>::Length]);

default impl<T, N> SimdArray<N> for T {
  type Type = T;
  const Length: usize = N;
}

// for AVX
impl<N> SimdArray<N> for f32
  where N <= 4 {
  type Type = __m128;
  const Length: usize = 1;
}

impl<N> SimdArray<N> for f32
  where N > 4 {
  type Type = __m256;
  const Length: usize = (N + 7) / 8;
}

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where T: SimdArray<N, Type = __m128>
{
   fn add_assign(&mut self, b: Self) {
     for i in 0..T::Length {
        self[i] = addps(self[i], b[i])
     }
   }
}

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where T: SimdArray<N, Type = __m256>
{
   fn add_assign(&mut self, b: Self) {
     for i in 0..T::Length {
        self[i] = vaddps(self[i], b[i])
     }
   }
}

Or something like this:

#[align(simd)] // where the compiler chooses the alignment
struct Simd<T, N>([T; N]);

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where N <= 4
{
   fn add_assign(&mut self, b: Self) {
        transmute::<_, &mut __m128>(&mut self)[0] = addps(transmute::<_, &mut __m128>(&self)[0], transmute::<_, &__m128>(&b)[0])
   }
}

impl AddAssign<Simd<f32, N>> for Simd<f32, N>
where N > 4
{
   fn add_assign(&mut self, b: Self) {
     for i in 0..(N + 7)/8 {
        transmute::<_, &mut __m256>(&mut self)[i] = vaddps(transmute::<_, &mut __m256>(&self)[i], transmute::<_, &__m256>(&b)[i])
     }
   }
}

Or alternatively the compiler itself could do all this, i.e. you could fix the codegen backend to support arbitrary sizes.

@programmerjake
Copy link
Member

i.e. you could fix the codegen backend to support arbitrary sizes.

Simd is translated directly to LLVM IR vector types, theoretically LLVM supports arbitrary sizes, but really only expects you to use native vector sizes and generates terrible code for very large vector types (e.g. generating 10000 vector add instructions instead of a loop). LLVM also is quite a bit more likely to encounter bugs for weird vector sizes. the general idea is that people using Simd should use a smallish size and if they need to process more data, write an explicit loop into their algorithm.

@calebzulawski
Copy link
Member Author

The codegen limitations are something like 2^16 max elements, which is large enough that I doubt it's necessary to support anything larger. Something like this would make it possible to exceed that limit but I'm not sure it's worth the added complexity. We will likely be able to lift the current limit to something much larger if we can get reliable non-power-of-two-length codegen, but that's proving problematic for now.

@programmerjake
Copy link
Member

generates terrible code for very large vector types (e.g. generating 10000 vector add instructions instead of a loop)

e.g. terrible code LLVM generates for a Simd<f32, 400> addition: https://llvm.godbolt.org/z/6r6EfjjfE

@kytans
Copy link

kytans commented Jul 16, 2024

I guess the downside of my proposal is that it allows the user to write "a + b + c" which will result in two loops, and unless the backend optimizer manages to fuse the loops, that code will be much worse than a single loop since it unnecessarily writes and reads the intermediate result to memory.

So maybe another option to solve the LaneCount<N>: SupportedLaneCount issue is to leave the design as-is, but specify that users should generally never use Simd directly, and instead use an API providing a "SIMD array", "SIMD Vec" and "SIMD iterator" on top of it, which could either be the standard library or a 3rd party crate.

The iterator based design could still implement arithmetic on iterators (a + b would be syntax sugar for a.zip(b).map(|(x, y)| x + y) , so you could write something like (a.iter() + b.iter() + c.iter()).collect(), and in principle one could even have it be (a + b + c).collect() (where a + b would be syntax sugar would a.iter() + b.iter()).

Ideally, the compiler would gain support for generic closures, which would allow to specify a polymorphic closure to the iterator combinator, which would allow to choose at runtime the vector size and instruction set to use.

This would suggest a different change to the design though, which is to have a Simd<T, N, ISA> which would allow for example to have different types that use either AVX-2 or AVX10/256, since otherwise the polymorphic closure scheme can't differentiate between them since vector length is the same, and would also allow to use either instruction in ISA extensions or a replacement sequence.

@andy-thomason
Copy link

generates terrible code for very large vector types (e.g. generating 10000 vector add instructions instead of a loop)

e.g. terrible code LLVM generates for a Simd<f32, 400> addition: https://llvm.godbolt.org/z/6r6EfjjfE

As always, this depends on your choice of CPU. If Rust used target-cpu=native by default...

https://llvm.godbolt.org/z/xjsndxc1b

I agree that we are constantly fighting with dreadful LLVM codegen, regardless of the target.
My libm library was plagued by the need to generate vroundps which LLVM is incapable of.

@programmerjake
Copy link
Member

As always, this depends on your choice of CPU. If Rust used target-cpu=native by default...

all that does is increase the upper limit for somewhat reasonable code, if using a 4000 element vector you go back to terrible code: https://llvm.godbolt.org/z/Wf9o5jT67

@anderspapitto
Copy link

I'm using a personal arbitrary-length wrapper layer. A couple notes

  • making this API is strictly harder than making an api for fixed-length native arrays (without simd) which is already a massive pain. E.g. writing functions like these requires propagating N * 2, N / 4, etc. bounds through your entire codebase, and simd can only make it harder.
fn repeat_twice<const N: usize>(x: [u32;N]) -> [u32;N * 2]) { mem::transmute([x, x]) }
fn reinterpret_u16_as_u64<const N: usize>(x: [u16; N]) -> [u64; N / 4]) { mem::transmute(x) }
  • "bitmask" apis need overhaul because u64 is no longer guaranteed to fit all the bits - you want something like [u8; N / 8] which again requires carrying around the N / 8 bound

  • some instructions inherently cannot be scaled up to arbitrary lengths, e.g. _mm512_conflict_epi32 , which does a quadratic number of comparisons (roughly, every lane compared with every other lane). I guess such operations could have additional bounds on them without affecting the rest of the api?

  • if a vector is size 48, should it be processed as a chunk of size 32 and then a chunk of size 16, or three chunks of size 16? Should the user be able to control this choice? If the user can choose, than maybe the representation should instead be e.g. [[u32; CHUNK_SIZE]; N / CHUNK_SIZE] with user-controlled CHUNK_SIZE where LaneCount<CHUNK_SIZE>: SupportedLaneCount

@hsivonen
Copy link
Member

They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation.

Considering that shuffles/swizzles with compile-time-constant lane indices have worked in since at least 2015 without substantial change to the API shape, I think it's well past time to stabilize the API shape that exists.

So I think it makes sense to stabilize the simd_swizzle! macro and if an API that's perceived to be better becomes possible later, it's then possible to introduce another API alongside the simd_swizzle! macro later.

@calebzulawski
Copy link
Member Author

Portable SIMD has taken a substantially different approach to swizzles than e.g. the x86 shuffle intrinsics. We really wanted to avoid a magic const argument that predates const generics and it's important to be able to generate the control mask with a const fn. This has resulted in a rather clunky but functional Swizzle trait.

It should be safe to stabilize simd_swizzle! (at least once other things are stabilized), but my biggest concern is that we may need to then keep two swizzle implementations around, because the macro is fairly sensitive to the type deduction of its implementation.

@RalfJung
Copy link
Member

Why is this even using a trait? If that's just to work around making the final argument const, then I think this can be expressed much better with a const { ... } block.

However, that would still expose the underlying simd_shuffle intrinsic fairly directly, so -- we'd have to be very careful here with stabilization.

@calebzulawski
Copy link
Member Author

We can't expose the intrinsic directly because we do some manipulation of the const indices before passing them into the intrinsic--passing it via an associated const allows it to be used in regular rust since it isn't bypassing the usual type system.

@RalfJung
Copy link
Member

You should be able to do the same manipulation inside the const { ... } block.

@calebzulawski
Copy link
Member Author

I think I'm following... but do we want more const args? I think I even saw an issue somewhere encouraging const args to be removed in an edition change or something, I'm not sure where they stand right now.

@RalfJung
Copy link
Member

Oh I see, the problem is that the macro shouldn't call simd_shuffle directly but wrap it in a function, and then the question is how does the function receive the index list as a constant. Ideally it would be a const generic but those are quite limited...

@calebzulawski
Copy link
Member Author

calebzulawski commented Sep 14, 2024

I think it's possible to write a macro something like the following (I have a prototype):

fn some_swizzle<T, const LEN: usize, const PARAM1: usize, const PARAM2: usize>(x: Simd<T, LEN>) -> Simd<T, LEN> {
    simd_swizzle!(x,
        const { /* some expression using LEN, PARAM1, PARAM2 */ },
        len = LEN,
        const_parameters = (PARAM1, PARAM2),
    )
}

But is something like this any better than implementing the Swizzle trait? I'm having trouble imagining an idiomatic interface for this macro, but it takes away the need to understand some const generics type inference magic you need to understand to implement it. Notably, const_parameters can be any outer parameter, not just from the function the macro is called in.

@murl-digital
Copy link

hi all, i landed here after doing some research into why stabilizing portable simd is stuck (i have a usecase where i need complex numbers, and writing a portable simd friendly replacement for num_complex is proving to be a nightmare)

based on what you said in rust-lang/rust#86656, the biggest blocker is the supported lane count trait bound. i'm having trouble following the discussion, but based on what i've been able to understand, a better solution than what you have is simply not possible with things currently in the language, if you want to limit the lane count to anything below 64.

have there been any ideas that i'm not aware of, or is this just deadlocked on "the current solution sucks and we can't think of anything better"?

@programmerjake
Copy link
Member

afaict basically the current situation sucks and we're waiting on improvements to the compiler so we can improve our API, we don't want to stabilize a bad API when we know it can be changed to be better in the future (hopefully soon).

@programmerjake
Copy link
Member

programmerjake commented May 26, 2025

if you want to limit the lane count to anything below 64

also the plan is to have the limit be much larger than 64, iirc in the range of a few hundreds up to tens of thousands.

@murl-digital
Copy link

afaict basically the current situation sucks and we're waiting on improvements to the compiler so we can improve our API, we don't want to stabilize a bad API when we know it can be changed to be better in the future (hopefully soon).

that's totally understandable, is there anything i can do to help move things along, or at least some compiler tracking issues i can follow?

@programmerjake
Copy link
Member

is there anything i can do to help move things along, or at least some compiler tracking issues i can follow?

idk, maybe @workingjubilee knows?

@calebzulawski
Copy link
Member Author

I think what needs to be done is provide an attribute (or possibly lang item) that reads the lane count and emits errors for sizes we don't want to support. I tried this previously but wasn't happy with my solution, which worked, but still allowed you to name invalid types as long as you didn't actually use it. It's a bit difficult finding where in the compiler to properly implement it.

@murl-digital
Copy link

at least to my untrained eyes that sounds like something a proc macro could do, but i suppose it makes sense that only the compiler can emit whatever code enforces the lane count/type restrictions. i can't really think of anything that isn't incredibly hacky or specific, which i imagine the lang and compiler teams wouldn't be too happy about

@programmerjake
Copy link
Member

at least to my untrained eyes that sounds like something a proc macro could do, but i suppose it makes sense that only the compiler can emit whatever code enforces the lane count/type restrictions. i can't really think of anything that isn't incredibly hacky or specific, which i imagine the lang and compiler teams wouldn't be too happy about

iirc the general plan is to implement the restriction similarly to the restriction on not having [u8; usize::MAX], where the type layout code is what errors (and maybe more things, but generally only after all the types have been resolved and generics substituted with their actual types aka. post-monomorphization).

@camel-cdr
Copy link

camel-cdr commented Jun 8, 2025

I think the current API encourage non-portable SIMD over portable SIMD.

The default should be "give me a SIMD vector of the specified type corresponding to the SIMD register size".
Instead, the current API and documentation encourages writing code based on a fixed number of elements encoded in the source code.

I'm happy there is an effort to standardize a portable SIMD API and this one looks good in terms of functionality, I just think the API should be portable first, non-portable second (maybe even discouraged by the API design).

The examples all assume a vector length of 128-bit. Looking at the top google results for articles of portable SIMD, not one, shows an implementation that can compile to multiple vector lengths.

Actually that wasn't completely correct, while "Portable SIMD Programming in Rust" also exclusively uses fixed size types in all code examples it at the very least tells you at the end how to query the native SIMD width. But this is even more exemplifies that the current API makes non-portable SIMD easy and portable SIMD harder, otherwise the examples would've been vector-length-agnostic.

This does not stand up to the self-imposed goal of "This module provides a SIMD implementation that is fast and predictable on any target".

You should never want to use types like f32x4. Yes, there are certain cases where a problem doesn't scale beyond for elements of parallelism, but this isn't the case for most usages.

If most people end up writing fixed size SIMD code, that like most examples targets 128-bit vectors, then we will end up in situations where the hand optimized SIMD code is slower than autovectorized code, because autovectorized code can take advantage of the full SIMD width.


Comments on beginners-guide.md

A SIMD vector has a fixed size, known at compile time.

This is not true, unless we are specifically talking about this library and not the general definition/use of this term in regard to SIMD.
I understand that this "portable" SIMD library see targeting variable length SIMD ISAs as out of scope, but those ISAs exist and also fall under the umbrella of the term SIMD.

On most architectures, the vector has to be pushed out of the SIMD register onto the stack, then an individual lane is accessed while it's on the stack

Most architectures, aka only x86. NEON,SVE,RVV and Alti Vec have mechanism to extract single elements from vector registers.

While 128-bit SIMD is the most common, there's also 64-bit, 256-bit, and even 512-bit on the newest CPUs.

SiFive X390 (1024-bit RVV), Andes AX45MPV/AX46MPV (1024/2048-bit RVV), Akeana 1200 (2048-bit RVV), Vitrivius (16384-bit RVV)

When using SIMD, you should be familiar with the CPU feature set that you're targeting

You should, but that kind of goes against the entire point of portable SIMD.

The things mentioned here shouldn't really matter for portable SIMD. More important are other things like the number of vector registers available (32 on most architectures, 16 on sse to avx2) and a rough idea of which instructions are common to most SIMD ISAs and which may need to be emulated with multiple instructions by portable SIMD.

When doing 128-bit operations it just uses two 64-bit registers as a single 128-bit register.

This is only on the arm32 version of NEON, aarch64 has 32 128-bit vector registers.

Comments on examples

Let's start with dot_product.rs.

Firstly I didn't see any mention of the fact that the results of the SIMD implementations differ from the scalar one due to the way floating point arithmetic works.
If rust had a -ffast-math equivilant then the compiler would be able to autovectorize the scalar code better than any of the shown SIMD implementations.
Not only because it can use the biggest supported vector length, while the code only target 128-bit vectors, but it could also unroll the loop with multiple accumulators, which is required to get full performance on modern OoO cores.

  • dot_prod_simd_0: It wasn't mentioned prior that reductions on SIMD registers are usually slow and this only implies it.
  • dot_prod_simd_2: This randomly changes the fold to for_each for no apparent reason.
  • dot_prod_simd_3: Is this even correct? How does as_rchunks know your lane count?
  • dot_prod_simd_4: I don't understand what the comment about allocating 1 XMM register is suppose to tell is.
  • dot_prod_simd_5: This looks unfinished.

The nbody.rs implementation shows why thinking in terms of a fixed vector length can sometimes be detrimental (assuming a large N_BODIES, which would be realistic in real applications).

The hot loop in energy() contain reductions and serial sqrt and the hot loop in both advance() needs to do a lot of splat operations, not to mention that advance() isn't even fully vectorized and needs a large temporary buffer.
Also, both only utilize 3 out of VLEN/64 available lanes and both only access part of the elements in the bodies array, which means you are touching unnecessary cache lines.

The proper way to vectorize something like this is using SOA as follows: https://godbolt.org/z/3qzKTvEv9

#include <hwy/highway.h>
namespace hn = hwy::HWY_NAMESPACE;

struct Bodies {
	double *x, *y, *z, *vx, *vy, *vz, *mass;
	size_t count;
};

double energy(const Bodies b) {
	const hn::ScalableTag<double> d;
	auto e = hn::Set(d, 0);
	for (size_t i = 0; i < b.count; ++i) {
		double sq = b.vx[i]*b.vx[i] + b.vy[i]*b.vy[i] + b.vz[i]*b.vz[i];
		e = hn::Add(e, hn::Set(d,0.5*b.mass[i]*sq));
		auto ix = hn::Set(d, b.x[i]);
		auto iy = hn::Set(d, b.y[i]);
		auto iz = hn::Set(d, b.z[i]);
		auto imass = hn::Set(d, b.mass[i]);
		for (size_t j = i+i; j < b.count; j += hn::Lanes(d)) {
			auto dx = hn::Sub(ix, hn::Load(d, b.x+j));
			auto dy = hn::Sub(iy, hn::Load(d, b.y+j));
			auto dz = hn::Sub(iz, hn::Load(d, b.z+j));
			auto mass = hn::Load(d, b.mass+j);
			auto dsq = hn::MulAdd(dx, dx, hn::MulAdd(dy, dy, hn::Mul(dz,dz)));
			e = hn::Sub(e, hn::Div(hn::Mul(imass, mass), hn::Sqrt(dsq)));
		}
		// TODO: handle tail
	}
	return hn::ReduceSum(d, e);
}

void advance(Bodies b, double dt) {
	const hn::ScalableTag<double> d;
	for (size_t i = 0; i < b.count; ++i) {
		auto ix = hn::Set(d, b.x[i]);
		auto iy = hn::Set(d, b.y[i]);
		auto iz = hn::Set(d, b.z[i]);
		auto imass = hn::Set(d, b.mass[i]);
		auto accvx = hn::Set(d, 0);
		auto accvy = hn::Set(d, 0);
		auto accvz = hn::Set(d, 0);
		for (size_t j = i+i; j < b.count; j += hn::Lanes(d)) {
			auto dx = hn::Sub(ix, hn::Load(d, b.x+j));
			auto dy = hn::Sub(iy, hn::Load(d, b.y+j));
			auto dz = hn::Sub(iz, hn::Load(d, b.z+j));
			auto dsq = hn::MulAdd(dx, dx, hn::MulAdd(dy, dy, hn::Mul(dz,dz)));
			auto mag = hn::Div(hn::Set(d, dt), hn::Mul(dsq, hn::Sqrt(dsq)));
			auto mj = hn::Mul(mag, hn::Load(d, b.mass+j));
			auto mi = hn::Mul(mag, imass);
			accvx = hn::NegMulAdd(dx, mj, accvx);
			accvy = hn::NegMulAdd(dy, mj, accvy);
			accvz = hn::NegMulAdd(dz, mj, accvz);
			hn::Store(hn::MulAdd(dx, mi, hn::Load(d, b.vx+j)), d, b.vx+j);
			hn::Store(hn::MulAdd(dy, mi, hn::Load(d, b.vy+j)), d, b.vy+j);
			hn::Store(hn::MulAdd(dz, mi, hn::Load(d, b.vz+j)), d, b.vz+j);
		}
		b.vx[i] += hn::ReduceSum(d, accvx);
		b.vy[i] += hn::ReduceSum(d, accvy);
		b.vz[i] += hn::ReduceSum(d, accvz);
		// TODO: handle tail
	}
}

(Sorry for writing C++, but I'm not that familiar with rust)

This scales perfectly with vector length, has no reductions in the inner loop and doesn't iterate over sparse memory.
Notice also how easy it was to write this in a vector length agnostic way using the Google Highway library.
This even supports scalable vector architectures like SVE and RVV, where the vector length isn't known at compile time.

spectral_norm.rs looks fine, apart from not being implemented in a scalable fashion. The a() function could also easily be vectorized, but this likely doesn't change much with this implementation, which uses just two lanes.

matrix_inversion.rs is quite decent, although I would've liked to see a comment mentioning that it's also possible to do multiple 4x4 matrix inversions in parallel if your vector registers can hold a multiple of four elements.

@murl-digital
Copy link

murl-digital commented Jun 8, 2025

Yes, there are certain cases where a problem doesn't scale beyond for elements of parallelism, but this isn't the case for most usages.

I'll happily present myself as an example of a case where a problem doesn't scale beyond a fixed amount of elements. In real-time audio, we only care about a maximum amount of things (max 2 channels for stereo, max 16 voices for a polyphonic synth, etc). My current usecase is calculating filter banks in parallel. My current prototype has 8 voices, with 8 bandpass filters per voice. The way I chose to implement this was using 8 element wide 32 bit float vectors, and it provides perfectly acceptable performance.

I'll admit my case may be unusual, and I think it's fair to say you know way more about the low level details than I do, but for me, and my usecase, conceptualizing a vector as an array that operates in parallel is more than enough. I haven't had to get into vendor specifics in my case.

@camel-cdr
Copy link

camel-cdr commented Jun 8, 2025

I'll happily present myself as an example of a case where a problem doesn't scale beyond a fixed amount of elements. In real-time audio, we only care about a maximum amount of things (max 2 channels for stereo, max 16 voices for a polyphonic synth, etc). My current usecase is calculating filter banks in parallel. My current prototype has 8 voices, with 8 bandpass filters per voice. The way I chose to implement this was using 8 element wide 32 bit float vectors, and it provides perfectly acceptable performance.

Thanks for the example.

I'm not saying to make fixed size code impossible, but the first approach should be to try to make it scale and if that doesn't work resort to a fixed solution.

If you don't care about other vector length, then you must know your target and are probably better off writing platform specific intrinsics.

I'm not familiar with your use case; do you have sample? (code can be non-vectorized)

One thing to remember is that you can always do a multiple of your fixed-sized problem at once. E.g. jpeg decode has a 8x8 16-bit IDCT step, that does operations on 8 rows, transposes, and does further operations on the transposed rows. Each row is stored in a separate vector register and this works very well for 128-bit vectors. At first glace it doesn't scale beyond that, until you realize that you don't need to do just one IDCT to decode a jpeg. You can implement a function that does N=VLEN/128 IDCTs at once instead and gain performance that way. I briefly mentioned this in my comment on the 4x4 transpose example.

Maybe this isn't possible for your problem, but I'd love to take a look.

@calebzulawski
Copy link
Member Author

This API does not require using fixed sizes, at least not in the sense that you are describing. One way to make the size variable is to use a generic, e.g. fn foo<const N: usize>(vector: Simd<f32, N>) and dispatch however you'd like.

The API you are describing is mostly impossible to implement in Rust today without significant compiler changes, and probably more difficult to implement than the rest of this project altogether. Imagine the following:

#[target_feature(enable = "sse4.1")]
fn sse(vector: NativeSimd<f32>) {}

#[target_feature(enable = "avx2")]
fn avx(vector: NativeSimd<f32>) { sse(vector) }

#[target_feature(enable = "avx512f")]
fn avx512(vector: NativeSimd<f32>) { avx(vector) }

What should happen here? Does the size of the vector change based on the enabled features? Do the layout/features need to be part of the function ABI?

With some simpler improvements to the language, it should be possible to improve the ergonomics of this, e.g. a macro best_simd_size!() that reads the target features and returns a const usize:

const N: usize = best_simd_size!();
type Vector = Simd<f32, N>;

@murl-digital
Copy link

I'll happily present myself as an example of a case where a problem doesn't scale beyond a fixed amount of elements. In real-time audio, we only care about a maximum amount of things (max 2 channels for stereo, max 16 voices for a polyphonic synth, etc). My current usecase is calculating filter banks in parallel. My current prototype has 8 voices, with 8 bandpass filters per voice. The way I chose to implement this was using 8 element wide 32 bit float vectors, and it provides perfectly acceptable performance.

Thanks for the example.

I'm not saying to make fixed size code impossible, but the first approach should be to try to make it scale and if that doesn't work resort to a fixed solution.

If you don't care about other vector length, then you must know your target and are probably better off writing platform specific intrinsics.

I'm not familiar with your use case; do you have sample? (code can be non-vectorized)

One thing to remember is that you can always do a multiple of your fixed-sized problem at once. E.g. jpeg decode has a 8x8 16-bit IDCT step, that does operations on 8 rows, transposes, and does further operations on the transposed rows. Each row is stored in a separate vector register and this works very well for 128-bit vectors. At first glace it doesn't scale beyond that, until you realize that you don't need to do just one IDCT to decode a jpeg. You can implement a function that does N=VLEN/128 IDCTs at once instead and gain performance that way. I briefly mentioned this in my comment on the 4x4 transpose example.

Maybe this isn't possible for your problem, but I'd love to take a look.

sure, you can email me or reach out to me on another channel, i don't want to clog up the discussion here

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests