-
Notifications
You must be signed in to change notification settings - Fork 88
Critical issues before stabilization #364
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
w00t! Very excited to see this. :D :D :D
Is there high level docs anywhere describing how the API is laid out? Like, a guided tour of everything? I realize that's a tall order to ask for before the API is set in stone because it takes work to stabilize. But I'm asking because as of right now, I find the API to be very overwhelming. There are a lot of traits and upon seeing them, I immediately wonder how they all fit together. And in particular, why they exist. That is, if the API were less trait heavy, what would we lose? Looking at the names of the traits, I can make some guesses as to what some of them are for. And it looks plausible that they could all mostly be explained (at a high level) without too much exposition. I have opinions on the names of the traits too, but I think that fun can be saved for later. :P I find the number of concrete types to be large but not overwhelming. At a quick glance, I can immediately understand what they mean and what connects them. While I do have some light SIMD background, I'm confident that the large number of concrete types could be explained by a ~paragraph in the crate docs.
Do you have a 50,000-foot view of what the partitioning might look like? I don't mean to put you on the spot and ask for a detailed proposal, but just a general feeling of what you think might make sense.
I agree that missing swizzles is a rather large limitation. For me personally, unless there is a language feature that you're pretty confident is right at the horizon, or if there's a small language feature that you can get the lang team on board with quickly, I would go to war with the army you have when everything else is ready to stabilize. But this is just a general sense of things personally, and I'm sure there is a deeper analysis that could be done here. e.g., "If Rust had language feature Foo, then the swizzle API could be written like this which would make things simpler and/or unlock new use cases." |
that would be: if LLVM had dynamic swizzles (using the |
Nope, that's definitely something we should have. Personally, I think we could use more submodules to make the division a little clearer. Each of these could be documented independently, with a bit of guidance in the main
The original implementation had no traits. This results in two related issues. To implement
Offhand, I would say something like:
One option might be to implement a slightly less powerful set of |
I hope it will be released soon. I've tried using it in one of my crates (linebender/tiny-skia#88) and it mostly works. There are some glitches which I haven't debugged yet, but I don't think they are critical. I personally do no use |
Realistically we'd probably gain things. As it is, there are a number of APIs that have been hard to add due to the need to make them fit into the trait-heavy design. That said, my stance here (that we should not have gone with the traits as it's lead to a cumbersome and inflexible design) is not a popular one, so YMMV. |
As far as restricting the number of lanes goes, I tried combining the bound into something like:
In theory, this seemed like a good compromise between the current verbose bounds and using a non-local error. In practice, it wasn't really possible to implement this. It turns out we use the separation between the element bound ( After this experiment, I think the best option is to add two new compiler attributes like |
could we just use: impl<T: ..., const N: usize> Simd<T, N> {
const VALID: () = {
let valid = N > 0 && N < 64;
if !cfg!(all_lane_counts_iirc) {
valid &= N.is_power_of_two();
}
assert!(valid, "invalid lane count");
};
pub fn each_method(self, ...) -> ... {
Self::VALID; // use so it triggers a post-mono error
...
}
} actually, no, we can't, since just copying a |
Also, it would be easy to accidentally forget |
Any chance we'd be willing to say that we're ok allowing the type to exist for non-PoT lane counts, but then just not offering anything other than (But I guess that exposes layout questions that we don't want to decide just yet, even if we left them explicitly unspecified.)
We could maybe have a |
we already have a feature flag to enable non-pot lane counts, the main issue imo is that someone could try to use |
Also, removing the bounds from
There is precedent with |
the other issue I forgot is that imo non-pot lane counts have the wrong layout and that should change: #319 |
This one I feel strongly isn't a problem, since So long as using
I would love to have even a hacky solution to this. lcnr had some thoughts on doing something like this in https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/How.20essential.20is.20the.20compile-time.20check.20for.20empty.20arrays.3F/near/248387965. Maybe someone could pick that up to deal with the zero problem for a while? If it was easier,
I suppose that's another possibility here -- explicitly decide to have v1 of stabilizing this not allow writing generic code over them, but still let people use specific versions when implementing stuff. That seems like it'd still be pretty useful, since using |
My proposal of using |
I love the current state of this project and am grateful for the hard work being done and the general friendliness/helpfulness of the team. However, I will be a bit of a contrarian and say that swizzles are important to me, particularly dynamic ones. I've also used the APIs with the I'm not saying swizzles should stop stabilization per se, I'm a very happy nightly user. I'll also be contrary and say that it's okay to abstract more functionality for very common use cases. I think some of this is already done, but I wouldn't be sad, for example, to see a function to map bytes using dynamic swizzles. The newbie user (like me) may not realize the hardware limitations (the usual size of look-up tables, for example) of certain operations and so creating a well-designed function for a common use case is helpful to people like me. |
I agree that a satisfying story for swizzles is important. Regarding traits, I initially proposed it as a way to reduce the "multiplication of entities" problem so that it's easier to follow things. I have begun to think a different approach may be appropriate (increased experience in API design changes your opinions about API design, who would've thought?), but I believe we should probably fork further discussion about that into its own issues/threads. |
My 2c is that if Rust's primitive types cannot currently be abstracted over with traits (other than |
Personally, I don't see a way around using traits. This was one of the first issues we hashed out, I'm sure there is more discussion available in zulip. SIMD types are not just numbers and not just arrays, they act like both simultaneously. Like arrays, SIMD vectors have container-like operations that are only concerned with their memory layout and are indifferent to their element type. This includes splatting, indexing, swizzles, scatter, gather, conversions to and from arrays and slices. To undo this means that you will have 14 vector types that each implement these functions separately, and there will be no way in the future to ever treat a vector like an element-agnostic container. A comparison would be like if Using traits allowed us to treat vectors as containers (which they are) while also giving them element-specific behavior. The element-specific functions can't be implemented directly |
Can we add "documentation and review of the intrinsics' safety requirements" to the list? Currently every time I want to implement one of these in Miri I have to dig through PRs and then usually still go ask people. Intrinsics are language extensions so they should be reviewed by t-opsem before stabilization (and ideally, a draft documentation is created when the intrinsic is implemented, so that one has some starting point for "intended behavior" vs "actual behavior" and checking that they are the same). I also think the intrinsics' declarations should be moved into libcore so that they can be equipped with doc comments that live in the same repository as the codegen and Miri implementations that turn those doc comments into reality. |
Yes, I intend on doing just that. |
I opened #381 to track that |
I strongly support checking lane count post-monomorphization. It's quite a hassle to carry around bounds proving that not only |
if you want to use a function |
related conversation about |
Why not support any number of lanes, by using arrays of native SIMD vectors large as needed, and masking for operations that need it? More in detail:
|
The simple answer is that the size limitations are not related to native SIMD register sizes, but codegen backend encoding limitations. We further restrict to even smaller sizes that we are confident operate correctly and optimize well. Our design allows for increasing the maximum in the future. |
More precisely, doing something vaguely like this:
Or something like this:
Or alternatively the compiler itself could do all this, i.e. you could fix the codegen backend to support arbitrary sizes. |
|
The codegen limitations are something like 2^16 max elements, which is large enough that I doubt it's necessary to support anything larger. Something like this would make it possible to exceed that limit but I'm not sure it's worth the added complexity. We will likely be able to lift the current limit to something much larger if we can get reliable non-power-of-two-length codegen, but that's proving problematic for now. |
e.g. terrible code LLVM generates for a |
I guess the downside of my proposal is that it allows the user to write "a + b + c" which will result in two loops, and unless the backend optimizer manages to fuse the loops, that code will be much worse than a single loop since it unnecessarily writes and reads the intermediate result to memory. So maybe another option to solve the The iterator based design could still implement arithmetic on iterators ( Ideally, the compiler would gain support for generic closures, which would allow to specify a polymorphic closure to the iterator combinator, which would allow to choose at runtime the vector size and instruction set to use. This would suggest a different change to the design though, which is to have a Simd<T, N, ISA> which would allow for example to have different types that use either AVX-2 or AVX10/256, since otherwise the polymorphic closure scheme can't differentiate between them since vector length is the same, and would also allow to use either instruction in ISA extensions or a replacement sequence. |
As always, this depends on your choice of CPU. If Rust used target-cpu=native by default... https://llvm.godbolt.org/z/xjsndxc1b I agree that we are constantly fighting with dreadful LLVM codegen, regardless of the target. |
all that does is increase the upper limit for somewhat reasonable code, if using a 4000 element vector you go back to terrible code: https://llvm.godbolt.org/z/Wf9o5jT67 |
I'm using a personal arbitrary-length wrapper layer. A couple notes
|
Considering that shuffles/swizzles with compile-time-constant lane indices have worked in since at least 2015 without substantial change to the API shape, I think it's well past time to stabilize the API shape that exists. So I think it makes sense to stabilize the |
Portable SIMD has taken a substantially different approach to swizzles than e.g. the x86 shuffle intrinsics. We really wanted to avoid a magic It should be safe to stabilize |
Why is this even using a trait? If that's just to work around making the final argument const, then I think this can be expressed much better with a However, that would still expose the underlying |
We can't expose the intrinsic directly because we do some manipulation of the const indices before passing them into the intrinsic--passing it via an associated const allows it to be used in regular rust since it isn't bypassing the usual type system. |
You should be able to do the same manipulation inside the |
I think I'm following... but do we want more const args? I think I even saw an issue somewhere encouraging const args to be removed in an edition change or something, I'm not sure where they stand right now. |
Oh I see, the problem is that the macro shouldn't call |
I think it's possible to write a macro something like the following (I have a prototype): fn some_swizzle<T, const LEN: usize, const PARAM1: usize, const PARAM2: usize>(x: Simd<T, LEN>) -> Simd<T, LEN> {
simd_swizzle!(x,
const { /* some expression using LEN, PARAM1, PARAM2 */ },
len = LEN,
const_parameters = (PARAM1, PARAM2),
)
} But is something like this any better than implementing the Swizzle trait? I'm having trouble imagining an idiomatic interface for this macro, but it takes away the need to understand some const generics type inference magic you need to understand to implement it. Notably, |
hi all, i landed here after doing some research into why stabilizing portable simd is stuck (i have a usecase where i need complex numbers, and writing a portable simd friendly replacement for num_complex is proving to be a nightmare) based on what you said in rust-lang/rust#86656, the biggest blocker is the supported lane count trait bound. i'm having trouble following the discussion, but based on what i've been able to understand, a better solution than what you have is simply not possible with things currently in the language, if you want to limit the lane count to anything below 64. have there been any ideas that i'm not aware of, or is this just deadlocked on "the current solution sucks and we can't think of anything better"? |
afaict basically the current situation sucks and we're waiting on improvements to the compiler so we can improve our API, we don't want to stabilize a bad API when we know it can be changed to be better in the future (hopefully soon). |
also the plan is to have the limit be much larger than 64, iirc in the range of a few hundreds up to tens of thousands. |
that's totally understandable, is there anything i can do to help move things along, or at least some compiler tracking issues i can follow? |
idk, maybe @workingjubilee knows? |
I think what needs to be done is provide an attribute (or possibly lang item) that reads the lane count and emits errors for sizes we don't want to support. I tried this previously but wasn't happy with my solution, which worked, but still allowed you to name invalid types as long as you didn't actually use it. It's a bit difficult finding where in the compiler to properly implement it. |
at least to my untrained eyes that sounds like something a proc macro could do, but i suppose it makes sense that only the compiler can emit whatever code enforces the lane count/type restrictions. i can't really think of anything that isn't incredibly hacky or specific, which i imagine the lang and compiler teams wouldn't be too happy about |
iirc the general plan is to implement the restriction similarly to the restriction on not having |
I think the current API encourage non-portable SIMD over portable SIMD. The default should be "give me a SIMD vector of the specified type corresponding to the SIMD register size". I'm happy there is an effort to standardize a portable SIMD API and this one looks good in terms of functionality, I just think the API should be portable first, non-portable second (maybe even discouraged by the API design). The examples all assume a vector length of 128-bit. Looking at the top google results for articles of portable SIMD, not one, shows an implementation that can compile to multiple vector lengths.
Actually that wasn't completely correct, while "Portable SIMD Programming in Rust" also exclusively uses fixed size types in all code examples it at the very least tells you at the end how to query the native SIMD width. But this is even more exemplifies that the current API makes non-portable SIMD easy and portable SIMD harder, otherwise the examples would've been vector-length-agnostic. This does not stand up to the self-imposed goal of "This module provides a SIMD implementation that is fast and predictable on any target". You should never want to use types like f32x4. Yes, there are certain cases where a problem doesn't scale beyond for elements of parallelism, but this isn't the case for most usages. If most people end up writing fixed size SIMD code, that like most examples targets 128-bit vectors, then we will end up in situations where the hand optimized SIMD code is slower than autovectorized code, because autovectorized code can take advantage of the full SIMD width. Comments on beginners-guide.md
This is not true, unless we are specifically talking about this library and not the general definition/use of this term in regard to SIMD.
Most architectures, aka only x86. NEON,SVE,RVV and Alti Vec have mechanism to extract single elements from vector registers.
SiFive X390 (1024-bit RVV), Andes AX45MPV/AX46MPV (1024/2048-bit RVV), Akeana 1200 (2048-bit RVV), Vitrivius (16384-bit RVV)
You should, but that kind of goes against the entire point of portable SIMD. The things mentioned here shouldn't really matter for portable SIMD. More important are other things like the number of vector registers available (32 on most architectures, 16 on sse to avx2) and a rough idea of which instructions are common to most SIMD ISAs and which may need to be emulated with multiple instructions by portable SIMD.
This is only on the arm32 version of NEON, aarch64 has 32 128-bit vector registers. Comments on examplesLet's start with dot_product.rs. Firstly I didn't see any mention of the fact that the results of the SIMD implementations differ from the scalar one due to the way floating point arithmetic works.
The nbody.rs implementation shows why thinking in terms of a fixed vector length can sometimes be detrimental (assuming a large N_BODIES, which would be realistic in real applications). The hot loop in The proper way to vectorize something like this is using SOA as follows: https://godbolt.org/z/3qzKTvEv9 #include <hwy/highway.h>
namespace hn = hwy::HWY_NAMESPACE;
struct Bodies {
double *x, *y, *z, *vx, *vy, *vz, *mass;
size_t count;
};
double energy(const Bodies b) {
const hn::ScalableTag<double> d;
auto e = hn::Set(d, 0);
for (size_t i = 0; i < b.count; ++i) {
double sq = b.vx[i]*b.vx[i] + b.vy[i]*b.vy[i] + b.vz[i]*b.vz[i];
e = hn::Add(e, hn::Set(d,0.5*b.mass[i]*sq));
auto ix = hn::Set(d, b.x[i]);
auto iy = hn::Set(d, b.y[i]);
auto iz = hn::Set(d, b.z[i]);
auto imass = hn::Set(d, b.mass[i]);
for (size_t j = i+i; j < b.count; j += hn::Lanes(d)) {
auto dx = hn::Sub(ix, hn::Load(d, b.x+j));
auto dy = hn::Sub(iy, hn::Load(d, b.y+j));
auto dz = hn::Sub(iz, hn::Load(d, b.z+j));
auto mass = hn::Load(d, b.mass+j);
auto dsq = hn::MulAdd(dx, dx, hn::MulAdd(dy, dy, hn::Mul(dz,dz)));
e = hn::Sub(e, hn::Div(hn::Mul(imass, mass), hn::Sqrt(dsq)));
}
// TODO: handle tail
}
return hn::ReduceSum(d, e);
}
void advance(Bodies b, double dt) {
const hn::ScalableTag<double> d;
for (size_t i = 0; i < b.count; ++i) {
auto ix = hn::Set(d, b.x[i]);
auto iy = hn::Set(d, b.y[i]);
auto iz = hn::Set(d, b.z[i]);
auto imass = hn::Set(d, b.mass[i]);
auto accvx = hn::Set(d, 0);
auto accvy = hn::Set(d, 0);
auto accvz = hn::Set(d, 0);
for (size_t j = i+i; j < b.count; j += hn::Lanes(d)) {
auto dx = hn::Sub(ix, hn::Load(d, b.x+j));
auto dy = hn::Sub(iy, hn::Load(d, b.y+j));
auto dz = hn::Sub(iz, hn::Load(d, b.z+j));
auto dsq = hn::MulAdd(dx, dx, hn::MulAdd(dy, dy, hn::Mul(dz,dz)));
auto mag = hn::Div(hn::Set(d, dt), hn::Mul(dsq, hn::Sqrt(dsq)));
auto mj = hn::Mul(mag, hn::Load(d, b.mass+j));
auto mi = hn::Mul(mag, imass);
accvx = hn::NegMulAdd(dx, mj, accvx);
accvy = hn::NegMulAdd(dy, mj, accvy);
accvz = hn::NegMulAdd(dz, mj, accvz);
hn::Store(hn::MulAdd(dx, mi, hn::Load(d, b.vx+j)), d, b.vx+j);
hn::Store(hn::MulAdd(dy, mi, hn::Load(d, b.vy+j)), d, b.vy+j);
hn::Store(hn::MulAdd(dz, mi, hn::Load(d, b.vz+j)), d, b.vz+j);
}
b.vx[i] += hn::ReduceSum(d, accvx);
b.vy[i] += hn::ReduceSum(d, accvy);
b.vz[i] += hn::ReduceSum(d, accvz);
// TODO: handle tail
}
} (Sorry for writing C++, but I'm not that familiar with rust) This scales perfectly with vector length, has no reductions in the inner loop and doesn't iterate over sparse memory. spectral_norm.rs looks fine, apart from not being implemented in a scalable fashion. The matrix_inversion.rs is quite decent, although I would've liked to see a comment mentioning that it's also possible to do multiple 4x4 matrix inversions in parallel if your vector registers can hold a multiple of four elements. |
I'll happily present myself as an example of a case where a problem doesn't scale beyond a fixed amount of elements. In real-time audio, we only care about a maximum amount of things (max 2 channels for stereo, max 16 voices for a polyphonic synth, etc). My current usecase is calculating filter banks in parallel. My current prototype has 8 voices, with 8 bandpass filters per voice. The way I chose to implement this was using 8 element wide 32 bit float vectors, and it provides perfectly acceptable performance. I'll admit my case may be unusual, and I think it's fair to say you know way more about the low level details than I do, but for me, and my usecase, conceptualizing a vector as an array that operates in parallel is more than enough. I haven't had to get into vendor specifics in my case. |
Thanks for the example. I'm not saying to make fixed size code impossible, but the first approach should be to try to make it scale and if that doesn't work resort to a fixed solution. If you don't care about other vector length, then you must know your target and are probably better off writing platform specific intrinsics. I'm not familiar with your use case; do you have sample? (code can be non-vectorized) One thing to remember is that you can always do a multiple of your fixed-sized problem at once. E.g. jpeg decode has a 8x8 16-bit IDCT step, that does operations on 8 rows, transposes, and does further operations on the transposed rows. Each row is stored in a separate vector register and this works very well for 128-bit vectors. At first glace it doesn't scale beyond that, until you realize that you don't need to do just one IDCT to decode a jpeg. You can implement a function that does N=VLEN/128 IDCTs at once instead and gain performance that way. I briefly mentioned this in my comment on the 4x4 transpose example. Maybe this isn't possible for your problem, but I'd love to take a look. |
This API does not require using fixed sizes, at least not in the sense that you are describing. One way to make the size variable is to use a generic, e.g. The API you are describing is mostly impossible to implement in Rust today without significant compiler changes, and probably more difficult to implement than the rest of this project altogether. Imagine the following: #[target_feature(enable = "sse4.1")]
fn sse(vector: NativeSimd<f32>) {}
#[target_feature(enable = "avx2")]
fn avx(vector: NativeSimd<f32>) { sse(vector) }
#[target_feature(enable = "avx512f")]
fn avx512(vector: NativeSimd<f32>) { avx(vector) } What should happen here? Does the size of the vector change based on the enabled features? Do the layout/features need to be part of the function ABI? With some simpler improvements to the language, it should be possible to improve the ergonomics of this, e.g. a macro const N: usize = best_simd_size!();
type Vector = Simd<f32, N>; |
sure, you can email me or reach out to me on another channel, i don't want to clog up the discussion here |
Uh oh!
There was an error while loading. Please reload this page.
I wanted to put together a more technical list of issues to be solved before we can stabilize (see rust-lang/rust#86656).
There are many more important issues like performance on some targets, or important missing functions, but IMO those don't prevent stabilization. In this list I've trimmed down the issues to major API-related issues that can't be changed, or are difficult to change, after stabilization.
Issues to be solved
LaneCount<N>: SupportedLaneCount
bound makes the API exceptionally cumbersome. I've also had some trouble writing generics around it, particularly when making functions that change the number of lanes (the trait solver sometimes simply fails). Adding the bound often feels like boilerplate, and I found myself looking for ways to launder the bound, like adding unnecessary const parameters. Making this a post-monomorphization error (I found Design meeting: Non-local errors (aka post-monomorphization errors) lang-team#195 helpful) might be the way to go, or perhaps there's a way to make a const fn panic. Cons: the trait bound is very explicit and hiding the error states could possibly do more harm than good when identifying the source of a build failure.Simd<f32, N>
should beSimd<i32, N>
(see Change Mask element to match Simd element #322 for discussion). I think it would be much more straightforward if the types simply matched. Cons: this would require an extra cast when usingi32
masks forf32
vectors or similar, and makes implementingFrom<Mask> for Mask
impossible (since pointer masks must be genericMask<*const T, N>
, maybe all pointers could use the same mask element?).Non-issues, but things that should be done
SimdPartialEq
,SimdInt
.Updates (after filing this issue)
Lane count
LaneCount
anSimdElement
into a single boundSimd<T, N>: Supported
. This doesn't work well for a variety of reasons. One example: scatter/gather useSimd<T, N>
,Simd<*const T, N>
, andSimd<usize, N>
. Each of these would need their own bound, rather than using oneLaneCount
bound since they all shareN
.LaneCount
andSimdElement
) or turningLaneCount
into a non-local error.Swizzles
I tried the following swizzle code. It requires the incomplete
adt_const_params
feature. Even with this feature enabled, it's impossible to implement functions likereverse
, because const generics can't access the generic const parameters.The text was updated successfully, but these errors were encountered: