Better SIMD shuffles

I'm trying to do the following in AVX2 using intrinsics: shift `x` one byte to the right, while shifting the rightmost byte of `y` into the leftmost byte of `x`. This is best done using two instructions: `vperm2i128` followed by `vpalignr`. However, `simd_shuffle32` generates four instructions: `vmovdqa` (to load a constant), `vpblendvb`, then `vperm2i128` and `vpalignr`. Here is a a full example, which may be compiled with `rustc -O -C target_feature=+avx2 --crate-type=lib --emit=asm shuffle.rs`.

```
#![feature(platform_intrinsics, repr_simd)]

#[allow(non_camel_case_types)]
#[repr(simd)]
pub struct u8x32(u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8);

extern "platform-intrinsic" {
    fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U;
}

pub fn right_shift_1(left: u8x32, right: u8x32) -> u8x32 {
    unsafe { simd_shuffle32(left, right, [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62]) }
}
```

This might be considered a bug in LLVM, in the sense that it's generating a sub-optimal shuffle. However, I think it should be addressed in rustc, because if I know what the right sequence of instructions is then I shouldn't have to hope that LLVM can generate it. Moreover, it's possible to get the right code from clang (compile with `clang -emit-llvm -mavx2 -O -S shuffle.c`):

```
#include <immintrin.h>
__m256i right_shift_1(__m256i left, __m256i right)
{
    __m256i new_left = _mm256_permute2x128_si256(left, right, 33);
    return _mm256_alignr_epi8(new_left, right, 1);
}
```

A possibly interesting observation is that the unoptimized LLVM IR from clang contains a `llvm.x86.avx2.vperm2i128` intrinsic followed by a `shufflevector`. The optimized LLVM IR from clang contains two `shufflevector` intrinsics. In order to try to get the same output from rustc, I first patched it to support `llvm.x86.avx2.vperm2i128`. After modifying `right_shift_1` to use the new intrinsic, I got rustc to produce `llvm.x86.avx2.vperm2i128` followed by a `shufflevector`. However, the optimized LLVM IR from rustc still produces a single `shufflevector`, and it still ends up producing the bad asm.

I think this means that the fault is from some optimization pass in rustc that isn't in clang, but I haven't had time to investigate it yet...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better SIMD shuffles #36624

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better SIMD shuffles #36624

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions