-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Better SIMD shuffles #36624
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
This commit adds a copy of the Teddy searcher that works on AVX2. We don't attempt to reuse any code between them just yet, and instead just copy & paste and tweak parts of it to work on 32 bytes instead of 16. (Some parts were trickier than others. For example, @jneem figured out how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR instruction, which we borrow here.) Overall, AVX2 provides a nice bump in performance.
This still occurs even when using the specific intrinsics in |
Can reproduce, will try to fill in an LLVM bug for this. Once the LLVM bug is filled and recognized as a real bug we could workaround this in EDIT: reported https://bugs.llvm.org/show_bug.cgi?id=36933 |
@rustbot modify labels: +A-simd |
This was solved upstream, it seems. I twiddled this a bit to simplify reading it (for me, anyways): #![feature(platform_intrinsics, repr_simd)]
#[allow(non_camel_case_types)]
#[repr(simd)]
pub struct u8x32([u8; 32]);
extern "platform-intrinsic" {
fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U;
}
pub fn right_shift_1(left: u8x32, right: u8x32) -> u8x32 {
const IDX: [u32; 32] = {
let mut a = [31u32; 32];
let mut n: u32 = 0;
while n < 32 {
a[n as usize] += n;
n +=1;
}
a
};
unsafe { simd_shuffle32(left, right, IDX) }
} Output is now (Rust-Godbolt): example::right_shift_1:
mov rax, rdi
vmovdqa ymm0, ymmword ptr [rdx]
vperm2i128 ymm1, ymm0, ymmword ptr [rsi], 3
vpalignr ymm0, ymm0, ymm1, 15
vmovdqa ymmword ptr [rdi], ymm0
vzeroupper
ret This seems "fixed" insofar as the vblend is gone, the vmovdqa and such are a somewhat unavoidable feature atm of our lack of using register-passing. So further improvements require large systemic changes in the compiler. For comparison, clang emits (Clang-Godbolt): right_shift_1(long long __vector(4), long long __vector(4)): # @right_shift_1(long long __vector(4), long long __vector(4))
vperm2i128 ymm0, ymm0, ymm1, 33 # ymm0 = ymm0[2,3],ymm1[0,1]
vpalignr ymm0, ymm0, ymm1, 1 # ymm0 = ymm1[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], ymm0[0], ymm1[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31], ymm0[16]
ret |
I'm trying to do the following in AVX2 using intrinsics: shift
x
one byte to the right, while shifting the rightmost byte ofy
into the leftmost byte ofx
. This is best done using two instructions:vperm2i128
followed byvpalignr
. However,simd_shuffle32
generates four instructions:vmovdqa
(to load a constant),vpblendvb
, thenvperm2i128
andvpalignr
. Here is a a full example, which may be compiled withrustc -O -C target_feature=+avx2 --crate-type=lib --emit=asm shuffle.rs
.This might be considered a bug in LLVM, in the sense that it's generating a sub-optimal shuffle. However, I think it should be addressed in rustc, because if I know what the right sequence of instructions is then I shouldn't have to hope that LLVM can generate it. Moreover, it's possible to get the right code from clang (compile with
clang -emit-llvm -mavx2 -O -S shuffle.c
):A possibly interesting observation is that the unoptimized LLVM IR from clang contains a
llvm.x86.avx2.vperm2i128
intrinsic followed by ashufflevector
. The optimized LLVM IR from clang contains twoshufflevector
intrinsics. In order to try to get the same output from rustc, I first patched it to supportllvm.x86.avx2.vperm2i128
. After modifyingright_shift_1
to use the new intrinsic, I got rustc to producellvm.x86.avx2.vperm2i128
followed by ashufflevector
. However, the optimized LLVM IR from rustc still produces a singleshufflevector
, and it still ends up producing the bad asm.I think this means that the fault is from some optimization pass in rustc that isn't in clang, but I haven't had time to investigate it yet...
The text was updated successfully, but these errors were encountered: