-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Slice::contains generates suboptimal assembly code #88204
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
@rustbot label +I-slow |
It can be made vectorizer-friendly by using |
It looks like something changed in one of the latest nightly version (I retested this on I can't figure out why the generated assembly when using arrays is so much different than when using borrowed slices. |
Update to LLVM 13. #87570 |
On the current nightly available in godbolt ( The only weird thign I found was on the assembly generated by the I would like to know if there is a reason for this discrepency, and if a serie of PS: rustc 1.55 vs current nightly for pub fn test(slice: &[u64; 16], val: u64) -> bool {
slice[0] == val
|| slice[1] == val
|| slice[2] == val
|| slice[3] == val
|| slice[4] == val
|| slice[5] == val
|| slice[6] == val
|| slice[7] == val
|| slice[8] == val
|| slice[9] == val
|| slice[10] == val
|| slice[11] == val
|| slice[12] == val
|| slice[13] == val
|| slice[14] == val
|| slice[15] == val
} rustc_1_55::test:
vmovq xmm0, rsi
vpbroadcastq ymm0, xmm0
vpcmpeqq ymm1, ymm0, ymmword ptr [rdi + 96]
vpcmpeqq ymm2, ymm0, ymmword ptr [rdi + 64]
vpcmpeqq ymm3, ymm0, ymmword ptr [rdi + 32]
vpcmpeqq ymm0, ymm0, ymmword ptr [rdi]
vpackssdw ymm1, ymm2, ymm1
vpackssdw ymm0, ymm0, ymm3
vpermq ymm1, ymm1, 216
vpermq ymm0, ymm0, 216
vpackssdw ymm0, ymm0, ymm1
vpmovmskb eax, ymm0
test eax, -1431655766
setne al
vzeroupper
ret
nightly::test:
mov al, 1
cmp qword ptr [rdi], rsi
je .LBB0_16
cmp qword ptr [rdi + 8], rsi
je .LBB0_16
cmp qword ptr [rdi + 16], rsi
je .LBB0_16
cmp qword ptr [rdi + 24], rsi
je .LBB0_16
cmp qword ptr [rdi + 32], rsi
je .LBB0_16
cmp qword ptr [rdi + 40], rsi
je .LBB0_16
cmp qword ptr [rdi + 48], rsi
je .LBB0_16
cmp qword ptr [rdi + 56], rsi
je .LBB0_16
cmp qword ptr [rdi + 64], rsi
je .LBB0_16
cmp qword ptr [rdi + 72], rsi
je .LBB0_16
cmp qword ptr [rdi + 80], rsi
je .LBB0_16
cmp qword ptr [rdi + 88], rsi
je .LBB0_16
cmp qword ptr [rdi + 96], rsi
je .LBB0_16
cmp qword ptr [rdi + 104], rsi
je .LBB0_16
cmp qword ptr [rdi + 112], rsi
je .LBB0_16
cmp qword ptr [rdi + 120], rsi
sete al
.LBB0_16:
ret |
If you use |
Could this be linked to #83623 in relation with the update to LLVM 13 ? The code vectorizes when using rustc 1.53-1.55 but not on the current nightly. |
Uh oh!
There was an error while loading. Please reload this page.
Given
val: u8
,slice: &[u8; 8]
andarr: [u8; 8]
, I expected the following statements to compile down to the same thing :However, the resulting assembly differs quite a lot:
a
statement compiles down to a loop, checking one element at a time, except forT = u8|i8
andN < 16
where it instead call fall on the fast path ofmemchr
which gets optimized a little bit better.b
statement compiles down to a unrolled-loop, checking one element at a time in a branchless fashion. Most of the time it doesn't give any SIMD instructions.c
statement always compiles down to a loop, checking one element at a time, except forT = u8|i8
andN >= 16
where it instead callmemchr_general_case
d
statement always compiles down to a few branchless SIMD instructions for any primitive type used and any array size.Because the slice/array size is known at compile-time and the type checker guarantees that it will be respected by any calling function, I expected the compiler to take this into account while optimizing the resulting assembly. However, this information seems to be lost at some point when using the
contains
method.arr.contains(&val)
andslice.contains(&val)
are simplified asarr.as_ref().iter().any(|e| *e == val)
andslice.iter().any(|e| *e == val)
if I'm not mistaken (which is wierd because for some N and T, they don't yield the same assembly). The compiler does not seem to be able to unroll this case.godbolt links for
T=u8; N=8
T=u16; N=8
T=u32; N=8
T=u64; N=8
T=u8; N=16
T=u16; N=16
T=u32; N=16
T=u64; N=16
The text was updated successfully, but these errors were encountered: