Skip to content

AVX512 code generated for i32 array sum is worse than code by clang 5 #48287

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Djuffin opened this issue Feb 17, 2018 · 6 comments
Closed

AVX512 code generated for i32 array sum is worse than code by clang 5 #48287

Djuffin opened this issue Feb 17, 2018 · 6 comments
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Djuffin
Copy link

Djuffin commented Feb 17, 2018

Demo: https://godbolt.org/g/vqB6oj

I tried this code:

pub  struct v {
    val:[i32;16]
}
pub fn test(a:v, b:v) -> v {
    let mut res = v { val : [0;16] };

    for i in 0..16 {
        res.val[i] = a.val[i] + b.val[i];
    }
    return res;
}

Compiled it with
rustc --crate-type=lib -C opt-level=3 -C target-cpu=skylake-avx512 --emit asm test.rs

I expected to see this happen:

  vmovdqu32 zmm0, zmmword ptr [rsp + 72]
  vpaddd zmm0, zmm0, zmmword ptr [rsp + 8]
  vmovdqu32 zmmword ptr [rdi], zmm0
  mov rax, rdi
  vzeroupper
  ret

Instead, this happened:

	movq	$0, 56(%rsp)
	vmovdqu	(%rdx), %ymm0
	vpaddd	(%rsi), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rsp)
	movl	32(%rdx), %eax
	addl	32(%rsi), %eax
	movl	%eax, 32(%rsp)
	movl	36(%rdx), %eax
	addl	36(%rsi), %eax
	movl	%eax, 36(%rsp)
	movl	40(%rdx), %eax
	addl	40(%rsi), %eax
	movl	%eax, 40(%rsp)
	movl	44(%rdx), %eax
	addl	44(%rsi), %eax
	movl	%eax, 44(%rsp)
	movl	48(%rdx), %eax
	addl	48(%rsi), %eax
	movl	%eax, 48(%rsp)
	movl	52(%rdx), %eax
	addl	52(%rsi), %eax
	movl	%eax, 52(%rsp)
	movl	56(%rdx), %eax
	addl	56(%rsi), %eax
	movl	%eax, 56(%rsp)
	movl	60(%rdx), %eax
	addl	60(%rsi), %eax
	movl	%eax, 60(%rsp)
	vmovdqu	(%rsp), %ymm0
	vmovdqu	32(%rsp), %ymm1
	vmovdqu	%ymm1, 32(%rdi)
	vmovdqu	%ymm0, (%rdi)
	movq	%rdi, %rax
	addq	$64, %rsp
	retq

Meta

~$ rustc --version --verbose
rustc 1.24.0 (4d90ac38c 2018-02-12)
binary: rustc
commit-hash: 4d90ac38c0b61bb69470b61ea2cccea0df48d9e5
commit-date: 2018-02-12
host: x86_64-unknown-linux-gnu
release: 1.24.0
LLVM version: 4.0
@matthiaskrgr
Copy link
Member

Funny, when I change 16 to 17 in the rust code

pub  struct v {
    val:[i32;17]
}


pub fn test(a:v, b:v) -> v {
    let mut res = v { val : [0;17] };

    for i in 0..17 {
        res.val[i] = a.val[i] + b.val[i];
    }
    return res;
}

I get

example::test:
  push rbp
  mov rbp, rsp
  sub rsp, 72
  mov dword ptr [rbp - 8], 0
  mov qword ptr [rbp - 16], 0
  vmovdqu32 zmm0, zmmword ptr [rdx]
  vpaddd zmm0, zmm0, zmmword ptr [rsi]
  vmovdqu32 zmmword ptr [rbp - 72], zmm0
  mov eax, dword ptr [rdx + 64]
  add eax, dword ptr [rsi + 64]
  mov dword ptr [rbp - 8], eax
  mov dword ptr [rdi + 64], eax
  vmovdqu ymm0, ymmword ptr [rbp - 72]
  vmovdqu ymm1, ymmword ptr [rbp - 40]
  vmovdqu ymmword ptr [rdi + 32], ymm1
  vmovdqu ymmword ptr [rdi], ymm0
  mov rax, rdi
  add rsp, 72
  pop rbp
  ret

Is this closer to the clang instructions?

@nagisa
Copy link
Member

nagisa commented Feb 17, 2018

The referenced issue #48293 has a better explanation of what is happening.

@AronParker
Copy link
Contributor

I was just about to post this issue here, good thing someone else already did. Clang only produces this "good" code for C++, not for C. On reddit people came to the conclusion that this is due to copy elision (in particular return value optimization) that is done in C++, but apparently not in C and Rust.

@jonas-schievink
Copy link
Contributor

this is due to copy elision (in particular return value optimization)

In that case, #47954 might help, right?

@pietroalbini pietroalbini added I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-SIMD Area: SIMD (Single Instruction Multiple Data) labels Feb 20, 2018
@GodTamIt
Copy link

This no longer seems to be a problem with the latest versions of both rustc and clang: https://gcc.godbolt.org/z/c4187cno3

@nikic
Copy link
Contributor

nikic commented Feb 19, 2022

And looks like this has been the case for quite a while already, since 1.52.

Worth mentioning that LLVM intentionally does not use 512-bit vectors here by default.

@nikic nikic closed this as completed Feb 19, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants