Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

optimize simd widening mul #1247

Merged
merged 1 commit into from
Aug 13, 2022
Merged

optimize simd widening mul #1247

merged 1 commit into from
Aug 13, 2022

Conversation

TheIronBorn
Copy link
Collaborator

stdsimd allows types larger than 512-bits so we can avoid the slow __mulddi3 path. If/when Simd<u128> arrives we can use it for 64-bit lanes as well

@dhardy
Copy link
Member

dhardy commented Aug 11, 2022

I didn't see discussion on what Simd<u16, 64> etc. means given the lack of CPU support. Are we simply deferring to a portable-simd software implementation? Since LaneCount<LANES>: SupportedLaneCount is independent of T and the CPU feature set, I suppose this must be the case.

Did you run any benchmarks?

@TheIronBorn
Copy link
Collaborator Author

Right, sorry. The compiler is smart enough to treat it like two u16x32 (or even four u16x16 if you only have 256-bit). i.e.:

vpmullw	%zmm1, %zmm4, %zmm1
vpmovzxbw	32(%r8), %zmm4
vpmullw	%zmm2, %zmm4, %zmm2
vpxor	%xmm4, %xmm4, %xmm4
vpunpckhbw	%zmm4, %zmm0, %zmm5
vpunpckhbw	%zmm4, %zmm3, %zmm16
vpmullw	%zmm5, %zmm16, %zmm5
vpsrlw	$8, %zmm5, %zmm5
vpunpcklbw	%zmm4, %zmm0, %zmm0
vpunpcklbw	%zmm4, %zmm3, %zmm3
vpmullw	%zmm0, %zmm3, %zmm0
vpsrlw	$8, %zmm0, %zmm0
vpackuswb	%zmm5, %zmm0, %zmm0
vpmovwb	%zmm1, %ymm1
vpmovwb	%zmm2, %ymm2
vinserti64x4	$1, %ymm2, %zmm1, %zmm1

There's a chance it won't work on other architectures, though those architectures might not even have 512-bits

and benchmarks

test cast_wmul_u16x32    ... bench:     255,482 ns/iter (+/- 115,946) = 16417 MB/s
test cast_wmul_u32x16    ... bench:     287,282 ns/iter (+/- 10,553) = 14599 MB/s
test cast_wmul_u8x64     ... bench:     295,669 ns/iter (+/- 25,442) = 14185 MB/s
test mulddi3_wmul_u16x32 ... bench:     364,887 ns/iter (+/- 11,275) = 11494 MB/s
test mulddi3_wmul_u32x16 ... bench:     488,590 ns/iter (+/- 17,975) = 8584 MB/s
test mulddi3_wmul_u8x64  ... bench:     749,078 ns/iter (+/- 46,821) = 5599 MB/s

with

#[bench]
fn $fnn(b: &mut Bencher) {
    let x = <$ty>::splat(7);
    let y = <$ty>::splat(3);

    b.iter(|| {
        let mut accum = <$ty>::default();
        for _ in 0..RAND_BENCH_N {
            // no unrolling, so it's similar to gen_range without the overhead
            let (h, l) = test::black_box(x).$wmul_type(test::black_box(y));
            accum += h;
            accum += l;
        }
        accum
    });
    b.bytes = size_of::<$ty>() as u64 * 2 * RAND_BENCH_N;
}

performing two multiplications instead of four is easily going to be faster

@TheIronBorn TheIronBorn merged commit 9dd97b4 into master Aug 13, 2022
@newpavlov newpavlov deleted the TheIronBorn-patch-1 branch May 22, 2024 02:16
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants