Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Work around swap_bytes on WebAssembly #222

Open
CryZe opened this issue Mar 12, 2024 · 3 comments
Open

Work around swap_bytes on WebAssembly #222

CryZe opened this issue Mar 12, 2024 · 3 comments

Comments

@CryZe
Copy link

CryZe commented Mar 12, 2024

Unfortunately neither u128 nor swap_bytes are supported directly by WebAssembly. So both implementations of folded_multiply are very slow.

I think an algorithm that takes both u64 values, turns them into a v128 vector and then does a bunch of swizzling and vector multiplications and co. probably would be the much faster solution. Here a Godbolt link with a little sketch:

https://rust.godbolt.org/z/jGGhYjGs8

I don't have enough knowledge about how to verify the quality, so I decided to not directly open a PR and instead first discuss the feasibility.

@tkaitchuck
Copy link
Owner

Do you have a good way to benchmark this? If so open a PR for that, and I'll play with it to make sure that quality is there.

@alexcrichton
Copy link

I ran across this and thought I might be able to offer some small insight, but in the end not a huge amount. I personally use wasmtime for local benchmarking but I'm biased because I work on it too. Using that what I did was to put this in ~/.cargo/config.toml:

[target.wasm32-wasip1]
runner = 'wasmtime --dir .'

then I ran:

$ cargo bench --bench ahash -- fallback --save-baseline native
...
$ cargo bench --bench ahash --target wasm32-wasip1 -- fallback --baseline native

and got:

Output of cargo bench
Gnuplot not found, using plotters backend
Benchmarking fallbackhash/u8
Benchmarking fallbackhash/u8: Warming up for 3.0000 s
Benchmarking fallbackhash/u8: Collecting 100 samples in estimated 5.0000 s (683M iterations)
Benchmarking fallbackhash/u8: Analyzing
fallbackhash/u8         time:   [1.1904 ns 1.1941 ns 1.1982 ns]
                        change: [+170.14% +170.58% +171.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u16
Benchmarking fallbackhash/u16: Warming up for 3.0000 s
Benchmarking fallbackhash/u16: Collecting 100 samples in estimated 5.0000 s (658M iterations)
Benchmarking fallbackhash/u16: Analyzing
fallbackhash/u16        time:   [1.5567 ns 1.5578 ns 1.5590 ns]
                        change: [+209.63% +210.00% +210.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u32
Benchmarking fallbackhash/u32: Warming up for 3.0000 s
Benchmarking fallbackhash/u32: Collecting 100 samples in estimated 5.0000 s (587M iterations)
Benchmarking fallbackhash/u32: Analyzing
fallbackhash/u32        time:   [2.4698 ns 2.4707 ns 2.4720 ns]
                        change: [+389.02% +389.89% +390.66%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u64
Benchmarking fallbackhash/u64: Warming up for 3.0000 s
Benchmarking fallbackhash/u64: Collecting 100 samples in estimated 5.0000 s (396M iterations)
Benchmarking fallbackhash/u64: Analyzing
fallbackhash/u64        time:   [1.5749 ns 1.5755 ns 1.5763 ns]
                        change: [+210.19% +211.21% +212.21%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u128
Benchmarking fallbackhash/u128: Warming up for 3.0000 s
Benchmarking fallbackhash/u128: Collecting 100 samples in estimated 5.0001 s (205M iterations)
Benchmarking fallbackhash/u128: Analyzing
fallbackhash/u128       time:   [1.9035 ns 1.9046 ns 1.9060 ns]
                        change: [+206.84% +207.75% +208.60%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/strings
Benchmarking fallbackhash/strings: Warming up for 3.0000 s
Benchmarking fallbackhash/strings: Collecting 100 samples in estimated 5.0003 s (37M iterations)
Benchmarking fallbackhash/strings: Analyzing
fallbackhash/strings    time:   [134.09 ns 134.11 ns 134.14 ns]
                        change: [+75.955% +76.203% +76.449%] (p = 0.00 < 0.05)
                        Performance has regressed.
...

showing that, as expected, performance isn't as good on wasm likely due to the lack of specialized instructions. I decided to take a look at a hot loop by profiling with perf with Wasmtime and ran:

$ perf record -k mono wasmtime --profile jitdump --dir . target/wasm32-wasip1/release/deps/ahash-c5fdbc98849e7f48.wasm --bench --baseline native 'fallback.*str' --profile-time 10

Using perf to record JIT-generated code is "weird" so the report command looks like:

$ perf inject --jit -i perf.data -o perf.jit-data && perf report -i perf.jit-data 

which showed 99% of runtime spent in this loop

       │441:┌─→mov     %eax,%ecx                                                                                                                                                                                                                                                                                               ▒
       │    │  mov     0x0(%r13,%rcx,1),%rcx                                                                                                                                                                                                                                                                                   ▒
       │    │  lea     0x8(%rax),%r11d                                                                                                                                                                                                                                                                                         ▒
  8.97 │    │  mov     0x0(%r13,%r11,1),%r15                                                                                                                                                                                                                                                                                   ▒
       │    │  mov     $0xfffffff0,%r11d                                                                                                                                                                                                                                                                                       ▒
  0.00 │    │  add     %r11d,%r8d                                                                                                                                                                                                                                                                                              ▒
       │    │  add     $0x10,%eax                                                                                                                                                                                                                                                                                              ▒
  9.19 │    │  xor     criterion::bencher::Bencher<M,%rcx                                                                                                                                                                                                                                                                      ▒
       │    │  mov     %rcx,%r11                                                                                                                                                                                                                                                                                               ▒
 10.43 │    │  bswap   %r11                                                                                                                                                                                                                                                                                                    ▒
       │    │  mov     %r15,%r14                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     criterion::bencher::Bencher<M,%r14                                                                                                                                                                                                                                                                      ▒
  8.76 │    │  imul    %r14,%r11                                                                                                                                                                                                                                                                                               ▒
       │    │  bswap   %r11                                                                                                                                                                                                                                                                                                    ▒
       │    │  movabs  $0x452821e638d01376,%r14                                                                                                                                                                                                                                                                                ▒
  7.50 │    │  add     %r14,%r10                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     %r10,%r11                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     criterion::bencher::Bencher<M,%r15                                                                                                                                                                                                                                                                      ▒
  4.66 │    │  bswap   %r15                                                                                                                                                                                                                                                                                                    ▒
       │    │  imul    %rcx,%r15                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     %r15,%r11                                                                                                                                                                                                                                                                                               ▒
  7.76 │    │  rorx    $0x29,%r11,%r10                                                                                                                                                                                                                                                                                         ▒
       │    ├──cmp     $0x10,%r8d                                                                                                                                                                                                                                                                                              ▒
  8.22 │    └──ja      441                                                                                                                                                                                                                                                                                                     ▒

which, given the three byte swaps, I believe is this function, the one in question here. Now interestingly Wasmtime is generating bswap instructions here despite WebAssembly not having a byte swap instruction. This is because Wasmtime specifically has pattern matching for LLVM-generated sequences of byte swaps to turn them into native bswap instructions. That means that this may not be a great baseline for WebAssembly engines that don't have this optimization.

Nevertheless if I replace folded_multiple with the version from @CryZe's Godbolt and recompile with RUSTFLAGS=-Ctarget-feature=+simd128 and run this in Wasmtime I get:

Gnuplot not found, using plotters backend
Benchmarking fallbackhash/strings
Benchmarking fallbackhash/strings: Warming up for 3.0000 s
Benchmarking fallbackhash/strings: Collecting 100 samples in estimated 5.0000 s (35M iterations)
Benchmarking fallbackhash/strings: Analyzing
fallbackhash/strings    time:   [142.70 ns 142.74 ns 142.79 ns]
                        change: [+87.155% +87.389% +87.577%] (p = 0.00 < 0.05)
                        Performance has regressed.

(it was 70-ish% before)

and the hot loop is now:

       │3eb:┌─→mov      %eax,%edx                                                                                                                                                                                                                                                                                              ▒
       │    │  mov      (%r14,%rdx,1),%rdx                                                                                                                                                                                                                                                                                     ▒
  0.55 │    │  lea      0x8(%rax),%r8d                                                                                                                                                                                                                                                                                         ▒
  5.86 │    │  mov      (%r14,%r8,1),%r12                                                                                                                                                                                                                                                                                      ▒
       │    │  mov      $0xfffffff0,%r8d                                                                                                                                                                                                                                                                                       ▒
       │    │  add      %r8d,%esi                                                                                                                                                                                                                                                                                              ▒
  0.60 │    │  add      $0x10,%eax                                                                                                                                                                                                                                                                                             ▒
  6.26 │    │  xor      criterion::bencher::Bencher<M,%rdx                                                                                                                                                                                                                                                                     ▒
  0.62 │    │  xor      criterion::bencher::Bencher<M,%r12                                                                                                                                                                                                                                                                     ▒
  6.63 │    │  vmovq    %rdx,%xmm1                                                                                                                                                                                                                                                                                             ▒
       │    │  vpinsrq  $0x1,%r12,%xmm1,%xmm1                                                                                                                                                                                                                                                                                  ▒
  0.58 │    │  vpaddusb 0x31a(%rip),%xmm1,%xmm2        # 7c0 <criterion::bencher::Bencher<M>::iter::h5aab6de6fe2515cc+0x740>                                                                                                                                                                                                   ▒
  7.22 │    │  vpshufb  %xmm2,%xmm6,%xmm2                                                                                                                                                                                                                                                                                      ▒
  0.00 │    │  vpsrlq   $0x20,%xmm1,%xmm3                                                                                                                                                                                                                                                                                      ▒
       │    │  vpmuludq %xmm2,%xmm3,%xmm3                                                                                                                                                                                                                                                                                      ▒
  0.58 │    │  vpsrlq   $0x20,%xmm2,%xmm4                                                                                                                                                                                                                                                                                      ▒
  7.36 │    │  vpmuludq %xmm4,%xmm1,%xmm4                                                                                                                                                                                                                                                                                      ▒
       │    │  vpaddq   %xmm4,%xmm3,%xmm3                                                                                                                                                                                                                                                                                      ▒
       │    │  vpsllq   $0x20,%xmm3,%xmm3                                                                                                                                                                                                                                                                                      ▒
  0.61 │    │  vpmuludq %xmm2,%xmm1,%xmm1                                                                                                                                                                                                                                                                                      ▒
  8.34 │    │  vpaddq   %xmm3,%xmm1,%xmm1                                                                                                                                                                                                                                                                                      ▒
  0.00 │    │  vpextrq  $0x1,%xmm1,%rdx                                                                                                                                                                                                                                                                                        ▒
  0.56 │    │  movabs   $0x452821e638d01376,%r8                                                                                                                                                                                                                                                                                ◆
  7.56 │    │  add      %r8,%rcx                                                                                                                                                                                                                                                                                               ▒
       │    │  xor      %rcx,%rdx                                                                                                                                                                                                                                                                                              ▒
  0.57 │    │  vpextrq  $0x0,%xmm1,%rcx                                                                                                                                                                                                                                                                                        ▒
  6.34 │    │  xor      %rcx,%rdx                                                                                                                                                                                                                                                                                              ▒
       │    │  rorx     $0x29,%rdx,%rcx                                                                                                                                                                                                                                                                                        ▒
       │    ├──cmp      $0x10,%esi                                                                                                                                                                                                                                                                                             ▒
  4.45 │    └──ja       3eb                                                                                                                                                                                                                                                                                                    ▒

where this shows that vector instructions are used and geing generated. On x64 though I don't think the lowerings are that great, notably vpaddusb I think is a fallback when a better shuffle otherwise doesn't match.

Sorry if that was a bit more information than desired, but this is at least an example of benchmarking/comparison in a wasm runtime. Benchmarking in a browser-based runtime I think would be more difficult since you'd have to set up a runner/WASI implementation in JS. Those exist, but will require setup.

@alexcrichton
Copy link

Also feel free to file this in the "too much information" category but the compiler explorer gist here uses u8x16_swizzle and that's what vpaddusb comes from and switching to i8x16_relaxed_swizzle (not in stable Rust yet) gets the native-to-wasm slowdown down to +78.206% for this one benchmark, which is much closer to the original version as-is in tree today as +76.203%, so I think simd could still be suitable to use here.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants