-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Work around swap_bytes
on WebAssembly
#222
Comments
Do you have a good way to benchmark this? If so open a PR for that, and I'll play with it to make sure that quality is there. |
I ran across this and thought I might be able to offer some small insight, but in the end not a huge amount. I personally use wasmtime for local benchmarking but I'm biased because I work on it too. Using that what I did was to put this in [target.wasm32-wasip1]
runner = 'wasmtime --dir .' then I ran:
and got: Output of cargo bench
showing that, as expected, performance isn't as good on wasm likely due to the lack of specialized instructions. I decided to take a look at a hot loop by profiling with
Using
which showed 99% of runtime spent in this loop
which, given the three byte swaps, I believe is this function, the one in question here. Now interestingly Wasmtime is generating Nevertheless if I replace
(it was 70-ish% before) and the hot loop is now:
where this shows that vector instructions are used and geing generated. On x64 though I don't think the lowerings are that great, notably Sorry if that was a bit more information than desired, but this is at least an example of benchmarking/comparison in a wasm runtime. Benchmarking in a browser-based runtime I think would be more difficult since you'd have to set up a runner/WASI implementation in JS. Those exist, but will require setup. |
Also feel free to file this in the "too much information" category but the compiler explorer gist here uses |
Unfortunately neither
u128
norswap_bytes
are supported directly by WebAssembly. So both implementations offolded_multiply
are very slow.I think an algorithm that takes both u64 values, turns them into a v128 vector and then does a bunch of swizzling and vector multiplications and co. probably would be the much faster solution. Here a Godbolt link with a little sketch:
https://rust.godbolt.org/z/jGGhYjGs8
I don't have enough knowledge about how to verify the quality, so I decided to not directly open a PR and instead first discuss the feasibility.
The text was updated successfully, but these errors were encountered: