-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Attempt safe and fast buffer allocation #14
Conversation
…es real-world performance by a couple percent.
Sorry about the confusing benchmarks, Usage is as follows:
That will print a table like this:
Depending on the kind of input file, this change is indeed not going to make a difference, because the buffer is recycled and allocated with the right size immediately. To test the worst case impact, you can replace this line to always pass in a clean
This is interesting ... when I run Claxon again now on the same system, I get about 1.4 times libflac too, even on Claxon 0.3.0. One thing that always stood out was that although Claxon was close to libflac in running time, it was executing many more instructions, and indeed on my Raspberry Pi it was not within a factor 2 of libflac if I recall correctly. So if Claxon was almost as fast as libflac, it was through the CPU being able to deal with those extra instructions, not because it did as little work as libflac. It might be that we are seeing the effects of Meltdown/Spectre/Foreshadow mitigations here. |
I'd say that a rustc performance regression is a lot more realistic, e.g. some loops not being vectorized in latest versions even though they used to be vectorized in earlier ones. I've tried to use |
This is the cause; I just ran it on Rust 1.13.0 and it's consistent with the value I measured before (I got 1.16 ± 0.09 now). I’ll dig a bit deeper and file a bug upstream. |
On rustc 1.29.0 (aa3ca1994 2018-09-11) I get 1.51x time compared to system libflac (Ubuntu 16.04, package version 1.3.1-4) with |
Actually, scratch that. I get 1.51x on 1.28 too, and even more on 1.27; it's back to 1.51x on 1.25. So probably not a regression. Which files did you use for the comparison that gave 1.10x? |
I use five files from my personal library; unfortunately these cannot be shared, but the ones from archive.org should give similar results. It might depend a lot on your CPU though. I am using a Skylake i7-6700HQ. The Rust issue here contains a bit more information: rust-lang/rust#53833. |
I'm testing on an AMD Vishera CPU. That might be the reason why I'm getting different results. |
I've attempted to rewrite the buffer allocation using appends to Vec so that memory zeroing would not be needed at all. But apparently I've messed up some math somewhere and the output is no longer correct, and without unit tests on For reference, the version with appending to Vec can be found here: https://github.com/Shnatsel/claxon/tree/safe-and-sliceless |
I did some benchmarking. The performance impact is slight but significant. In the typical case, it leads to a 0.03% increase in decode time, for an adversarial decoding setup it can be as much as 0.2%. I think that's small enough to be worth the effort. Implemented myself in fec2467, see also the commit message for full benchmark details. Thanks for getting this discussion started! |
This should be the fastest way to safely initialize an intermediate buffer.
vec![value; lenght]
desugars intoVec::from_elem()
which has a fast path for value 0 that requests zeroed memory from the OS, and the OS zeroes memory on demand. Addresses #13.I have converted the benchmarking suite to Criterion to get more accurate benchmarks, specifically so that I would be able to reliably detect <2% changes in performance which is impossible with regular cargo-bench. It also has an added bonus of working on stable Rust. I can open a PR for that if you'd like to have that in master.
According to the
testsamples.rs
benchmark, as well as measuring execution time of the bench_decode binary with hyperfine, this change makes no difference in terms of performance. I have tested this both onrustc 1.28.0 (9634041f0 2018-07-30)
andrustc 1.30.0-nightly (d767ee116 2018-08-15)
with identical results.However, the performance difference between Claxon master and libflac 1.3.1 is 1.44, not 1.13 as advertised in README.md; it is possible that on my machine the execution is bottlenecked elsewhere, and I would not notice performance degradation caused by this PR. So please run this through your benchmarking setup before merging.