Attempt safe and fast buffer allocation #14

Shnatsel · 2018-08-26T16:06:28Z

This should be the fastest way to safely initialize an intermediate buffer. vec![value; lenght] desugars into Vec::from_elem() which has a fast path for value 0 that requests zeroed memory from the OS, and the OS zeroes memory on demand. Addresses #13.

I have converted the benchmarking suite to Criterion to get more accurate benchmarks, specifically so that I would be able to reliably detect <2% changes in performance which is impossible with regular cargo-bench. It also has an added bonus of working on stable Rust. I can open a PR for that if you'd like to have that in master.

According to the testsamples.rs benchmark, as well as measuring execution time of the bench_decode binary with hyperfine, this change makes no difference in terms of performance. I have tested this both on rustc 1.28.0 (9634041f0 2018-07-30) and rustc 1.30.0-nightly (d767ee116 2018-08-15) with identical results.

However, the performance difference between Claxon master and libflac 1.3.1 is 1.44, not 1.13 as advertised in README.md; it is possible that on my machine the execution is bottlenecked elsewhere, and I would not notice performance degradation caused by this PR. So please run this through your benchmarking setup before merging.

…es real-world performance by a couple percent.

src/frame.rs

ruuda · 2018-08-27T21:32:33Z

Sorry about the confusing benchmarks, benches/testsamples.rs is a remnant that I should remove. There are tools/benchmark.sh to run the benchmark and tools/compare_benches.r to compare two runs.

Usage is as follows:

# Put a few flac files in testsamples/extra.
# I use five files, all 44100 Hz, 16 bit, stereo (CD quality), together 193 MiB.
git checkout master
tools/benchmark.sh before
git checkout alternative
tools/benchmark.sh after
tools/compare_benches.r before_*_all.dat after_*_all.dat

That will print a table like this:

metric	value after (mean and stddev)	fraction of before
p10 (0.1 quantile time per sample)	18.1 ± 0.2 ns	1.000 ± 0.016
p50 (median time per sample)	18.9 ± 0.3 ns	1.000 ± 0.019
p90 (0.9 quantile time per sample)	20.0 ± 0.4 ns	0.996 ± 0.040
μ (mean time per sample)	19.2 ± 0.3 ns	0.999 ± 0.020
τ (input bytes throughput)	66.7 ± 3.4 MiB/s	1.001 ± 0.074

Depending on the kind of input file, this change is indeed not going to make a difference, because the buffer is recycled and allocated with the right size immediately. To test the worst case impact, you can replace this line to always pass in a clean Vec::new(), which will be too small, triggering ensure_buffer_len to allocate a bigger one on every block.

However, the performance difference between Claxon master and libflac 1.3.1 is 1.44, not 1.13 as advertised in README.md; it is possible that on my machine the execution is bottlenecked elsewhere, and I would not notice performance degradation caused by this PR. So please run this through your benchmarking setup before merging.

This is interesting ... when I run Claxon again now on the same system, I get about 1.4 times libflac too, even on Claxon 0.3.0. One thing that always stood out was that although Claxon was close to libflac in running time, it was executing many more instructions, and indeed on my Raspberry Pi it was not within a factor 2 of libflac if I recall correctly. So if Claxon was almost as fast as libflac, it was through the CPU being able to deal with those extra instructions, not because it did as little work as libflac. It might be that we are seeing the effects of Meltdown/Spectre/Foreshadow mitigations here.

Shnatsel · 2018-08-27T21:37:02Z

It might be that we are seeing the effects of Meltdown/Spectre/Foreshadow mitigations here.

I'd say that a rustc performance regression is a lot more realistic, e.g. some loops not being vectorized in latest versions even though they used to be vectorized in earlier ones.

I've tried to use tools/benchmark.sh but I couldn't get it to work, and didn't want to get into debugging it just yet.

ruuda · 2018-08-30T18:55:34Z

I'd say that a rustc performance regression is a lot more realistic

This is the cause; I just ran it on Rust 1.13.0 and it's consistent with the value I measured before (I got 1.16 ± 0.09 now). I’ll dig a bit deeper and file a bug upstream.

Shnatsel · 2018-09-14T21:24:17Z

On rustc 1.29.0 (aa3ca1994 2018-09-11) I get 1.51x time compared to system libflac (Ubuntu 16.04, package version 1.3.1-4) with -C codegen-units=1 and a staggering 1.77x without it. It appears that another rustc regression has hit Claxon.

Shnatsel · 2018-09-14T22:22:58Z

Actually, scratch that. I get 1.51x on 1.28 too, and even more on 1.27; it's back to 1.51x on 1.25. So probably not a regression.

Which files did you use for the comparison that gave 1.10x?

ruuda · 2018-09-15T09:02:13Z

Which files did you use for the comparison that gave 1.10x

I use five files from my personal library; unfortunately these cannot be shared, but the ones from archive.org should give similar results. It might depend a lot on your CPU though. I am using a Skylake i7-6700HQ.

The Rust issue here contains a bit more information: rust-lang/rust#53833.

Shnatsel · 2018-09-15T11:04:51Z

I'm testing on an AMD Vishera CPU. That might be the reason why I'm getting different results.

…est!

Shnatsel · 2018-09-15T23:22:21Z

I've attempted to rewrite the buffer allocation using appends to Vec so that memory zeroing would not be needed at all. But apparently I've messed up some math somewhere and the output is no longer correct, and without unit tests on decode_* functions it's not trivial to isolate or debug. So I've fixed this branch instead.

For reference, the version with appending to Vec can be found here: https://github.com/Shnatsel/claxon/tree/safe-and-sliceless

ruuda · 2018-09-17T19:58:23Z

I did some benchmarking. The performance impact is slight but significant. In the typical case, it leads to a 0.03% increase in decode time, for an adversarial decoding setup it can be as much as 0.2%. I think that's small enough to be worth the effort.

Implemented myself in fec2467, see also the commit message for full benchmark details.

Thanks for getting this discussion started!

Shnatsel added 2 commits August 26, 2018 16:27

Initial conversion of ensure_buffer_len() to safe Rust. Likely degrad…

0626150

…es real-world performance by a couple percent.

Drop comment on unsafety; the function has been converted into safe code

3b22feb

ruuda reviewed Aug 27, 2018

View reviewed changes

src/frame.rs Show resolved Hide resolved

Shnatsel added 3 commits September 15, 2018 22:45

Merge remote-tracking branch 'origin/master' into less-unsafe

395050c

Fix ensure_buffer_len() function. I dun goofed. Thanks for the unit t…

6fb7716

…est!

Drop now-superfluous unsafe block

7266010

ruuda closed this Sep 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt safe and fast buffer allocation #14

Attempt safe and fast buffer allocation #14

Shnatsel commented Aug 26, 2018 •

edited

Loading

ruuda commented Aug 27, 2018

Shnatsel commented Aug 27, 2018

ruuda commented Aug 30, 2018

Shnatsel commented Sep 14, 2018 •

edited

Loading

Shnatsel commented Sep 14, 2018

ruuda commented Sep 15, 2018

Shnatsel commented Sep 15, 2018

Shnatsel commented Sep 15, 2018

ruuda commented Sep 17, 2018

Attempt safe and fast buffer allocation #14

Attempt safe and fast buffer allocation #14

Conversation

Shnatsel commented Aug 26, 2018 • edited Loading

ruuda commented Aug 27, 2018

Shnatsel commented Aug 27, 2018

ruuda commented Aug 30, 2018

Shnatsel commented Sep 14, 2018 • edited Loading

Shnatsel commented Sep 14, 2018

ruuda commented Sep 15, 2018

Shnatsel commented Sep 15, 2018

Shnatsel commented Sep 15, 2018

ruuda commented Sep 17, 2018

Shnatsel commented Aug 26, 2018 •

edited

Loading

Shnatsel commented Sep 14, 2018 •

edited

Loading