35% performance regression in generated code since 1.24 #53833

ruuda · 2018-08-30T19:45:53Z

Summary

Claxon, when compiled with Rust 1.25.0 or later, takes 1.36 times as long to run as a version compiled with Rust 1.23.0. When compiled with Rust 1.24.0, it takes 1.45 times as long to run.

It looks like there was a severe regression in generated code in 1.24.0. 1.25.0 improved a bit again, but is still significantly worse than 1.23.0.

Steps to reproduce

Prepare:

git clone https://github.com/ruuda/claxon --branch v0.4.1
cd claxon/testsamples
./populate.sh
cp p2.flac p3.flac p4.flac extra
cd ..
mkdir rust_bench

You can also copy your own files into testsamples/extra if you happen to have some lying around.

Then, with a Rust 1.23.0 toolchain, or after changing this line to use cargo +1.23.0:

tools/benchmark.sh rust_bench/23

Then, with a Rust 1.25.0 toolchain, or after updating the script to use cargo +1.25.0:

tools/benchmark.sh rust_bench/25
tools/compare_benches.r rust_bench/23_*_all.dat rust_bench/25_*_all.dat

Output in my case:

| p10 |  18.0 ± 0.3 ns    | 1.370 ± 0.033 |
| p50 |  18.9 ± 0.4 ns    | 1.363 ± 0.043 |
| p90 |  21.1 ± 2.1 ns    | 1.388 ± 0.181 |
| μ   |  19.4 ± 0.7 ns    | 1.359 ± 0.065 |
| τ   |  65.9 ± 3.6 MiB/s | 0.736 ± 0.053 |

The numbers in the rightmost column show the running time of the benchmark compiled with Rust 1.25.0 relative to the running time of the benchmark compiled with Rust 1.23.0.

For Rust 1.23.0 vs Rust 1.24.0 I get these results:

| p10 |  19.8 ± 0.2 ns    | 1.501 ± 0.028 |
| p50 |  20.5 ± 0.2 ns    | 1.473 ± 0.036 |
| p90 |  21.4 ± 0.4 ns    | 1.413 ± 0.120 |
| μ   |  20.7 ± 0.2 ns    | 1.449 ± 0.048 |
| τ   |  61.8 ± 3.3 MiB/s | 0.690 ± 0.049 |

For Rust 1.23.0 vs Rust 1.28.0 I get these results:

| p10 |  18.3 ± 0.2 ns    | 1.392 ± 0.026 |
| p50 |  18.9 ± 0.1 ns    | 1.360 ± 0.031 |
| p90 |  19.7 ± 0.4 ns    | 1.301 ± 0.111 |
| μ   |  19.1 ± 0.2 ns    | 1.339 ± 0.043 |
| τ   |  66.9 ± 3.8 MiB/s | 0.747 ± 0.055 |

For Rust 1.29.0-beta.1 and 1.30.0-nightly (3edb355 2018-08-03) I get similar results.

To show that the setup works, this is Rust 1.13.0 vs Rust 1.23.0, which shows no significant difference:

| p10 |  13.2 ± 0.2 ns    | 0.989 ± 0.022 |
| p50 |  13.9 ± 0.3 ns    | 0.993 ± 0.027 |
| p90 |  15.2 ± 1.3 ns    | 1.036 ± 0.088 |
| μ   |  14.3 ± 0.4 ns    | 1.013 ± 0.034 |
| τ   |  89.6 ± 4.2 MiB/s | 0.988 ± 0.067 |

And Rust 1.25.0 vs Rust 1.28.0 does not show a significant difference either:

| p10 |  18.3 ± 0.2 ns    | 1.016 ± 0.021 |
| p50 |  18.9 ± 0.1 ns    | 0.998 ± 0.024 |
| p90 |  19.7 ± 0.4 ns    | 0.937 ± 0.097 |
| μ   |  19.1 ± 0.2 ns    | 0.985 ± 0.037 |
| τ   |  66.9 ± 3.8 MiB/s | 1.015 ± 0.080 |

Details

Claxon is a decoder for the flac audio format; the benchmark decodes every file in testsamples/extra 5 times and collects statistics about the duration.

The benchmark is compiled with -C target_cpu=native.

I use a Skylake i7.

vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping	: 3
microcode	: 0xc6
cpu MHz		: 3394.795
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 5186.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

The text was updated successfully, but these errors were encountered:

est31 · 2018-08-30T21:04:13Z

1.24.0 release notes

rustc now uses 16 codegen units by default for release builds. For the fastest builds, utilize codegen-units=1.

May that's the cause? Could you try with one codegen unit?

As for the improvement in 1.25.0, you might thank the LLVM update for it :).

ruuda · 2018-08-31T14:15:12Z

May that's the cause? Could you try with one codegen unit?

Wow, the difference is enormous!

With codegen-units=1, I measure no difference between 1.23.0 and 1.24.0:

| p10 |  13.2 ± 0.2 ns    | 1.003 ± 0.022 |
| p50 |  13.8 ± 0.2 ns    | 0.996 ± 0.026 |
| p90 |  14.4 ± 0.3 ns    | 0.953 ± 0.081 |
| μ   |  13.9 ± 0.2 ns    | 0.974 ± 0.032 |
| τ   |  92.0 ± 4.7 MiB/s | 1.027 ± 0.071 |

1.25.0 is even a bit faster than 1.23.0 now (due to the LLVM upgrade?):

| p10 |  12.9 ± 0.2 ns    | 0.979 ± 0.020 |
| p50 |  13.5 ± 0.2 ns    | 0.973 ± 0.027 |
| p90 |  14.1 ± 0.3 ns    | 0.932 ± 0.080 |
| μ   |  13.6 ± 0.2 ns    | 0.949 ± 0.033 |
| τ   |  94.4 ± 4.8 MiB/s | 1.054 ± 0.072 |

Although it looks like that advantage has been lost again. 1.26.0 was still fast:

| p10 |  12.9 ± 0.2 ns    | 0.979 ± 0.020 |
| p50 |  13.5 ± 0.2 ns    | 0.971 ± 0.026 |
| p90 |  14.1 ± 0.5 ns    | 0.933 ± 0.083 |
| μ   |  13.6 ± 0.2 ns    | 0.949 ± 0.033 |
| τ   |  94.4 ± 4.8 MiB/s | 1.054 ± 0.073 |

But 1.27.0 again performs similar to 1.23.0.

| p10 |  13.2 ± 0.3 ns    | 1.004 ± 0.028 |
| p50 |  14.1 ± 0.4 ns    | 1.019 ± 0.035 |
| p90 |  14.9 ± 0.4 ns    | 0.984 ± 0.085 |
| μ   |  14.2 ± 0.3 ns    | 0.995 ± 0.038 |
| τ   |  90.0 ± 3.7 MiB/s | 1.005 ± 0.063 |

1.28.0 performs similarly to 1.27.0.

Thanks for the suggestion @est31! Does that resolve this issue, or is the 1.26 – 1.27 difference still worth investigating?

est31 · 2018-08-31T14:29:35Z

There's a tracking issue for codegen unit regressions in #47745 . I think all the issues it links to are kept open.

About the 1.26 -> 1.27 regression, it could probably be narrowed down using rustup and nightlies, and/or cargo-bisect-rustc. Not sure though whether it's worth the effort.

Since Rust 1.24.0, the number of codegen units is no longer 1, even for release builds. This speeds up builds slightly, but it reduces the quality of the generated code tremendously. To get decent performance out of Claxon, we need to explicitly pass -C codegen-units=1 as RUSTFLAGS. Update the benchmarks to do so, and add a note in the readme. Thanks to est31 for pointing this out. See also rust-lang/rust#53833.

pnkfelix · 2018-11-29T14:22:44Z

triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally.

estebank added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Aug 30, 2018

nagisa mentioned this issue Aug 31, 2018

codegen-units + ThinLTO is not as good as codegen-units = 1 #47745

Open

ruuda mentioned this issue Sep 15, 2018

Attempt safe and fast buffer allocation ruuda/claxon#14

Closed

BusyJay mentioned this issue Sep 27, 2018

Efficient Codecs tikv/tikv#3629

Merged

breezewish mentioned this issue Sep 29, 2018

Observe the influence of codegen-unit bug tikv/tikv#3640

Closed

anp mentioned this issue Oct 5, 2018

add claxon anp/lolbench#40

Open

est31 mentioned this issue Oct 28, 2018

38x performance regression in lewton since rust 1.26 #55446

Closed

kg88 mentioned this issue Nov 7, 2018

CMake 3.5 doesn't work with grpcio tikv/tikv#3748

Closed

pnkfelix added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 8, 2018

pnkfelix added the P-medium Medium priority label Nov 29, 2018

brson mentioned this issue Jan 31, 2019

Adjust codegen-units in optimized builds tikv/tikv#4164

Open

Enselic added the C-bug Category: This is a bug. label Nov 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

35% performance regression in generated code since 1.24 #53833

35% performance regression in generated code since 1.24 #53833

ruuda commented Aug 30, 2018

est31 commented Aug 30, 2018

ruuda commented Aug 31, 2018

est31 commented Aug 31, 2018

pnkfelix commented Nov 29, 2018

35% performance regression in generated code since 1.24 #53833

35% performance regression in generated code since 1.24 #53833

Comments

ruuda commented Aug 30, 2018

Summary

Steps to reproduce

Details

est31 commented Aug 30, 2018

ruuda commented Aug 31, 2018

est31 commented Aug 31, 2018

pnkfelix commented Nov 29, 2018