Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

35% performance regression in generated code since 1.24 #53833

Open
ruuda opened this issue Aug 30, 2018 · 4 comments
Open

35% performance regression in generated code since 1.24 #53833

ruuda opened this issue Aug 30, 2018 · 4 comments
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ruuda
Copy link
Contributor

ruuda commented Aug 30, 2018

Summary

Claxon, when compiled with Rust 1.25.0 or later, takes 1.36 times as long to run as a version compiled with Rust 1.23.0. When compiled with Rust 1.24.0, it takes 1.45 times as long to run.

It looks like there was a severe regression in generated code in 1.24.0. 1.25.0 improved a bit again, but is still significantly worse than 1.23.0.

Steps to reproduce

Prepare:

git clone https://github.com/ruuda/claxon --branch v0.4.1
cd claxon/testsamples
./populate.sh
cp p2.flac p3.flac p4.flac extra
cd ..
mkdir rust_bench

You can also copy your own files into testsamples/extra if you happen to have some lying around.

Then, with a Rust 1.23.0 toolchain, or after changing this line to use cargo +1.23.0:

tools/benchmark.sh rust_bench/23

Then, with a Rust 1.25.0 toolchain, or after updating the script to use cargo +1.25.0:

tools/benchmark.sh rust_bench/25
tools/compare_benches.r rust_bench/23_*_all.dat rust_bench/25_*_all.dat

Output in my case:

| p10 |  18.0 ± 0.3 ns    | 1.370 ± 0.033 |
| p50 |  18.9 ± 0.4 ns    | 1.363 ± 0.043 |
| p90 |  21.1 ± 2.1 ns    | 1.388 ± 0.181 |
| μ   |  19.4 ± 0.7 ns    | 1.359 ± 0.065 |
| τ   |  65.9 ± 3.6 MiB/s | 0.736 ± 0.053 |

The numbers in the rightmost column show the running time of the benchmark compiled with Rust 1.25.0 relative to the running time of the benchmark compiled with Rust 1.23.0.

For Rust 1.23.0 vs Rust 1.24.0 I get these results:

| p10 |  19.8 ± 0.2 ns    | 1.501 ± 0.028 |
| p50 |  20.5 ± 0.2 ns    | 1.473 ± 0.036 |
| p90 |  21.4 ± 0.4 ns    | 1.413 ± 0.120 |
| μ   |  20.7 ± 0.2 ns    | 1.449 ± 0.048 |
| τ   |  61.8 ± 3.3 MiB/s | 0.690 ± 0.049 |

For Rust 1.23.0 vs Rust 1.28.0 I get these results:

| p10 |  18.3 ± 0.2 ns    | 1.392 ± 0.026 |
| p50 |  18.9 ± 0.1 ns    | 1.360 ± 0.031 |
| p90 |  19.7 ± 0.4 ns    | 1.301 ± 0.111 |
| μ   |  19.1 ± 0.2 ns    | 1.339 ± 0.043 |
| τ   |  66.9 ± 3.8 MiB/s | 0.747 ± 0.055 |

For Rust 1.29.0-beta.1 and 1.30.0-nightly (3edb355 2018-08-03) I get similar results.

To show that the setup works, this is Rust 1.13.0 vs Rust 1.23.0, which shows no significant difference:

| p10 |  13.2 ± 0.2 ns    | 0.989 ± 0.022 |
| p50 |  13.9 ± 0.3 ns    | 0.993 ± 0.027 |
| p90 |  15.2 ± 1.3 ns    | 1.036 ± 0.088 |
| μ   |  14.3 ± 0.4 ns    | 1.013 ± 0.034 |
| τ   |  89.6 ± 4.2 MiB/s | 0.988 ± 0.067 |

And Rust 1.25.0 vs Rust 1.28.0 does not show a significant difference either:

| p10 |  18.3 ± 0.2 ns    | 1.016 ± 0.021 |
| p50 |  18.9 ± 0.1 ns    | 0.998 ± 0.024 |
| p90 |  19.7 ± 0.4 ns    | 0.937 ± 0.097 |
| μ   |  19.1 ± 0.2 ns    | 0.985 ± 0.037 |
| τ   |  66.9 ± 3.8 MiB/s | 1.015 ± 0.080 |

Details

Claxon is a decoder for the flac audio format; the benchmark decodes every file in testsamples/extra 5 times and collects statistics about the duration.

The benchmark is compiled with -C target_cpu=native.

I use a Skylake i7.

vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping	: 3
microcode	: 0xc6
cpu MHz		: 3394.795
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 5186.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
@estebank estebank added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Aug 30, 2018
@est31
Copy link
Member

est31 commented Aug 30, 2018

1.24.0 release notes

rustc now uses 16 codegen units by default for release builds. For the fastest builds, utilize codegen-units=1.

May that's the cause? Could you try with one codegen unit?

As for the improvement in 1.25.0, you might thank the LLVM update for it :).

@ruuda
Copy link
Contributor Author

ruuda commented Aug 31, 2018

May that's the cause? Could you try with one codegen unit?

Wow, the difference is enormous!

With codegen-units=1, I measure no difference between 1.23.0 and 1.24.0:

| p10 |  13.2 ± 0.2 ns    | 1.003 ± 0.022 |
| p50 |  13.8 ± 0.2 ns    | 0.996 ± 0.026 |
| p90 |  14.4 ± 0.3 ns    | 0.953 ± 0.081 |
| μ   |  13.9 ± 0.2 ns    | 0.974 ± 0.032 |
| τ   |  92.0 ± 4.7 MiB/s | 1.027 ± 0.071 |

1.25.0 is even a bit faster than 1.23.0 now (due to the LLVM upgrade?):

| p10 |  12.9 ± 0.2 ns    | 0.979 ± 0.020 |
| p50 |  13.5 ± 0.2 ns    | 0.973 ± 0.027 |
| p90 |  14.1 ± 0.3 ns    | 0.932 ± 0.080 |
| μ   |  13.6 ± 0.2 ns    | 0.949 ± 0.033 |
| τ   |  94.4 ± 4.8 MiB/s | 1.054 ± 0.072 |

Although it looks like that advantage has been lost again. 1.26.0 was still fast:

| p10 |  12.9 ± 0.2 ns    | 0.979 ± 0.020 |
| p50 |  13.5 ± 0.2 ns    | 0.971 ± 0.026 |
| p90 |  14.1 ± 0.5 ns    | 0.933 ± 0.083 |
| μ   |  13.6 ± 0.2 ns    | 0.949 ± 0.033 |
| τ   |  94.4 ± 4.8 MiB/s | 1.054 ± 0.073 |

But 1.27.0 again performs similar to 1.23.0.

| p10 |  13.2 ± 0.3 ns    | 1.004 ± 0.028 |
| p50 |  14.1 ± 0.4 ns    | 1.019 ± 0.035 |
| p90 |  14.9 ± 0.4 ns    | 0.984 ± 0.085 |
| μ   |  14.2 ± 0.3 ns    | 0.995 ± 0.038 |
| τ   |  90.0 ± 3.7 MiB/s | 1.005 ± 0.063 |

1.28.0 performs similarly to 1.27.0.

Thanks for the suggestion @est31! Does that resolve this issue, or is the 1.26 – 1.27 difference still worth investigating?

@est31
Copy link
Member

est31 commented Aug 31, 2018

There's a tracking issue for codegen unit regressions in #47745 . I think all the issues it links to are kept open.

About the 1.26 -> 1.27 regression, it could probably be narrowed down using rustup and nightlies, and/or cargo-bisect-rustc. Not sure though whether it's worth the effort.

ruuda added a commit to ruuda/claxon that referenced this issue Aug 31, 2018
Since Rust 1.24.0, the number of codegen units is no longer 1, even for
release builds. This speeds up builds slightly, but it reduces the
quality of the generated code tremendously. To get decent performance
out of Claxon, we need to explicitly pass -C codegen-units=1 as
RUSTFLAGS. Update the benchmarks to do so, and add a note in the readme.

Thanks to est31 for pointing this out.

See also rust-lang/rust#53833.
@pnkfelix pnkfelix added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 8, 2018
@pnkfelix
Copy link
Member

triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally.

@pnkfelix pnkfelix added the P-medium Medium priority label Nov 29, 2018
@Enselic Enselic added the C-bug Category: This is a bug. label Nov 18, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants