-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Optimize zstd decompression by another x% #2689
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome PR & analysis!
This is looking good to me! I've done some initial measurements and they look good. I will spend a bit more time measuring, then we should be good to merge this, with the change to make it C89 compliant.
Note that on my machine (i9-9900K) gcc-11 has some alignment issues that cause a ~10% regression. But, changing the alignment to .p2align 4
fixes that. This is a known issue, see:
zstd/lib/decompress/zstd_decompress_block.c
Lines 1128 to 1175 in 18d02cb
/* Align the decompression loop to 32 + 16 bytes. | |
* | |
* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression | |
* speed swings based on the alignment of the decompression loop. This | |
* performance swing is caused by parts of the decompression loop falling | |
* out of the DSB. The entire decompression loop should fit in the DSB, | |
* when it can't we get much worse performance. You can measure if you've | |
* hit the good case or the bad case with this perf command for some | |
* compressed file test.zst: | |
* | |
* perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \ | |
* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst | |
* | |
* If you see most cycles served out of the MITE you've hit the bad case. | |
* If you see most cycles served out of the DSB you've hit the good case. | |
* If it is pretty even then you may be in an okay case. | |
* | |
* This issue has been reproduced on the following CPUs: | |
* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9 | |
* Use Instruments->Counters to get DSB/MITE cycles. | |
* I never got performance swings, but I was able to | |
* go from the good case of mostly DSB to half of the | |
* cycles served from MITE. | |
* - Coffeelake: Intel i9-9900k | |
* - Coffeelake: Intel i7-9700k | |
* | |
* I haven't been able to reproduce the instability or DSB misses on any | |
* of the following CPUS: | |
* - Haswell | |
* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH | |
* - Skylake | |
* | |
* If you are seeing performance stability this script can help test. | |
* It tests on 4 commits in zstd where I saw performance change. | |
* | |
* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4 | |
*/ | |
__asm__(".p2align 6"); | |
__asm__("nop"); | |
__asm__(".p2align 5"); | |
__asm__("nop"); | |
# if __GNUC__ >= 9 | |
/* better for gcc-9 and gcc-10, worse for clang and gcc-8 */ | |
__asm__(".p2align 3"); | |
# else | |
__asm__(".p2align 4"); | |
# endif | |
#endif |
Thanks, let me try out alignment for other gcc versions |
Not all Intel CPUs have this problem, it seems to only be the newer consumer CPUs. I haven't seen the same on Skylake server CPUs, but it may be present in some of the newer ones. You can use the perf commands referenced in the comments to see if you're running into the DSB issues. |
After several rounds of evaluation, I agree gcc11 is worse, however, gcc9 and gcc10 are better with .palign 3. Adjusted to that |
Thanks for doing that! The alignment issues are super annoying, and make it very hard to get consistent measurements, since any small change to the decoder can cause a 10% drop in performance. The I'll dedicate some time this week to measure performance across the various machines I have, then we can get this merged. |
Just FYI, Intel released an Optimization Reference Manual yesterday: https://www.phoronix.com/scan.php?page=news_item&px=Intel-ORM-Code-Samples |
Hi, @terrelln, were you able to take a look? |
Yeah, I ran some benchmarks on Friday & today. I still see a consistent gcc-11 regression of ~4% on my i9-9900k (down from ~10%). But, I have confirmed the clang speedup. It looks like gcc-11 is hitting the bad case of falling out of the DSB both before & after this patch. Comparing v148 and v150 on gcc-11 there is a 5% regression. And, I can't mess with the I still have to measure on two more machines, both of which don't have alignment issues. I want to accept this PR because the changes make sense. And, performance parity with clang would be great. But, the fact that it isn't a pure win, and that there are already gcc regressions makes it tricky. I'm collecting the results of my benchmarks, and I will share them here when I'm done. In the meantime I'm going to continue trying to fiddle with the code to make gcc happy. |
I also was not able to remove 3-4% performance difference for gcc-11 I have several options:
The patch is extremely valuable for other configurations and for us specifically, it would be unfortunate to abandon it |
Hi, @terrelln I would like to know how we should proceed with that PR. Were you able to test on other hardware? |
Sorry for the radio silence. I've been a bit busy, and kicking this task down the road because its a bit of a puzzler. But, I think I may have found a solution to our gcc-11 problems: diff --git a/lib/decompress/zstd_decompress_block.c b/lib/decompress/zstd_decompress_block.c
index e5391d66..9acf2f74 100644
--- a/lib/decompress/zstd_decompress_block.c
+++ b/lib/decompress/zstd_decompress_block.c
@@ -977,10 +977,8 @@ ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets)
U32 const ll0 = (llDInfo->baseValue == 0);
if (LIKELY((ofBits == 0))) {
offset = seqState->prevOffset[ll0];
- if (UNLIKELY(ll0)) {
- seqState->prevOffset[1] = seqState->prevOffset[0];
- seqState->prevOffset[0] = offset;
- }
+ seqState->prevOffset[1] = seqState->prevOffset[!ll0];
+ seqState->prevOffset[0] = offset;
} else {
offset = ofBase + ll0 + BIT_readBitsFast(&seqState->DStream, 1);
{ size_t temp = (offset==3) ? seqState->prevOffset[0] - 1 : seqState->prevOffset[offset]; I've found that gcc-11 and clang-12 both do well with this patch. Please try it out and let me know it works well in your tests. If not, I think we go ahead and merge this PR as-is. Our current internal gcc-9 and clang-9 are both neutral with this PR. |
I tried the patch, it does slightly better for clang and a little bit better for gcc-11, yet, I am still able to reproduce some regression for server gcc-11. Now it is 2% instead of 4%. Anyway, I am good with this, thanks a lot, I appreciate the effort I do remember I tried something similar but probably instead of |
Glad to hear that you've found a slight improvement as well. I think we should merge this PR with the small gcc-11 regression, since it brings large benefits for clang. And hopefully we'll find a way to improve gcc down the line.
Yeah, that is what I tried first and found it to be worse as well. I didn't dig into why one works and the other doesn't, but my guess is that something (either the CPU or the compiler) doesn't like the potentially aliased stores, or it is just noise, and there is no good reason why it is worse. |
Hi @animalize, thanks for the benchmark! We'd accept patches that improve Visual Studios performance without hurting clang/gcc performance or code legibility. However, VS is not a primary target, and we do not actively measure / optimize VS performance. We already have more than enough trouble trying to keep clang/gcc performance high. Adding another compiler to that list makes the problem much harder. If anything, we'd prefer to drop a "supported performance" compiler rather than add one. However, we do fully support VS for correctness. |
Hi, that's me again, for now with more significant numbers, graphs and explanations.
TL;DR. We see 3% improvement for silesia.tar on server processors for clang-12, on laptops around 6%. Other clang compilers also benefit a lot (see below). GCC for laptops is 2-3% faster, for servers 1-2% slower. At Google we see a significant improvement on our internal data like text logs, protobufs, they vary from 5% to 9%, for us this results in thousands of cores. This also eliminates most of the perf gap between gcc and clang and we are more than happy to show the improvements and methodology behind that.
After the 1.5.0 update we saw significant improvements and were interested in some more, we decided to precompile zstd with GCC and saw that the compression speed was more or less on par with the clang one. However, decompression differed by 4-7% pretty stably and we saw an opportunity there. Having a precompiled assembly or using GCC is not sufficiently stable for us and we decided to understand where the gap comes from. Even you can see that in
silesia.tar
, the gap for level 8 of compression (our default) for laptop processor is around 5.6% and for servers around 4%Idea 1. Mark more conditions as likely and unlikely
In the
ZSTD_decodeSequence
(of the most consuming decompression entity) there are 3 main if statementsGCC annotated objdump for zstd_decompress_block is here, for clang is here. You should look for
ZSTD_decompressSequences_bmi2
function starting withZSTD_seqSymbol const ofDInfo = seqState->stateOffb.table
.Some key observations that we had found out
GCC does a very great job of already addressing conditions for mlBits and llBits with jumping pretty far away if they do not equal zero:
Separately, when ofBits > 1, GCC figures out for some reason that this condition holds more often and jumps on a negation of the condition
GCC, maybe mistakenly, maybe for a reason, decides correctly what paths to have in order to make code closer and it turned out it benefits some percent from that.
Maybe GCC analyzed the tables, LL_bits and ML_bits contain lots of zeros, OF_bits more likely to have values more than 1. However, I don't believe GCC is so smart because these table are located far away from the actual code.
We gathered some profiles and clang showed that their decisions of setting up blocks is not what inlined with GCC, this can result in some wasting of loading up instructions.
For llBits and mlBits gaps look slightly scary and avoidable.
Of course, we gathered data how often these conditions hit. See the dump. Some terminology.
OF0
is the condition whereofBits == 0
,OF1
is whenofBits == 1
andOFMore1
is whenofBits > 1
,Ratio OF
isOFMore1/(OF0 + OF1)
.LL0
isllBits == 0
,LL1
is when (llBits > 0),Ratio LL
isLL0/LL1
, same with ML,LLD
corresponds to another branch in the code:if (LIKELY(!ll0)) {
When it hits,
LLD1
has +1, otherwiseLLD0
, the ratio isLLD1/LLD0
.So, all ratios correspond to the GCC decisions (likeliness for
ofBits > 1
, unlikeliness for{ml,ll}Bits > 0
).The dump contains first name of the dataset, then compression level, then Ratio. We checked silesia and other datasets, for example, which snappy has. As you may see, ratios below 2 are super rare for all data. 32 out 41
0.something
values correspond to LLD which is already marked as LIKELY, OF is below 1 only for gaviota. Lower than 2 is also pretty rare. Overall, average ratio is around 205 for everyone: 297 for LL, 137 for ML, 90 for OF, 294 for LLD. Also it does not seem the likeliness depends on the compression level. Anyway, it is very much reasonable to put likeliness to specific branches. Good to know that OF1 was zero always, it looks like it is not used currently in the compression, however, fuzzing hits that line.After that clang improved by 2% for silesia and we saw already improvements. Unfortunately, GCC decided to move blocks and performance degraded and we decided to put likeliness only for clang.
Idea 2. BMI2 codegen
As BMI2 instructions are evolving and zstd has a dynamic dispatch for these things, we noticed that it accesses some memory a little more often and this was a static memory
BIT_mask
which stores(1 << i) - 1
for all i from 0 to 31.And the code was something like
and, add, mov seemed to me suspicious as well as using r11 register for static memory. I remembered of
bzhi
(BZHI — Zero High Bits Starting with Specified Bit Position) instruction which just dumps the bits, the gut feeling was right, codegen is much better for bmi2 https://gcc.godbolt.org/z/cac5nr5hsClang produces optimal code
LLVM-mca shows clearly it is better than the mask
Even if we consider all memory accesses to be L1 cached, it is still better in number of instructions and worst latencies.
We changed from a mask approach and
_bmi2
functions improved in their codegen. When bmi2 is disabled, for example, because of old hardware, this can only happen pre-Haswell, we consider such hardware not so important for performance in order to avoid complexity of dispatchingBIT_getMiddleBits
toBIT_getMiddleBits_bmi2
and so on. I don't recall any such instructions for ARM and others, that's why I put ifdef on x86_64 platforms.It improved by another 0.3-0.4% though it is hard to estimate, for sure it does not make the things worse.
Idea 3. Preambula
After the changes above stack memory for defining variables in the preambula decided to live its own life in GCC and a little bit in clang.
Store nextState in eax and move it back on stack, not the best policy, after the changes, even though some other variables were still pushed to stack, it became slightly faster and allowed to remove ifdef for gcc and clang on WithDInfo. I bet alignment played role or the stack address comes into cache a little bit earlier. Either way, changing where variables are put is a good idea because compilers already unpack all structures within the definitions and use it for updating FSE State. After that the performance stabilized for almost all compilers.
I also moved
to
it saved couple of instructions for stack spill and registers were reused after decodeSequence in execSequence which is a good thing to have.
Idea 4. Previous patch
I moved
to
Because gcc9 had bad perf for lazy2 compressed files. Seemed like a slight win performance for clang. And it reasoned a bit better about block fuse
Results
I was manually testing silesia.tar with all compression levels from 0 to 22.
And then these results were aggregated. Perf data for laptop, perf data for server.
clang 12 laptop
clang 12 production server
gcc 10 laptop
gcc 10 production server (note it became slightly worse)
clang 11 laptop
clang 11 server
clang 10 laptop
clang 10 server
clang 9 laptop
clang 9 server
gcc 8 laptop
gcc 8 server
gcc 9 laptop
gcc 9 server
GCC vs Clang without patch laptop:
GCC vs Clang with patch laptop:
GCC vs Clang without patch xeon:
GCC vs Clang with patch xeon:
Snappy benchmark (level 8, window log 20, our defaults at Google):
And couple of internal stuff which we ran for quite some time showed +5-6% improvement.