Optimize zstd decompression by another x% #2689

danlark1 · 2021-05-29T19:48:33Z

Hi, that's me again, for now with more significant numbers, graphs and explanations.

TL;DR. We see 3% improvement for silesia.tar on server processors for clang-12, on laptops around 6%. Other clang compilers also benefit a lot (see below). GCC for laptops is 2-3% faster, for servers 1-2% slower. At Google we see a significant improvement on our internal data like text logs, protobufs, they vary from 5% to 9%, for us this results in thousands of cores. This also eliminates most of the perf gap between gcc and clang and we are more than happy to show the improvements and methodology behind that.

After the 1.5.0 update we saw significant improvements and were interested in some more, we decided to precompile zstd with GCC and saw that the compression speed was more or less on par with the clang one. However, decompression differed by 4-7% pretty stably and we saw an opportunity there. Having a precompiled assembly or using GCC is not sufficiently stable for us and we decided to understand where the gap comes from. Even you can see that in silesia.tar, the gap for level 8 of compression (our default) for laptop processor is around 5.6% and for servers around 4%

Idea 1. Mark more conditions as likely and unlikely

In the ZSTD_decodeSequence (of the most consuming decompression entity) there are 3 main if statements

if (ofBits > 1) {
// ...
if (mlBits > 0) {
// ...
if (llBits > 0) {

GCC annotated objdump for zstd_decompress_block is here, for clang is here. You should look for ZSTD_decompressSequences_bmi2 function starting with ZSTD_seqSymbol const ofDInfo = seqState->stateOffb.table.

Some key observations that we had found out

GCC does a very great job of already addressing conditions for mlBits and llBits with jumping pretty far away if they do not equal zero:

 if (mlBits > 0)
    21ca: 45 84 db              test   %r11b,%r11b
JUMP FAR AWAY >>>>>>>>>>>    21cd: 0f 85 4d 02 00 00     jne    2420 <ZSTD_decompressSequences_bmi2.constprop.0+0x5a0>
    if (MEM_64bits() && UNLIKELY(totalBits >= STREAM_ACCUMULATOR_MIN_64-(LLFSELog+MLFSELog+OffFSELog)))
    21d3: 41 80 fd 1e           cmp    $0x1e,%r13b
    21d7: 0f 87 6f 02 00 00     ja     244c <ZSTD_decompressSequences_bmi2.constprop.0+0x5cc>
    21dd: 48 8b 8c 24 80 00 00  mov    0x80(%rsp),%rcx
    21e4: 00 
    if (llBits > 0)
    21e5: 45 84 d2              test   %r10b,%r10b
JUMP FAR AWAY >>>>>>>>>>    21e8: 0f 85 da 02 00 00     jne    24c8 <ZSTD_decompressSequences_bmi2.constprop.0+0x648>
    U32 const nbBits = DInfo.nbBits;
    21ee: 45 0f b6 50 03        movzbl 0x3(%r8),%r10d
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];
    21f3: 4c 8d 1d 00 00 00 00  lea    0x0(%rip),%r11        # 21fa <ZSTD_decompressSequences_bmi2.constprop.0+0x37a>
    DStatePtr->state = DInfo.nextState + lowBits;
    21fa: 45 0f b7 00           movzwl (%r8),%r8d
    size_t const sequenceLength = sequence.litLength + sequence.matchLength;
    21fe: 4e 8d 34 22           lea    (%rdx,%r12,1),%r14
    BYTE* const oMatchEnd = op + sequenceLength;   /* risk : address space overflow (32-bits) */
    2202: 4e 8d 2c 33           lea    (%rbx,%r14,1),%r13
    return BIT_getMiddleBits(bitD->bitContainer, (sizeof(bitD->bitContainer)*8) - bitD->bitsConsumed - nbBits, nbBits);
    2206: 44 01 d0              add    %r10d,%eax
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];

Separately, when ofBits > 1, GCC figures out for some reason that this condition holds more often and jumps on a negation of the condition

        if (ofBits > 1) {
  217d: 80 f9 01              cmp    $0x1,%cl
JUMP VERY FAR AWAY >>>>>>    2180: 0f 86 62 02 00 00     jbe    23e8 <ZSTD_decompressSequences_bmi2.constprop.0+0x568>
  2186: 41 89 cf              mov    %ecx,%r15d
  2189: c4 62 f9 f7 b4 24 80  shlx   %rax,0x80(%rsp),%r14
  2190: 00 00 00 
  bitD->bitsConsumed += nbBits;
  2193: 01 c8                 add    %ecx,%eax
          seqState->prevOffset[2] = seqState->prevOffset[1];
  2195: 48 8b 8c 24 e0 00 00  mov    0xe0(%rsp),%rcx
  219c: 00 
  return (bitD->bitContainer << (bitD->bitsConsumed & regMask)) >> (((regMask+1)-nbBits) & regMask);
  219d: 41 f7 df              neg    %r15d
  bitD->bitsConsumed += nbBits;
  21a0: 89 84 24 88 00 00 00  mov    %eax,0x88(%rsp)

GCC, maybe mistakenly, maybe for a reason, decides correctly what paths to have in order to make code closer and it turned out it benefits some percent from that.

Maybe GCC analyzed the tables, LL_bits and ML_bits contain lots of zeros, OF_bits more likely to have values more than 1. However, I don't believe GCC is so smart because these table are located far away from the actual code.

static UNUSED_ATTR const U32 LL_bits[MaxLL+1] = {
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 2, 2, 3, 3,
     4, 6, 7, 8, 9,10,11,12,
    13,14,15,16
};

static UNUSED_ATTR const U32 ML_bits[MaxML+1] = {
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 2, 2, 3, 3,
     4, 4, 5, 7, 8, 9,10,11,
    12,13,14,15,16
};

static UNUSED_ATTR const U32 OF_bits[MaxOff+1] = {
                     0,  1,  2,  3,  4,  5,  6,  7,
                     8,  9, 10, 11, 12, 13, 14, 15,
                    16, 17, 18, 19, 20, 21, 22, 23,
                    24, 25, 26, 27, 28, 29, 30, 31 };

We gathered some profiles and clang showed that their decisions of setting up blocks is not what inlined with GCC, this can result in some wasting of loading up instructions.

if (ofBits > 1) {
...
0.66 │       add      %esi,%eax
0.55 │       add      %r10,%r8
0.66 │       movdqu   -0x50(%rbp),%xmm0
1.05 │       movdqu   %xmm0,-0x48(%rbp)
0.28 │       mov      %r8,-0x50(%rbp)
1.16 │       mov      -0x38(%rbp),%r13
0.55 │     ↓ jmp      52c
     │       xchg     %ax,%ax
0.06 │510:   test     %esi,%esi
0.06 │       mov      -0x38(%rbp),%r13
0.22 │     ↓ jne      8e3
VERY RARE BRANCH IN THE BENCHMARKS >>>>> else {
     │       mov      %ebx,%r11d
     │       test     %rdx,%rdx
U32 const ll0 = (llBase == 0);
if (LIKELY((ofBits == 0))) {
 if (LIKELY(!ll0))
   offset = seqState->prevOffset[0];
     │     ↓ je       92c
     │       mov      -0x50(%rbp),%r8
1.05 │52c:   mov      %r14d,%r10d
0.83 │       mov      -0x30(%rbp),%ebx
0.83 │       add      %sil,%r9b
0.33 │       test     %ecx,%ecx

For llBits and mlBits gaps look slightly scary and avoidable.

1.05 │       movdqu   %xmm0,-0x48(%rbp)
0.28 │       mov      %r8,-0x50(%rbp)
1.16 │       mov      -0x38(%rbp),%r13
0.55 │     ↓ jmp      52c
     │       xchg     %ax,%ax
if (mlBits > 0)
0.06 │510:   test     %esi,%esi
0.06 │       mov      -0x38(%rbp),%r13
0.22 │     ↓ jne      8e3
     │       mov      %ebx,%r11d
     │       test     %rdx,%rdx
     │     ↓ je       92c
     │       mov      -0x50(%rbp),%r8
1.05 │52c:   mov      %r14d,%r10d
0.83 │       mov      -0x30(%rbp),%ebx
0.83 │       add      %sil,%r9b
if (llBits > 0)
0.33 │       test     %ecx,%ecx
     │     ↓ je       551
     │539:   shlx     %rax,-0xa8(%rbp),%rsi
0.06 │       mov      %ecx,%edi
     │       neg      %dil
     │       shrx     %rdi,%rsi,%rsi
     │       add      %ecx,%eax
     │       add      %rsi,%r12
0.83 │551:   cmp      $0x1f,%r9b
     │     ↓ jae      7e7
1.05 │55b:   mov      -0xb0(%rbp),%r9
0.88 │       mov      -0xa8(%rbp),%rcx
0.50 │       test     %r15d,%r15d
     │     ↓ je       584
0.17 │       shlx     %rax,%rcx,%rsi
0.06 │       mov      %r15d,%edi
0.11 │       neg      %dil
0.11 │       shrx     %rdi,%rsi,%rsi
0.17 │       add      %r15d,%eax
     │       add      %rsi,%rdx

Of course, we gathered data how often these conditions hit. See the dump. Some terminology. OF0 is the condition where ofBits == 0, OF1 is when ofBits == 1 and OFMore1 is when ofBits > 1, Ratio OF is OFMore1/(OF0 + OF1). LL0 is llBits == 0, LL1 is when (llBits > 0), Ratio LL is LL0/LL1, same with ML, LLD corresponds to another branch in the code:

if (LIKELY(!ll0)) {

When it hits, LLD1 has +1, otherwise LLD0, the ratio is LLD1/LLD0.

So, all ratios correspond to the GCC decisions (likeliness for ofBits > 1, unlikeliness for {ml,ll}Bits > 0).

The dump contains first name of the dataset, then compression level, then Ratio. We checked silesia and other datasets, for example, which snappy has. As you may see, ratios below 2 are super rare for all data. 32 out 41 0.something values correspond to LLD which is already marked as LIKELY, OF is below 1 only for gaviota. Lower than 2 is also pretty rare. Overall, average ratio is around 205 for everyone: 297 for LL, 137 for ML, 90 for OF, 294 for LLD. Also it does not seem the likeliness depends on the compression level. Anyway, it is very much reasonable to put likeliness to specific branches. Good to know that OF1 was zero always, it looks like it is not used currently in the compression, however, fuzzing hits that line.

After that clang improved by 2% for silesia and we saw already improvements. Unfortunately, GCC decided to move blocks and performance degraded and we decided to put likeliness only for clang.

Idea 2. BMI2 codegen

As BMI2 instructions are evolving and zstd has a dynamic dispatch for these things, we noticed that it accesses some memory a little more often and this was a static memory BIT_mask which stores (1 << i) - 1 for all i from 0 to 31.

/*=====    Local Constants   =====*/
static const unsigned BIT_mask[] = {
    0,          1,         3,         7,         0xF,       0x1F,
    0x3F,       0x7F,      0xFF,      0x1FF,     0x3FF,     0x7FF,
    0xFFF,      0x1FFF,    0x3FFF,    0x7FFF,    0xFFFF,    0x1FFFF,
    0x3FFFF,    0x7FFFF,   0xFFFFF,   0x1FFFFF,  0x3FFFFF,  0x7FFFFF,
    0xFFFFFF,   0x1FFFFFF, 0x3FFFFFF, 0x7FFFFFF, 0xFFFFFFF, 0x1FFFFFFF,
    0x3FFFFFFF, 0x7FFFFFFF}; /* up to 31 bits */
#define BIT_MASK_SIZE (sizeof(BIT_mask) / sizeof(BIT_mask[0]))

And the code was something like

  2206: 44 01 d0              add    %r10d,%eax
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];
    2209: 41 89 c1              mov    %eax,%r9d
    220c: 41 f7 d9              neg    %r9d
    220f: c4 62 b3 f7 c9        shrx   %r9,%rcx,%r9
    2214: 47 23 0c 93           and    (%r11,%r10,4),%r9d
 222b: 44 01 c8              add    %r9d,%eax
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];
    222e: 41 89 c0              mov    %eax,%r8d
    2231: 41 f7 d8              neg    %r8d
    2234: c4 62 bb f7 c1        shrx   %r8,%rcx,%r8
    2239: 47 23 04 8b           and    (%r11,%r9,4),%r8d
    223d: 4c 01 c7              add    %r8,%rdi
    2240: 48 89 bc 24 c8 00 00  mov    %rdi,0xc8(%rsp)

and, add, mov seemed to me suspicious as well as using r11 register for static memory. I remembered of bzhi (BZHI — Zero High Bits Starting with Specified Bit Position) instruction which just dumps the bits, the gut feeling was right, codegen is much better for bmi2 https://gcc.godbolt.org/z/cac5nr5hs

Clang produces optimal code

        movl    8(%rdi), %eax
        addl    %esi, %eax
        negb    %al
        shrxq   %rax, (%rdi), %rax
        bzhiq   %rsi, %rax, %rax
        retq

LLVM-mca shows clearly it is better than the mask

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   movl	8(%rdi), %eax
 1      1     0.25                        addl	%esi, %eax
 1      1     0.25                        negb	%al
 2      6     0.50    *                   shrxq	%rax, (%rdi), %rax # likely cached
 1      1     0.50                        bzhiq	%rsi, %rax, %rax
 3      7     1.00                  U     retq
 
 1      1     0.25                        movl	%esi, %ecx
 2      6     0.50    *                   addl	8(%rdi), %esi
 1      1     0.25                        negb	%sil
 2      6     0.50    *                   shrxq	%rsi, (%rdi), %rax # likely cached
 2      6     0.50    *                   andl	_ZL8BIT_mask(,%rcx,4), %eax #likely cached
 3      7     1.00                  U     retq

Even if we consider all memory accesses to be L1 cached, it is still better in number of instructions and worst latencies.

We changed from a mask approach and _bmi2 functions improved in their codegen. When bmi2 is disabled, for example, because of old hardware, this can only happen pre-Haswell, we consider such hardware not so important for performance in order to avoid complexity of dispatching BIT_getMiddleBits to BIT_getMiddleBits_bmi2 and so on. I don't recall any such instructions for ARM and others, that's why I put ifdef on x86_64 platforms.

It improved by another 0.3-0.4% though it is hard to estimate, for sure it does not make the things worse.

Idea 3. Preambula

After the changes above stack memory for defining variables in the preambula decided to live its own life in GCC and a little bit in clang.

ZSTD_seqSymbol const llDInfo = seqState->stateLL.table[seqState->stateLL.state];
    5ec2: 41 0f b7 44 f5 00     movzwl 0x0(%r13,%rsi,8),%eax // next state
    5ec8: 48 89 84 24 98 00 00  mov    %rax,0x98(%rsp)
    5ecf: 00
    5ed0: 41 0f b6 44 f5 02     movzbl 0x2(%r13,%rsi,8),%eax // nbAdditionalBits
    5ed6: 45 0f b6 44 f5 03     movzbl 0x3(%r13,%rsi,8),%r8d // nbBits

Store nextState in eax and move it back on stack, not the best policy, after the changes, even though some other variables were still pushed to stack, it became slightly faster and allowed to remove ifdef for gcc and clang on WithDInfo. I bet alignment played role or the stack address comes into cache a little bit earlier. Either way, changing where variables are put is a good idea because compilers already unpack all structures within the definitions and use it for updating FSE State. After that the performance stabilized for almost all compilers.

I also moved

U32 const llBase = llDInfo.baseValue;
U32 const mlBase = mlDInfo.baseValue;

to

seq.matchLength = mlDInfo->baseValue;
seq.litLength = llDInfo->baseValue;

it saved couple of instructions for stack spill and registers were reused after decodeSequence in execSequence which is a good thing to have.

Idea 4. Previous patch

I moved

if (LIKELY(!ll0))
    offset = seqState->prevOffset[0];
else {
    offset = seqState->prevOffset[1];

to

offset = seqState->prevOffset[ll0];
if (UNLIKELY(ll0)) {

Because gcc9 had bad perf for lazy2 compressed files. Seemed like a slight win performance for clang. And it reasoned a bit better about block fuse

    6056:	48 8b 44 24 78       	mov    0x78(%rsp),%rax
    605b:	48 89 84 24 80 00 00 	mov    %rax,0x80(%rsp)
    6062:	00
    6063:	4c 89 64 24 78       	mov    %r12,0x78(%rsp)
    6068:	44 00 c3             	add    %r8b,%bl
    
    ...
        62e0:	0f 94 c0             	sete   %al
                offset = seqState->prevOffset[ll0];
    62e3:	4c 8b 64 c4 78       	mov    0x78(%rsp,%rax,8),%r12
    62e8:	0f 85 7a fd ff ff    	jne    6068 <ZSTD_decompressSequences_bmi2+0x508>
    62ee:	e9 63 fd ff ff       	jmpq   6056 <ZSTD_decompressSequences_bmi2+0x4f6> // jump to 6056
    ...
                        if (offset != 1) seqState->prevOffset[2] = seqState->prevOffset[1];
    664f:	48 83 f8 01          	cmp    $0x1,%rax
    6653:	0f 85 ed f9 ff ff    	jne    6046 <ZSTD_decompressSequences_bmi2+0x4e6>
    6659:	e9 f8 f9 ff ff       	jmpq   6056 <ZSTD_decompressSequences_bmi2+0x4f6> // jump to 6056

Results

I was manually testing silesia.tar with all compression levels from 0 to 22.

#!/bin/bash

echo start >results

for compiler in gcc_8 gcc_9 gcc_10 clang_9 clang_10 clang_11 clang_12
do
 echo $compiler >../out
 cd build_optimized_$compiler && make >/dev/null -j 7
 sudo perf stat -r 50 "$HOME/zstd_clean/build_clean_$compiler/programs/zstd" -t $1 2>>../out
 sudo perf stat -r 50 "$HOME/zstd/build/cmake/build_optimized_$compiler/programs/zstd" -t $1 2>>../out
 cd ..
 grep -E "optimized|clean|task|cycles" out && rm out
done

And then these results were aggregated. Perf data for laptop, perf data for server.

clang 12 laptop

clang 12 production server

gcc 10 laptop

gcc 10 production server (note it became slightly worse)

clang 11 laptop

clang 11 server

clang 10 laptop

clang 10 server

clang 9 laptop

clang 9 server

gcc 8 laptop

gcc 8 server

gcc 9 laptop

gcc 9 server

GCC vs Clang without patch laptop:

GCC vs Clang with patch laptop:

GCC vs Clang without patch xeon:

GCC vs Clang with patch xeon:

Snappy benchmark (level 8, window log 20, our defaults at Google):

name                           old cpu/op  new cpu/op  delta
BM_DecompressSink/0 [html   ]    56.7µs ± 6%  52.1µs ± 5%  -8.12%  (p=0.000 n=116+119)
BM_DecompressSink/1 [urls   ]     699µs ± 5%   680µs ± 7%  -2.60%    (p=0.000 n=88+81)
BM_DecompressSink/2 [jpg    ]    4.48µs ± 5%  4.51µs ± 6%  +0.62%  (p=0.034 n=116+118)
BM_DecompressSink/4 [pdf    ]    14.2µs ± 6%  13.9µs ± 6%  -2.09%  (p=0.000 n=107+110)
BM_DecompressSink/5 [html4  ]    83.9µs ± 5%  80.2µs ± 9%  -4.42%  (p=0.000 n=117+105)
BM_DecompressSink/6 [txt1   ]     213µs ± 4%   205µs ± 5%  -3.86%    (p=0.000 n=87+81)
BM_DecompressSink/7 [txt2   ]     185µs ± 5%   177µs ± 5%  -4.57%   (p=0.000 n=100+99)
BM_DecompressSink/8 [txt3   ]     511µs ± 4%   494µs ± 7%  -3.44%    (p=0.000 n=87+78)
BM_DecompressSink/9 [txt4   ]     682µs ± 4%   664µs ± 7%  -2.68%    (p=0.000 n=85+79)
BM_DecompressSink/10 [pb     ]    51.2µs ± 4%  48.4µs ± 5%  -5.49%  (p=0.000 n=107+116)
BM_DecompressSink/11 [gaviota]     244µs ± 4%   232µs ± 5%  -4.77%    (p=0.000 n=79+83)


name                           old speed               new speed               delta
BM_DecompressSink/0 [html   ]  1.81GB/s ± 5%           1.97GB/s ± 5%  +8.85%      (p=0.000 n=116+119)
BM_DecompressSink/1 [urls   ]  1.01GB/s ± 4%           1.03GB/s ± 7%  +2.68%        (p=0.000 n=88+81)
BM_DecompressSink/2 [jpg    ]  27.4GB/s ± 5%           27.3GB/s ± 5%  -0.55%      (p=0.046 n=117+118)
BM_DecompressSink/4 [pdf    ]  7.22GB/s ± 6%           7.37GB/s ± 5%  +2.14%      (p=0.000 n=107+110)
BM_DecompressSink/5 [html4  ]  4.88GB/s ± 5%           5.11GB/s ± 8%  +4.66%      (p=0.000 n=117+105)
BM_DecompressSink/6 [txt1   ]   713MB/s ± 4%            742MB/s ± 5%  +4.02%        (p=0.000 n=87+81)
BM_DecompressSink/7 [txt2   ]   676MB/s ± 5%            708MB/s ± 4%  +4.79%       (p=0.000 n=100+99)
BM_DecompressSink/8 [txt3   ]   835MB/s ± 4%            864MB/s ± 7%  +3.50%        (p=0.000 n=87+79)
BM_DecompressSink/9 [txt4   ]   707MB/s ± 4%            726MB/s ± 9%  +2.68%        (p=0.000 n=85+80)
BM_DecompressSink/10 [pb     ]  2.32GB/s ± 4%           2.45GB/s ± 5%  +5.83%      (p=0.000 n=107+116)
BM_DecompressSink/11 [gaviota]   757MB/s ± 4%            795MB/s ± 5%  +5.02%        (p=0.000 n=79+83)

And couple of internal stuff which we ran for quite some time showed +5-6% improvement.

lib/decompress/zstd_decompress_block.c

terrelln

Awesome PR & analysis!

This is looking good to me! I've done some initial measurements and they look good. I will spend a bit more time measuring, then we should be good to merge this, with the change to make it C89 compliant.

Note that on my machine (i9-9900K) gcc-11 has some alignment issues that cause a ~10% regression. But, changing the alignment to .p2align 4 fixes that. This is a known issue, see:

zstd/lib/decompress/zstd_decompress_block.c

Lines 1128 to 1175 in 18d02cb

    
                   /* Align the decompression loop to 32 + 16 bytes. 
        
                    * 
        
                    * zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression 
        
                    * speed swings based on the alignment of the decompression loop. This 
        
                    * performance swing is caused by parts of the decompression loop falling 
        
                    * out of the DSB. The entire decompression loop should fit in the DSB, 
        
                    * when it can't we get much worse performance. You can measure if you've 
        
                    * hit the good case or the bad case with this perf command for some 
        
                    * compressed file test.zst: 
        
                    * 
        
                    *   perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \ 
        
                    *             -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst 
        
                    * 
        
                    * If you see most cycles served out of the MITE you've hit the bad case. 
        
                    * If you see most cycles served out of the DSB you've hit the good case. 
        
                    * If it is pretty even then you may be in an okay case. 
        
                    * 
        
                    * This issue has been reproduced on the following CPUs: 
        
                    *   - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9 
        
                    *               Use Instruments->Counters to get DSB/MITE cycles. 
        
                    *               I never got performance swings, but I was able to 
        
                    *               go from the good case of mostly DSB to half of the 
        
                    *               cycles served from MITE. 
        
                    *   - Coffeelake: Intel i9-9900k 
        
                    *   - Coffeelake: Intel i7-9700k 
        
                    * 
        
                    * I haven't been able to reproduce the instability or DSB misses on any 
        
                    * of the following CPUS: 
        
                    *   - Haswell 
        
                    *   - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH 
        
                    *   - Skylake 
        
                    * 
        
                    * If you are seeing performance stability this script can help test. 
        
                    * It tests on 4 commits in zstd where I saw performance change. 
        
                    * 
        
                    *   https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4 
        
                    */ 
        
                   __asm__(".p2align 6"); 
        
                   __asm__("nop"); 
        
                   __asm__(".p2align 5"); 
        
                   __asm__("nop"); 
        
           #  if __GNUC__ >= 9 
        
                   /* better for gcc-9 and gcc-10, worse for clang and gcc-8 */ 
        
                   __asm__(".p2align 3"); 
        
           #  else 
        
                   __asm__(".p2align 4"); 
        
           #  endif 
        
           #endif

lib/decompress/zstd_decompress_block.c

danlark1 · 2021-06-02T20:12:11Z

Awesome PR & analysis!

This is looking good to me! I've done some initial measurements and they look good. I will spend a bit more time measuring, then we should be good to merge this, with the change to make it C89 compliant.

Note that on my machine (i9-9900K) gcc-11 has some alignment issues that cause a ~10% regression. But, changing the alignment to .p2align 4 fixes that. This is a known issue, see:

zstd/lib/decompress/zstd_decompress_block.c

Lines 1128 to 1175 in 18d02cb

/* Align the decompression loop to 32 + 16 bytes.

*

* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression

* speed swings based on the alignment of the decompression loop. This

* performance swing is caused by parts of the decompression loop falling

* out of the DSB. The entire decompression loop should fit in the DSB,

* when it can't we get much worse performance. You can measure if you've

* hit the good case or the bad case with this perf command for some

* compressed file test.zst:

*

* perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \

* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst

*

* If you see most cycles served out of the MITE you've hit the bad case.

* If you see most cycles served out of the DSB you've hit the good case.

* If it is pretty even then you may be in an okay case.

*

* This issue has been reproduced on the following CPUs:

* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9

* Use Instruments->Counters to get DSB/MITE cycles.

* I never got performance swings, but I was able to

* go from the good case of mostly DSB to half of the

* cycles served from MITE.

* - Coffeelake: Intel i9-9900k

* - Coffeelake: Intel i7-9700k

*

* I haven't been able to reproduce the instability or DSB misses on any

* of the following CPUS:

* - Haswell

* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH

* - Skylake

*

* If you are seeing performance stability this script can help test.

* It tests on 4 commits in zstd where I saw performance change.

*

* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4

*/

__asm__(".p2align 6");

__asm__("nop");

__asm__(".p2align 5");

__asm__("nop");

# if __GNUC__ >= 9

/* better for gcc-9 and gcc-10, worse for clang and gcc-8 */

__asm__(".p2align 3");

# else

__asm__(".p2align 4");

# endif

#endif

Thanks, let me try out alignment for other gcc versions

terrelln · 2021-06-02T22:00:44Z

Thanks, let me try out alignment for other gcc versions

Not all Intel CPUs have this problem, it seems to only be the newer consumer CPUs. I haven't seen the same on Skylake server CPUs, but it may be present in some of the newer ones. You can use the perf commands referenced in the comments to see if you're running into the DSB issues.

danlark1 · 2021-06-08T19:47:10Z

Thanks, let me try out alignment for other gcc versions

Not all Intel CPUs have this problem, it seems to only be the newer consumer CPUs. I haven't seen the same on Skylake server CPUs, but it may be present in some of the newer ones. You can use the perf commands referenced in the comments to see if you're running into the DSB issues.

After several rounds of evaluation, I agree gcc11 is worse, however, gcc9 and gcc10 are better with .palign 3. Adjusted to that

terrelln · 2021-06-08T21:36:35Z

After several rounds of evaluation, I agree gcc11 is worse, however, gcc9 and gcc10 are better with .palign 3. Adjusted to that

Thanks for doing that! The alignment issues are super annoying, and make it very hard to get consistent measurements, since any small change to the decoder can cause a 10% drop in performance. The .p2align directives help stabilize it a bit, so at least changes outside the decoding loop generally don't affect its performance. But, it doesn't help as much when changing the core decoder logic.

I'll dedicate some time this week to measure performance across the various machines I have, then we can get this merged.

ghost · 2021-06-11T00:53:09Z

Just FYI, Intel released an Optimization Reference Manual yesterday:

https://www.phoronix.com/scan.php?page=news_item&px=Intel-ORM-Code-Samples

danlark1 · 2021-06-14T17:58:49Z

Hi, @terrelln, were you able to take a look?

terrelln · 2021-06-15T06:09:13Z

Yeah, I ran some benchmarks on Friday & today.

I still see a consistent gcc-11 regression of ~4% on my i9-9900k (down from ~10%). But, I have confirmed the clang speedup.

It looks like gcc-11 is hitting the bad case of falling out of the DSB both before & after this patch. Comparing v148 and v150 on gcc-11 there is a 5% regression. And, I can't mess with the .p2align directives to fix it. I believe that the change in PR #2625 is the culprit. But, I'm trying to figure out if this performance loss is real, an if it affects just my i9-9900k or other machines, and if we can avoid a gcc performance loss. So far, I haven't been able to re-write the loop to regain the speed of v148 without reverting PR #2625.

I still have to measure on two more machines, both of which don't have alignment issues. I want to accept this PR because the changes make sense. And, performance parity with clang would be great. But, the fact that it isn't a pure win, and that there are already gcc regressions makes it tricky.

I'm collecting the results of my benchmarks, and I will share them here when I'm done. In the meantime I'm going to continue trying to fiddle with the code to make gcc happy.

danlark1 · 2021-06-15T13:47:16Z

I confirm only gcc 11 is affected by 2-4% for compression levels 2-11, all other versions of gcc stay the same for performance as noted in the description

gcc 11 laptop

gcc 11 server

danlark1 · 2021-06-15T14:00:39Z

More graphs!

gcc10 vs gcc11 comparison, it seems like gcc11 found new ways to optimize code even further. With that patch gcc11 codegen is the same as old gcc10 codegen

lib/decompress/zstd_decompress_block.c

lib/common/bitstream.h

danlark1 · 2021-06-20T12:14:45Z

I also was not able to remove 3-4% performance difference for gcc-11

I have several options:

Leave it as it is, as I showed the performance is the same as gcc-10 and we can try to optimize regressions from [lib] Fix UBSAN warning in ZSTD_decompressSequences() #2625 separately
Try to move old code under __GNUC__ == 11 macro
Investigate more, however, first analysis showed it is much harder to reason about GCC changes and I believe the problem is somewhere in stack allocation

The patch is extremely valuable for other configurations and for us specifically, it would be unfortunate to abandon it

danlark1 · 2021-06-25T17:10:09Z

Hi, @terrelln

I would like to know how we should proceed with that PR. Were you able to test on other hardware?

terrelln · 2021-06-29T03:07:52Z

Sorry for the radio silence. I've been a bit busy, and kicking this task down the road because its a bit of a puzzler. But, I think I may have found a solution to our gcc-11 problems:

diff --git a/lib/decompress/zstd_decompress_block.c b/lib/decompress/zstd_decompress_block.c
index e5391d66..9acf2f74 100644
--- a/lib/decompress/zstd_decompress_block.c
+++ b/lib/decompress/zstd_decompress_block.c
@@ -977,10 +977,8 @@ ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets)
                 U32 const ll0 = (llDInfo->baseValue == 0);
                 if (LIKELY((ofBits == 0))) {
                     offset = seqState->prevOffset[ll0];
-                    if (UNLIKELY(ll0)) {
-                        seqState->prevOffset[1] = seqState->prevOffset[0];
-                        seqState->prevOffset[0] = offset;
-                    }
+                    seqState->prevOffset[1] = seqState->prevOffset[!ll0];
+                    seqState->prevOffset[0] = offset;
                 } else {
                     offset = ofBase + ll0 + BIT_readBitsFast(&seqState->DStream, 1);
                     {   size_t temp = (offset==3) ? seqState->prevOffset[0] - 1 : seqState->prevOffset[offset];

I've found that gcc-11 and clang-12 both do well with this patch. Please try it out and let me know it works well in your tests.

If not, I think we go ahead and merge this PR as-is. Our current internal gcc-9 and clang-9 are both neutral with this PR.

danlark1 · 2021-06-29T08:08:35Z

I tried the patch, it does slightly better for clang and a little bit better for gcc-11, yet, I am still able to reproduce some regression for server gcc-11. Now it is 2% instead of 4%.

Anyway, I am good with this, thanks a lot, I appreciate the effort

I do remember I tried something similar but probably instead of seqState->prevOffset[1] = seqState->prevOffset[!ll0]; I tried seqState->prevOffset[ll0] = seqState->prevOffset[0]; and it was worse

terrelln · 2021-06-29T17:29:08Z

I tried the patch, it does slightly better for clang and a little bit better for gcc-11, yet, I am still able to reproduce some regression for server gcc-11. Now it is 2% instead of 4%.

Glad to hear that you've found a slight improvement as well. I think we should merge this PR with the small gcc-11 regression, since it brings large benefits for clang. And hopefully we'll find a way to improve gcc down the line.

I tried seqState->prevOffset[ll0] = seqState->prevOffset[0]; and it was worse

Yeah, that is what I tried first and found it to be worse as well. I didn't dig into why one works and the other doesn't, but my guess is that something (either the CPU or the compiler) doesn't like the potentially aliased stores, or it is just noise, and there is no good reason why it is worse.

ghost · 2021-07-07T04:27:55Z

The decompression speed of negative compression levels is reduced a bit.
AMD 3600X, MSVC 2019.

I only tested a special data, maybe the result is not universal.
Please also test the negative compression level in the future.

The X axis is the compression level, the Y axis is the speed (MiB/s).

terrelln · 2021-07-07T20:48:42Z

Hi @animalize, thanks for the benchmark!

We'd accept patches that improve Visual Studios performance without hurting clang/gcc performance or code legibility. However, VS is not a primary target, and we do not actively measure / optimize VS performance.

We already have more than enough trouble trying to keep clang/gcc performance high. Adding another compiler to that list makes the problem much harder. If anything, we'd prefer to drop a "supported performance" compiler rather than add one.

However, we do fully support VS for correctness.

Optimize ZSTD_decodeSequence by another x%

a80d268

facebook-github-bot added the CLA Signed label May 29, 2021

Move declaration of 1 to an inlined cast

444f4db

georgthegreat reviewed May 29, 2021

View reviewed changes

lib/decompress/zstd_decompress_block.c Outdated Show resolved Hide resolved

danlark1 commented May 30, 2021

View reviewed changes

lib/decompress/zstd_decompress_block.c Show resolved Hide resolved

terrelln reviewed Jun 2, 2021

View reviewed changes

lib/decompress/zstd_decompress_block.c Show resolved Hide resolved

Be C89 compliant and fix alignment for gcc11

6534c00

Add comment for gcc-11

08a3ddb

ericonr reviewed Jun 16, 2021

View reviewed changes

lib/decompress/zstd_decompress_block.c Show resolved Hide resolved

lib/common/bitstream.h Show resolved Hide resolved

Add possible improvements for gcc-11

2c2c9e7

danlark1 mentioned this pull request Jun 29, 2021

Significant compression speed diffs between Clang and GCC in long mode #2695

Open

terrelln approved these changes Jun 29, 2021

View reviewed changes

terrelln merged commit 094b260 into facebook:dev Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize zstd decompression by another x% #2689

Optimize zstd decompression by another x% #2689

danlark1 commented May 29, 2021 •

edited

Loading

terrelln left a comment

danlark1 commented Jun 2, 2021

terrelln commented Jun 2, 2021

danlark1 commented Jun 8, 2021

terrelln commented Jun 8, 2021 •

edited

Loading

ghost commented Jun 11, 2021 •

edited by ghost

Loading

danlark1 commented Jun 14, 2021

terrelln commented Jun 15, 2021

danlark1 commented Jun 15, 2021 •

edited

Loading

danlark1 commented Jun 15, 2021

danlark1 commented Jun 20, 2021

danlark1 commented Jun 25, 2021

terrelln commented Jun 29, 2021

danlark1 commented Jun 29, 2021 •

edited

Loading

terrelln commented Jun 29, 2021

ghost commented Jul 7, 2021

terrelln commented Jul 7, 2021

	/* Align the decompression loop to 32 + 16 bytes.
	*
	* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression
	* speed swings based on the alignment of the decompression loop. This
	* performance swing is caused by parts of the decompression loop falling
	* out of the DSB. The entire decompression loop should fit in the DSB,
	* when it can't we get much worse performance. You can measure if you've
	* hit the good case or the bad case with this perf command for some
	* compressed file test.zst:
	*
	* perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \
	* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst
	*
	* If you see most cycles served out of the MITE you've hit the bad case.
	* If you see most cycles served out of the DSB you've hit the good case.
	* If it is pretty even then you may be in an okay case.
	*
	* This issue has been reproduced on the following CPUs:
	* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9
	* Use Instruments->Counters to get DSB/MITE cycles.
	* I never got performance swings, but I was able to
	* go from the good case of mostly DSB to half of the
	* cycles served from MITE.
	* - Coffeelake: Intel i9-9900k
	* - Coffeelake: Intel i7-9700k
	*
	* I haven't been able to reproduce the instability or DSB misses on any
	* of the following CPUS:
	* - Haswell
	* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH
	* - Skylake
	*
	* If you are seeing performance stability this script can help test.
	* It tests on 4 commits in zstd where I saw performance change.
	*
	* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4
	*/
	__asm__(".p2align 6");
	__asm__("nop");
	__asm__(".p2align 5");
	__asm__("nop");
	# if __GNUC__ >= 9
	/* better for gcc-9 and gcc-10, worse for clang and gcc-8 */
	__asm__(".p2align 3");
	# else
	__asm__(".p2align 4");
	# endif
	#endif

Optimize zstd decompression by another x% #2689

Optimize zstd decompression by another x% #2689

Conversation

danlark1 commented May 29, 2021 • edited Loading

Idea 1. Mark more conditions as likely and unlikely

Idea 2. BMI2 codegen

Idea 3. Preambula

Idea 4. Previous patch

Results

terrelln left a comment

Choose a reason for hiding this comment

danlark1 commented Jun 2, 2021

terrelln commented Jun 2, 2021

danlark1 commented Jun 8, 2021

terrelln commented Jun 8, 2021 • edited Loading

ghost commented Jun 11, 2021 • edited by ghost Loading

danlark1 commented Jun 14, 2021

terrelln commented Jun 15, 2021

danlark1 commented Jun 15, 2021 • edited Loading

danlark1 commented Jun 15, 2021

danlark1 commented Jun 20, 2021

danlark1 commented Jun 25, 2021

terrelln commented Jun 29, 2021

danlark1 commented Jun 29, 2021 • edited Loading

terrelln commented Jun 29, 2021

ghost commented Jul 7, 2021

terrelln commented Jul 7, 2021

danlark1 commented May 29, 2021 •

edited

Loading

terrelln commented Jun 8, 2021 •

edited

Loading

ghost commented Jun 11, 2021 •

edited by ghost

Loading

danlark1 commented Jun 15, 2021 •

edited

Loading

danlark1 commented Jun 29, 2021 •

edited

Loading