Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

More complete analysis of AVX512 in both gcc and clang #173

Open
valassi opened this issue Apr 23, 2021 · 50 comments
Open

More complete analysis of AVX512 in both gcc and clang #173

valassi opened this issue Apr 23, 2021 · 50 comments
Assignees
Labels
performance How fast is it? Make it go faster!

Comments

@valassi
Copy link
Member

valassi commented Apr 23, 2021

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

(A preliminary observation: the clang vectorization still needs a cross-check, see #172)

A couple of observations on performance

  • on gcc, '512y' (AVX512VL with only 256-bit ymm registers) is slightly faster than AVX2
  • on clang, AVX2 is faster than 512y
  • in both gcc and clang, '512z' (AVX512 with 512-bit zmm registers) is slower than the two above
  • typically the speedup is almost a factor 4 for double and almost a factor 8 for floats (the ME calculation is very much vectorized)

It would be useful to understand these issues a bit better and see if we can squeeze out something. Two points in particular:

  1. About 512z, the slowdown is around 20-30% with respect to 512y or avx2. It would be good to understand
  • is this only due to processor speed? (I did a very very quick test with perf stat, and I have the impression that my 2.4 GHz slows down to 2.0 GHz in 512z)
  • or is there something else beyone processor speed? (one would need some disassembling)
  • apart from the slowdown, do I understand this completely wrong that in principle I could get an additional factor 2 better throughput? after all we would be doing operations on 8-double vectors rather than 4-double, so we should need a factor 2 less?... the part of the code we are looking at is very much vectorized, there is little scalar code (otherwise we would not be seeing almost a factor 4 from AVX2...)
  1. About clang, it would be useful to have a quick look at the symbols in the 512y with respect to the avx2 version. Compare these log messages:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 5.206133e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.425824 sec
real    0m3.433s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3004) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 5.140326e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.435411 sec
real    0m3.443s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2727) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 3.714051e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.919521 sec
real    0m3.927s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3524) (512y:    0) (512z: 1193)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.431103e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.717100 sec
real    0m3.727s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2780) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.757142e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.631310 sec
real    0m3.641s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.698749e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.014051 sec
real    0m4.024s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1205) (512y:  209) (512z: 2044)
-------------------------------------------------------------------------

The 'symbols' line comes from this script and in particular this line

cnt512y=$(stripSyms '^v.*dqa(32|64).*(x|y)mm' '^v.*(32|64)x2.*(x|y)mm' '^vpcmpneqq.*(x|y)mm')

The "512y" symbols are 0 in clang and are more than 0 in gcc for 512y (while they are 0 in avx2). Essentially, the idea (expaning on a great suggestion by @sponce) was that 512y is a bit faster than avx2 because some extra symbols (from AVX512VL I guess) are used. I did a sytematic analysis and found that a few symbols matching these regular expressions are probably those that make the difference

cnt512y=$(stripSyms '^v.*dqa(32|64).*(x|y)mm' '^v.*(32|64)x2.*(x|y)mm' '^vpcmpneqq.*(x|y)mm')

In clang, however, there does not seem to be any such extra benefit from AVX512VL. This is also consistent with what was suggested by @hageboeck (see also https://arxiv.org/pdf/2003.12875.pdf), is that clang is best used at AVX2. The idea here is to give a brief cross-check, and potentially try newer clang versions. Note by the way that for single precision instead 512y is very slightly faster than avx2 also in clang, d65129b

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 1.048762e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     2.734676 sec
real    0m2.742s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3727) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 1.051075e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     2.734637 sec
real    0m2.741s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3204) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 7.302535e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     2.984331 sec
real    0m2.991s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3928) (512y:    0) (512z: 1872)
-------------------------------------------------------------------------

Low priority... but at least it's documented before I move on.

@valassi valassi added the performance How fast is it? Make it go faster! label Apr 23, 2021
@sponce
Copy link

sponce commented Apr 23, 2021

Just a comment on

About 512z, the slowdown is around 20-30% with respect to 512y or avx2. It would be good to understand
- is this only due to processor speed? (I did a very very quick test with perf stat, and I have the impression that my 2.4 GHz slows down to 2.0 GHz in 512z)

Yes, essentially. Of course, as you discuss after, you may gain back something as you may do twice more operations per cycle, but when you lose 30% on the clock speed initially, you can compute from Amdahl's law that it's quite a challenge for standard code to be faster at the end. I've never seen it happening actually in HEP.

Concerning the clock frequency reduction when using full AVX512, it depends on your processor but you have to know that any AVX512 instruction will slow down the clock for roughly the next ms, yes for millions of instructions. So basically you're running slow clock all the time.

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
…adgraph5#173)

On itscrd03.cern.ch:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.238520e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.781936 sec
     2,522,542,287      cycles                    #    2.658 GHz
     3,487,190,786      instructions              #    1.38  insn per cycle
       1.070591449 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.312061e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.151202 sec
    19,150,190,925      cycles                    #    2.676 GHz
    48,624,130,145      instructions              #    2.54  insn per cycle
       7.160649288 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.522426e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.855343 sec
    12,990,550,722      cycles                    #    2.672 GHz
    29,947,264,265      instructions              #    2.31  insn per cycle
       4.864907320 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.558120e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.701845 sec
     9,362,892,708      cycles                    #    2.525 GHz
    16,560,124,475      instructions              #    1.77  insn per cycle
       3.711390559 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.913520e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.570205 sec
     9,047,092,255      cycles                    #    2.529 GHz
    16,496,830,998      instructions              #    1.82  insn per cycle
       3.580131314 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.754219e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.983833 sec
     8,829,184,043      cycles                    #    2.213 GHz
    13,360,526,672      instructions              #    1.51  insn per cycle
       3.993458249 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
-------------------------------------------------------------------------
@valassi
Copy link
Member Author

valassi commented Apr 27, 2021

Hi @sponce thanks a lot for the feedback!

My point is that the numbers I am quoting (MEs/sec) ONLY refer to the part which is vectorized, Giving an example I havejust run from the latest master, ce00881

time ./build.512z/check.exe -p 2048 256 12
INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
*******************************************************************************
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-------------------------------------------------------------------------------
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[8]
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
-------------------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 3.839106e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 3.518191e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.209147e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.841612e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.676579e+00                 )  sec
MeanTimeInMatrixElems       = ( 1.397150e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.394192e-01 ,  1.400849e-01 ]  sec
-------------------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.638782e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.788264e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 3.752555e+06                 )  sec^-1
*******************************************************************************
NumMatrixElems(notAbnormal) = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
*******************************************************************************
0a ProcInit :     0.000255 sec
0b MemAlloc :     0.028964 sec
0c GenCreat :     0.000826 sec
0d SGoodHel :     0.000112 sec
1a GenSeed  :     0.000024 sec
1b GenRnGen :     0.320891 sec
2a RamboIni :     0.125858 sec
2b RamboFin :     1.715753 sec
3a SigmaKin :     1.676579 sec
4a DumpLoop :     0.041980 sec
8a CompStat :     0.067752 sec
9a GenDestr :     0.000093 sec
9b DumpScrn :     0.003942 sec
9c DumpJson :     0.000002 sec
TOTAL       :     3.983032 sec
TOTAL (123) :     3.839106 sec
TOTAL  (23) :     3.518191 sec
TOTAL   (1) :     0.320915 sec
TOTAL   (2) :     1.841612 sec
TOTAL   (3) :     1.676579 sec
*******************************************************************************
real    0m3.992s
user    0m3.940s
sys     0m0.051s

The throughput is computed from the "TOTAL(3)" which is sigmakin and does the matrix element calculations. True there are other things before and afterwards, but the number we see for throughput is not affected.

There are three things that I was considering to understand this better

  • 1, measure the processor speed, and/or use perf: I have done this, giving some results below
  • 2, look at the objdump for the various functions: I can dump them here, but I know too little assembly to make sense of them
  • 3, look at Intel Advisor, which is the vectorization advisor: this is something that @lfield has suggested he will do (thanks!)

In the commit I mentioned above, in the log I gave some first results from perf stat. They are quite interesting

On itscrd03.cern.ch:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.238520e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.781936 sec
     2,522,542,287      cycles                    #    2.658 GHz
     3,487,190,786      instructions              #    1.38  insn per cycle
       1.070591449 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.312061e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.151202 sec
    19,150,190,925      cycles                    #    2.676 GHz
    48,624,130,145      instructions              #    2.54  insn per cycle
       7.160649288 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.522426e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.855343 sec
    12,990,550,722      cycles                    #    2.672 GHz
    29,947,264,265      instructions              #    2.31  insn per cycle
       4.864907320 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.558120e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.701845 sec
     9,362,892,708      cycles                    #    2.525 GHz
    16,560,124,475      instructions              #    1.77  insn per cycle
       3.711390559 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.913520e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.570205 sec
     9,047,092,255      cycles                    #    2.529 GHz
    16,496,830,998      instructions              #    1.82  insn per cycle
       3.580131314 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.754219e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.983833 sec
     8,829,184,043      cycles                    #    2.213 GHz
    13,360,526,672      instructions              #    1.51  insn per cycle
       3.993458249 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
-------------------------------------------------------------------------

One sees that

  • there is a tiny clock reduction, but indeed only 30%
  • the number of instructions executed drops, probably much less than I was expecting/hoping, but still by quite a bit (unfortunately, here the instructions are for sigmakin and rambo and random all together, not sure if we can disentagle them)
  • most interestingly, maybe, is that the number of instructions per cycle also goes down from 512y (your "AVX256") to 512z

Anyway... some more studies to do. I just wanted to dump some info here.

@lfield: I suggest you can take the current master, or that commit I mention above, and try it out. The instructions for the build are in issue #177 (ignore the suggestion at the end about building a single binary - this is where I documented the different binaries we have now)... thanks!

Andrea

@valassi
Copy link
Member Author

valassi commented Apr 27, 2021

Attaching the five objdump of one of the simples functions FFV1_0 from

void FFV1_0( const cxtype_sv F1[], // input: wavefunction1[6]

CPPProcess.o.objdump.FFV1_0.avx2.txt
CPPProcess.o.objdump.FFV1_0.none.txt
CPPProcess.o.objdump.FFV1_0.sse4.txt
CPPProcess.o.objdump.FFV1_0.512y.txt
CPPProcess.o.objdump.FFV1_0.512z.txt

This was from

for avx in none sse4 avx2 512y 512z; do objdump -d -C ./build.$avx/CPPProcess.o | awk -v RS= '/^[[:xdigit:]]+ <MG5_sm::FFV1_0.*/' > CPPProcess.o.objdump.FFV1_0.${avx}.txt; done

@sponce
Copy link

sponce commented Apr 28, 2021

From the quick look, you get very typical numbers in terms of frequency reduction and gain in number of instructions. Hence my comment "it's quite a challenge for standard code to be faster at the end". Note that by standard, I did not mention "standard HEP", but rather good to excellent code from HEP point of view :-)

Now it's clear that there are reasons that you can analyze and solve, so that you finally gain. But not something that you can do broadly for all your code, definitely.

@valassi
Copy link
Member Author

valassi commented Apr 28, 2021

Thanks Sebastien :-)
Let's see, getting this far was "relatively" easy (thanks also to your documentation and help!), there was quite a bit of tedious/boilerplate/test work, but the basic ideas were surprisingly fast to implement. From here onwards instead I am not sure any further improvements will be easy.

One thing I would find useful is get 'perf stat' counters only for the part that is relevant. I opened #190 on this. If you have any suggestions they are most welcome :-)

Thanks
Andrea

@lfield
Copy link
Contributor

lfield commented Apr 28, 2021

Note that all the results are for the epoch1/cuda/ee_mumu code running on my 4 core NUC with 5 threads using the parameters 2048 256 12 and the Common Random Numbers method .

Code from a few months ago:

  • Elapsed Time 12.04s
  • Total CPU time 12.72s
  • Time in 3 vectorized loops 1.99s (15.6%)
  • Time in scalar code 10.73s (84.4%)
  • Roofline Plot

Current master:

  • Elapsed Time 8.86s
  • Total CPU time 9.41s
  • Time in 4 vectorized loops 8.82s (93.7%)
  • Time in scalar code 0.59s (6.3%)
  • Roofline Plot

Current master with SSE4:

  • Elapsed Time 5.52s
  • Total CPU time 6.11s
  • Time in 4 vectorized loops 5.49s (89.8%)
  • Time in scalar code 0.62s (10.2%)
  • Roofline Plot

Current master with AVX2:

  • Elapsed Time 4.01s
  • Total CPU time 4.59s
  • Time in 5 vectorized loops 4.00s (87.2%)
  • Time in scalar code 0.59s (12.8%)
  • Roofline Plot

Advisor has the following recommendations:

  1. Use the smallest data typeloop in std::thread::_State_impl<std::thread::_Invoker<std::tuple<CommonRandomNumbers::startGenerateAsync(void, std::vector<std::promise<std::vector<double, std::allocator>>, std::allocator<std::promise<std::vector<double, std::allocator>>>>&, unsigned long, unsigned long, unsigned int, unsigned long)::{lambda(unsigned long)#1}, unsigned int>>>::_M_run
  2. Enable the use of approximate sqrt instructionsloop in _ZN11rambo2toNm015getMomentaFinalEdPKdPDv4_dPdi
  3. Vectorize serialized function(s) inside looploop in Proc::calculate_wavefunctions._omp_fn.0
  4. Enable the use of approximate division instructionsloop in Proc::calculate_wavefunctions._omp_fn.0
  5. Enable the use of approximate sqrt instructionsloop in Proc::calculate_wavefunctions._omp_fn.0

I haven't tried avx512 as my NUC doesn't support it. Will try it on some other hardware and post the results.

@lfield
Copy link
Contributor

lfield commented Apr 28, 2021

I also ran the AVX2 build through the Application Performance Snapshot

  • Physical Core Utilization 85.4%
  • Memory Stalls 14.5%
  • Vectorization 73%

"Your application might underutilize the available logical CPU cores because of insufficient parallel work, blocking on synchronization, or too much I/O. Perform function or source line-level profiling with tools like Intel® VTune™ Profiler to discover why the CPU is underutilized."

@sponce
Copy link

sponce commented Apr 29, 2021

From your numbers, there is an int on why AVX512 does not perform so well : amdhal ! Indeed your serial code jumps from 6% in non vectorized to 13% in AVX2 and 27% in AVX512. No surprise here. These numbers are already very good actually, 6% is quite low. Now it also means that the hint of the Application Performance Snapshot is correct : you're limited by your parallelization level, or actually vectorization level. And finally it means that "Use the smallest data typeloop", although a very good advice, may not bring that much, as it will increase even more vectorization and probably lead to ~50% of time in the serialized part. Still a 20% gain probably, so I would definitely do it ! It's only about changing double to float everywhere and checking your results are still correct.

Also I would check how much time you spend in sqrt. There are many ways to improve that :

  • just drop some : it's surprising how many times people compare lengths when they could compare their squares !
  • use standard approximations and check the results are not changed
  • use 1/sqrt when possible (much faster than sqrt)
  • use fast 1/sqrt + Newton-Raphson if needed : see all details in paragraph 2.2.2 of Florian's thesis . It's in french, but I know it's not a problem for you

@valassi
Copy link
Member Author

valassi commented Apr 29, 2021

Hi, thanks a lot @lfield and @sponce ! Very interesting :-)

Ok one preliminary point: there are three parts in the code, random numbers, rambo (random numbers to momenta), sigmakin/calculate_wavefunction (momenta to matrix elements, eg FFV_2_4_3). The third part is the only one I vectorized and is generally the largest consumer. The random numbers are either curand on CPU (if cuda is available) or std on CPU (if cuda is not available), but anyway not something we optimize. So:

  1. Advisor recommendation 1 is about the std version of random numbers. Not sure we can do much about it (maybe @hageboeck knows), but in any case it's not really crucial. And I do not expect scalar code here, I assume std/gcc and also curand in any case must vectorize. (PS I had missed this part in Sebastien's advice: if it is just changing double to float, this is something we could certainly consider... we need double, or even quadruple, precision for some calculations like Feynman diagranm loops issue Analyse the library dependency (and numerical precision implications) for Feynman diagram loops in NLO #191, but for generating random numbers I guess we are more than enough with single precision, even in production code? I opened Investigate the possible use of single-precision random numbers also for double-precision madgraph #193)
  2. Advisor recommendation 2 on rambo, and also Sebastien's points on the amount of scalar code: yes I know I should vectorize rambo, and I created Vectorize rambo #192 . This is actually why I was saying that it would be useful to do perf stat selectively only on the ME part ('perf stat' counters only in the sigmakin function (ME calculation)  #190), it is mainly because I know rambo is scalar. However, as not only perf stat but also advisor and many other tools ultimately see the whole program, in the end it's just easier to vectorize the whole thing...
  3. and 4. Advisor recommendations, interesting, to look at, they concern the ME. It would be useful to have som epointers to source code lines, Laurence does advisor give that?
  4. Advisor recommendation about sqrt, and also Seastien's advice - again very interesting, something to look at. We actually also have issue Investigate how expensive sqrt is for CUDA (registers) and C++ (vectorization etc) #15 open for similar reasons in the GPU implementation, getting rid of sqrt if possible would probably help speed up also the sqrt part. I have added a few comments there on Investigate how expensive sqrt is for CUDA (registers) and C++ (vectorization etc) #15.

Thanks again... quite a few things to consider...

@lfield
Copy link
Contributor

lfield commented Apr 29, 2021

@valassi it says calculate_wavefunctions at HelAmps_sm.cc:175. There are three improvements that could be made here:

  1. Vectorize serialized function(s) inside looploop
  2. Enable the use of approximate division instructions
  3. Enable the use of approximate sqrt instructions

Here is a plot for just that function plot. Note that this is for the AVX2 build.

@sponce
Copy link

sponce commented May 3, 2021

we need double, or even quadruple, precision for some calculations

hum... in my experience such a statement is the sign for something more important and to be solved the proper way. Let me explain a typical example I've seen in LHCb. We "need" double precision for matrix inversion according to popular belief. And indeed, results change if you switch to float. Now after AMD came in the gain, we also have different results there (even with double precision), so I studied the case and found out what should be expected : unstable mathematical code. Unstable in the sense that you do 1.00...00x - 1 and thus get x with extremely bad precision if not completely random. Of course switching to double improves x for one iteration, but in this case we had several iterations leading to garbage whatever happens.

Bottom line : the "need" for doubles (or worse) should be read as "we have unstable code to be fixed". In our case, it was matrix diagonalisation for matrices close to unit and there are good techniques for this.

@lfield
Copy link
Contributor

lfield commented May 3, 2021

I ran the avx2, 512y and 512z builds with Intel Advisor and am more confused than enlightened. Firstly even when trying to restrict the code to 1 OMP thread, Intel Advisor states that it used 14 CPU threads. This makes it difficult to understand the roofline plots.

Secondly the avx2 build vectorized 2 loops while the 512 builds vectorized 3 loops. The 512z built only used AVX512F_512 while the others also used AVX and FMA. Here are some metrics.

  • AVX
    • Elapsed Time: 5.80s
    • CPU time: 5.03s
    • Time in vectorized loops: 48.9%
    • Time in scalar code 51.1%
    • calculate_wavefunctions instruction matrix
      • Memory 35%
      • Compute 33%
      • Mixed 28%
      • Other 4%
  • AVX512y
    • Elapsed Time: 5.59s
    • CPU time: 4.80s
    • Time in vectorized loops: 47.3%
    • Time in scalar code 52.7%
    • calculate_wavefunctions instruction matrix
      • Memory 56%
      • Compute 5%
      • Mixed 6%
      • Other 33%
  • AVX512z
    • Elapsed Time: 6.59s
    • CPU time: 5.83s
    • Time in vectorized loops: 50.8%
    • Time in scalar code 49.2%
    • calculate_wavefunctions instruction matrix
      • Memory 48%
      • Compute 4%
      • Mixed 6%
      • Other 42%

@lfield
Copy link
Contributor

lfield commented May 4, 2021

I was looking at the GPU offload modeling in Advisor using the AVX2 build. It identified 12 regions for potential offloading.

  • 4 in Common Random numbers
  • 2 in Rambo
  • 4 in check.cc
  • 2 in calculate_wavefunctions (HelAmps)

However, although the estimated speedup for the accelerated code is ~18x, the Amdahl's law speedup is 1.8x.

@hageboeck
Copy link
Member

Hi Laurence,
the common random numbers were not meant to be fast. They are only for testing the accuracy of the computations with stable random numbers.

For benchmarks, I recommend "native" random generators that can run on a device, like curand for cuda.
If there's nothing like this in SYCL/oneAPI, one could start profiling after the random numbers have been produced.

@valassi
Copy link
Member Author

valassi commented May 4, 2021 via email

@lfield
Copy link
Contributor

lfield commented May 4, 2021

I know that we are not currently focusing on the random number generation, but mentioned it for completeness and yes, maybe the other CPU threads are used by the non-omp code.

@valassi
Copy link
Member Author

valassi commented May 28, 2021

On the AVX512 topic, I just had an extremely interesting discussion with @HadrienG2 (thanks a lot again Hadrien!).

He suggests that one possible explanation of the suboptimal speedup from AVX512 is the processor I am using for tests, an Intel Xeon Silver 4216, which according to Intel's specifications has only one AVX-512 FMA unit (source: https://ark.intel.com/content/www/us/en/ark/products/193394/intel-xeon-silver-4216-processor-22m-cache-2-10-ghz.html). If I understand correctly, this single unit fuses two AVX2 units. In this model, I do not get a x2 speedup from AVX512 with respect to AVX2, because with AVX2 I can use ports 0 and 1 to execute two streams of instructions in parallel (instruction level paralelism), while with this fused AVX512, the unit occupies by itself both ports 0 and 1, therefore the x2 speedup from AVX512 is lost because of a 1/2 factor from having half as many instructions though the ports (and ON TOP I also get the slowdown from the clock slowdown, so it is even slower than AVX2).

Now, Hadrien also suggests that higher end Lake processors have two AVX512 FMA units, because there is an optional second unit on port 5. (Question: but then, do I not get the same issue, that I could use four AVX2 instructions on ports 0,1,5,6, as opposed to only two AVX512 on ports 0+1 and 5+6? just speculating and inventing port6...). Anyway, I will try to find one of these processors, eg https://www.intel.com/content/www/us/en/products/sku/198017/intel-core-i910980xe-extreme-edition-processor-24-75m-cache-3-00-ghz/specifications.html?wapkw=i9-10980XE.

(Aside: a nice diagram of lake architectures: https://en.wikichip.org/w/images/e/ee/skylake_server_block_diagram.svg - this may also be useful to understand simd register pressure eg 16 registers per core in SSE and 32 per core in AVX2)

(Another aside: to understand the assembly of AVX2 vs AVX512, which is another complementary approach to analyse this issue further, Hadrien suggests this book https://www.apress.com/gp/book/9781484240625, Modern X86 Assembly Language Programming - Covers x86 64-bit, AVX, AVX2, and AVX-512 | Daniel Kusswurm)

I will try to find a different more modern lake processor and repeat the test... thanks so much again Hadrien!
Andrea

@HadrienG2
Copy link

HadrienG2 commented May 28, 2021

Random clarifications:

Now, Hadrien also suggests that higher end Lake processors have two AVX512 FMA units, because there is an optional second unit on port 5. (Question: but then, do I not get the same issue, that I could use four AVX2 instructions on ports 0,1,5,6, as opposed to only two AVX512 on ports 0+1 and 5+6? just speculating and inventing port6...).

If you look at the bottom of this microarchitecture block diagram, you will see Intel only used the "fuse two 256-bit units into a 512-bit FMA unit" trick on port 0 + port 1. On the higher-end CPUs, port 5 features a genuine 512-bit FMA unit, which will only be used if you're computing 512-bit FMAs (notice the "zmm only" annotation). So if your workload is purely compute-bound, you should be able to run 2x more FMAs per cycle on those CPUs and get the expected 2x FLOPs benefit.

I do not know why Intel do not allow port 5 to execute 256-bit FMAs. I thought it would, and just noticed my mistake while reviewing this comment.

Also, for reasons detailed below, many workloads are bound not by FLOPs but by memory operations, and here again using AVX-512 with its native 512-bit vector width can help by letting you shuffle around more data per cycle between the L1 cache and CPU registers. If you look at the aforementioned microarchitecture diagram, you will see that e.g. ports 2 and 3 can load two 512-bit vectors per cycle, whereas if you use 256-bit vector instructions, they will only load two 256-bit vectors per cycle, i.e. half as much data. If this becomes the limiting factor, then 512-bit instructions can be 2x faster even on lower-end CPUs.

(But this is only the limiting factor if you manage to get the input data to fit in L1 cache. Otherwise you will be limited by the bandwidth of the interconnect between the L1 and L2 cache, which is 64B/512b per cycle and thus fully saturated already when using 256-bit memory loads)

(Aside: a nice diagram of lake architectures: https://en.wikichip.org/w/images/e/ee/skylake_server_block_diagram.svg - this may also be useful to understand simd register pressure eg 16 registers per core in SSE and 32 per core in AVX2)

This diagram does not feature architectural registers, but you will see a handy schematic of these here : https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions . Notice that storage from the 512-bit ZMM registers is reused by the 256-bit YMM and 128-bit XMM registers. Notice also that when used in 256-bit mode, AVX-512 instructions still have access to 16 more YMM registers than classic AVX(2) instructions.

To see how registers come into the picture, you need to know that in the lucky scenario where all your input data is in the L1 cache, each core of an Intel *Lake CPUs can do the all of the following (and a few extras) in a single CPU cycle :

  1. Execute two 256-bit FMAs / one 512-bit FMA per cycle (lower-end CPUs) or two 256-bit FMAs / two 512-bit FMAs per cycle (high-end CPUs), on data which is already resident in CPU registers.
  2. Transfer two 512-bit or 256-bit vectors from "memory" (actually the L1 cache) to CPU registers.
  3. Transfer one 512-bit or 256-bit vector from a CPU register to "memory" (actually the L1 cache).

Since FMAs have three input operands, it quickly becomes obvious that if you were to load all of these operands from memory to registers every time you want to execute an FMA, you would be limited by performance characteristic number 2 (how many SIMD vectors you can load from memory to registers per CPU cycle) and not performance characteristic number 1 (how many multiply-add floating-point operations you can perform per cycle).

The secret to avoid this outcome is to keep reused operands in CPU registers, rather than constantly reload them from memory. When using higher-level programming models like autovectorization or GCC's vector extension, your compiler does its best to take care of this matter for you. But it is still limited in this caching process by the number of registers that the CPU architecture provides it with.

In AVX2, you have 16 of these, whereas in AVX-512 you have 32 of them (even when using the AVX-512 instruction set to process 256-bit vectors), which gives the compiler twice as much "breathing room" to intelligently cache your data. This is one of the reasons why the AVX-512 instruction set can improve your performance, even when used on 256-bit vectors.

Another reason is that the AVX-512 instruction set is better than legacy AVX at shuffling data around between CPU registers, which comes in handy when your input data is not quite organized the way SIMD instruction sets like it. For example, when working with complex numbers, operating on arrays of std::complex takes preprocessing on the CPU side, which you could avoid if you had one array of real parts and one array of imaginary parts instead. But when you do have to engage in such deinterleaving, AVX-512 instructions can do it faster than legacy AVX instructions.

TL;DR AVX-512 is a bit of a scam when it comes to number-crunching power, as it only brings you the expected 2x performance benefit on specific high-end CPUs with two 512-bit FMA units. But it has other benefits that can improve the performance of code that's limited by something other than raw FLOPs.

@lfield
Copy link
Contributor

lfield commented Jun 1, 2021

Sorry for the naive question but isn't AVX512 with 256-bit registers the same as AVX2? Looking at VTune, I can see that the average CPU frequency decreases from 2.4GHz to 2.2GHz. I also noticed that I can't see any FMA instructions with the 512z build. From what I can see here, FMA is provided by the AVX512IFMA extension, however, the CPU that I am using does not provide this. It is a Xeon Silver 4216 release in Q2 2019 so reasonably recent.

@HadrienG2
Copy link

HadrienG2 commented Jun 1, 2021

Sorry for the naive question but isn't AVX512 with 256-bit registers the same as AVX2?

No, because it has access to AVX-512 specific instructions (more efficient shuffling of data across registers, better ways to mask unwanted operations when control flow diverges... see https://en.wikipedia.org/wiki/AVX-512 for an extended list) and 16 more ymm16..ymm31 architectural 256-bit registers (which allows you or the compiler to cache more frequently used data in registers instead of going through L1 cache loads/stores all the time).

Looking at VTune, I can see that the average CPU frequency decreases from 2.4GHz to 2.2GHz.

This is expected. To prevent overheating, CPUs downclock when wider vectorization is used (or, equivalently, overclock when narrower vectorization is used, take is as sense thou wilt). The rules for CPU frequency scaling are a bit complicated and changed a lot over the history of Intel processors, but this page is good starting point if you want to dig into this further : https://en.wikichip.org/wiki/intel/frequency_behavior .

I also noticed that I can't see any FMA instructions with the 512z build. From what I can see here, FMA is provided by the AVX512IFMA extension, however, the CPU that I am using does not provide this. It is a Xeon Silver 4216 release in Q2 2019 so reasonably recent.

As the name suggests, AVX512IFMA is about integer fused multiply-add, which is somewhat niche. From previous discussions, I'm assuming that you're doing floating-point computations here, for which FMA operation should be part of the basic AVX512F ("Foundation") instruction set.

Therefore, if you're not seeing any FMA in your code, there must be another reason. Maybe your code does not contain clear multiply-add sequences, maybe the compiler does not feel like turning them into FMA as that can change numerical results (Clang cares about this unless -ffast-math is on, GCC doesn't on the ground that changing results in a direction that increases precision is okay)... or it can be a few more exotic things.

In any case, note that the FMA units are also used (albeit obviously less efficiently) for individual add/sub/mul operations. So if your code only uses those, you can still aim for a peak floating-point throughput of two vector add/sub/mul per cycle on CPUs with two FMA units, which is twice less than what you can get with FMA but still very much honorable :)

@lfield
Copy link
Contributor

lfield commented Jun 1, 2021

I don't understand then why the 512z build is not using FMA as it is there in the 512x build. The code and build instructions seem to be identical, the only difference is the addition of -mprefer-vector-width=256 for the 512z build.

It looks like those AVX-512 specific instructions are doing something good as I see a 20% speed-up with the 512y build.

@valassi
Copy link
Member Author

valassi commented Jun 1, 2021

Hi, first of all thanks @HadrienG2 ! I will need some time to digest all this information :-)
But indeed the main point seems to be that the port 5 unit is a real AVX512 unit, and you can only use ot for that. I should soon get hold of a Gold Skylake to compare to my previous Silver Skylake, then we'll see...

On the question from @lfield : it took me some time to even be bothered to try out AVX512 with 256 bit width, because I could not understand why it would work. This is a very good suggestion that @sponce had given me, an din the end I tried and it works! A few technical details

  • The specific reason why "512y" (AVX512 limited to ymm registers) is faster than AVX2 (AVX2 on the same ymm registers) is that there are a few more instructions executed. If I understand correctly, these are "AVX512VL" instructions, which are specifically meant to backport some newer functionality of AVX512 from zmm to xmm and ymm registers. I have disassembled the code to check the differences, and then I made some scripts with regex to count them. See the links here More complete analysis of AVX512 in both gcc and clang #173 (comment) and the post before it. In my categorization:
AVX2 build
- =Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
AVX512/512y build
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
AVX512/512z build
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)

Those 95 symbols give the speedup over AVX2. This is a very home-made categorization, no guarantee it is exact, but it gives an idea. Those symbols are matched here

cnt512y=$(stripSyms '^v.*dqa(32|64).*(x|y)mm' '^v.*(32|64)x2.*(x|y)mm' '^vpcmpneqq.*(x|y)mm')
i.e. '^v.*dqa(32|64).*(x|y)mm' '^v.*(32|64)x2.*(x|y)mm' '^vpcmpneqq.*(x|y)mm'. In particular, note that these are NOT fma instructions. I mean, there are a lot of fma instructions in all builds, but the difference between AVX2 and 512y is not related to those (example, vfmadd132pd 0x160(%rdi),%ymm1,%ymm9 is in AVX2 and something similar in 512y too)

  • About the clock speed, indeed there is a reduction for AVX512/512z. Thanks for confirming also with vtune! In the end I am checking this using perf. Again from the previous post:
AVX512/512y build
     9,047,092,255      cycles                    #    2.529 GHz
AVX512/512z build
     8,829,184,043      cycles                    #    2.213 GHz

I investigated various options, with help from your colleagues in IT-CM (https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF1783725). I wondered if the cpufreq mechanism would work, but with VMs it is tough. Instead they enabled CPU passthrough on this VM, so I can use perf, and this looks enough.

  • About the actual build of the code, indeed -mprefer-vector-width=256 is added for AVX512/512y, but this is NOT what explains the difference. I imagine that this might make a difference if you rely completely on autovectorization, but I am not: instead, I rely on compiler vector extensions and I explicitly hardcode the vector lengths I want to use. Since -mprefer-vector-width=256 does NOT change the #ifdef available in the code (check with the usual g++ -E -dM, you will see that the output is identical with and without -mprefer-vector-width=256), I solve the issue by adding another definition MGONGPU_PVW512, see
    `#if defined AVX512VL
    #define MGONGPU_CPPSIMD 1
    #ifdef MGONGPU_PVW512
    const int neppM = 64/sizeof(fptype); // "512z" AVX512 with 512 width (512-bit ie 64-byte): 8 (DOUBLE) or 16 (FLOAT)
    #else
    const int neppM = 32/sizeof(fptype); // "512y" AVX512 with 256 width (256-bit ie 32-byte): 4 (DOUBLE) or 8 (FLOAT) [gcc DEFAULT]
    #endif

Cheers Andrea

@lfield
Copy link
Contributor

lfield commented Jun 8, 2021

I haven't tried on AMD but from what I have seen, generally agree. AVX512 with 256 registers seems to be the best. Linus doesn't like it (article).

@valassi
Copy link
Member Author

valassi commented Jun 8, 2021 via email

@hageboeck
Copy link
Member

I do not understand your last point on cores: in any case you can use AVX2 and AVX512 on the same number of cores, I see no difference there.

That's about AMD deciding to not go for anything more complicated than AVX2, but put more cores on a chip. I don't know where Intel's top models are, but the Threadrippers have 64 cores on a chip that each can do AVX2. A quick search for Xeon W resulted in 28 cores max.

@HadrienG2
Copy link

@hageboeck I'd agree with your general sentiment that if, as developers, we got to choose what WLCG put on their grid sites, AMD + AVX2 would be an infinitely better choice than Intel + AVX-512 in pretty much every respect.

But personally, I don't get to choose the hardware, and I know that many grid sites have bought or are still buying Intel chips. So the route that I personally take is "make it work well on AVX2 first, and then make it work well on AVX-512 as well if you have extra time".

@hageboeck
Copy link
Member

hageboeck commented Jun 8, 2021

"make it work well on AVX2 first, and then make it work well on AVX-512 as well if you have extra time".

That's a good approach. I made it compile with AVX512 and stopped.

@lfield
Copy link
Contributor

lfield commented Jun 9, 2021

  • AVX512 with 512 bits has a theoretical factor 2 advantage over AVX512 with 256 bit, and it is this extra factor i am still loking for: i thought maybe Gold vs Silver would gve it (2 FMA units against 1), but I see no improvement

I think that due to the dynamic frequency behavior, a theoretical factor 2 is not achievable even for 1 thread. AFAIU, the frequency is for the CPU and not per core. So if you have 16 cores and run just one thread using AVX512, all other cores will be slowed down too. In the example, the frequency drops from 3.5 GHz to 1.9GHz if all cores are used, hence the real performance degrades further as you run more work.

@valassi
Copy link
Member Author

valassi commented Jul 23, 2021

By chance, I have run last week some tests on the CORI HPC at NERSC, see PR #236. This is on a Xeon Gold 6148 CPU, so it is not meant to be very different from the Xeon Gold 6130 that I had tried previously. (See for instance https://colfaxresearch.com/xeon-2017.)

Surprisingly, however, on this syem I finally saw a benefit of 512-bit AVX512 "512z" over 256-bit AVX512 "512y". The throughput increase is around 10% for double and around 30% for float. Still less than the nominal factor 2, but keeping into account the clock slowdown this looks quite good. Unfortunately I did not have perf - and I forgot to run also my tests with aggressive inlining or LTO (issue #229).

Note also that this CPU is 30-40% faster than those I previously tried. Both my usual Silver 4216 and my test Gold 6130 gave 1.31E6 throughput in scalar mode, while this new Gold 6148 gives 1.73E6 in scalar mode, so there is clearly something else which has changed and could be very relevant.

Apart from the processor speed itself, which might be ~15% higher, the 6148 has a large L2 cache and alarger L3 cache, see https://versus.com/en/intel-xeon-gold-6130-vs-intel-xeon-gold-6148. It could be interesting to understand better if and how this is relevant in these tests.

@srlantz
Copy link

srlantz commented Jul 24, 2021

(This is a long email that I wrote to @valassi in response to an inquiry from him. I am posting a slightly redacted version here at his request.)

The mkFit project (https://github.com/trackreco/mkFit, https://trackreco.github.io/) has gained quite a bit of experience with AVX-512 in the context of vectorized, parallelized tracking. And we too have been confronted from time to time with puzzles in the vector speedup results. Apparently you are already aware of one that stumped me for a while, namely the handicap of most Silver and even some Gold Xeon processors that have just one AVX-512 VPU per core instead of two. (Don’t worry, the Gold 6130 and 6148 have two.)

My main reaction to [this thread] is that it would be good for you to try the Intel C/C++ compiler instead of gcc. The Intel C/C++ “Classic” Compiler is now free to download through the Intel oneAPI HPC Toolkit. (You’ll need the Base Toolkit as well. Be sure to use icc/icpc rather than icx/icpx, which is based on LLVM.) We find that the Intel classic compilers do a much better job in getting good AVX-512 performance. Note, you do have to set -qopt-zmm-usage=high to enable zmm usage for Xeon.

The question of why gcc does so much worse with AVX-512 is a difficult one to answer, in general. In our case, we think it is partly due to a failure to vectorize sin/cos, which icc does via Intel SVML. There are somewhat subtle reasons for why gcc has trouble with this, having to do with libmvec, glibc, and -ffast-math.

Even with the Intel compilers, however, it is difficult to get full AVX-512 speedup. As you point out, dynamic frequency scaling penalizes AVX-512z relative to AVX2 or AVX-512y when all cores are in use. (Note, this is NOT true for single-core tests.) Therefore, we have recently moved away from AVX-512 entirely, even with Intel compilers, as it seems to provide little benefit when frequency scaling is factored in.

I looked through the comments in [this issue] and I would like to highlight the comments by @sponce. The part that makes vector speedup tough is Amdahl’s Law, which enforces a plateau in the speedup due to the presence of “serial code”. Frequency scaling can turn this plateau into a slowdown, even for code that “to the eye” is CPU-bound. Instruction scheduling may in effect serialize the instruction stream at certain points. For example, format conversions (e.g., double to float) should be avoided as they interrupt the vector pipeline. Other aspects of instruction scheduling may be baked into the hardware and out of the programmer’s control.

I’d like to point out that even the unavoidable act of incrementing a loop counter is “serial code”. The only way to mitigate against it is to unroll the loop to a high degree, making the pipeline as deep as possible before needing to interrupt it for the compare-and-branch of the loop. But the number of registers available for such unrolling and pipelining is not infinite. (It’s possible that good zmm register usage is what makes icc cleverer than gcc.)

I don’t think that Amdahl’s Law is the complete explanation in your case, though, based on your data. Let’s postulate that the 3.5x speedup you observe for AVX2 is limited entirely by Amdahl. If we take S(N) = 1/(1-P+P/N) and solve it for parallel fraction P based on speedup S(N), we get P = (1-1/S(N))/(1-1/N); thus, given S(N)=3.5, we get P=0.95. That implies S(8) should be around 5.9, but even slowing it down by 30% for frequency scaling doesn’t make it less than 3.5. Of course, this approach may be too simplistic to give any real insight.

To disambiguate dynamic frequency scaling from other effects, another thing we have done is to disable Turbo Boost and look at speedup curves with the frequency scaling (mostly) removed. To shut off turbo, you can do this as root: “echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo”. But AVX-512 runs so hot that when all cores are active there is some processor slowdown even with turbo off.

The discussion in the GitHub issue doesn’t say much about how many threads are used in your tests. For sure, hyperthreading is not your friend for CPU-bound codes. But the thread count is integral to understanding the effects of frequency scaling [as the effects are much more pronounced when all cores are active].

By the way, if it is possible for you to pin threads to cores, do so. That can make a huge difference. Unfortunately, our code is multithreaded with TBB and I am not aware of a way to control thread pinning with it.

Looking at other possible factors, one that doesn’t work (to my mind) is cache size. I believe the L2 and L3 cache per core is actually the same for Gold 6130 and 6148.

I guess overall my recommendation is to try icc and compare its disassembly with that from gcc. If you download the oneAPI toolkits, you will also get access to Intel Advisor. This tool can produce a roofline plot in which the arithmetic intensity is calculated for all the key loops and functions, and each one is represented by its own Gflop/s point. Advisor also lets you drill down into the source to see the instructions generated by each line and analysis of how well each loop was vectorized. The code does not have to be compiled with icc to use the toolkit… oh, looking again at [this issue], I see Laurence is already aware of this tool, great. Another good tool for looking at assembler output from a wide range of compilers is godbolt.org.

In general the discussion in the GitHub thread is extremely well-informed and I don’t know that there is a whole lot more that I can add to it, beyond what I have said above. One correction, though, is that Skylake Xeons can adjust frequency on a per-core basis, I believe.

Anyway, I hope that something in the above brain dump will prove helpful to you.

@valassi
Copy link
Member Author

valassi commented Jul 25, 2021

Hi @srlantz , thank you so much again for all your useful feedback! (Copying also @roiser and @oliviermattelaer explicitly for info).

Apart from other points we discussed privately, here's a few points that I note down as some TODO for us in this investigation of AVX512/zmm:

  • Definitely we should try again icc. Last week I moved from icc to clang-based icx (issue Build and test using the Intel compiler #220), where I got interesting performances, comparable to (or even better than) those of clang. My main problem with icc was that our vectorization now uses compiler vector extensions (CVE, eg attribute vector_size), and vectorizing in icc would imply a significant software rewrite, for instance with pragma omp simd or with intrinsics (as in your code https://github.com/trackreco/mkFit/blob/devel/Matriplex/Matriplex.h). Eventually I expect icx to beat the performances of icc, but in the meantime using icc sounds like a useful thing to try out. For the record: we already use fast math everywhere now (issue 1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117, PR Fast math in fortran and c++ #128). Also, we have no sin/cos functions, but we do have sqrt: this is why CVEs are enough for us and we do not need complex dependencies like Vc.

  • Your suggestion to use Advisor and produce roofline plots with arithmetic intensity displays is also very useful.

  • You definitely have a point about the interplay of multithreading with these studies, especially about dynamic frequency scaling. The tests I typically take as reference are all single threaded. We should certainly investigate again the effect of vectorization levels when we also have multithreading (possibly an improved MT version better than the current OMP prototype, issue OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) #196 - where we also started some tests of NUMA pinning). Or another possibility is to study the effect of software vectorization and/or multithreading with Turbo disabled, as you explain in your paper https://arxiv.org/abs/2006.00071.

  • We should understand a bit better how Amdahl applied to use, ie how much serial and parallel code we have. We should produce plots like those in your https://arxiv.org/abs/2006.00071, with superposed predictions for some parallel fractions, I find those plots very useful. In any case our ratio parallel/serial is affected by many factors. For instance, we should repeat these tests for both simple (eemumu) and more complex (ggttgg) physics processes (when the latter are vectorized, which is not yet the case (issue (Cancelled) Manual vectorization of ggtt(g(g))? #241). Also, we should repeat these tests with/without aggressive inlining or LTO (issue Investigate link time optimizations (-flto) and inlining #229).

  • One point that still remains mysterious to me is why we did get AVX512/zmm to beat AVX512/ymm on a Gold 6148 (PR#236), but not on a Gold 6130. Unfortunately we do not have regular access to these machines (especially Gold 6148), and we cannot really investigate this further for the moment. One thing I just noticed though is that on the Gold6148 I used gcc10 and not gcc9, I will quickly test if this makes a difference. Interesting in any case that you confirm that for mkFit you did observe a difference between Gold and Silver.

I also found a couple of other links related to your project which I found very interesting, more food for thought for us:

Well this gives us a lot of things to think about and try out. Thanks again Steve for the feedback! Best, Andrea

@valassi
Copy link
Member Author

valassi commented Jul 25, 2021

PS My tests on a Gold 6130 were certainly with gcc9, see PR #209 and commit 7a1c112

@srlantz
Copy link

srlantz commented Jul 28, 2021

(Again posting comments that were initially made to Andrea via email, with some edits.)

The Matriplex code in mkFit includes implementations of key routines in intrinsics, both AVX-512 and AVX2 (with FMA). When we first started the mkFit project we were interested in Xeon Phi, and at that time, the intrinsics were pretty essential to getting top performance. These days we find that auto-vectorization by the compilers (Intel especially) has caught up to a large extent, so intrinsics no longer confer such a big advantage. Still, I think the code has some interesting uses of the vgather-type intrinsics.

Anyway, this is why we never bothered with vector extensions. Instead, our Matriplex classes define certain data members that are just flat C-style arrays. This provides plenty of opportunities for the compiler to vectorize simple loops in the class’s methods, even without resorting to our intrinsics-based implementations that can be enabled via ifdef’s.

I have no idea how Intel’s C++ compilers will evolve in the future, but as of right now, the icc classic line does far better at code optimization than the icx or clang line, so we are not planning to switch anytime soon. I can send you some timing comparisons for the mkFit code if you are interested.

Yes, I am still interested in approaches like VC and VCL due to the desire to have vectorized trig functions when compiling with gcc (unless libmvec turns out to be sufficient).

The idea of compiling in such a way as to produce AVX-512VL instructions is interesting. I think this is what one gets with icc when one uses -xCORE-AVX512 without -qopt-zmm-usage=high (which I have tried), but I will have to confirm.

Finally, if you are doing single-threaded tests, then your results are not explained at all by dynamic frequency scaling. On 1 core, Turbo Boost is able to speed up AVX-512 nearly as much as AVX2 and even SSE.

(from a follow-on message to Andrea...)

I would be particularly interested to know if the performance improvement you saw with AVX-512 on the Gold 6148 was really due to the different hardware, or due to the use of gcc-10 rather than 9 (as you mentioned on GH). A few tests with gcc 10 on your Gold 6130 machine should let you know.

I ought to mention that our usual test machine does in fact have dual Gold 6130 processors... In the past, I have done some performance tests on other Skylake models, including at least one Platinum, and I did see better performance than could be explained purely by the ratio of clock speeds. However, it was not as dramatic as what found in your observations. So I might lean more toward the gcc-10 explanation in your case.

Note that the OS can make a difference too, in that libc is baked into the OS. I have found that CentOS 7 is based on glibc 2.17, while CentOS 8 has gblic 2.31. This matters because libmvec (which can be linked by gcc >= 8.1) is not available prior to gblic 2.22 (https://sourceware.org/glibc/wiki/libmvec). I don’t know if this makes a difference to your sqrt() or not, or how important that would be in your case.

@srlantz
Copy link

srlantz commented Jul 28, 2021

One update to the above is that I just tried a test with the "512y" idea, i.e., options -march=skylake-avx512 -mprefer-vector-width=256, using gcc 9.2.1. I found that the gcc-based performance of mkfFit was more characteristic of previous tests with AVX-512 (i.e., slower) than pure AVX2. This was on a dual Gold 6142 with Turbo Boost enabled, running 64 threads on 32 cores, while handling up to 32 events in parallel. In a particular test scenario, the event loop ran in 14.0 sec. vs. 10.8 sec. for pure AVX2 (-mavx2 -mfma). This result highlights what I was saying about the difference between AVX-512 and AVX2 being far more dramatic with all cores active.

In contrast, the same code compiled with icc runs the same event loop in 7.3 sec. for AVX2, and pretty nearly the same for AVX-512.

@ivorobts
Copy link

I want to add that the known issue with significant frequency drop with AVX512z (AVX512 with 512-bit zmm registers) and, as result, performance drop was improved in the latest Intel CPUs, e.g. Icelake and Sunny Cove Core microarchitecture.
Please refer to this article for more details: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html
Note that the situation is even better with latest Willow Cove Core microarchitecture (available in Tiger Lake client processors).
My recommendation is to make your tests on latest Icelake processors, e.g. on Intel DevCloud there are few machines available.

@valassi
Copy link
Member Author

valassi commented Sep 6, 2021

Thanks a lot @srlantz and @ivorobts for the info and suggestions! I will definitely put on my todo list some tests with all cores active, and also some tests on IceLake or more recent chips (Intel DevCloud also sounds like a nice suggestion).

I also still need to complete and document some tests that I did on gcc10 two months ago, as I had discussed with Steve.

@valassi
Copy link
Member Author

valassi commented Sep 24, 2021

I have made some tests on an Icelake Platinum.
Unfortunately the results (on a single core) are fluctuating enormously, much more than on other chips I tested.

So it is difficult to say anything concrete about AVX512 with 512bit width.
It looks as if there is (as on cori) some improvement vs AVX512 with 256bit (512y), but still less thanI was expecting.

Again, there is a large effect in my code from using aggressive inlining or not.
And again, I have not used multthreading (mainly because we need to improve its implementation)

@valassi
Copy link
Member Author

valassi commented Sep 29, 2021

Hi @lfield I made myself a note (issue #249) about writing down some documentation about how to run our performance tests. Possibly this week or else next week.

@valassi
Copy link
Member Author

valassi commented Oct 21, 2021

A brief update on this issue (copying this from the older issue #71 that I just closed).

All results discussed so far in this issue #173 were about the vectorization of the simple physics process e e to mu mu. In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process ie g g to t t g g (4 particles in the final state instead of two, with QCD rather than QED - more Feynman diagrams and more complex diagrams, hence more CPU/GPU intensive and slower).

The very good news is that I observe similar speedups there, or even slightly better. With respect to basic c++ with no SIMD, I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real.

I also tested more precisely the effect of aggressive inlining (issue #229), mimicking LTO link time optimization. This seemed to give large performance boosts for the simpler eemumu (for reasons that I had not fully understood), but for the more complex/realistic ggttgg it seems irrelevant at most, if not counterproductive. This was an optional feature, and I will keep it disabled by default.

The details are below. See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto

DOUBLE

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.809004e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.809004e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.490002e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.490002e+03                 )  sec^-1

For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case.

FLOAT

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.793838e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.793838e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.397076e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.397076e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 7775) (512y:   29) (512z:    0)

512z SIMD, INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.391933e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.391933e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 4075) (512y:    7) (512z:39296)

By the way, note en passant that I moved from gcc9.2 to gcc10.3 for all these results. But here I am still on a Xeon Silver.

Concerning the specific issue of AVX512 with 512bit width zmm registers ("512z") discussed in this thread #173, the results are essentially unchanged.

  • The fastest option is using 512y (256bit width ymm with AVX512) rather than 512z, for both double and float. In both cases, the fastest option is not to use inlining.
  • For double, without inlining, 512y gives a ~4,2 speedup over no simd. Instead 512z is always worse.
  • For real, without inlining, 512y gives a ~7.7 speedup over no simd. The 512z is worse without inlining, while 512z with inlining gives the same total throughput as 512y without inlining (but I will still prefer 512y without inlining in this case).
  • For our eventual production usage, we should compare these numbers to the Fortran calculation (which seemed initially to be twice as fast as C++ with no simd??), also after reviewing all build options like -O3 and fast math. This is planned...

Now that I have ggttgg vectorized, at some point I will rerun the same tests on other machines, including Xeon Platinum or Skylake. I need to document how to run the epochX tests for ggttgg, but it is essentially the same as the epoch1 tests for eemumu.

@valassi
Copy link
Member Author

valassi commented Feb 18, 2022

I have some interesting results from tests at the Juwels HPC (thanks @roiser for getting the resources!). These were merged in PR #381. Specifically, the results are here

On jwlogin05.juwels [CPU: Intel(R) Xeon(R) Gold 6148 CPU] [GPU: none]:
(look only at HELINL=0: the other table HELINL=1 with aggressive inlining is just there for reference and has worse results)

*** FPTYPE=d ******************************************************************
+++ REVISION 65730b2 +++
On jwlogin05.juwels [CPU: Intel(R) Xeon(R) Gold 6148 CPU] [GPU: none]:

[gcc 11.2.0] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    --------    --------    --------    --------    --------    
CPP/none    2.17e+06    2.62e+05    3.35e+04    2.39e+03    1.00e+02    
CPP/sse4    4.01e+06    4.36e+05    6.22e+04    4.59e+03    1.80e+02    
CPP/avx2    8.26e+06    8.57e+05    1.34e+05    1.06e+04    4.04e+02    
CPP/512y    8.48e+06    9.51e+05    1.45e+05    1.15e+04    4.51e+02    
CPP/512z    1.10e+07    9.39e+05    2.09e+05    1.96e+04    7.70e+02    

*** FPTYPE=f ******************************************************************
+++ REVISION 65730b2 +++
On jwlogin05.juwels [CPU: Intel(R) Xeon(R) Gold 6148 CPU] [GPU: none]:

[gcc 11.2.0] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    --------    --------    --------    --------    --------    
CPP/none    2.27e+06    2.72e+05    3.38e+04    2.50e+03    1.09e+02    
CPP/sse4    8.23e+06    6.71e+05    1.11e+05    9.42e+03    4.04e+02    
CPP/avx2    1.75e+07    1.52e+06    2.70e+05    2.15e+04    8.18e+02    
CPP/512y    1.88e+07    1.73e+06    2.96e+05    2.28e+04    9.06e+02    
CPP/512z    2.44e+07    1.97e+06    4.35e+05    4.03e+04    1.57e+03    

Look at the most complex physics process considered, gg to ttggg, the last column on the right:

  • in double precision the thoughput goes from 1.00E2 to 7.70E2 with AVX512/zmm (x7.7 speedup)
  • in single precision the throughput goes from 1.09E2 to 1.57E3 with AVX512/zmm (x14.4 speedup)

With respect to AVX2 or AVX512/ymm, these throughputs are now essentially a factor 2 better. This seems to confirm @HadrienG2 's initial suggestion that these higher end processors have two FMA units and that this helps exploiting zmm computations.

Note that this is on a Juwels Cluster login node! On Juwels Cluster compute nodes, the CPUs are even higher end (Platinum, instead of Gold), so I would expect similar results if not better.

These are still results for single-threaded computations. Clearly when we fill up the machine with several threads the results may change, but this is the first solid evidence I see that speedups close to x8(double) or x16(float) can be achieved. Previously I was always quoting x4(double) and x8(float) from AVX512/ymm. Promising results...

@valassi
Copy link
Member Author

valassi commented Feb 20, 2023

I come back to this issue after quite some time, I confirm that I was able to obtain the full theoretical speedup x8 for double and x16 for doubles on Intel Gold nodes which I believe have two FMA units.

The results have been confirmed for one CPU core. When going to several CPU cores, the speedup is lower than a full x8 or x16 - this might be the effect of closk slowdown, although I have not fully investigated that. I keep this issue open for the moment, just with the idea of investigating that.

In any case, it is clear that "512z" ie AVX512 with 512-bit zmm registers is faster than "512y" is AVX512 with ymm registers, when on Gold nodes with two FMA units. On Silver nodes conversely 512y is still faster than 512z (and also faster than avx2).

The results on one core are publishd on https://doi.org/10.22323/1.414.0212. They also include the effect on a full workflow (where the speedup is - for the moment - lower than x8 and x16 due to Amdahl)
image

The results on many cores were presented at ACAT https://indico.cern.ch/event/1106990/contributions/4997226/
image

Thanks again to all of you in this thread for a lot of relevant feedback to get to this point!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
performance How fast is it? Make it go faster!
Projects
None yet
Development

No branches or pull requests

7 participants