-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
More complete analysis of AVX512 in both gcc and clang #173
Comments
Just a comment on
Yes, essentially. Of course, as you discuss after, you may gain back something as you may do twice more operations per cycle, but when you lose 30% on the clock speed initially, you can compute from Amdahl's law that it's quite a challenge for standard code to be faster at the end. I've never seen it happening actually in HEP. Concerning the clock frequency reduction when using full AVX512, it depends on your processor but you have to know that any AVX512 instruction will slow down the clock for roughly the next ms, yes for millions of instructions. So basically you're running slow clock all the time. |
…adgraph5#173) On itscrd03.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.238520e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.781936 sec 2,522,542,287 cycles # 2.658 GHz 3,487,190,786 instructions # 1.38 insn per cycle 1.070591449 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.312061e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.151202 sec 19,150,190,925 cycles # 2.676 GHz 48,624,130,145 instructions # 2.54 insn per cycle 7.160649288 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.522426e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.855343 sec 12,990,550,722 cycles # 2.672 GHz 29,947,264,265 instructions # 2.31 insn per cycle 4.864907320 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.558120e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.701845 sec 9,362,892,708 cycles # 2.525 GHz 16,560,124,475 instructions # 1.77 insn per cycle 3.711390559 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.913520e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.570205 sec 9,047,092,255 cycles # 2.529 GHz 16,496,830,998 instructions # 1.82 insn per cycle 3.580131314 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.754219e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.983833 sec 8,829,184,043 cycles # 2.213 GHz 13,360,526,672 instructions # 1.51 insn per cycle 3.993458249 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) -------------------------------------------------------------------------
Hi @sponce thanks a lot for the feedback! My point is that the numbers I am quoting (MEs/sec) ONLY refer to the part which is vectorized, Giving an example I havejust run from the latest master, ce00881
The throughput is computed from the "TOTAL(3)" which is sigmakin and does the matrix element calculations. True there are other things before and afterwards, but the number we see for throughput is not affected. There are three things that I was considering to understand this better
In the commit I mentioned above, in the log I gave some first results from perf stat. They are quite interesting
One sees that
Anyway... some more studies to do. I just wanted to dump some info here. @lfield: I suggest you can take the current master, or that commit I mention above, and try it out. The instructions for the build are in issue #177 (ignore the suggestion at the end about building a single binary - this is where I documented the different binaries we have now)... thanks! Andrea |
Attaching the five objdump of one of the simples functions FFV1_0 from
CPPProcess.o.objdump.FFV1_0.avx2.txt CPPProcess.o.objdump.FFV1_0.none.txt CPPProcess.o.objdump.FFV1_0.sse4.txt CPPProcess.o.objdump.FFV1_0.512y.txt CPPProcess.o.objdump.FFV1_0.512z.txt This was from
|
From the quick look, you get very typical numbers in terms of frequency reduction and gain in number of instructions. Hence my comment "it's quite a challenge for standard code to be faster at the end". Note that by standard, I did not mention "standard HEP", but rather good to excellent code from HEP point of view :-) Now it's clear that there are reasons that you can analyze and solve, so that you finally gain. But not something that you can do broadly for all your code, definitely. |
Thanks Sebastien :-) One thing I would find useful is get 'perf stat' counters only for the part that is relevant. I opened #190 on this. If you have any suggestions they are most welcome :-) Thanks |
Note that all the results are for the epoch1/cuda/ee_mumu code running on my 4 core NUC with 5 threads using the parameters 2048 256 12 and the Common Random Numbers method . Code from a few months ago:
Current master:
Current master with SSE4:
Current master with AVX2:
Advisor has the following recommendations:
I haven't tried avx512 as my NUC doesn't support it. Will try it on some other hardware and post the results. |
I also ran the AVX2 build through the Application Performance Snapshot
"Your application might underutilize the available logical CPU cores because of insufficient parallel work, blocking on synchronization, or too much I/O. Perform function or source line-level profiling with tools like Intel® VTune™ Profiler to discover why the CPU is underutilized." |
From your numbers, there is an int on why AVX512 does not perform so well : amdhal ! Indeed your serial code jumps from 6% in non vectorized to 13% in AVX2 and 27% in AVX512. No surprise here. These numbers are already very good actually, 6% is quite low. Now it also means that the hint of the Application Performance Snapshot is correct : you're limited by your parallelization level, or actually vectorization level. And finally it means that "Use the smallest data typeloop", although a very good advice, may not bring that much, as it will increase even more vectorization and probably lead to ~50% of time in the serialized part. Still a 20% gain probably, so I would definitely do it ! It's only about changing double to float everywhere and checking your results are still correct. Also I would check how much time you spend in
|
Hi, thanks a lot @lfield and @sponce ! Very interesting :-) Ok one preliminary point: there are three parts in the code, random numbers, rambo (random numbers to momenta), sigmakin/calculate_wavefunction (momenta to matrix elements, eg FFV_2_4_3). The third part is the only one I vectorized and is generally the largest consumer. The random numbers are either curand on CPU (if cuda is available) or std on CPU (if cuda is not available), but anyway not something we optimize. So:
Thanks again... quite a few things to consider... |
@valassi it says calculate_wavefunctions at HelAmps_sm.cc:175. There are three improvements that could be made here:
Here is a plot for just that function plot. Note that this is for the AVX2 build. |
hum... in my experience such a statement is the sign for something more important and to be solved the proper way. Let me explain a typical example I've seen in LHCb. We "need" double precision for matrix inversion according to popular belief. And indeed, results change if you switch to float. Now after AMD came in the gain, we also have different results there (even with double precision), so I studied the case and found out what should be expected : unstable mathematical code. Unstable in the sense that you do 1.00...00x - 1 and thus get x with extremely bad precision if not completely random. Of course switching to double improves x for one iteration, but in this case we had several iterations leading to garbage whatever happens. Bottom line : the "need" for doubles (or worse) should be read as "we have unstable code to be fixed". In our case, it was matrix diagonalisation for matrices close to unit and there are good techniques for this. |
I ran the avx2, 512y and 512z builds with Intel Advisor and am more confused than enlightened. Firstly even when trying to restrict the code to 1 OMP thread, Intel Advisor states that it used 14 CPU threads. This makes it difficult to understand the roofline plots. Secondly the avx2 build vectorized 2 loops while the 512 builds vectorized 3 loops. The 512z built only used AVX512F_512 while the others also used AVX and FMA. Here are some metrics.
|
I was looking at the GPU offload modeling in Advisor using the AVX2 build. It identified 12 regions for potential offloading.
However, although the estimated speedup for the accelerated code is ~18x, the Amdahl's law speedup is 1.8x. |
Hi Laurence, For benchmarks, I recommend "native" random generators that can run on a device, like curand for cuda. |
Hi Laurence, I agree with Stephan.
It would actually be interesting to see if you run on a machine with curand and use that (on the CPU). I assume nvidia has optimized (and vectorized) that. (Or maybe we can get that from cvmfs as an alternative on no-gpu nodes).
For rambo, vectorizing/improving it is on my todo, #192. Not central to what we do, but it seems to always get in the way of performance measurements. It cannot harm.
The other two suggestions by advisor that you mention are more relevant, "4 in check.cc, 2 in calculate_wavefunctions (HelAmps)". Can you maybe post details?
Thanks Andrea
PS By the way on one thread (private? here?) you had mentioned that you seem to get more cores used than expected. Maybe the common random and/or the curand are responsible for those, they are external libraries and I do not know if they stick to one thread or try to aggressively use all cores available.
From: Stephan Hageboeck ***@***.***>
Sent: Tuesday, 4 May, 2021 12:10
To: madgraph5/madgraph4gpu ***@***.***>
Cc: Andrea Valassi ***@***.***>; Mention ***@***.***>
Subject: Re: [madgraph5/madgraph4gpu] More complete analysis of AVX512 in both gcc and clang (#173)
Hi Laurence,
the common random numbers were not meant to be fast. They are only for testing the accuracy of the computations with stable random numbers.
For benchmarks, I recommend "native" random generators that can run on a device, like curand for cuda.
If there's nothing like this in SYCL/oneAPI, one could start profiling after the random numbers have been produced.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#173 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA2QBDTMO2Y4JCCHZUN6W2LTL7BVVANCNFSM43O7HBDQ>.
|
I know that we are not currently focusing on the random number generation, but mentioned it for completeness and yes, maybe the other CPU threads are used by the non-omp code. |
On the AVX512 topic, I just had an extremely interesting discussion with @HadrienG2 (thanks a lot again Hadrien!). He suggests that one possible explanation of the suboptimal speedup from AVX512 is the processor I am using for tests, an Intel Xeon Silver 4216, which according to Intel's specifications has only one AVX-512 FMA unit (source: https://ark.intel.com/content/www/us/en/ark/products/193394/intel-xeon-silver-4216-processor-22m-cache-2-10-ghz.html). If I understand correctly, this single unit fuses two AVX2 units. In this model, I do not get a x2 speedup from AVX512 with respect to AVX2, because with AVX2 I can use ports 0 and 1 to execute two streams of instructions in parallel (instruction level paralelism), while with this fused AVX512, the unit occupies by itself both ports 0 and 1, therefore the x2 speedup from AVX512 is lost because of a 1/2 factor from having half as many instructions though the ports (and ON TOP I also get the slowdown from the clock slowdown, so it is even slower than AVX2). Now, Hadrien also suggests that higher end Lake processors have two AVX512 FMA units, because there is an optional second unit on port 5. (Question: but then, do I not get the same issue, that I could use four AVX2 instructions on ports 0,1,5,6, as opposed to only two AVX512 on ports 0+1 and 5+6? just speculating and inventing port6...). Anyway, I will try to find one of these processors, eg https://www.intel.com/content/www/us/en/products/sku/198017/intel-core-i910980xe-extreme-edition-processor-24-75m-cache-3-00-ghz/specifications.html?wapkw=i9-10980XE. (Aside: a nice diagram of lake architectures: https://en.wikichip.org/w/images/e/ee/skylake_server_block_diagram.svg - this may also be useful to understand simd register pressure eg 16 registers per core in SSE and 32 per core in AVX2) (Another aside: to understand the assembly of AVX2 vs AVX512, which is another complementary approach to analyse this issue further, Hadrien suggests this book https://www.apress.com/gp/book/9781484240625, Modern X86 Assembly Language Programming - Covers x86 64-bit, AVX, AVX2, and AVX-512 | Daniel Kusswurm) I will try to find a different more modern lake processor and repeat the test... thanks so much again Hadrien! |
Random clarifications:
If you look at the bottom of this microarchitecture block diagram, you will see Intel only used the "fuse two 256-bit units into a 512-bit FMA unit" trick on port 0 + port 1. On the higher-end CPUs, port 5 features a genuine 512-bit FMA unit, which will only be used if you're computing 512-bit FMAs (notice the "zmm only" annotation). So if your workload is purely compute-bound, you should be able to run 2x more FMAs per cycle on those CPUs and get the expected 2x FLOPs benefit. I do not know why Intel do not allow port 5 to execute 256-bit FMAs. I thought it would, and just noticed my mistake while reviewing this comment. Also, for reasons detailed below, many workloads are bound not by FLOPs but by memory operations, and here again using AVX-512 with its native 512-bit vector width can help by letting you shuffle around more data per cycle between the L1 cache and CPU registers. If you look at the aforementioned microarchitecture diagram, you will see that e.g. ports 2 and 3 can load two 512-bit vectors per cycle, whereas if you use 256-bit vector instructions, they will only load two 256-bit vectors per cycle, i.e. half as much data. If this becomes the limiting factor, then 512-bit instructions can be 2x faster even on lower-end CPUs. (But this is only the limiting factor if you manage to get the input data to fit in L1 cache. Otherwise you will be limited by the bandwidth of the interconnect between the L1 and L2 cache, which is 64B/512b per cycle and thus fully saturated already when using 256-bit memory loads)
This diagram does not feature architectural registers, but you will see a handy schematic of these here : https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions . Notice that storage from the 512-bit ZMM registers is reused by the 256-bit YMM and 128-bit XMM registers. Notice also that when used in 256-bit mode, AVX-512 instructions still have access to 16 more YMM registers than classic AVX(2) instructions. To see how registers come into the picture, you need to know that in the lucky scenario where all your input data is in the L1 cache, each core of an Intel *Lake CPUs can do the all of the following (and a few extras) in a single CPU cycle :
Since FMAs have three input operands, it quickly becomes obvious that if you were to load all of these operands from memory to registers every time you want to execute an FMA, you would be limited by performance characteristic number 2 (how many SIMD vectors you can load from memory to registers per CPU cycle) and not performance characteristic number 1 (how many multiply-add floating-point operations you can perform per cycle). The secret to avoid this outcome is to keep reused operands in CPU registers, rather than constantly reload them from memory. When using higher-level programming models like autovectorization or GCC's vector extension, your compiler does its best to take care of this matter for you. But it is still limited in this caching process by the number of registers that the CPU architecture provides it with. In AVX2, you have 16 of these, whereas in AVX-512 you have 32 of them (even when using the AVX-512 instruction set to process 256-bit vectors), which gives the compiler twice as much "breathing room" to intelligently cache your data. This is one of the reasons why the AVX-512 instruction set can improve your performance, even when used on 256-bit vectors. Another reason is that the AVX-512 instruction set is better than legacy AVX at shuffling data around between CPU registers, which comes in handy when your input data is not quite organized the way SIMD instruction sets like it. For example, when working with complex numbers, operating on arrays of std::complex takes preprocessing on the CPU side, which you could avoid if you had one array of real parts and one array of imaginary parts instead. But when you do have to engage in such deinterleaving, AVX-512 instructions can do it faster than legacy AVX instructions. TL;DR AVX-512 is a bit of a scam when it comes to number-crunching power, as it only brings you the expected 2x performance benefit on specific high-end CPUs with two 512-bit FMA units. But it has other benefits that can improve the performance of code that's limited by something other than raw FLOPs. |
Sorry for the naive question but isn't AVX512 with 256-bit registers the same as AVX2? Looking at VTune, I can see that the average CPU frequency decreases from 2.4GHz to 2.2GHz. I also noticed that I can't see any FMA instructions with the 512z build. From what I can see here, FMA is provided by the AVX512IFMA extension, however, the CPU that I am using does not provide this. It is a Xeon Silver 4216 release in Q2 2019 so reasonably recent. |
No, because it has access to AVX-512 specific instructions (more efficient shuffling of data across registers, better ways to mask unwanted operations when control flow diverges... see https://en.wikipedia.org/wiki/AVX-512 for an extended list) and 16 more ymm16..ymm31 architectural 256-bit registers (which allows you or the compiler to cache more frequently used data in registers instead of going through L1 cache loads/stores all the time).
This is expected. To prevent overheating, CPUs downclock when wider vectorization is used (or, equivalently, overclock when narrower vectorization is used, take is as sense thou wilt). The rules for CPU frequency scaling are a bit complicated and changed a lot over the history of Intel processors, but this page is good starting point if you want to dig into this further : https://en.wikichip.org/wiki/intel/frequency_behavior .
As the name suggests, AVX512IFMA is about integer fused multiply-add, which is somewhat niche. From previous discussions, I'm assuming that you're doing floating-point computations here, for which FMA operation should be part of the basic AVX512F ("Foundation") instruction set. Therefore, if you're not seeing any FMA in your code, there must be another reason. Maybe your code does not contain clear multiply-add sequences, maybe the compiler does not feel like turning them into FMA as that can change numerical results (Clang cares about this unless In any case, note that the FMA units are also used (albeit obviously less efficiently) for individual add/sub/mul operations. So if your code only uses those, you can still aim for a peak floating-point throughput of two vector add/sub/mul per cycle on CPUs with two FMA units, which is twice less than what you can get with FMA but still very much honorable :) |
I don't understand then why the 512z build is not using FMA as it is there in the 512x build. The code and build instructions seem to be identical, the only difference is the addition of It looks like those AVX-512 specific instructions are doing something good as I see a 20% speed-up with the 512y build. |
Hi, first of all thanks @HadrienG2 ! I will need some time to digest all this information :-) On the question from @lfield : it took me some time to even be bothered to try out AVX512 with 256 bit width, because I could not understand why it would work. This is a very good suggestion that @sponce had given me, an din the end I tried and it works! A few technical details
Those 95 symbols give the speedup over AVX2. This is a very home-made categorization, no guarantee it is exact, but it gives an idea. Those symbols are matched here madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/simdSymSummary.sh Line 60 in f604af9
'^v.*dqa(32|64).*(x|y)mm' '^v.*(32|64)x2.*(x|y)mm' '^vpcmpneqq.*(x|y)mm' . In particular, note that these are NOT fma instructions. I mean, there are a lot of fma instructions in all builds, but the difference between AVX2 and 512y is not related to those (example, vfmadd132pd 0x160(%rdi),%ymm1,%ymm9 is in AVX2 and something similar in 512y too)
I investigated various options, with help from your colleagues in IT-CM (https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF1783725). I wondered if the
|
I haven't tried on AMD but from what I have seen, generally agree. AVX512 with 256 registers seems to be the best. Linus doesn't like it (article). |
Hi Stefan,
your summary is good but I would rephrase a few points
- AVX2 works on Intel (and i take it from you also on AMD)
- AVX2 works on gcc and clang
- on intel/gcc, it is still better to use 256 bit AVX512 than (256bit) AVX2, you get maybe 10% more due a to a few instructions on AVX512VL
- about this 256bit AVX512 on clang, i should recheck the numbers but i think AVX2 is better on clang (they did not invest much in AVX512), i may be wrong
- AVX512 with 512 bits has a theoretical factor 2 advantage over AVX512 with 256 bit, and it is this extra factor i am still loking for: i thought maybe Gold vs Silver would gve it (2 FMA units against 1), but I see no improvement
- if you do NOT get the extra factor 2, it is clear that you are better off with 256bit AVX512 than with 512bit AVX512, eg because you get a clock slowdown... but if you did get the factor 2, then i think it should compensate the slower clock
- there is probably something in memory access that is also relevant in the discussion above: for us, when we have epoch3, comparing ggttgg to eemumu will be useful (you are more compute bound and memory access is even less important)
I do not understand your last point on cores: in any case you can use AVX2 and AVX512 on the same number of cores, I see no difference there.
As for your conclusion, maybe rephrase as "design for 256 bit registers" - but AVX512 can still help there (AVX512VL = on ymm registers, not on zmm)
Andrea
|
That's about AMD deciding to not go for anything more complicated than AVX2, but put more cores on a chip. I don't know where Intel's top models are, but the Threadrippers have 64 cores on a chip that each can do AVX2. A quick search for Xeon W resulted in 28 cores max. |
@hageboeck I'd agree with your general sentiment that if, as developers, we got to choose what WLCG put on their grid sites, AMD + AVX2 would be an infinitely better choice than Intel + AVX-512 in pretty much every respect. But personally, I don't get to choose the hardware, and I know that many grid sites have bought or are still buying Intel chips. So the route that I personally take is "make it work well on AVX2 first, and then make it work well on AVX-512 as well if you have extra time". |
That's a good approach. I made it compile with AVX512 and stopped. |
I think that due to the dynamic frequency behavior, a theoretical factor 2 is not achievable even for 1 thread. AFAIU, the frequency is for the CPU and not per core. So if you have 16 cores and run just one thread using AVX512, all other cores will be slowed down too. In the example, the frequency drops from 3.5 GHz to 1.9GHz if all cores are used, hence the real performance degrades further as you run more work. |
By chance, I have run last week some tests on the CORI HPC at NERSC, see PR #236. This is on a Xeon Gold 6148 CPU, so it is not meant to be very different from the Xeon Gold 6130 that I had tried previously. (See for instance https://colfaxresearch.com/xeon-2017.) Surprisingly, however, on this syem I finally saw a benefit of 512-bit AVX512 "512z" over 256-bit AVX512 "512y". The throughput increase is around 10% for double and around 30% for float. Still less than the nominal factor 2, but keeping into account the clock slowdown this looks quite good. Unfortunately I did not have perf - and I forgot to run also my tests with aggressive inlining or LTO (issue #229). Note also that this CPU is 30-40% faster than those I previously tried. Both my usual Silver 4216 and my test Gold 6130 gave 1.31E6 throughput in scalar mode, while this new Gold 6148 gives 1.73E6 in scalar mode, so there is clearly something else which has changed and could be very relevant. Apart from the processor speed itself, which might be ~15% higher, the 6148 has a large L2 cache and alarger L3 cache, see https://versus.com/en/intel-xeon-gold-6130-vs-intel-xeon-gold-6148. It could be interesting to understand better if and how this is relevant in these tests. |
(This is a long email that I wrote to @valassi in response to an inquiry from him. I am posting a slightly redacted version here at his request.) The mkFit project (https://github.com/trackreco/mkFit, https://trackreco.github.io/) has gained quite a bit of experience with AVX-512 in the context of vectorized, parallelized tracking. And we too have been confronted from time to time with puzzles in the vector speedup results. Apparently you are already aware of one that stumped me for a while, namely the handicap of most Silver and even some Gold Xeon processors that have just one AVX-512 VPU per core instead of two. (Don’t worry, the Gold 6130 and 6148 have two.) My main reaction to [this thread] is that it would be good for you to try the Intel C/C++ compiler instead of gcc. The Intel C/C++ “Classic” Compiler is now free to download through the Intel oneAPI HPC Toolkit. (You’ll need the Base Toolkit as well. Be sure to use icc/icpc rather than icx/icpx, which is based on LLVM.) We find that the Intel classic compilers do a much better job in getting good AVX-512 performance. Note, you do have to set -qopt-zmm-usage=high to enable zmm usage for Xeon. The question of why gcc does so much worse with AVX-512 is a difficult one to answer, in general. In our case, we think it is partly due to a failure to vectorize sin/cos, which icc does via Intel SVML. There are somewhat subtle reasons for why gcc has trouble with this, having to do with libmvec, glibc, and -ffast-math. Even with the Intel compilers, however, it is difficult to get full AVX-512 speedup. As you point out, dynamic frequency scaling penalizes AVX-512z relative to AVX2 or AVX-512y when all cores are in use. (Note, this is NOT true for single-core tests.) Therefore, we have recently moved away from AVX-512 entirely, even with Intel compilers, as it seems to provide little benefit when frequency scaling is factored in. I looked through the comments in [this issue] and I would like to highlight the comments by @sponce. The part that makes vector speedup tough is Amdahl’s Law, which enforces a plateau in the speedup due to the presence of “serial code”. Frequency scaling can turn this plateau into a slowdown, even for code that “to the eye” is CPU-bound. Instruction scheduling may in effect serialize the instruction stream at certain points. For example, format conversions (e.g., double to float) should be avoided as they interrupt the vector pipeline. Other aspects of instruction scheduling may be baked into the hardware and out of the programmer’s control. I’d like to point out that even the unavoidable act of incrementing a loop counter is “serial code”. The only way to mitigate against it is to unroll the loop to a high degree, making the pipeline as deep as possible before needing to interrupt it for the compare-and-branch of the loop. But the number of registers available for such unrolling and pipelining is not infinite. (It’s possible that good zmm register usage is what makes icc cleverer than gcc.) I don’t think that Amdahl’s Law is the complete explanation in your case, though, based on your data. Let’s postulate that the 3.5x speedup you observe for AVX2 is limited entirely by Amdahl. If we take S(N) = 1/(1-P+P/N) and solve it for parallel fraction P based on speedup S(N), we get P = (1-1/S(N))/(1-1/N); thus, given S(N)=3.5, we get P=0.95. That implies S(8) should be around 5.9, but even slowing it down by 30% for frequency scaling doesn’t make it less than 3.5. Of course, this approach may be too simplistic to give any real insight. To disambiguate dynamic frequency scaling from other effects, another thing we have done is to disable Turbo Boost and look at speedup curves with the frequency scaling (mostly) removed. To shut off turbo, you can do this as root: “echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo”. But AVX-512 runs so hot that when all cores are active there is some processor slowdown even with turbo off. The discussion in the GitHub issue doesn’t say much about how many threads are used in your tests. For sure, hyperthreading is not your friend for CPU-bound codes. But the thread count is integral to understanding the effects of frequency scaling [as the effects are much more pronounced when all cores are active]. By the way, if it is possible for you to pin threads to cores, do so. That can make a huge difference. Unfortunately, our code is multithreaded with TBB and I am not aware of a way to control thread pinning with it. Looking at other possible factors, one that doesn’t work (to my mind) is cache size. I believe the L2 and L3 cache per core is actually the same for Gold 6130 and 6148. I guess overall my recommendation is to try icc and compare its disassembly with that from gcc. If you download the oneAPI toolkits, you will also get access to Intel Advisor. This tool can produce a roofline plot in which the arithmetic intensity is calculated for all the key loops and functions, and each one is represented by its own Gflop/s point. Advisor also lets you drill down into the source to see the instructions generated by each line and analysis of how well each loop was vectorized. The code does not have to be compiled with icc to use the toolkit… oh, looking again at [this issue], I see Laurence is already aware of this tool, great. Another good tool for looking at assembler output from a wide range of compilers is godbolt.org. In general the discussion in the GitHub thread is extremely well-informed and I don’t know that there is a whole lot more that I can add to it, beyond what I have said above. One correction, though, is that Skylake Xeons can adjust frequency on a per-core basis, I believe. Anyway, I hope that something in the above brain dump will prove helpful to you. |
Hi @srlantz , thank you so much again for all your useful feedback! (Copying also @roiser and @oliviermattelaer explicitly for info). Apart from other points we discussed privately, here's a few points that I note down as some TODO for us in this investigation of AVX512/zmm:
I also found a couple of other links related to your project which I found very interesting, more food for thought for us:
Well this gives us a lot of things to think about and try out. Thanks again Steve for the feedback! Best, Andrea |
(Again posting comments that were initially made to Andrea via email, with some edits.) The Matriplex code in mkFit includes implementations of key routines in intrinsics, both AVX-512 and AVX2 (with FMA). When we first started the mkFit project we were interested in Xeon Phi, and at that time, the intrinsics were pretty essential to getting top performance. These days we find that auto-vectorization by the compilers (Intel especially) has caught up to a large extent, so intrinsics no longer confer such a big advantage. Still, I think the code has some interesting uses of the vgather-type intrinsics. Anyway, this is why we never bothered with vector extensions. Instead, our Matriplex classes define certain data members that are just flat C-style arrays. This provides plenty of opportunities for the compiler to vectorize simple loops in the class’s methods, even without resorting to our intrinsics-based implementations that can be enabled via ifdef’s. I have no idea how Intel’s C++ compilers will evolve in the future, but as of right now, the icc classic line does far better at code optimization than the icx or clang line, so we are not planning to switch anytime soon. I can send you some timing comparisons for the mkFit code if you are interested. Yes, I am still interested in approaches like VC and VCL due to the desire to have vectorized trig functions when compiling with gcc (unless libmvec turns out to be sufficient). The idea of compiling in such a way as to produce AVX-512VL instructions is interesting. I think this is what one gets with icc when one uses -xCORE-AVX512 without -qopt-zmm-usage=high (which I have tried), but I will have to confirm. Finally, if you are doing single-threaded tests, then your results are not explained at all by dynamic frequency scaling. On 1 core, Turbo Boost is able to speed up AVX-512 nearly as much as AVX2 and even SSE. (from a follow-on message to Andrea...) I would be particularly interested to know if the performance improvement you saw with AVX-512 on the Gold 6148 was really due to the different hardware, or due to the use of gcc-10 rather than 9 (as you mentioned on GH). A few tests with gcc 10 on your Gold 6130 machine should let you know. I ought to mention that our usual test machine does in fact have dual Gold 6130 processors... In the past, I have done some performance tests on other Skylake models, including at least one Platinum, and I did see better performance than could be explained purely by the ratio of clock speeds. However, it was not as dramatic as what found in your observations. So I might lean more toward the gcc-10 explanation in your case. Note that the OS can make a difference too, in that libc is baked into the OS. I have found that CentOS 7 is based on glibc 2.17, while CentOS 8 has gblic 2.31. This matters because libmvec (which can be linked by gcc >= 8.1) is not available prior to gblic 2.22 (https://sourceware.org/glibc/wiki/libmvec). I don’t know if this makes a difference to your sqrt() or not, or how important that would be in your case. |
One update to the above is that I just tried a test with the "512y" idea, i.e., options -march=skylake-avx512 -mprefer-vector-width=256, using gcc 9.2.1. I found that the gcc-based performance of mkfFit was more characteristic of previous tests with AVX-512 (i.e., slower) than pure AVX2. This was on a dual Gold 6142 with Turbo Boost enabled, running 64 threads on 32 cores, while handling up to 32 events in parallel. In a particular test scenario, the event loop ran in 14.0 sec. vs. 10.8 sec. for pure AVX2 (-mavx2 -mfma). This result highlights what I was saying about the difference between AVX-512 and AVX2 being far more dramatic with all cores active. In contrast, the same code compiled with icc runs the same event loop in 7.3 sec. for AVX2, and pretty nearly the same for AVX-512. |
I want to add that the known issue with significant frequency drop with AVX512z (AVX512 with 512-bit zmm registers) and, as result, performance drop was improved in the latest Intel CPUs, e.g. Icelake and Sunny Cove Core microarchitecture. |
Thanks a lot @srlantz and @ivorobts for the info and suggestions! I will definitely put on my todo list some tests with all cores active, and also some tests on IceLake or more recent chips (Intel DevCloud also sounds like a nice suggestion). I also still need to complete and document some tests that I did on gcc10 two months ago, as I had discussed with Steve. |
I have made some tests on an Icelake Platinum. So it is difficult to say anything concrete about AVX512 with 512bit width. Again, there is a large effect in my code from using aggressive inlining or not. |
A brief update on this issue (copying this from the older issue #71 that I just closed). All results discussed so far in this issue #173 were about the vectorization of the simple physics process e e to mu mu. In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process ie g g to t t g g (4 particles in the final state instead of two, with QCD rather than QED - more Feynman diagrams and more complex diagrams, hence more CPU/GPU intensive and slower). The very good news is that I observe similar speedups there, or even slightly better. With respect to basic c++ with no SIMD, I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real. I also tested more precisely the effect of aggressive inlining (issue #229), mimicking LTO link time optimization. This seemed to give large performance boosts for the simpler eemumu (for reasons that I had not fully understood), but for the more complex/realistic ggttgg it seems irrelevant at most, if not counterproductive. This was an optional feature, and I will keep it disabled by default. The details are below. See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto DOUBLE
For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case. FLOAT
By the way, note en passant that I moved from gcc9.2 to gcc10.3 for all these results. But here I am still on a Xeon Silver. Concerning the specific issue of AVX512 with 512bit width zmm registers ("512z") discussed in this thread #173, the results are essentially unchanged.
Now that I have ggttgg vectorized, at some point I will rerun the same tests on other machines, including Xeon Platinum or Skylake. I need to document how to run the epochX tests for ggttgg, but it is essentially the same as the epoch1 tests for eemumu. |
I have some interesting results from tests at the Juwels HPC (thanks @roiser for getting the resources!). These were merged in PR #381. Specifically, the results are here
Look at the most complex physics process considered, gg to ttggg, the last column on the right:
With respect to AVX2 or AVX512/ymm, these throughputs are now essentially a factor 2 better. This seems to confirm @HadrienG2 's initial suggestion that these higher end processors have two FMA units and that this helps exploiting zmm computations. Note that this is on a Juwels Cluster login node! On Juwels Cluster compute nodes, the CPUs are even higher end (Platinum, instead of Gold), so I would expect similar results if not better. These are still results for single-threaded computations. Clearly when we fill up the machine with several threads the results may change, but this is the first solid evidence I see that speedups close to x8(double) or x16(float) can be achieved. Previously I was always quoting x4(double) and x8(float) from AVX512/ymm. Promising results... |
I come back to this issue after quite some time, I confirm that I was able to obtain the full theoretical speedup x8 for double and x16 for doubles on Intel Gold nodes which I believe have two FMA units. The results have been confirmed for one CPU core. When going to several CPU cores, the speedup is lower than a full x8 or x16 - this might be the effect of closk slowdown, although I have not fully investigated that. I keep this issue open for the moment, just with the idea of investigating that. In any case, it is clear that "512z" ie AVX512 with 512-bit zmm registers is faster than "512y" is AVX512 with ymm registers, when on Gold nodes with two FMA units. On Silver nodes conversely 512y is still faster than 512z (and also faster than avx2). The results on one core are publishd on https://doi.org/10.22323/1.414.0212. They also include the effect on a full workflow (where the speedup is - for the moment - lower than x8 and x16 due to Amdahl) The results on many cores were presented at ACAT https://indico.cern.ch/event/1106990/contributions/4997226/ Thanks again to all of you in this thread for a lot of relevant feedback to get to this point! |
This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.
(A preliminary observation: the clang vectorization still needs a cross-check, see #172)
A couple of observations on performance
It would be useful to understand these issues a bit better and see if we can squeeze out something. Two points in particular:
The 'symbols' line comes from this script and in particular this line
madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/simdSymSummary.sh
Line 60 in dfcc0f9
The "512y" symbols are 0 in clang and are more than 0 in gcc for 512y (while they are 0 in avx2). Essentially, the idea (expaning on a great suggestion by @sponce) was that 512y is a bit faster than avx2 because some extra symbols (from AVX512VL I guess) are used. I did a sytematic analysis and found that a few symbols matching these regular expressions are probably those that make the difference
In clang, however, there does not seem to be any such extra benefit from AVX512VL. This is also consistent with what was suggested by @hageboeck (see also https://arxiv.org/pdf/2003.12875.pdf), is that clang is best used at AVX2. The idea here is to give a brief cross-check, and potentially try newer clang versions. Note by the way that for single precision instead 512y is very slightly faster than avx2 also in clang, d65129b
Low priority... but at least it's documented before I move on.
The text was updated successfully, but these errors were encountered: