-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Validate clang-style "no cxtype ref" vectorization and use it as default #172
Comments
Compare the commit logs of these two commits With cxtype_ref, gcc/double
Without cxtype_ref, gcc/double
The relevant lines are
They should be strictly identical, not just statistically compatible |
A small comment: I was so sure that this should not make a difference in the 'none' implementation, that I did not print out the tag "[cxtype_ref=YES]" or "[cxtype_ref=NO]" in that case. Maybe better to add it. Well, on eo fthe things to cross-check... |
This is peculiar. I cannot reproduce it. I went back to 4d6870d which for gcc was giving 1.372113e-02, I now instead get the expected 1.371706e-02... I also checked the same commit with clang, there I do get 1.372113e-02 (an dthe printout says clang, so it's not a mismatch in the compiler printout). Did I mix random numbers from two compilers?... In any case, with the current latest master da19d3c, I get the expected 1.371706e-02 on gcc and 1.372113e-02 on clang |
…is invalid!) The problem was that with clang11/12 I cannot use CUDA, hence I was using common random numbers. With clang10 I get the usual results. On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.216996e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.534132 sec real 0m7.546s =Symbols in CPPProcess.o= (~sse4: 1326) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.355185e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.935344 sec real 0m1.272s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.591017e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.842237 sec real 0m4.854s =Symbols in CPPProcess.o= (~sse4: 3607) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.121843e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.654099 sec real 0m3.666s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3023) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.101228e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.639943 sec real 0m3.651s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2735) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.731910e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.072267 sec real 0m4.084s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3524) (512y: 0) (512z: 1164) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.206297e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.639024 sec real 0m7.650s =Symbols in CPPProcess.o= (~sse4: 1238) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.470097e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.743197 sec real 0m1.035s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------
…madgraph5#172) With clang11 I must use common random numbers an dI get a different result. The performance is the same as clang10. This is the new clang 11.1 (issue madgraph5#182) On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.224632e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.318761 sec real 0m7.331s =Symbols in CPPProcess.o= (~sse4: 1259) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.650708e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.619855 sec real 0m4.629s =Symbols in CPPProcess.o= (~sse4: 3608) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.143089e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.441790 sec real 0m3.451s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3005) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.118344e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.454231 sec real 0m3.464s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2727) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.716717e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.919487 sec real 0m3.929s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3552) (512y: 0) (512z: 1196) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.225500e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.380646 sec real 0m7.390s =Symbols in CPPProcess.o= (~sse4: 1166) (avx2: 0) (512y: 0) (512z: 0) -------------------------------------------------------------------------
…madgraph5#172) With clang12 I must use common random numbers an dI get a different result. The performance in the new clang 12.0 (issue madgraph5#182) is worse than 10 or 11. Note that 512y is very slightly better than avx2 with clang12. It is very slightly worse in clang10 and clang11. Will probably use 512y as default also in clang for simplicity? On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.227871e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.514820 sec real 0m7.525s =Symbols in CPPProcess.o= (~sse4: 1234) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.588040e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.794895 sec real 0m4.805s =Symbols in CPPProcess.o= (~sse4: 3664) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.458991e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.780732 sec real 0m3.791s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3307) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.465177e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.777371 sec real 0m3.787s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2983) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.349914e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.251994 sec real 0m4.262s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3978) (512y: 0) (512z: 1183) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.221702e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.531229 sec real 0m7.541s =Symbols in CPPProcess.o= (~sse4: 1120) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.379180e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.732928 sec real 0m1.025s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------
…5#182)" This reverts commit 50c12e5. Keep AVX2 on clang for the moment as this is actually faster than gcc! Lastet baseline performance on gcc with cxtype_ref (issue madgraph5#172): On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.306401e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.185372 sec real 0m7.195s =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.265629e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.742860 sec real 0m1.035s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.504491e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.867292 sec real 0m4.878s =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.592850e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.656376 sec real 0m3.666s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.916156e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.581543 sec real 0m3.591s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.705860e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.008144 sec real 0m4.018s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147166e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.840836 sec real 0m7.850s =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.417572e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.735049 sec real 0m1.027s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------
This is completely understood now, there is no bug. The problem is that clang11 and clang12 are not supported by cuda11, so in that case I build on common random numbers, which give of course a different physics. I checked the first 32 ransom numbers were different, and the curand seeds are not used, then it was obvious... Not clear wjhy I saw it in gcc at some point, maybe I built it with my usual "export CUDA_HOME=invalid" hack that I need on clang 11 and 12. About the second issue, whether the clang version can be used in production also for gcc, this is now validated. On ecould use that version. However, it is very tinily slower (a few permille). And I lile the original operator[] idea. I will keep as is for the moment. En passant, I hav evalidated the latest cvmfs installs of clang11.1 and clang12.0 in issue #182. In a PR #187 I have committed a few tests and minor patches. This can be closed. Not a bug. |
See additional comments in #1004. There were issues in the braket implementation on gcc14.2 (now fixed), so one can ask the question whether we should use the 'clang' no-bracket version also in gcc. I still prefr to keep the bracket version in gcc for now. |
This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.
There are presently two slightly different vectorisation implementations
In both implementations (take double with AVX2, ie 4 doubles per vector, as an example
A small difference between the two implementations is the following
madgraph4gpu/epoch1/cuda/ee_mumu/src/mgOnGpuVectors.h
Line 70 in dfcc0f9
However:
So, this issue is just about understanding if there is a bug and where. Maybe I just read the results in the wrong way and there is no issue.
The text was updated successfully, but these errors were encountered: