Validate clang-style "no cxtype ref" vectorization and use it as default #172

valassi · 2021-04-23T16:16:49Z

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

There are presently two slightly different vectorisation implementations

the original one developed on gcc
a recent one where I had to tweak a few things for clang

In both implementations (take double with AVX2, ie 4 doubles per vector, as an example

floats are 4-vectors FFFF
complex numbers are implemented as two float vectors RRRRIIII (and not as RIRIRIRI)
compiler vector extensions (those of gcc, or those of clang) are used for vectors

A small difference between the two implementations is the following

the original gcc version introduces a "cxtype_ref" class that is only a wrapper to two float non-const references, so that cxtype_v[0] returns a reference to R...I... in the RRRRIIII: this is the operator[],

madgraph4gpu/epoch1/cuda/ee_mumu/src/mgOnGpuVectors.h

Line 70 in dfcc0f9

cxtype_ref operator[]( size_t i ) const { return cxtype_ref( m_real[i], m_imag[i] ); }
in clang this is not possible because "non-const reference cannot bind to vector element"
initially I tried to use in clang an operator[] returning a pair of values, rather than a pair of references, however this led to wrong results (only in the testxxx tests! not in eemumu ME averages??)... the issue is that in some places it was still used as one would use a non-const reference (and I was surprised that the code built at all)
anyway, in the end, I realised that this operator[] was really only needed in a minimal number of places, so I built a clang implementation where there is no such operator[]
now all the tests pass, and actually this implementation can only work on gcc, and it is much simpler, so I'd like to use it as default (removing cxtype_ref)

However:

I observed that the average ME that is printed out is different whether this implementation is used or not
What is really puzzling is that this is true also for the 'none' implementation (where there are no vectors at all, no compiler vector extensions, no need for any such thing)

So, this issue is just about understanding if there is a bug and where. Maybe I just read the results in the wrong way and there is no issue.

valassi · 2021-04-23T16:23:08Z

Compare the commit logs of these two commits

With cxtype_ref, gcc/double
8edae31

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.305527e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.191895 sec
real    0m7.202s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.118856e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.908404 sec
real    0m1.201s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.531723e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.845304 sec
real    0m4.855s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

Without cxtype_ref, gcc/double
4d6870d

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.306067e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.003754 sec
real    0m7.011s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.188863e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.148541 sec
real    0m1.448s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.489082e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.705370 sec
real    0m4.713s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

The relevant lines are

MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0

They should be strictly identical, not just statistically compatible

valassi · 2021-04-23T16:57:40Z

A small comment: I was so sure that this should not make a difference in the 'none' implementation, that I did not print out the tag "[cxtype_ref=YES]" or "[cxtype_ref=NO]" in that case. Maybe better to add it. Well, on eo fthe things to cross-check...

valassi · 2021-04-27T14:17:29Z

This is peculiar. I cannot reproduce it.

I went back to 4d6870d which for gcc was giving 1.372113e-02, I now instead get the expected 1.371706e-02... I also checked the same commit with clang, there I do get 1.372113e-02 (an dthe printout says clang, so it's not a mismatch in the compiler printout).

Did I mix random numbers from two compilers?...

In any case, with the current latest master da19d3c, I get the expected 1.371706e-02 on gcc and 1.372113e-02 on clang

…ph5#172)

…is invalid!) The problem was that with clang11/12 I cannot use CUDA, hence I was using common random numbers. With clang10 I get the usual results. On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.216996e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.534132 sec real 0m7.546s =Symbols in CPPProcess.o= (~sse4: 1326) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.355185e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.935344 sec real 0m1.272s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.591017e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.842237 sec real 0m4.854s =Symbols in CPPProcess.o= (~sse4: 3607) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.121843e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.654099 sec real 0m3.666s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3023) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.101228e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.639943 sec real 0m3.651s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2735) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.731910e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.072267 sec real 0m4.084s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3524) (512y: 0) (512z: 1164) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 10.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.206297e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.639024 sec real 0m7.650s =Symbols in CPPProcess.o= (~sse4: 1238) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.470097e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.743197 sec real 0m1.035s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------

…madgraph5#172) With clang11 I must use common random numbers an dI get a different result. The performance is the same as clang10. This is the new clang 11.1 (issue madgraph5#182) On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.224632e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.318761 sec real 0m7.331s =Symbols in CPPProcess.o= (~sse4: 1259) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.650708e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.619855 sec real 0m4.629s =Symbols in CPPProcess.o= (~sse4: 3608) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.143089e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.441790 sec real 0m3.451s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3005) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.118344e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.454231 sec real 0m3.464s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2727) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.716717e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.919487 sec real 0m3.929s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3552) (512y: 0) (512z: 1196) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 11.1.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.225500e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.380646 sec real 0m7.390s =Symbols in CPPProcess.o= (~sse4: 1166) (avx2: 0) (512y: 0) (512z: 0) -------------------------------------------------------------------------

…madgraph5#172) With clang12 I must use common random numbers an dI get a different result. The performance in the new clang 12.0 (issue madgraph5#182) is worse than 10 or 11. Note that 512y is very slightly better than avx2 with clang12. It is very slightly worse in clang10 and clang11. Will probably use 512y as default also in clang for simplicity? On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.227871e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.514820 sec real 0m7.525s =Symbols in CPPProcess.o= (~sse4: 1234) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.588040e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.794895 sec real 0m4.805s =Symbols in CPPProcess.o= (~sse4: 3664) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.458991e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.780732 sec real 0m3.791s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3307) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.465177e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.777371 sec real 0m3.787s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2983) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.349914e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.251994 sec real 0m4.262s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3978) (512y: 0) (512z: 1183) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 12.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.221702e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.531229 sec real 0m7.541s =Symbols in CPPProcess.o= (~sse4: 1120) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.379180e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.732928 sec real 0m1.025s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------

…5#182)" This reverts commit 50c12e5. Keep AVX2 on clang for the moment as this is actually faster than gcc! Lastet baseline performance on gcc with cxtype_ref (issue madgraph5#172): On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.306401e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.185372 sec real 0m7.195s =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.265629e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.742860 sec real 0m1.035s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.504491e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.867292 sec real 0m4.878s =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.592850e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.656376 sec real 0m3.666s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.916156e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.581543 sec real 0m3.591s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.705860e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.008144 sec real 0m4.018s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147166e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.840836 sec real 0m7.850s =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.417572e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.735049 sec real 0m1.027s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------

…ph5#172 madgraph5#182)

valassi · 2021-04-27T15:45:41Z

This is completely understood now, there is no bug. The problem is that clang11 and clang12 are not supported by cuda11, so in that case I build on common random numbers, which give of course a different physics. I checked the first 32 ransom numbers were different, and the curand seeds are not used, then it was obvious... Not clear wjhy I saw it in gcc at some point, maybe I built it with my usual "export CUDA_HOME=invalid" hack that I need on clang 11 and 12.

About the second issue, whether the clang version can be used in production also for gcc, this is now validated. On ecould use that version. However, it is very tinily slower (a few permille). And I lile the original operator[] idea. I will keep as is for the moment.

En passant, I hav evalidated the latest cvmfs installs of clang11.1 and clang12.0 in issue #182.

In a PR #187 I have committed a few tests and minor patches.

This can be closed. Not a bug.

valassi · 2024-09-18T08:47:19Z

See additional comments in #1004. There were issues in the braket implementation on gcc14.2 (now fixed), so one can ask the question whether we should use the 'clang' no-bracket version also in gcc. I still prefr to keep the bracket version in gcc for now.

valassi added the bug Something isn't working label Apr 23, 2021

valassi self-assigned this Apr 23, 2021

valassi mentioned this issue Apr 23, 2021

More complete analysis of AVX512 in both gcc and clang #173

Open

valassi mentioned this issue Apr 26, 2021

Port to clang 11 and clang12 #182

Closed

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021

[klas3] add debug printouts for issue madgraph5#172

e2043e6

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021

[klas3] comment out debug printouts for issue madgraph5#172

97b8ea9

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021

[klas3] printout if COMMON random are used (relevant for issue madgra…

741ffc5

…ph5#172)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021

[klas3] finally keep no-cxtype_ref as default on clang (issues madgra…

db5db67

…ph5#172 madgraph5#182)

valassi mentioned this issue Apr 27, 2021

Clarify issues with cxtype_ref on gcc/clang + Test clang11.1 and clang12.0 #187

Merged

valassi closed this as completed Apr 27, 2021

valassi added question Further information is requested and removed bug Something isn't working labels Apr 27, 2021

valassi mentioned this issue Sep 18, 2024

cxtype_ref problem for some gcc versions #1004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate clang-style "no cxtype ref" vectorization and use it as default #172

Validate clang-style "no cxtype ref" vectorization and use it as default #172

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

valassi commented Apr 27, 2021

valassi commented Apr 27, 2021

valassi commented Sep 18, 2024

Validate clang-style "no cxtype ref" vectorization and use it as default #172

Validate clang-style "no cxtype ref" vectorization and use it as default #172

Comments

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

valassi commented Apr 27, 2021

valassi commented Apr 27, 2021

valassi commented Sep 18, 2024