Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Validate clang-style "no cxtype ref" vectorization and use it as default #172

Closed
valassi opened this issue Apr 23, 2021 · 5 comments
Closed
Assignees
Labels
question Further information is requested

Comments

@valassi
Copy link
Member

valassi commented Apr 23, 2021

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

There are presently two slightly different vectorisation implementations

  • the original one developed on gcc
  • a recent one where I had to tweak a few things for clang

In both implementations (take double with AVX2, ie 4 doubles per vector, as an example

  • floats are 4-vectors FFFF
  • complex numbers are implemented as two float vectors RRRRIIII (and not as RIRIRIRI)
  • compiler vector extensions (those of gcc, or those of clang) are used for vectors

A small difference between the two implementations is the following

  • the original gcc version introduces a "cxtype_ref" class that is only a wrapper to two float non-const references, so that cxtype_v[0] returns a reference to R...I... in the RRRRIIII: this is the operator[],
    cxtype_ref operator[]( size_t i ) const { return cxtype_ref( m_real[i], m_imag[i] ); }
  • in clang this is not possible because "non-const reference cannot bind to vector element"
  • initially I tried to use in clang an operator[] returning a pair of values, rather than a pair of references, however this led to wrong results (only in the testxxx tests! not in eemumu ME averages??)... the issue is that in some places it was still used as one would use a non-const reference (and I was surprised that the code built at all)
  • anyway, in the end, I realised that this operator[] was really only needed in a minimal number of places, so I built a clang implementation where there is no such operator[]
  • now all the tests pass, and actually this implementation can only work on gcc, and it is much simpler, so I'd like to use it as default (removing cxtype_ref)

However:

  • I observed that the average ME that is printed out is different whether this implementation is used or not
  • What is really puzzling is that this is true also for the 'none' implementation (where there are no vectors at all, no compiler vector extensions, no need for any such thing)

So, this issue is just about understanding if there is a bug and where. Maybe I just read the results in the wrong way and there is no issue.

@valassi valassi added the bug Something isn't working label Apr 23, 2021
@valassi valassi self-assigned this Apr 23, 2021
@valassi
Copy link
Member Author

valassi commented Apr 23, 2021

Compare the commit logs of these two commits

With cxtype_ref, gcc/double
8edae31

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.305527e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.191895 sec
real    0m7.202s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.118856e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.908404 sec
real    0m1.201s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.531723e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.845304 sec
real    0m4.855s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

Without cxtype_ref, gcc/double
4d6870d

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.306067e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.003754 sec
real    0m7.011s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.188863e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.148541 sec
real    0m1.448s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.489082e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.705370 sec
real    0m4.713s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

The relevant lines are

MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0

They should be strictly identical, not just statistically compatible

@valassi
Copy link
Member Author

valassi commented Apr 23, 2021

A small comment: I was so sure that this should not make a difference in the 'none' implementation, that I did not print out the tag "[cxtype_ref=YES]" or "[cxtype_ref=NO]" in that case. Maybe better to add it. Well, on eo fthe things to cross-check...

@valassi
Copy link
Member Author

valassi commented Apr 27, 2021

This is peculiar. I cannot reproduce it.

I went back to 4d6870d which for gcc was giving 1.372113e-02, I now instead get the expected 1.371706e-02... I also checked the same commit with clang, there I do get 1.372113e-02 (an dthe printout says clang, so it's not a mismatch in the compiler printout).

Did I mix random numbers from two compilers?...

In any case, with the current latest master da19d3c, I get the expected 1.371706e-02 on gcc and 1.372113e-02 on clang

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
…is invalid!)

The problem was that with clang11/12 I cannot use CUDA, hence I was
using common random numbers. With clang10 I get the usual results.

On itscrd70.cern.ch:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 10.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.216996e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.534132 sec
real    0m7.546s
=Symbols in CPPProcess.o= (~sse4: 1326) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.355185e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.935344 sec
real    0m1.272s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 10.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.591017e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.842237 sec
real    0m4.854s
=Symbols in CPPProcess.o= (~sse4: 3607) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 10.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.121843e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.654099 sec
real    0m3.666s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3023) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 10.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.101228e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.639943 sec
real    0m3.651s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2735) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 10.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.731910e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.072267 sec
real    0m4.084s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3524) (512y:    0) (512z: 1164)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [clang 10.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.206297e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.639024 sec
real    0m7.650s
=Symbols in CPPProcess.o= (~sse4: 1238) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.470097e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.743197 sec
real    0m1.035s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 164
-------------------------------------------------------------------------
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
…madgraph5#172)

With clang11 I must use common random numbers an dI get a different result.

The performance is the same as clang10. This is the new clang 11.1 (issue madgraph5#182)

On itscrd70.cern.ch:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.224632e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.318761 sec
real    0m7.331s
=Symbols in CPPProcess.o= (~sse4: 1259) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.650708e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.619855 sec
real    0m4.629s
=Symbols in CPPProcess.o= (~sse4: 3608) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.143089e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.441790 sec
real    0m3.451s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3005) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.118344e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.454231 sec
real    0m3.464s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2727) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.716717e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.919487 sec
real    0m3.929s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3552) (512y:    0) (512z: 1196)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [clang 11.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.225500e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.380646 sec
real    0m7.390s
=Symbols in CPPProcess.o= (~sse4: 1166) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
…madgraph5#172)

With clang12 I must use common random numbers an dI get a different result.

The performance in the new clang 12.0 (issue madgraph5#182) is worse than 10 or 11.

Note that 512y is very slightly better than avx2 with clang12.
It is very slightly worse in clang10 and clang11.
Will probably use 512y as default also in clang for simplicity?

On itscrd70.cern.ch:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 12.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.227871e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.514820 sec
real    0m7.525s
=Symbols in CPPProcess.o= (~sse4: 1234) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 12.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.588040e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.794895 sec
real    0m4.805s
=Symbols in CPPProcess.o= (~sse4: 3664) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 12.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.458991e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.780732 sec
real    0m3.791s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3307) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 12.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.465177e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.777371 sec
real    0m3.787s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2983) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 12.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.349914e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.251994 sec
real    0m4.262s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3978) (512y:    0) (512z: 1183)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [clang 12.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.221702e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.531229 sec
real    0m7.541s
=Symbols in CPPProcess.o= (~sse4: 1120) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.379180e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.732928 sec
real    0m1.025s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 164
-------------------------------------------------------------------------
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 27, 2021
…5#182)"

This reverts commit 50c12e5.

Keep AVX2 on clang for the moment as this is actually faster than gcc!

Lastet baseline performance on gcc with cxtype_ref (issue madgraph5#172):

On itscrd70.cern.ch:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.306401e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.185372 sec
real    0m7.195s
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.265629e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.742860 sec
real    0m1.035s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.504491e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.867292 sec
real    0m4.878s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.592850e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.656376 sec
real    0m3.666s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.916156e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.581543 sec
real    0m3.591s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.705860e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.008144 sec
real    0m4.018s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.147166e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.840836 sec
real    0m7.850s
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.417572e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.735049 sec
real    0m1.027s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 164
-------------------------------------------------------------------------
@valassi
Copy link
Member Author

valassi commented Apr 27, 2021

This is completely understood now, there is no bug. The problem is that clang11 and clang12 are not supported by cuda11, so in that case I build on common random numbers, which give of course a different physics. I checked the first 32 ransom numbers were different, and the curand seeds are not used, then it was obvious... Not clear wjhy I saw it in gcc at some point, maybe I built it with my usual "export CUDA_HOME=invalid" hack that I need on clang 11 and 12.

About the second issue, whether the clang version can be used in production also for gcc, this is now validated. On ecould use that version. However, it is very tinily slower (a few permille). And I lile the original operator[] idea. I will keep as is for the moment.

En passant, I hav evalidated the latest cvmfs installs of clang11.1 and clang12.0 in issue #182.

In a PR #187 I have committed a few tests and minor patches.

This can be closed. Not a bug.

@valassi valassi closed this as completed Apr 27, 2021
@valassi valassi added question Further information is requested and removed bug Something isn't working labels Apr 27, 2021
@valassi
Copy link
Member Author

valassi commented Sep 18, 2024

See additional comments in #1004. There were issues in the braket implementation on gcc14.2 (now fixed), so one can ask the question whether we should use the 'clang' no-bracket version also in gcc. I still prefr to keep the bracket version in gcc for now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant