Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561

valassi · 2022-12-10T08:18:40Z

Hi @oliviermattelaer @roiser @hageboeck @zeniheisser I have an interesting one.

I am making progress in the integration of random color and helicity. For the moment I have essentially completed the rebasing of my madgraph4gpu cudacpp patches on top of Olivier's latest upstream code (ie moving from nuvecMLM to vecMLM branches). I have two ongoing MRs on upstream mg5amcnlo and two MRs on madgraph4gpu, I will give the details.

I am rerunning my usual set of tests. The interesting, puzzling finding is the following: the Fortran ME calculation is now a factor 4 faster than it used to be. I do not think that Fortran is now magically vectorizing SIMD (should be checked with objsim), I would rather imagine that the algorithm for Fortran ME calculation has changed. It may also be that I am doing the "bookeeping" wrong, now that random color/helicity have moved elsewhere, but again I do not think this is the issue, as the OVERALL time taken by Fortran seems a factor 4 faster.

Note that

the cross section produced by Fortran with the same random seeds has now changed
the cudacpp implementation also produces this new cross section, which agrees with the new Fortran

This is in itself good news as speedups are always good, but it kind of significantly reduces the interest of C++ vectorization...

I think that we need to understand this quite well before we give the code to the experiments? It would especially be important to do some physics validation of the old Fortran vs the new Fortran.

Another think that it would be useful to test is whether changing the vector size has any effect.

Thanks for any feedback! Andrea

STARTED AT Fri Dec 9 20:32:57 CET 2022 ENDED AT Sat Dec 10 00:35:04 CET 2022 *** NB! A large performance speedup appears in Fortran MEs! madgraph5#561 ***

valassi · 2022-12-10T08:30:11Z

See for instance

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp> git diff f8117c408 f4329daeb tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 
...
85c85
< 8192 ! Number of events in a single Fortran iteration (nb_page_loop)
---
> 8192 ! Number of events in a single Fortran iteration (VECSIZE_USED)
94c94
<  [XSECTION] nb_page_loop = 8192
---
>  [XSECTION] VECSIZE_USED = 8192
98,102c98,102
<  [XSECTION] Cross section = 2.158e-07 [2.1583423593436162E-007] fbridge_mode=0
<  [UNWEIGHT] Wrote 97 events (found 1199 events)
<  [COUNTERS] PROGRAM TOTAL          : 1228.1925s
<  [COUNTERS] Fortran Overhead ( 0 ) :    5.0023s
<  [COUNTERS] Fortran MEs      ( 1 ) : 1223.1902s for    90112 events => throughput is 7.37E+01 events/s
---
>  [XSECTION] Cross section = 2.136e-07 [2.1358448797976056E-007] fbridge_mode=0
>  [UNWEIGHT] Wrote 84 events (found 1181 events)
>  [COUNTERS] PROGRAM TOTAL          :  322.0246s
>  [COUNTERS] Fortran Overhead ( 0 ) :    4.9707s
>  [COUNTERS] Fortran MEs      ( 1 ) :  317.0539s for    90112 events => throughput is 2.84E+02 events/s
...
137c137
< 8192 ! Number of events in a single C++ or CUDA iteration (nb_page_loop)
---
> 8192 ! Number of events in a single C++ or CUDA iteration (VECSIZE_USED)
146c146
<  [XSECTION] nb_page_loop = 8192
---
>  [XSECTION] VECSIZE_USED = 8192
150,154c150,154
<  [XSECTION] Cross section = 2.158e-07 [2.1583423593436168E-007] fbridge_mode=1
<  [UNWEIGHT] Wrote 97 events (found 1199 events)
<  [COUNTERS] PROGRAM TOTAL          : 1572.6206s
<  [COUNTERS] Fortran Overhead ( 0 ) :  116.8344s
<  [COUNTERS] CudaCpp MEs      ( 2 ) : 1455.7863s for    90112 events => throughput is 6.19E+01 events/s
---
>  [XSECTION] Cross section = 2.136e-07 [2.1358448797976064E-007] fbridge_mode=1
>  [UNWEIGHT] Wrote 84 events (found 1181 events)
>  [COUNTERS] PROGRAM TOTAL          : 1568.8477s
>  [COUNTERS] Fortran Overhead ( 0 ) :  115.3590s
>  [COUNTERS] CudaCpp MEs      ( 2 ) : 1453.4886s for    90112 events => throughput is 6.20E+01 events/s
158c158
< OK! xsec from fortran (2.1583423593436162E-007) and cpp (2.1583423593436168E-007) differ by less than 2E-14 (2.220446049250313e-16)
---
> OK! xsec from fortran (2.1358448797976056E-007) and cpp (2.1358448797976064E-007) differ by less than 2E-14 (4.440892098500626e-16)

I am starting to wonder if there is an issue with helicity filtering. Maybe I made a mess with this somewhere? I had already seen this in the past, #419 because Fortran uses LIMHEL>0 and cudacpp uses LIMHEL=0. Maybe this is what I lost in backporting my changes...?

valassi · 2022-12-10T08:47:16Z

NB I think the issue was already in the MR that I already merged
#559

I will not unmerge that, but I should try to fix that in the second madgraph4gpu MR, not yet merged and still in WIP

valassi · 2022-12-10T08:47:50Z

Hm anyway it is unlikely that LIMHEL is the issue. In the latest code I do have LIMHEL=0.

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad> grep -i limhel . -r
./SubProcesses/P1_gg_ttx/auto_dsig1.f:        IF( LIMHEL.NE.0 ) THEN
./SubProcesses/P1_gg_ttx/auto_dsig1.f:          WRITE(6,*) 'ERROR! The cudacpp bridge only supports LIMHEL=0'
./SubProcesses/P1_gg_ttx/matrix1.f:                IF (DABS(TS(I)).GT.ANS*LIMHEL/NCOMB) THEN
./SubProcesses/P1_gg_ttx/matrix1.f:     $         .GT.ANS*LIMHEL/NCOMB)) THEN
./Source/genps.inc:      REAL*8 LIMHEL
./Source/genps.inc:c     PARAMETER(LIMHEL=1e-8) ! ME threshold for helicity filtering (Fortran default)
./Source/genps.inc:      PARAMETER(LIMHEL=0) ! ME threshold for helicity filtering (force Fortran to mimic cudacpp, see #419)

oliviermattelaer · 2022-12-10T09:09:44Z

Do you generate your code with group_subprocesses=False ?

valassi · 2022-12-10T18:17:59Z

Hi Olivier, thanks. Hm no idea about group_subprocesses=False. I do not think it is a parameter I use at all, so I keep the defaults I imagine. Not something I explicitly changed at least.

I am trying to debug this the good old way, just comparing the code, for gg_tt.mad. There is not that much that changed actually...

One thing I noticed (by mistake, I was looking for code changes and the diff gave me lhe changes) is that the produced lhe files are different. The momenta, at least for the first events, are the same as the random seeds are the same, but there are some values that differ, example

diff -r ../../../gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx/events.lhe ../../SubProcesses/P1_gg_ttx/events.lhe
3,6c3,6
<          21   -1    0    0  501  502  0.00000000000E+00  0.00000000000E+00  0.63830146788E+02  0.63830146788E+02  0.00000000000E+00 0.  1.
<          21   -1    0    0  502  503 -0.00000000000E+00 -0.00000000000E+00 -0.51807786912E+03  0.51807786912E+03  0.00000000000E+00 0.  1.
<           6    1    1    2  501    0  0.49855660492E+02  0.22722068674E+02 -0.20832108497E+03  0.27627622722E+03  0.17300000000E+03 0.  1.
<          -6    1    1    2    0  503 -0.49855660492E+02 -0.22722068674E+02 -0.24592663737E+03  0.30563178868E+03  0.17300000000E+03 0.  1.
---
>          21   -1    0    0  503  502  0.00000000000E+00  0.00000000000E+00  0.63830146788E+02  0.63830146788E+02  0.00000000000E+00 0. -1.
>          21   -1    0    0  501  503 -0.00000000000E+00 -0.00000000000E+00 -0.51807786912E+03  0.51807786912E+03  0.00000000000E+00 0.  1.
>           6    1    1    2  501    0  0.49855660492E+02  0.22722068674E+02 -0.20832108497E+03  0.27627622722E+03  0.17300000000E+03 0. -1.
>          -6    1    1    2    0  502 -0.49855660492E+02 -0.22722068674E+02 -0.24592663737E+03  0.30563178868E+03  0.17300000000E+03 0.  1.

What are those 0, 501, 502, 503? Those are the values that differ.

I am comparing

commit 2340ac2 (upstream/master), i.e. after First set of lhe patches (towards random color/helicity): upgrade upstream MG5aMC to vecMLM and port my vector.inc patches upstream #559
commit f8117c4, i.e. before First set of lhe patches (towards random color/helicity): upgrade upstream MG5aMC to vecMLM and port my vector.inc patches upstream #559 (older, but I am building a NEW from this)

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> ./madevent < /tmp/avalassi/input_ggtt_x1_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0263s for     8192 events => throughput is 3.11E+05 events/s
...
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> ./madevent < /tmp/avalassi/input_ggtt_x1_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0800s for     8192 events => throughput is 1.02E+05 events/s
...
cat /tmp/avalassi/input_ggtt_x1_fortran 
0 ! Fortran bridge mode (CppOnly=1, FortranOnly=0, BothQuiet=-1, BothDebug=-2)
8192 ! Number of events in a single Fortran iteration (VECSIZE_USED)
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)

valassi · 2022-12-10T18:23:43Z

Hm I read somewhere that those 501/502 etc are the famous color info. But then this means that the new (vecMLM) and old (nuvecMLM) code produce lhe files with different color info starting from the same random numbers?...

I mean, I understand that the way the color is computed is different, but I was imagining that the result would be the same? Or maybe the same random numbers are used for momenta, but not for the choice of color precisely because the algorthm had to be changed to offload it to GPU?

zeniheisser · 2022-12-10T19:02:40Z

As you've noticed, the differing values are colour charges (if you need to check LHE info, the original paper goes over all the fundamentals pretty well ). Since colour is chosen before ME calculation in vecMLM (whereas nuvecMLM does a full colour summation), I don't think this is too strange. @oliviermattelaer, do you agree?

oliviermattelaer · 2022-12-10T19:54:16Z

It does make sense that the color is now different since before the choice of color was done only for the event that was passing the first un-weighting and now this is done for all events.

Cheers,

Olivier

valassi · 2022-12-11T09:02:46Z

Thanks Zenny and Olivier!

Ok anyway I think I understood it now. Not a problem of color or helicity, the issue is multithreading.

I understood it by just running 'perf stat' of madevent as a first step towards profiling. I noticed that the old code had user time lower than elapsed time, as I thought it should be, while the new code has a lower elapsed time but the user time is higher than elapsed, and similar to that of the old code. This made me think of multithreading.

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> perf stat ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1

 Performance counter stats for './madevent':

          3,740.99 msec task-clock:u              #    1.265 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             2,016      page-faults:u             #    0.539 K/sec                  
     9,784,971,216      cycles:u                  #    2.616 GHz                    
    23,180,298,587      instructions:u            #    2.37  insn per cycle         
     3,874,149,295      branches:u                # 1035.593 M/sec                  
        15,354,038      branch-misses:u           #    0.40% of all branches        

       2.957658345 seconds time elapsed

       3.675880000 seconds user
       0.070997000 seconds sys


 [COUNTERS] Fortran MEs      ( 1 ) :    0.2512s for    90112 events => throughput is 3.59E+05 events/s
...
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> perf stat ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1

 Performance counter stats for './madevent':

          3,580.76 msec task-clock:u              #    0.995 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,537      page-faults:u             #    0.429 K/sec                  
     9,379,513,277      cycles:u                  #    2.619 GHz                    
    24,067,620,536      instructions:u            #    2.57  insn per cycle         
     4,075,392,160      branches:u                # 1138.137 M/sec                  
        16,073,585      branch-misses:u           #    0.39% of all branches        

       3.597123223 seconds time elapsed

       3.520746000 seconds user
       0.062897000 seconds sys


 [COUNTERS] Fortran MEs      ( 1 ) :    0.8814s for    90112 events => throughput is 1.02E+05 events/s

Indeed, it seems that OMP multithreading works reasonably well in the new Fortran version (which is good news!), while it seems to do nothing at all in the old version.

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=1 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.6804s for    90112 events => throughput is 1.32E+05 events/s
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=4 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.2496s for    90112 events => throughput is 3.61E+05 events/s
...
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=1 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.8811s for    90112 events => throughput is 1.02E+05 events/s
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=4 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.8811s for    90112 events => throughput is 1.02E+05 events/s

The new code actually does go 30% faster with only one thread, but this is something I can understand, if it does color/helicity in an improved way - or maybe because I do not count it in, this I need to check.

Ok so this is understood, now I only need to make sure the comparisons in my madgraph4gpu scripts use single threaded fortran. Maybe I will add the same hack I have in the SA cudacpp executables, that OMP_NUM_THREADS not set means 1 rather than "all threads you have"

valassi · 2022-12-11T09:08:05Z

For completeness, new code with MT disabled, through perf

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=1 perf stat ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1

 Performance counter stats for './madevent':

          3,353.96 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,991      page-faults:u             #    0.594 K/sec                  
     8,777,393,997      cycles:u                  #    2.617 GHz                    
    23,094,920,632      instructions:u            #    2.63  insn per cycle         
     3,844,578,748      branches:u                # 1146.279 M/sec                  
        15,674,701      branch-misses:u           #    0.41% of all branches        

       3.368614834 seconds time elapsed

       3.291551000 seconds user
       0.065010000 seconds sys


 [COUNTERS] Fortran MEs      ( 1 ) :    0.6802s for    90112 events => throughput is 1.32E+05 events/s

…ans 1 thread" feature in fortran madevent Previously this was hardcoded only inside the body of check_sa.cc, move it to ompnumthreads.h/cc This should remove the ~factor x4 speedup observed in fortran between nuvecMLM and vecMLM madgraph5#561

…(excluding patch.*)

valassi · 2022-12-11T11:56:48Z

I have changed madevent in madgraph4gpu to use 1 thread by default as in check_sa.cc, this is in the upcoming MR #562. I have also added printouts of the numbers of threads in the tmad scripts.

This can be closed now.

… fix build error about missing intel_fast_copy This and previous timermap/driver/ompnumthreads fixes for openmp are part of the madgraph5#561 patch

…om gg_tt.mad (excluding patch.*)

valassi assigned valassi and oliviermattelaer Dec 10, 2022

valassi changed the title ~~Understand the large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code~~ Understand the large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code (LIMHEL issue again?) Dec 10, 2022

This was referenced Dec 10, 2022

Second set of lhe patches (towards random color/helicity): streamline VECSIZE_USED in upstream MG5aMC and rerun performance tests #562

Merged

In tmad/tput scripts, print out how many helicities are used (helicity filtering) #563

Closed

valassi changed the title ~~Understand the large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code (LIMHEL issue again?)~~ Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading Dec 11, 2022

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 11, 2022

[lhe] madgraph5#561 backport ompnumthreads to CODEGEN from gg_tt.mad …

6054dd3

…(excluding patch.*)

valassi closed this as completed Dec 11, 2022

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 11, 2022

[lhe] madgraph5#561 add OMP thread printout to madX.sh

5d25acf

valassi linked a pull request Dec 11, 2022 that will close this issue

Second set of lhe patches (towards random color/helicity): streamline VECSIZE_USED in upstream MG5aMC and rerun performance tests #562

Merged

valassi mentioned this issue Dec 19, 2022

Reenable OPENMP multithreading in cudacpp #577

Merged

valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023

[lhe] madgraph5/madgraph4gpu#561 backport ompnumthreads to CODEGEN fr…

e3f2a16

…om gg_tt.mad (excluding patch.*)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561

Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022

oliviermattelaer commented Dec 10, 2022

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022 •

edited

Loading

zeniheisser commented Dec 10, 2022

oliviermattelaer commented Dec 10, 2022

valassi commented Dec 11, 2022 •

edited

Loading

valassi commented Dec 11, 2022

valassi commented Dec 11, 2022

Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561

Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561

Comments

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022

oliviermattelaer commented Dec 10, 2022

valassi commented Dec 10, 2022

valassi commented Dec 10, 2022 • edited Loading

zeniheisser commented Dec 10, 2022

oliviermattelaer commented Dec 10, 2022

valassi commented Dec 11, 2022 • edited Loading

valassi commented Dec 11, 2022

valassi commented Dec 11, 2022

valassi commented Dec 10, 2022 •

edited

Loading

valassi commented Dec 11, 2022 •

edited

Loading