Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561

Closed
valassi opened this issue Dec 10, 2022 · 11 comments · Fixed by #562
Assignees

Comments

@valassi
Copy link
Member

valassi commented Dec 10, 2022

Hi @oliviermattelaer @roiser @hageboeck @zeniheisser I have an interesting one.

I am making progress in the integration of random color and helicity. For the moment I have essentially completed the rebasing of my madgraph4gpu cudacpp patches on top of Olivier's latest upstream code (ie moving from nuvecMLM to vecMLM branches). I have two ongoing MRs on upstream mg5amcnlo and two MRs on madgraph4gpu, I will give the details.

I am rerunning my usual set of tests. The interesting, puzzling finding is the following: the Fortran ME calculation is now a factor 4 faster than it used to be. I do not think that Fortran is now magically vectorizing SIMD (should be checked with objsim), I would rather imagine that the algorithm for Fortran ME calculation has changed. It may also be that I am doing the "bookeeping" wrong, now that random color/helicity have moved elsewhere, but again I do not think this is the issue, as the OVERALL time taken by Fortran seems a factor 4 faster.

Note that

  • the cross section produced by Fortran with the same random seeds has now changed
  • the cudacpp implementation also produces this new cross section, which agrees with the new Fortran

This is in itself good news as speedups are always good, but it kind of significantly reduces the interest of C++ vectorization...

I think that we need to understand this quite well before we give the code to the experiments? It would especially be important to do some physics validation of the old Fortran vs the new Fortran.

Another think that it would be useful to test is whether changing the vector size has any effect.

Thanks for any feedback! Andrea

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 10, 2022
STARTED AT Fri Dec  9 20:32:57 CET 2022
ENDED   AT Sat Dec 10 00:35:04 CET 2022

*** NB! A large performance speedup appears in Fortran MEs! madgraph5#561 ***
@valassi
Copy link
Member Author

valassi commented Dec 10, 2022

See for instance

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp> git diff f8117c408 f4329daeb tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 
...
85c85
< 8192 ! Number of events in a single Fortran iteration (nb_page_loop)
---
> 8192 ! Number of events in a single Fortran iteration (VECSIZE_USED)
94c94
<  [XSECTION] nb_page_loop = 8192
---
>  [XSECTION] VECSIZE_USED = 8192
98,102c98,102
<  [XSECTION] Cross section = 2.158e-07 [2.1583423593436162E-007] fbridge_mode=0
<  [UNWEIGHT] Wrote 97 events (found 1199 events)
<  [COUNTERS] PROGRAM TOTAL          : 1228.1925s
<  [COUNTERS] Fortran Overhead ( 0 ) :    5.0023s
<  [COUNTERS] Fortran MEs      ( 1 ) : 1223.1902s for    90112 events => throughput is 7.37E+01 events/s
---
>  [XSECTION] Cross section = 2.136e-07 [2.1358448797976056E-007] fbridge_mode=0
>  [UNWEIGHT] Wrote 84 events (found 1181 events)
>  [COUNTERS] PROGRAM TOTAL          :  322.0246s
>  [COUNTERS] Fortran Overhead ( 0 ) :    4.9707s
>  [COUNTERS] Fortran MEs      ( 1 ) :  317.0539s for    90112 events => throughput is 2.84E+02 events/s
...
137c137
< 8192 ! Number of events in a single C++ or CUDA iteration (nb_page_loop)
---
> 8192 ! Number of events in a single C++ or CUDA iteration (VECSIZE_USED)
146c146
<  [XSECTION] nb_page_loop = 8192
---
>  [XSECTION] VECSIZE_USED = 8192
150,154c150,154
<  [XSECTION] Cross section = 2.158e-07 [2.1583423593436168E-007] fbridge_mode=1
<  [UNWEIGHT] Wrote 97 events (found 1199 events)
<  [COUNTERS] PROGRAM TOTAL          : 1572.6206s
<  [COUNTERS] Fortran Overhead ( 0 ) :  116.8344s
<  [COUNTERS] CudaCpp MEs      ( 2 ) : 1455.7863s for    90112 events => throughput is 6.19E+01 events/s
---
>  [XSECTION] Cross section = 2.136e-07 [2.1358448797976064E-007] fbridge_mode=1
>  [UNWEIGHT] Wrote 84 events (found 1181 events)
>  [COUNTERS] PROGRAM TOTAL          : 1568.8477s
>  [COUNTERS] Fortran Overhead ( 0 ) :  115.3590s
>  [COUNTERS] CudaCpp MEs      ( 2 ) : 1453.4886s for    90112 events => throughput is 6.20E+01 events/s
158c158
< OK! xsec from fortran (2.1583423593436162E-007) and cpp (2.1583423593436168E-007) differ by less than 2E-14 (2.220446049250313e-16)
---
> OK! xsec from fortran (2.1358448797976056E-007) and cpp (2.1358448797976064E-007) differ by less than 2E-14 (4.440892098500626e-16)

I am starting to wonder if there is an issue with helicity filtering. Maybe I made a mess with this somewhere? I had already seen this in the past, #419 because Fortran uses LIMHEL>0 and cudacpp uses LIMHEL=0. Maybe this is what I lost in backporting my changes...?

@valassi valassi changed the title Understand the large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code Understand the large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code (LIMHEL issue again?) Dec 10, 2022
@valassi
Copy link
Member Author

valassi commented Dec 10, 2022

NB I think the issue was already in the MR that I already merged
#559

I will not unmerge that, but I should try to fix that in the second madgraph4gpu MR, not yet merged and still in WIP

@valassi
Copy link
Member Author

valassi commented Dec 10, 2022

Hm anyway it is unlikely that LIMHEL is the issue. In the latest code I do have LIMHEL=0.

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad> grep -i limhel . -r
./SubProcesses/P1_gg_ttx/auto_dsig1.f:        IF( LIMHEL.NE.0 ) THEN
./SubProcesses/P1_gg_ttx/auto_dsig1.f:          WRITE(6,*) 'ERROR! The cudacpp bridge only supports LIMHEL=0'
./SubProcesses/P1_gg_ttx/matrix1.f:                IF (DABS(TS(I)).GT.ANS*LIMHEL/NCOMB) THEN
./SubProcesses/P1_gg_ttx/matrix1.f:     $         .GT.ANS*LIMHEL/NCOMB)) THEN
./Source/genps.inc:      REAL*8 LIMHEL
./Source/genps.inc:c     PARAMETER(LIMHEL=1e-8) ! ME threshold for helicity filtering (Fortran default)
./Source/genps.inc:      PARAMETER(LIMHEL=0) ! ME threshold for helicity filtering (force Fortran to mimic cudacpp, see #419)

@oliviermattelaer
Copy link
Member

Do you generate your code with group_subprocesses=False ?

@valassi
Copy link
Member Author

valassi commented Dec 10, 2022

Hi Olivier, thanks. Hm no idea about group_subprocesses=False. I do not think it is a parameter I use at all, so I keep the defaults I imagine. Not something I explicitly changed at least.

I am trying to debug this the good old way, just comparing the code, for gg_tt.mad. There is not that much that changed actually...

One thing I noticed (by mistake, I was looking for code changes and the diff gave me lhe changes) is that the produced lhe files are different. The momenta, at least for the first events, are the same as the random seeds are the same, but there are some values that differ, example

diff -r ../../../gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx/events.lhe ../../SubProcesses/P1_gg_ttx/events.lhe
3,6c3,6
<          21   -1    0    0  501  502  0.00000000000E+00  0.00000000000E+00  0.63830146788E+02  0.63830146788E+02  0.00000000000E+00 0.  1.
<          21   -1    0    0  502  503 -0.00000000000E+00 -0.00000000000E+00 -0.51807786912E+03  0.51807786912E+03  0.00000000000E+00 0.  1.
<           6    1    1    2  501    0  0.49855660492E+02  0.22722068674E+02 -0.20832108497E+03  0.27627622722E+03  0.17300000000E+03 0.  1.
<          -6    1    1    2    0  503 -0.49855660492E+02 -0.22722068674E+02 -0.24592663737E+03  0.30563178868E+03  0.17300000000E+03 0.  1.
---
>          21   -1    0    0  503  502  0.00000000000E+00  0.00000000000E+00  0.63830146788E+02  0.63830146788E+02  0.00000000000E+00 0. -1.
>          21   -1    0    0  501  503 -0.00000000000E+00 -0.00000000000E+00 -0.51807786912E+03  0.51807786912E+03  0.00000000000E+00 0.  1.
>           6    1    1    2  501    0  0.49855660492E+02  0.22722068674E+02 -0.20832108497E+03  0.27627622722E+03  0.17300000000E+03 0. -1.
>          -6    1    1    2    0  502 -0.49855660492E+02 -0.22722068674E+02 -0.24592663737E+03  0.30563178868E+03  0.17300000000E+03 0.  1.

What are those 0, 501, 502, 503? Those are the values that differ.

I am comparing

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> ./madevent < /tmp/avalassi/input_ggtt_x1_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0263s for     8192 events => throughput is 3.11E+05 events/s
...
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> ./madevent < /tmp/avalassi/input_ggtt_x1_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0800s for     8192 events => throughput is 1.02E+05 events/s
...
cat /tmp/avalassi/input_ggtt_x1_fortran 
0 ! Fortran bridge mode (CppOnly=1, FortranOnly=0, BothQuiet=-1, BothDebug=-2)
8192 ! Number of events in a single Fortran iteration (VECSIZE_USED)
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)

@valassi
Copy link
Member Author

valassi commented Dec 10, 2022

Hm I read somewhere that those 501/502 etc are the famous color info. But then this means that the new (vecMLM) and old (nuvecMLM) code produce lhe files with different color info starting from the same random numbers?...

I mean, I understand that the way the color is computed is different, but I was imagining that the result would be the same? Or maybe the same random numbers are used for momenta, but not for the choice of color precisely because the algorthm had to be changed to offload it to GPU?

@zeniheisser
Copy link
Contributor

As you've noticed, the differing values are colour charges (if you need to check LHE info, the original paper goes over all the fundamentals pretty well ). Since colour is chosen before ME calculation in vecMLM (whereas nuvecMLM does a full colour summation), I don't think this is too strange. @oliviermattelaer, do you agree?

@oliviermattelaer
Copy link
Member

It does make sense that the color is now different since before the choice of color was done only for the event that was passing the first un-weighting and now this is done for all events.

Cheers,

Olivier

@valassi
Copy link
Member Author

valassi commented Dec 11, 2022

Thanks Zenny and Olivier!

Ok anyway I think I understood it now. Not a problem of color or helicity, the issue is multithreading.

I understood it by just running 'perf stat' of madevent as a first step towards profiling. I noticed that the old code had user time lower than elapsed time, as I thought it should be, while the new code has a lower elapsed time but the user time is higher than elapsed, and similar to that of the old code. This made me think of multithreading.

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> perf stat ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1

 Performance counter stats for './madevent':

          3,740.99 msec task-clock:u              #    1.265 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             2,016      page-faults:u             #    0.539 K/sec                  
     9,784,971,216      cycles:u                  #    2.616 GHz                    
    23,180,298,587      instructions:u            #    2.37  insn per cycle         
     3,874,149,295      branches:u                # 1035.593 M/sec                  
        15,354,038      branch-misses:u           #    0.40% of all branches        

       2.957658345 seconds time elapsed

       3.675880000 seconds user
       0.070997000 seconds sys


 [COUNTERS] Fortran MEs      ( 1 ) :    0.2512s for    90112 events => throughput is 3.59E+05 events/s
...
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> perf stat ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1

 Performance counter stats for './madevent':

          3,580.76 msec task-clock:u              #    0.995 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,537      page-faults:u             #    0.429 K/sec                  
     9,379,513,277      cycles:u                  #    2.619 GHz                    
    24,067,620,536      instructions:u            #    2.57  insn per cycle         
     4,075,392,160      branches:u                # 1138.137 M/sec                  
        16,073,585      branch-misses:u           #    0.39% of all branches        

       3.597123223 seconds time elapsed

       3.520746000 seconds user
       0.062897000 seconds sys


 [COUNTERS] Fortran MEs      ( 1 ) :    0.8814s for    90112 events => throughput is 1.02E+05 events/s

Indeed, it seems that OMP multithreading works reasonably well in the new Fortran version (which is good news!), while it seems to do nothing at all in the old version.

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=1 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.6804s for    90112 events => throughput is 1.32E+05 events/s
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=4 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.2496s for    90112 events => throughput is 3.61E+05 events/s
...
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=1 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.8811s for    90112 events => throughput is 1.02E+05 events/s
[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.NEW/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=4 ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1
 [COUNTERS] Fortran MEs      ( 1 ) :    0.8811s for    90112 events => throughput is 1.02E+05 events/s

The new code actually does go 30% faster with only one thread, but this is something I can understand, if it does color/helicity in an improved way - or maybe because I do not count it in, this I need to check.

Ok so this is understood, now I only need to make sure the comparisons in my madgraph4gpu scripts use single threaded fortran. Maybe I will add the same hack I have in the SA cudacpp executables, that OMP_NUM_THREADS not set means 1 rather than "all threads you have"

@valassi
Copy link
Member Author

valassi commented Dec 11, 2022

For completeness, new code with MT disabled, through perf

[avalassi@itscrd70 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt.mad.upmaster/SubProcesses/P1_gg_ttx> OMP_NUM_THREADS=1 perf stat ./madevent < /tmp/avalassi/input_ggtt_x10_fortran | tail -1

 Performance counter stats for './madevent':

          3,353.96 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,991      page-faults:u             #    0.594 K/sec                  
     8,777,393,997      cycles:u                  #    2.617 GHz                    
    23,094,920,632      instructions:u            #    2.63  insn per cycle         
     3,844,578,748      branches:u                # 1146.279 M/sec                  
        15,674,701      branch-misses:u           #    0.41% of all branches        

       3.368614834 seconds time elapsed

       3.291551000 seconds user
       0.065010000 seconds sys


 [COUNTERS] Fortran MEs      ( 1 ) :    0.6802s for    90112 events => throughput is 1.32E+05 events/s

@valassi valassi changed the title Understand the large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code (LIMHEL issue again?) Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading Dec 11, 2022
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 11, 2022
…ans 1 thread" feature in fortran madevent

Previously this was hardcoded only inside the body of check_sa.cc, move it to ompnumthreads.h/cc

This should remove the ~factor x4 speedup observed in fortran between nuvecMLM and vecMLM madgraph5#561
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 11, 2022
@valassi
Copy link
Member Author

valassi commented Dec 11, 2022

I have changed madevent in madgraph4gpu to use 1 thread by default as in check_sa.cc, this is in the upcoming MR #562. I have also added printouts of the numbers of threads in the tmad scripts.

This can be closed now.

@valassi valassi closed this as completed Dec 11, 2022
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 11, 2022
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022
… fix build error about missing intel_fast_copy

This and previous timermap/driver/ompnumthreads fixes for openmp are part of the madgraph5#561 patch
valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
3 participants