-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Large (x4) speedup in Fortran MEs with Olivier's random helicity/color vecMLM code: disable OMP multithreading #561
Comments
STARTED AT Fri Dec 9 20:32:57 CET 2022 ENDED AT Sat Dec 10 00:35:04 CET 2022 *** NB! A large performance speedup appears in Fortran MEs! madgraph5#561 ***
See for instance
I am starting to wonder if there is an issue with helicity filtering. Maybe I made a mess with this somewhere? I had already seen this in the past, #419 because Fortran uses LIMHEL>0 and cudacpp uses LIMHEL=0. Maybe this is what I lost in backporting my changes...? |
NB I think the issue was already in the MR that I already merged I will not unmerge that, but I should try to fix that in the second madgraph4gpu MR, not yet merged and still in WIP |
Hm anyway it is unlikely that LIMHEL is the issue. In the latest code I do have LIMHEL=0.
|
Do you generate your code with group_subprocesses=False ? |
Hi Olivier, thanks. Hm no idea about group_subprocesses=False. I do not think it is a parameter I use at all, so I keep the defaults I imagine. Not something I explicitly changed at least. I am trying to debug this the good old way, just comparing the code, for gg_tt.mad. There is not that much that changed actually... One thing I noticed (by mistake, I was looking for code changes and the diff gave me lhe changes) is that the produced lhe files are different. The momenta, at least for the first events, are the same as the random seeds are the same, but there are some values that differ, example
What are those 0, 501, 502, 503? Those are the values that differ. I am comparing
|
Hm I read somewhere that those 501/502 etc are the famous color info. But then this means that the new (vecMLM) and old (nuvecMLM) code produce lhe files with different color info starting from the same random numbers?... I mean, I understand that the way the color is computed is different, but I was imagining that the result would be the same? Or maybe the same random numbers are used for momenta, but not for the choice of color precisely because the algorthm had to be changed to offload it to GPU? |
As you've noticed, the differing values are colour charges (if you need to check LHE info, the original paper goes over all the fundamentals pretty well ). Since colour is chosen before ME calculation in vecMLM (whereas nuvecMLM does a full colour summation), I don't think this is too strange. @oliviermattelaer, do you agree? |
It does make sense that the color is now different since before the choice of color was done only for the event that was passing the first un-weighting and now this is done for all events. Cheers, Olivier |
Thanks Zenny and Olivier! Ok anyway I think I understood it now. Not a problem of color or helicity, the issue is multithreading. I understood it by just running 'perf stat' of madevent as a first step towards profiling. I noticed that the old code had user time lower than elapsed time, as I thought it should be, while the new code has a lower elapsed time but the user time is higher than elapsed, and similar to that of the old code. This made me think of multithreading.
Indeed, it seems that OMP multithreading works reasonably well in the new Fortran version (which is good news!), while it seems to do nothing at all in the old version.
The new code actually does go 30% faster with only one thread, but this is something I can understand, if it does color/helicity in an improved way - or maybe because I do not count it in, this I need to check. Ok so this is understood, now I only need to make sure the comparisons in my madgraph4gpu scripts use single threaded fortran. Maybe I will add the same hack I have in the SA cudacpp executables, that OMP_NUM_THREADS not set means 1 rather than "all threads you have" |
For completeness, new code with MT disabled, through perf
|
…ans 1 thread" feature in fortran madevent Previously this was hardcoded only inside the body of check_sa.cc, move it to ompnumthreads.h/cc This should remove the ~factor x4 speedup observed in fortran between nuvecMLM and vecMLM madgraph5#561
…(excluding patch.*)
I have changed madevent in madgraph4gpu to use 1 thread by default as in check_sa.cc, this is in the upcoming MR #562. I have also added printouts of the numbers of threads in the tmad scripts. This can be closed now. |
… fix build error about missing intel_fast_copy This and previous timermap/driver/ompnumthreads fixes for openmp are part of the madgraph5#561 patch
…om gg_tt.mad (excluding patch.*)
Hi @oliviermattelaer @roiser @hageboeck @zeniheisser I have an interesting one.
I am making progress in the integration of random color and helicity. For the moment I have essentially completed the rebasing of my madgraph4gpu cudacpp patches on top of Olivier's latest upstream code (ie moving from nuvecMLM to vecMLM branches). I have two ongoing MRs on upstream mg5amcnlo and two MRs on madgraph4gpu, I will give the details.
I am rerunning my usual set of tests. The interesting, puzzling finding is the following: the Fortran ME calculation is now a factor 4 faster than it used to be. I do not think that Fortran is now magically vectorizing SIMD (should be checked with objsim), I would rather imagine that the algorithm for Fortran ME calculation has changed. It may also be that I am doing the "bookeeping" wrong, now that random color/helicity have moved elsewhere, but again I do not think this is the issue, as the OVERALL time taken by Fortran seems a factor 4 faster.
Note that
This is in itself good news as speedups are always good, but it kind of significantly reduces the interest of C++ vectorization...
I think that we need to understand this quite well before we give the code to the experiments? It would especially be important to do some physics validation of the old Fortran vs the new Fortran.
Another think that it would be useful to test is whether changing the vector size has any effect.
Thanks for any feedback! Andrea
The text was updated successfully, but these errors were encountered: