-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783
Comments
…ble - some failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd1.txt:Floating Point Exception (CPU) tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU) tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU) tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU) STARTED AT Sun Oct 29 10:38:58 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Oct 29 11:10:35 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sun Oct 29 11:22:39 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sun Oct 29 11:31:39 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sun Oct 29 11:34:48 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sun Oct 29 11:37:55 PM CET 2023 [Status=0]
…ble - some failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance STARTED AT Mon Oct 30 10:32:51 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Mon Oct 30 10:56:58 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Mon Oct 30 11:06:26 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Mon Oct 30 11:15:35 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Mon Oct 30 11:18:48 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Mon Oct 30 11:22:00 PM CET 2023 [Status=0]
…ble - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance (*NB OpenMP is now disabled by default!*) STARTED AT Fri Nov 3 10:06:44 AM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Fri Nov 3 01:30:11 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Fri Nov 3 01:55:47 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Fri Nov 3 02:05:25 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Fri Nov 3 02:08:40 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Fri Nov 3 02:11:53 PM CET 2023 [Status=0]
…ble - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance (*NB OpenMP is now again enabled by default*) (or maybe ~1-2% slower on average? anyway, keep OpenMP on as in the past)
I have checked this again. Note that these FPEs only happen in very special corners now. For instance the one above gives
So: this one only crashes for float (not double and not mixed) and only for 512z (not for 512y) |
Hi @roiser I added the comment above for you. Via email I got a notification that you had sent a comment here asking if this was fixed? But I do not see it in the web interface. (You deleted it or there was a problem in email-github interface?) Anyway. The issue is still pending at least in some cases. I guess it will need to be debugged one by one. I will try to add this to the testsuite in my PR #792 |
…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.
I have implemented this in the extended CI, but the issue does not show because the github default nodes do not have AVX512...
We will need to run these tests on our own nodes with GPU and AVX512 |
… list of physics processes (test madgraph5#783?)
…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.
… list of physics processes (test madgraph5#783?)
…le - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance
… 3.5.2 - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance STARTED AT Thu Nov 9 05:26:21 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Thu Nov 9 05:54:46 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Thu Nov 9 06:05:38 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Thu Nov 9 06:15:05 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Thu Nov 9 06:18:20 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Thu Nov 9 06:21:32 PM CET 2023 [Status=0]
…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.
… list of physics processes (test madgraph5#783?)
… color choice if channelId==0 (this fixes the FPE madgraph5#783!) Before this change, I had identified the the source of the FPE with gdb: cd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu make -f cudacpp.mk debug -j CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe -p 1 8 1 (gdb) run Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/check.exe -p 1 8 1 ... Program received signal SIGFPE, Arithmetic exception. mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0(void) () at CPPProcess.cc:1250 1250 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] ); Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64 (gdb) p channelIdC $1 = -1
…adgraph5#783 (skip evt-by-evt color choice if channelId==0)
I checked that this indeed fixes tput tests for ggttg ./tput/teeThroughputX.sh -ggttg -fltonly -makeclean -makej
I believe I have found the source and a fix for (at least part of?) the floating point exceptions here. The fix is committed as part of PR #706. I have debugged this as follows
Essentially the problem is that line 1250 gives a division 0/0 because it should not be executed at all to start with! This belongs to event-by-event color choice, which should be skipped when channelId==0 and it was not. The fix simpoly consists in skipping the event-by-event color choice if channelId==0. I checked that this does bypass and fix the FPEs at least for a few cases like nobm_pp_ttW and gg_ttg. I am now rerunning more complete coverage tests. (PS I have no idea why FPEs happen only for come combinations of processes, fp precision and AVX architecture... I have no idea, I guess that those that do not fail are "lucky"). |
Note, this sounds very much related to what @nscottnichols reported in #611, There is a channelIdC that becomes -1 and this line should not be executed. |
Note also that bypassing the color choice is needed in cases such as reweighting #607 |
…ere are no FPEs any more!) STARTED AT Fri Nov 24 12:07:57 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Fri Nov 24 02:47:13 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Fri Nov 24 03:06:19 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Fri Nov 24 03:16:14 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Fri Nov 24 03:19:35 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Fri Nov 24 03:22:53 PM CET 2023 [Status=0] There used to be eight "Floating Point Exception (CPU)" errors in the logs, they have now all disappeared
NB1: the code builds ok for HRDCOD=0 (so madgraph5#695 does NOT affect this!), and runTest is ok in all P* (the ref files are there) NB2: a full tlau test is now also ok on this process, showing that all issues madgraph5#701 madgraph5#733 madgraph5#783 etc have been fixed tlau/lauX.sh -CPP nobm_pp_ttW.mad ... Cross-section : 1.276 +- 0.007916 pb In summary: this process should now be fully usable by ATLAS and other experiments.
This is a followup of #733 and PR #706.
I have now added to all processes via CODEGEN the option to enable FPEs in the check.exe executable. I have rerun this for my usual 78 tests. I find 8 failures out for this, for ggttg (float/mixed) and gqttq
The fact that this happens for the mixed mode makes me suspect that this is in the color algebra, for ggttg?
On the other hand for gqttq this is maybe in the Feynman diagrams, as it only happens in f and not in m?
Note: the madevent 'tmad' tests for these processes conversely have run ok for me so far. Maybe just because I have not tried enough events? By default FPEs are normally enabled in madevent (by linking fortran and c++). Here the difference is that I have explicitly enabled FPEs in the c++-only check.exe executables.
The text was updated successfully, but these errors were encountered: