Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Shared libraries + Bridge + Cleaner Makefiles #367

Merged
merged 193 commits into from
Feb 24, 2022
Merged

Conversation

valassi
Copy link
Member

@valassi valassi commented Feb 1, 2022

This is a WIP PR to follow up on

For the momente there is only one commit, ficing the conflicts in #361.

…e shared libraries

Cherry-pick commit 5df671b of 'roiser/sharedlib' (only commit in that branch) into shared (PR madgraph5#361)

Fix conflicts in epochX/cudacpp/gg_tt/SubProcesses/Makefile:
add OMPFLAGS and remove CXXFLAGS and CPPFLAGS when linking cxx_main
@valassi valassi marked this pull request as draft February 1, 2022 16:44
@valassi valassi mentioned this pull request Feb 1, 2022
@valassi
Copy link
Member Author

valassi commented Feb 1, 2022

The dependency of check.exe and runTest.exe on the shared libraries is not ideal now.

First problem, the executables are now not statically built so one needs an LD_LIBRARY_PATH (ok maybe that's what we want...)

[avalassi@itscrd70 gcc10.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx> ./check.exe 
./check.exe: error while loading shared libraries: libmodel_sm.so: cannot open shared object file: No such file or directory
[avalassi@itscrd70 gcc10.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe 
./runTest.exe: error while loading shared libraries: libmodel_sm.so: cannot open shared object file: No such file or directory

Second problem, there seem to be some hardcoded paths

[avalassi@itscrd70 gcc10.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx> ldd ../../lib/libmg5amc_cu.so 
        linux-vdso.so.1 =>  (0x00007ffcf6796000)
        ../../lib/libmodel_sm.so (0x00007f6e95b5e000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f6e95817000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6e955fb000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f6e953f7000)
        libstdc++.so.6 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libstdc++.so.6 (0x00007f6e95222000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f6e94f20000)
        libgcc_s.so.1 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libgcc_s.so.1 (0x00007f6e95b23000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f6e94b52000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6e95a1f000)
[avalassi@itscrd70 gcc10.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx> ldd ./check.exe 
        linux-vdso.so.1 =>  (0x00007ffcc37b7000)
        ../../lib/libmg5amc_cxx.so (0x00007f7fd9dcb000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f7fd99bf000)
        libmodel_sm.so => not found
        libcurand.so.10 => /usr/local/cuda-10.2/lib64/libcurand.so.10 (0x00007f7fd591c000)
        libstdc++.so.6 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libstdc++.so.6 (0x00007f7fd5747000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f7fd5445000)
        libgcc_s.so.1 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libgcc_s.so.1 (0x00007f7fd9d90000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7fd5229000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f7fd4e5b000)
        ../../lib/libmodel_sm.so (0x00007f7fd9d7d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f7fd9bc3000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f7fd4c53000)

I would be tempted to have the libmodel embedded inside the libcxx/libcuda. This actually means replicating some symbos, but for the moment I guess we WITHER use one OR the other? Eventually if we have all of them together we should have them inside a single library?... Something to be clarified related to heterogeneous apps #318

@roiser
Copy link
Member

roiser commented Feb 2, 2022

Hi @valassi concerning the having one shared library or many, I kept them on purpose separately as the duplication of the symbols didn't make too much sense to me. Best

…er and document the various parts of the Makefile
…ve OMPFLAGS, AVX, FPTYPE, HELINL, HRDCOD, RNDGEN
…) - reorder and document the various parts of the Makefile
… - reorder and document the various parts of the Makefile
…of the full .so path

Before the fix:
ldd ../../lib/libmg5amc_cxx.so
        linux-vdso.so.1 =>  (0x00007ffda0f64000)
        ../../lib/libmg5amc_common.so (0x00007f01674e1000)
        libstdc++.so.6 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libstdc++.so.6 (0x00007f0167115000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f0166e13000)
        libgcc_s.so.1 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libgcc_s.so.1 (0x00007f01674a7000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f0166a45000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f01672ea000)

After the fix:
ldd ../../lib/libmg5amc_cxx.so
        linux-vdso.so.1 =>  (0x00007ffd79dbc000)
        libmg5amc_common.so => not found
        libstdc++.so.6 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libstdc++.so.6 (0x00007f2c55a11000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2c5570f000)
        libgcc_s.so.1 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libgcc_s.so.1 (0x00007f2c55db5000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f2c55341000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2c55be6000)
…tead of the full .so path

Before the fix:
ldd ./check.exe
        linux-vdso.so.1 =>  (0x00007fff76344000)
        ../../lib/libmg5amc_cxx.so (0x00007f1c71c75000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f1c71868000)
        libmg5amc_common.so => not found
        libcurand.so.10 => /usr/local/cuda-10.2/lib64/libcurand.so.10 (0x00007f1c6d7c5000)
        libstdc++.so.6 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libstdc++.so.6 (0x00007f1c6d5f0000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f1c6d2ee000)
        libgcc_s.so.1 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libgcc_s.so.1 (0x00007f1c71c3a000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f1c6d0d2000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f1c6cd04000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1c71a6c000)
        libmg5amc_common.so => not found
        librt.so.1 => /lib64/librt.so.1 (0x00007f1c6cafc000)

After the fix:
ldd ./check.exe
        linux-vdso.so.1 =>  (0x00007ffc637c7000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa38f2de000)
        libmg5amc_common.so => not found
        libcurand.so.10 => /usr/local/cuda-10.2/lib64/libcurand.so.10 (0x00007fa38b23b000)
        libmg5amc_cxx.so => not found
        libstdc++.so.6 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libstdc++.so.6 (0x00007fa38f50c000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa38af39000)
        libgcc_s.so.1 => /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/lib64/libgcc_s.so.1 (0x00007fa38af1f000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa38ad03000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa38a935000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa38f4e2000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa38a72d000)
…eck, runGcheck (check is used in the CI historically)
./tput/teeThroughputX.sh -flt -hrd -makej -makeclean -eemumu -ggtt -ggttg -ggttgg -ggttggg

This took 3 hours in total including the build (from scratch at least for ggttg and ggttggg)
STARTED AT Thu Feb 24 00:01:58 CET 2022
ENDED   AT Thu Feb 24 02:58:09 CET 2022
…en and all 5 processes)

The CI was giving errors as follows
https://github.com/madgraph5/madgraph4gpu/runs/5316173040?check_suite_focus=true

./check.exe: error while loading shared libraries: libmg5amc_common.so: cannot open shared object file: No such file or directory
./fcheck.exe: error while loading shared libraries: libmg5amc_common.so: cannot open shared object file: No such file or directory

./check.exe --common -p 2 32 2
./fcheck.exe 2 32 2
Avg ME (C++/C++)    =
/bin/sh: 1: [: unexpected operator
/bin/sh: 1: [: unexpected operator
Avg ME (F77/C++)    =
  File "<string>", line 1
    me1=; me2=; reldif=abs((me2-me1)/me1); print('Relative difference =', reldif); ok = reldif <= 2E-4; print ( '%s (relative difference %s 2E-4)' % ( ('OK','<=') if ok else ('ERROR','>') ) ); import sys; sys.exit(0 if ok else 1)
        ^
SyntaxError: invalid syntax
make: *** [Makefile:630: cmpFcheck] Error 1
…ke commands in the github CI

This will avoid CI failures in float tests that would look for the default double builds.
The issues has been introduced when I added a dependency of the check target on all.$(TAG)
… processes)

The CI was failing on the self-hosted GPU nodes with the following errors
https://github.com/madgraph5/madgraph4gpu/runs/5316634564?check_suite_focus=true
./check.exe --common -p 2 32 2
./fcheck.exe 2 32 2
Avg ME (C++/C++)    = 1.215805e-02
Avg ME (F77/C++)    = 1.2158051820303455E-002
/bin/bash: python: command not found
make: *** [Makefile:640: cmpFcheck] Error 127
…ge (must install python on the node!)

Revert "[shared] replace 'python' by 'python3' in Makefile (codegen and all 5 processes)"
This reverts commit ed9e5e8.
… all 5 processes)

"yum install python39" has been executed on the CI, but I still get
/bin/bash: python: command not found
…the only one with active developments

(although I am not 100% certain that the jobs will be executed in this order...)
@valassi
Copy link
Member Author

valassi commented Feb 24, 2022

Ok all tests are finally succeeding in the CI, after fixing a few CI related issues.

Amongst the latest changes in this PR in the last few days:

  • I implemented a test (also run through the CI make check) that computes an ME average for a fixed small number of events both through the C++ standalone executable (check or gcheck) and through the Fortran+Bridge standalone executable (fcheck or fgcheck) and compares the outputs using some basic python. The results have to agree within a given tolerance.
  • In this test, the Fortran always uses double precision, but I also implemented the option of having the MEs in CUDA/C++ in single precision (see Use single precision or even half precision #5 and Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212)
  • The tolerance had to be adapted, because especially with float precision the results of the two chains can differen quite a lot. I think it is now a 2E-4, which is already huge. This is enough for float ggttgg tests, but would fail float ggttggg tests where the difference is even larger. The root causes of this are probably a combination of the fact that momenta are converted float to double back to float in the Fortran version, and maybe also of the more complex (and hopefully more precise) way in which the c+ executable computes ME averages uing EventStatistics.h

I am just running a few more tests to commit some logs, but otherwise this is ready to go

@valassi valassi marked this pull request as ready for review February 24, 2022 10:56
@valassi valassi changed the title WIP: Shared libraries + Bridge + Cleaner Makefiles Shared libraries + Bridge + Cleaner Makefiles Feb 24, 2022
./tput/teeThroughputX.sh -inlonly -flt -makej -makeclean -eemumu -ggtt -ggttgg
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -rmbhst
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -bridge
./tput/teeThroughputX.sh -eemumu -curhst
./tput/teeThroughputX.sh -eemumu -common

Check all logs are updated:
grep DATE tput/logs_*/*txt | sort -k2
@valassi
Copy link
Member Author

valassi commented Feb 24, 2022

Hi @roiser @oliviermattelaer this is now complete, I am about to merge it - sorry for the delay

@valassi valassi mentioned this pull request Feb 24, 2022
@valassi
Copy link
Member Author

valassi commented Feb 24, 2022

All checks have passed. Self merging.

@valassi valassi merged commit 5b8c952 into madgraph5:master Feb 24, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants