Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improving support for systems using Omni-Path interconnect #4560

Open
jfgrimm opened this issue Jun 12, 2024 · 0 comments
Open

Improving support for systems using Omni-Path interconnect #4560

jfgrimm opened this issue Jun 12, 2024 · 0 comments

Comments

@jfgrimm
Copy link
Member

jfgrimm commented Jun 12, 2024

Currently, the default way we build OpenMPI, MPICH etc. works well for InfiniBand systems, but shows very poor performance on Omni-Path (we saw between 2x and 10x worse bandwidth and latency in benchmarks on our system).
It would be good to figure out a way to improve Omni-Path support in EasyBuild (perhaps through a configuration option?); at a minimum, we should improve documentation.

relevant PRs to date:

  • [#20501] PSM2 dependency added to recent libfabric easyconfigs
  • [#20585] PSM2 dependency made conditional on having x86_64
  • [#20794] previous changes effectively undone, by commenting PSM2 dependency back out due to CUDA build dependency

further info/ideas:

  • Omni-Path systems should use either PSM2 or opx
    • PSM2 can be either stand-alone, or via libfabric
    • opx is a libfabric provider; drop-in replacement for PSM2
    • Cornelis' plan is to move away from PSM2 (the upcoming 400G adapters will only support opx)
    • no benefit (only additional overhead) from using UCX with Omni-Path
  • Cornelis' documentation currently recommends using PSM2:
    • For best performance, Cornelis recommends that you use the PSM2, the high performance
      interface to the OPX Fabric. This is accomplished using the Open Fabrics Interface (OFI) MPI
      fabric setting -genv I_MPI_FABRICS=ofi and ensure that FI_PROVIDER=psm2.

    • source: Cornelis_OPX_Performance_Tuning_UG_H93143_v25_0.pdf (March 2024)
  • [#20794 comment] suggestion by @bartoldeman to patch PSM2 in order to drop the CUDA build dependency
    • matches current approach for OpenMPI, by including some minimal CUDA prototypes (since PSM2 will dlopen('libcuda.so.1') at runtime)
  • [#20794 comment] suggestion by @boegel to perhaps implement "opt-in hooks that are part of EasyBuild framework, which can be enabled selectively?"
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant