You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm getting the following error when using MPI_Comm_connect/MPI_Comm_accept on a single node between processes spawned on different mpirun commands.
[d14.descartes:04196] [[23420,0],0] ORTE_ERROR_LOG: Not supported in file orted/pmix/pmix_server_dyn.c at line 702
[d14.descartes:04202] [[23420,1],0] ORTE_ERROR_LOG: Not supported in file dpm/dpm.c at line 403
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.
MPI function: MPI_Comm_accept
Reason: Underlying runtime environment does not support accept/connect functionality
--------------------------------------------------------------------------
[d14:04202] *** An error occurred in MPI_Comm_accept
[d14:04202] *** reported by process [1534853121,0]
[d14:04202] *** on communicator MPI_COMM_SELF
[d14:04202] *** MPI_ERR_INTERN: internal error
[d14:04202] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d14:04202] *** and potentially your MPI job)
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[23414,1],0]) is on host: d14
Process 2 ([[23420,1],0]) is on host: unknown!
BTLs attempted: self tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[d14.descartes:04211] [[23414,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 495
[d14:04211] *** An error occurred in MPI_Comm_connect
[d14:04211] *** reported by process [1534459905,0]
[d14:04211] *** on communicator MPI_COMM_SELF
[d14:04211] *** MPI_ERR_INTERN: internal error
[d14:04211] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d14:04211] *** and potentially your MPI job)
If instead I spawn a new process within the same mpirun there no error. I'm attaching a reproducer. reproducer.zip
You have a couple of options. The problem is that the two mpirun's need a rendezvous server. One way to do it is to start ompi-server before executing the mpirun cmds and then point each mpirun at that process for the rendezvous.
The other option is to use PRRTE (https://github.com/pmix/prrte) as a "shim" environment. You would get a Slurm allocation, then start PRRTE and run your apps using PRRTE's "prun" command (which is identical to mpirun). PRRTE knows how to provide the rendezvous service. If you want to use PRRTE, then the build/use instructions are available at the bottom of https://pmix.org/support/how-to/
Details of the problem
I'm getting the following error when using MPI_Comm_connect/MPI_Comm_accept on a single node between processes spawned on different mpirun commands.
If instead I spawn a new process within the same mpirun there no error. I'm attaching a reproducer.
reproducer.zip
Environment
OpenMPI master commit 69bd945
./configure --prefix=${HOME}/ompi/Build --enable-orterun-prefix-by-default --with-platform=optimized
Linux cluster, resource manager Slurm, nodes:
Dell T7500 chassis
2x Gainestown E5520 @2.27GHz. 8 cores, 16 ht
~12G RAM (actual amount varies between 8G and 16G)
Infiniband DDR 20G (MT25208 cards), Ethernet
The text was updated successfully, but these errors were encountered: