Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error using MPI_Comm_connect/MPI_Comm_accept #6916

Open
nuriallv opened this issue Aug 20, 2019 · 1 comment
Open

Error using MPI_Comm_connect/MPI_Comm_accept #6916

nuriallv opened this issue Aug 20, 2019 · 1 comment
Labels

Comments

@nuriallv
Copy link
Contributor

Details of the problem

I'm getting the following error when using MPI_Comm_connect/MPI_Comm_accept on a single node between processes spawned on different mpirun commands.

[d14.descartes:04196] [[23420,0],0] ORTE_ERROR_LOG: Not supported in file orted/pmix/pmix_server_dyn.c at line 702
[d14.descartes:04202] [[23420,1],0] ORTE_ERROR_LOG: Not supported in file dpm/dpm.c at line 403
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_accept
  Reason:       Underlying runtime environment does not support accept/connect functionality
--------------------------------------------------------------------------
[d14:04202] *** An error occurred in MPI_Comm_accept
[d14:04202] *** reported by process [1534853121,0]
[d14:04202] *** on communicator MPI_COMM_SELF
[d14:04202] *** MPI_ERR_INTERN: internal error
[d14:04202] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d14:04202] ***    and potentially your MPI job)
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[23414,1],0]) is on host: d14
  Process 2 ([[23420,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[d14.descartes:04211] [[23414,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 495
[d14:04211] *** An error occurred in MPI_Comm_connect
[d14:04211] *** reported by process [1534459905,0]
[d14:04211] *** on communicator MPI_COMM_SELF
[d14:04211] *** MPI_ERR_INTERN: internal error
[d14:04211] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d14:04211] ***    and potentially your MPI job)

If instead I spawn a new process within the same mpirun there no error. I'm attaching a reproducer.
reproducer.zip

Environment

OpenMPI master commit 69bd945
./configure --prefix=${HOME}/ompi/Build --enable-orterun-prefix-by-default --with-platform=optimized

Linux cluster, resource manager Slurm, nodes:
Dell T7500 chassis
2x Gainestown E5520 @2.27GHz. 8 cores, 16 ht
~12G RAM (actual amount varies between 8G and 16G)
Infiniband DDR 20G (MT25208 cards), Ethernet

@rhc54
Copy link
Contributor

rhc54 commented Sep 3, 2019

You have a couple of options. The problem is that the two mpirun's need a rendezvous server. One way to do it is to start ompi-server before executing the mpirun cmds and then point each mpirun at that process for the rendezvous.

The other option is to use PRRTE (https://github.com/pmix/prrte) as a "shim" environment. You would get a Slurm allocation, then start PRRTE and run your apps using PRRTE's "prun" command (which is identical to mpirun). PRRTE knows how to provide the rendezvous service. If you want to use PRRTE, then the build/use instructions are available at the bottom of https://pmix.org/support/how-to/

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants