Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Comm_connect/accept fails #398

Closed
rhc54 opened this issue Feb 21, 2020 · 3 comments
Closed

Comm_connect/accept fails #398

rhc54 opened this issue Feb 21, 2020 · 3 comments
Labels
3rd Pri Third priority for release

Comments

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2020

Hello,

I have started to develop a distributed application and I am trying to use mpi dynamic process for data exchange. The problem is that when I run the program, I always receive the same error.

I have compiled version 4.0.2 from source with gcc7. First, I tried a machine with Centos 7, then I tried one with Ubuntu 18.04 and finally another with Ubuntu 19.04 with the same result.

I have seen similar errors in other issues but they were about different problems. For example, open-mpi/ompi#6916

The source code

Server:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv ) { 
    MPI_Comm client; 
    char port_name[MPI_MAX_PORT_NAME]; 
    int size; 
    MPI_Info info;
 
    MPI_Init( &argc, &argv ); 
    MPI_Comm_size(MPI_COMM_WORLD, &size); 

    
    MPI_Open_port(MPI_INFO_NULL, port_name); 
    printf("Server available at %s\n", port_name); 
    
    MPI_Info_create(&info);

    MPI_Publish_name("name", info, port_name);
        
    printf("Wait for client connection\n"); 
    MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  &client ); 
    printf("Client connected\n"); 
    
    MPI_Unpublish_name("name", MPI_INFO_NULL, port_name);
    MPI_Comm_free( &client ); 
    MPI_Close_port(port_name); 
    MPI_Finalize(); 
    return 0;
} 

Client:

#include <mpi.h> 
#include <stdio.h>


int main(int argc, char **argv ) { 
    MPI_Comm server; 
    char port_name[MPI_MAX_PORT_NAME]; 

    MPI_Init( &argc, &argv ); 
            
    printf("Looking for server\n");
    MPI_Lookup_name( "name", MPI_INFO_NULL, port_name); 
    printf("server found at %s\n", port_name);
    
    printf("Wait for server connection\n");
    MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  &server ); 
    printf("Server connected\n"); 

    MPI_Comm_disconnect( &server ); 
    MPI_Finalize(); 
    return 0; 
} 

Execution

Once the source code has been compiled, I open three terminals in the same machine and I execute the following commands in order:

Terminal 1
mpi-server -r /tmp/ompi-server.txt --no-daemonize
Terminal 2
mpiexec --ompi-server file:/tmp/ompi-server.txt -np 1 ./server
Terminal 3
mpiexec --ompi-server file:/tmp/ompi-server.txt -np 1 ./client

Output

Terminal 1
continues to run without any errors.

Terminal 2

Server available at 2953052161.0:2641729496
Wait for client connection
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45060,1],0]) is on host: Node1
  Process 2 ([[45085,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[CESAR-NITROV:00207] [[45060,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[CESAR-NITROV:00207] *** An error occurred in MPI_Comm_accept
[CESAR-NITROV:00207] *** reported by process [2953052161,0]
[CESAR-NITROV:00207] *** on communicator MPI_COMM_WORLD
[CESAR-NITROV:00207] *** MPI_ERR_INTERN: internal error
[CESAR-NITROV:00207] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[CESAR-NITROV:00207] ***    and potentially your MPI job)

Terminal 3

Looking for server
server found at 2953052161.0:2641729496
Wait for server connection
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45085,1],0]) is on host: Node1
  Process 2 ([[45060,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[CESAR-NITROV:00214] [[45085,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[CESAR-NITROV:00214] *** An error occurred in MPI_Comm_connect
[CESAR-NITROV:00214] *** reported by process [2954690561,0]
[CESAR-NITROV:00214] *** on communicator MPI_COMM_WORLD
[CESAR-NITROV:00214] *** MPI_ERR_INTERN: internal error
[CESAR-NITROV:00214] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[CESAR-NITROV:00214] ***    and potentially your MPI job)

I have tried all the solutions that I found and I have run out of ideas. I don't know if this is a bug or it's me that I forgot to set some parameter or environment variable.

Is there anyone who can help me?

Thank you in advance.

Cross-reference: open-mpi/ompi#7094

@rhc54 rhc54 added the 3rd Pri Third priority for release label Mar 30, 2020
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 30, 2020

We need to devise a "standard" solution to this problem that we, as a community, want to put forward. One simple solution is to declare that the method outlined by the user is "not supported" and instead direct them to launch a PRRTE DVM and then use prun to start the individual jobs. Or we can restore the old ompi-server method.

Just need to decide on the architecture and move forward.

@fwyzard
Copy link

fwyzard commented Dec 28, 2020

I've just run into the same problem, detailed here and in open-mpi/ompi#6916 .

Or we can restore the old ompi-server method.

Does it mean that using ompi-server is no longer working / supported ?

Edit: after some digging and testing, it looks like the ompi-server approach is working in OpenMPI 4.1.0 (built from source on CentOS 7), while it was still broken in OpenMPI 4.0.3 (Ubuntu 20.04).

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 10, 2021

After pondering this and working on PRRTE integration into OMPI, I think the correct answer here is to use the DVM. In truth, the old ompi-server really is acting like the DVM in that it was a rendezvous server for the various mpirun invocations. Starting with OMPI v5.0, mpirun is nothing more than a symlink to prte, so it is effectively starting a DVM each time.

The correct solution therefore is to:

  • start the DVM with prte
  • use prun to launch your individual applications
  • use pterm to terminate the DVM when everything is done

This gives you the same result as before, only in a cleaner (and faster) package. @jjhursey We need to add this to the OMPI and PRRTE man pages (in some appropriate place).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
3rd Pri Third priority for release
Projects
None yet
Development

No branches or pull requests

2 participants