Skip to content

v.4.0.2 MPI_Comm_connect/MPI_Comm_accept fails with “is on host: unknown!” #7094

Closed
@cesarpomar

Description

@cesarpomar

Hello,

I have started to develop a distributed application and I am trying to use mpi dynamic process for data exchange. The problem is that when I run the program, I always receive the same error.

I have compiled version 4.0.2 from source with gcc7. First, I tried a machine with Centos 7, then I tried one with Ubuntu 18.04 and finally another with Ubuntu 19.04 with the same result.

I have seen similar errors in other issues but they were about different problems. For example, https://github.com/open-mpi/ompi/issues/6916

The source code

Server:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv ) { 
    MPI_Comm client; 
    char port_name[MPI_MAX_PORT_NAME]; 
    int size; 
    MPI_Info info;
 
    MPI_Init( &argc, &argv ); 
    MPI_Comm_size(MPI_COMM_WORLD, &size); 

    
    MPI_Open_port(MPI_INFO_NULL, port_name); 
    printf("Server available at %s\n", port_name); 
    
    MPI_Info_create(&info);

    MPI_Publish_name("name", info, port_name);
        
    printf("Wait for client connection\n"); 
    MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  &client ); 
    printf("Client connected\n"); 
    
    MPI_Unpublish_name("name", MPI_INFO_NULL, port_name);
    MPI_Comm_free( &client ); 
    MPI_Close_port(port_name); 
    MPI_Finalize(); 
    return 0;
} 

Client:

#include <mpi.h> 
#include <stdio.h>


int main(int argc, char **argv ) { 
    MPI_Comm server; 
    char port_name[MPI_MAX_PORT_NAME]; 

    MPI_Init( &argc, &argv ); 
            
    printf("Looking for server\n");
    MPI_Lookup_name( "name", MPI_INFO_NULL, port_name); 
    printf("server found at %s\n", port_name);
    
    printf("Wait for server connection\n");
    MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  &server ); 
    printf("Server connected\n"); 

    MPI_Comm_disconnect( &server ); 
    MPI_Finalize(); 
    return 0; 
} 

Execution

Once the source code has been compiled, I open three terminals in the same machine and I execute the following commands in order:

Terminal 1
mpi-server -r /tmp/ompi-server.txt --no-daemonize
Terminal 2
mpiexec --ompi-server file:/tmp/ompi-server.txt -np 1 ./server
Terminal 3
mpiexec --ompi-server file:/tmp/ompi-server.txt -np 1 ./client

Output

Terminal 1
continues to run without any errors.

Terminal 2

Server available at 2953052161.0:2641729496
Wait for client connection
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45060,1],0]) is on host: Node1
  Process 2 ([[45085,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[CESAR-NITROV:00207] [[45060,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[CESAR-NITROV:00207] *** An error occurred in MPI_Comm_accept
[CESAR-NITROV:00207] *** reported by process [2953052161,0]
[CESAR-NITROV:00207] *** on communicator MPI_COMM_WORLD
[CESAR-NITROV:00207] *** MPI_ERR_INTERN: internal error
[CESAR-NITROV:00207] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[CESAR-NITROV:00207] ***    and potentially your MPI job)

Terminal 3

Looking for server
server found at 2953052161.0:2641729496
Wait for server connection
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45085,1],0]) is on host: Node1
  Process 2 ([[45060,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[CESAR-NITROV:00214] [[45085,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[CESAR-NITROV:00214] *** An error occurred in MPI_Comm_connect
[CESAR-NITROV:00214] *** reported by process [2954690561,0]
[CESAR-NITROV:00214] *** on communicator MPI_COMM_WORLD
[CESAR-NITROV:00214] *** MPI_ERR_INTERN: internal error
[CESAR-NITROV:00214] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[CESAR-NITROV:00214] ***    and potentially your MPI job)

I have tried all the solutions that I found and I have run out of ideas. I don't know if this is a bug or it's me that I forgot to set some parameter or environment variable.

Is there anyone who can help me?

Thank you in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RTEIssue likely is in RTE or PMIx areas

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions