Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

v.4.0.2 MPI_Comm_connect/MPI_Comm_accept fails with “is on host: unknown!” #7094

Closed
cesarpomar opened this issue Oct 16, 2019 · 3 comments
Labels
RTE Issue likely is in RTE or PMIx areas

Comments

@cesarpomar
Copy link

Hello,

I have started to develop a distributed application and I am trying to use mpi dynamic process for data exchange. The problem is that when I run the program, I always receive the same error.

I have compiled version 4.0.2 from source with gcc7. First, I tried a machine with Centos 7, then I tried one with Ubuntu 18.04 and finally another with Ubuntu 19.04 with the same result.

I have seen similar errors in other issues but they were about different problems. For example, https://github.com/open-mpi/ompi/issues/6916

The source code

Server:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv ) { 
    MPI_Comm client; 
    char port_name[MPI_MAX_PORT_NAME]; 
    int size; 
    MPI_Info info;
 
    MPI_Init( &argc, &argv ); 
    MPI_Comm_size(MPI_COMM_WORLD, &size); 

    
    MPI_Open_port(MPI_INFO_NULL, port_name); 
    printf("Server available at %s\n", port_name); 
    
    MPI_Info_create(&info);

    MPI_Publish_name("name", info, port_name);
        
    printf("Wait for client connection\n"); 
    MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  &client ); 
    printf("Client connected\n"); 
    
    MPI_Unpublish_name("name", MPI_INFO_NULL, port_name);
    MPI_Comm_free( &client ); 
    MPI_Close_port(port_name); 
    MPI_Finalize(); 
    return 0;
} 

Client:

#include <mpi.h> 
#include <stdio.h>


int main(int argc, char **argv ) { 
    MPI_Comm server; 
    char port_name[MPI_MAX_PORT_NAME]; 

    MPI_Init( &argc, &argv ); 
            
    printf("Looking for server\n");
    MPI_Lookup_name( "name", MPI_INFO_NULL, port_name); 
    printf("server found at %s\n", port_name);
    
    printf("Wait for server connection\n");
    MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  &server ); 
    printf("Server connected\n"); 

    MPI_Comm_disconnect( &server ); 
    MPI_Finalize(); 
    return 0; 
} 

Execution

Once the source code has been compiled, I open three terminals in the same machine and I execute the following commands in order:

Terminal 1
mpi-server -r /tmp/ompi-server.txt --no-daemonize
Terminal 2
mpiexec --ompi-server file:/tmp/ompi-server.txt -np 1 ./server
Terminal 3
mpiexec --ompi-server file:/tmp/ompi-server.txt -np 1 ./client

Output

Terminal 1
continues to run without any errors.

Terminal 2

Server available at 2953052161.0:2641729496
Wait for client connection
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45060,1],0]) is on host: Node1
  Process 2 ([[45085,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[CESAR-NITROV:00207] [[45060,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[CESAR-NITROV:00207] *** An error occurred in MPI_Comm_accept
[CESAR-NITROV:00207] *** reported by process [2953052161,0]
[CESAR-NITROV:00207] *** on communicator MPI_COMM_WORLD
[CESAR-NITROV:00207] *** MPI_ERR_INTERN: internal error
[CESAR-NITROV:00207] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[CESAR-NITROV:00207] ***    and potentially your MPI job)

Terminal 3

Looking for server
server found at 2953052161.0:2641729496
Wait for server connection
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45085,1],0]) is on host: Node1
  Process 2 ([[45060,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[CESAR-NITROV:00214] [[45085,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[CESAR-NITROV:00214] *** An error occurred in MPI_Comm_connect
[CESAR-NITROV:00214] *** reported by process [2954690561,0]
[CESAR-NITROV:00214] *** on communicator MPI_COMM_WORLD
[CESAR-NITROV:00214] *** MPI_ERR_INTERN: internal error
[CESAR-NITROV:00214] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[CESAR-NITROV:00214] ***    and potentially your MPI job)

I have tried all the solutions that I found and I have run out of ideas. I don't know if this is a bug or it's me that I forgot to set some parameter or environment variable.

Is there anyone who can help me?

Thank you in advance.

@rhc54
Copy link
Contributor

rhc54 commented Mar 30, 2020

As you can tell, I haven't had time to get around to this problem. Sadly, support for OMPI's runtime environment has declined a great deal over recent years as I've moved on to other things and am getting ready to retire - we just haven't been able to get other folks to pick it up the way anyone would like.

I'd suggest trying MPICH as an alternative - I don't know if they can handle your use-case or not, but it is worth a try. If they can't, then your best bet is to downgrade your OMPI installation until you find one that works - you might try the v3 series, or even v2.

This might eventually get addressed, but it probably won't happen in a very timely fashion.

@jjhursey jjhursey added the RTE Issue likely is in RTE or PMIx areas label Apr 1, 2020
@artpol84
Copy link
Contributor

@cesarpomar thanks for the exhaustive issue description.
I've recently come over it and would like to note, that v4.0.1 works fine (with #6446).
So it's something introduced in v4.0.2.

@rhc54
Copy link
Contributor

rhc54 commented Apr 23, 2021

This works fine on master, and no backport is planned.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
RTE Issue likely is in RTE or PMIx areas
Projects
None yet
Development

No branches or pull requests

4 participants