Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

HNP topology not found in hetero node scenario #803

Closed
rhc54 opened this issue Mar 5, 2021 · 6 comments
Closed

HNP topology not found in hetero node scenario #803

rhc54 opened this issue Mar 5, 2021 · 6 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2021

Assuming I just built the right thing, we're back to the error without :NOLOCAL:

prterun -n 2 hostname
batch2
--------------------------------------------------------------------------
A ppr pattern was specified, but the topology information
for the following node is missing:

  Node:  batch2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A ppr pattern was specified, but the topology information
for the following node is missing:

  Node:  batch2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request has timed out and will therefore fail:

  Operation:  SPAWN: /.../openmpi-5.0.0_pre20210305/3rd-party/prrte/src/prted/pmix/pmix_server_dyn.c:631

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

With -n 1 the first two errors are printed but there is no timeout, exits immediately:

prterun -n 1 hostname
batch2
--------------------------------------------------------------------------
A ppr pattern was specified, but the topology information
for the following node is missing:

  Node:  batch2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A ppr pattern was specified, but the topology information
for the following node is missing:

  Node:  batch2
--------------------------------------------------------------------------

All good with :NOLOCAL:

 prterun -n 2 --map-by :NOLOCAL hostname
h22n01
h22n01

prterun -n 2 --map-by ppr:1:node:NOLOCAL hostname
h22n01
h36n14
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 5, 2021

In case it helps diagnose these new issues with topology info: there are two "elements" unpacked one for "batch" node (which has 1 slot on it) and one worker compute node and their sig fields are:

2N:2S:16L3:16L2:32L1:32C:128H:0-127::ppc64le:le
2N:2S:22L3:22L2:44L1:44C:176H:0-83,88-171:0-175:ppc64le:le

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 5, 2021

@acolinisi @jjhursey I think my question over whether batch2 should have been included in the allocation took us off on a tangent. The question that still needs resolution is: why is the topology from batch2 not being found?

I think I have an answer, but will investigate when I wake up fully in a bit. Just didn't want to lose this track.

@jjhursey
Copy link
Member

jjhursey commented Mar 5, 2021

I saw that you posted this issue after posting this to the other thread. I agree that there is something off here that we need to fix - completely separate from the LSF+CSM issue (Issue #804).

@acolinisi
Copy link
Contributor

acolinisi commented Mar 5, 2021

All I can offer is that this is definitely a regression introduced after 2020-12-02, because there are no errors in that older version, just rebuilt and checked:

acolin@batch2 $ prterun -n 1 hostname
batch2
acolin@batch2 $ prterun -n 2 hostname
batch2
a24n06
acolin@batch2 $ prterun -n 2 --map-by ppr:1:node hostname
batch2
a24n06
acolin@batch2 $ prterun -n 2 --map-by :NOLOCAL hostname
a24n06
a24n06
acolin@batch2 $ prterun -n 2 --map-by ppr:1:node:NOLOCAL hostname
a24n07
a24n06

acolin@batch2 $ prterun hostname | sort  | uniq -c
     42 a24n06
     42 a24n07
      1 batch2

Working versions are:
OMPI 47fb05f82a
PRRTE 080c116
PMIX 93b02abe

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 5, 2021

I'm unable to replicate the cited error message, even when I force the topologies to be hetero. I've made an attempt to do a better job of matching topos with nodes in #808, but will have to wait and see if you find that helped.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 5, 2021

Fixed by #808

@rhc54 rhc54 closed this as completed Mar 5, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants