Skip to content

Fix Singletons and Singleton Spawn #10688

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Aug 23, 2022

Conversation

jjhursey
Copy link
Member

  • Fixes Singleton MPI initialization and spawn #10590
  • Singletons will not have a PMIx value for PMIX_LOCAL_PEERS
    so make that optional instead of required.
  • & is being confused as an application argument in prte
    instead of the background character
    • Replace with --daemonize which is probably better anyway

 * Fixes open-mpi#10590
 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS`
   so make that optional instead of required.
 * `&` is being confused as an application argument in `prte`
   instead of the background character
   * Replace with `--daemonize` which is probably better anyway

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey jjhursey marked this pull request as ready for review August 18, 2022 21:17
@jjhursey
Copy link
Member Author

@awlauria We will need to sync the prrte submodule pointer to pick up openpmix/prrte#1443

@jjhursey jjhursey requested a review from awlauria August 18, 2022 21:17
@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

FWIW: I have fixed the & confusion in the PMIx command line parser (along with other things) in openpmix/openpmix#2694. You need to update the PMIx submodule pointer once that has been committed, and add #10695, to fix #10691.

What a tangled web we weave!

@jsquyres
Copy link
Member

@jjhursey When singletons are fully fixed, please merge open-mpi/ompi-scripts#62 so that singleton tests are added to the OMPI Jenkins CI.

@jjhursey
Copy link
Member Author

Hold this PR for combined testing. I'm trying to align the 3 repos to get a current view of the state of this work.

I'm testing with Open MPI main:

shell$ git submodule status
+8cb6f58fe074efde0239aa4567854742223ab1a9 3rd-party/openpmix (v4.2.0-3-g8cb6f58f)
+b29abde61c618de28f6a4c181e8c28f68e332969 3rd-party/prrte (v3.0.0rc1-18-gb29abde6)

With these changes it seems the things are failing again :( I'm investigating.

@jjhursey
Copy link
Member Author

Tested with Open MPI main

shell$ git submodule status
+e3b925f82d2a59f58c60c7d7a7b5a71eda7d41ae ../../3rd-party/openpmix (v1.1.3-3598-ge3b925f8)
+12bb6c7dd6df522a38ed611c1fa4cf2dc9ea1761 ../../3rd-party/prrte (psrvr-v2.0.0rc1-4410-g12bb6c7d)

So something must be missing from the OpenPMIx and/or PRRTE release branches.

For Open MPI, since it uses the master branch of those two projects this PR is fine to merge.

@jjhursey
Copy link
Member Author

FYI: I pushed my set of tests open-mpi/ompi-tests-public#20

@jjhursey
Copy link
Member Author

jjhursey commented Aug 22, 2022

  • ✅ Open MPI main with OpenPMIx master and PRRTE master works.
  • 💥 Open MPI main with OpenPMIx v4.2 and PRRTE v3.0 does not work.
  • 💥 Open MPI main with OpenPMIx v4.2 and PRRTE master does not work
  • ✅ Open MPI main with OpenPMIx master and PRRTE v3.0 does work

So it looks like an OpenPMIx issue - probably a missing commit from master. 👀

Here is the error message: tests are here

shell$ ./simple_spawn ./simple_spawn
[f5n18:3003503] PMIX ERROR: PROC-ENTRY-NOT-FOUND in file server/pmix_server.c at line 3588
[f5n18:3003493] pml_ucx.c:191  Error: Failed to receive UCX worker address: Take next option (-46)
[f5n18:3003493] OPAL ERROR: Error in file dpm/dpm.c at line 480

@jjhursey
Copy link
Member Author

Ok we resolved the issue with OpenPMIx v4.2 as recommended by openpmix/openpmix#2705 (comment) . I posted #10700 with the change.

Once Open MPI main has the following merged then spawn should work correctly.

Once verified then we will PR these back to Open MPI v5.x

@jjhursey jjhursey merged commit 2659da6 into open-mpi:main Aug 23, 2022
@jjhursey jjhursey deleted the fix-singleton-spawn branch August 23, 2022 15:50
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Singleton MPI initialization and spawn
4 participants