-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
openmpi-5.0.5 won't spawn #12916
Comments
Would it be possible to test using the open MPI main branch?
They seem to get confused trying to handle the intercomm_merge call in the test case in #11749 .
? |
FWIW: it works fine for me if you disable those components. Thanks @hppritcha ! |
Should have added - it also works fine if you run it using |
@sukanka can you confirm you are testing the
|
FWIW: quick check indicates that it is hanging in the |
@ggouaillardet You only see the problem in singleton mode. |
@rhc54 I only see the problem in singleton mode (and with fwiw, in singleton mode, the workaround (credit is yours) is
|
Actually, @hppritcha came up with that workaround. You might want to check that the singleton is finding the modex info for the child procs - I'm guessing that it doesn't and hangs in the modex_recv call. But that is just a (somewhat educated) guess. Would, however, explain why it all works when run under |
Ohhhh...you know what? That singleton "pushes" its connection info during Thus, it is likely that the singleton "knows" how to connect to the child job - but the child job has no connection info for the singleton parent. Might be something worth checking. |
Yeah, I'm testing the C program and running in singleton mode. (
And the workaround |
Well, I stand corrected - there is a call to |
That fix makes little sense, not saying it does not work just saying it might looks like addressing the bug but that's not what it does. Put it simply hcoll can only work for special hardware setups, and both of these collective components disable themselves for intercoms (check the |
Well George I recommended that because in gdb traceback i saw a bunch of ranks blocked in some kind of hcoll calls. probably when doing an allreduce for a cid for some part of the merge operation. Its likely the user doesn't even have that installed (which is probably a good thing). |
I know nothing about the coll system any more, but FWIW everything runs fine if I simply remove the I saw no problems getting thru the intercomm merge operation. |
i'll take a look in to this. |
gdb says a lot (can reproduce with a single child process)
|
I rebuilt OMPI main without hcoll support and verified that it suffices just to disable han to avoid this "confusion about which MCA coll component to use" problem that seems to occur in this singleton launch plus certain collectives on communicators involving both parent and child processes problem use case. |
I see what's going on. First, according to the code I found in #11749 the Except ... HAN, and I assume HCOLL are disabled in the singleton, because there is a single process so we don't need any fancy collectives. On the children they are enabled because by that point there are more than one process, so HAN and/or hcoll make sense. That's why disabling them fixes the hang, as it forces the children to use tuned, and matching the algorithm selected on the parent. Let me fiddle a little with HAN initialization to find a way to address this. |
My assumption above was correct, however the root cause was not. Basically, the two groups have different knowledge about each other: the original group correctly identified the spawned processes as local and therefore disabled HAN. The spawned processes however seem to have no knowledge about the location of the original processes, assume they are not local, so HAN make sense. At the first collective communication their selection logic diverge, one group uses As a result, all the solutions proposed in this thread are incorrect, disabled some collective components is a bandaid not a real solution. The real solution is to make sure the knowledge about the processes location is symmetric between the parent and the sawned group. |
Would appreciate a little help understanding the problem. Are you saying that the core issue is that the spawned procs are getting an incorrect response to this request: OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_LOCAL_PEERS,
&wildcard_rank, &val, PMIX_STRING); when asking about the parent procs? |
I don't know, I did not looked into that particular aspect. What I noticed is that when the child processes is looking for the location of the parent process, which at that point is part of the proc_t struct, is not getting the right answer. Basically, |
Okay, I can take a peek at some point. If it is in PMIx or PRRTE, the fix won't be available until late this year or early next year as I just did the final release in the current series for those projects. Probably means any eventual fix in OMPI won't be available until OMPI v6 appears. |
We should probably go ahead and close this as "will not fix" since it won't be fixed in the OMPI v5 series. Just to clarify George's comment, this problem only exists if the parent process is a singleton. It notably does not exist if you start the parent process with So @sukanka, if you cannot wait until sometime next year for OMPI v6, your best bet is to simply use one of the known good methods for starting the parent. Alternatively, the fix will eventually appear in either or both of PMIx and PRRTE (assuming it isn't ultimately a problem in OMPI itself), and you can then build against an external updated version of them - but that won't happen until late this year or early next year. |
Thank you all for the answers. |
Just start your parent process with |
Did some further digging into this and found a solution. Good and bad news. Changes are relatively minor, but it unfortunately requires changes in all three projects - OMPI, PMIx, and PRRTE. I have filed PRs accordingly: openpmix/openpmix#3445 No idea on when those changes might appear in releases, but I would guess not for awhile. I am working on a little more aesthetically pleasing alternative fix (will coexist with the above as there is no harm in having both methods), but that won't appear for another week or two (additional changes should be confined to just PMIx and PRRTE). |
Thanks a lot! These patches work. I just rebuilt I will file a bug report at Archlinux packages once the final fix is ready, so I don't have to wait for
BTW, how can I achieve this with mpi4py (The example script in #11749 (comment))? As in the YADE project, we just use mpi4py. |
Just do |
Background information
There may be a regression in openmpi-5.0 series
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.5, but in fact this regression has been there since 5.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From archlinux repo
Please describe the system on which you are running
Operating system/version: Arch Linux x86_64 Linux 6.11.6-zen1-1-zen
Computer hardware: Laptop with AMD Ryzen 7 8845H w/ Radeon 780M Graphics (16) @ 5.10 GHz and NVIDIA GeForce RTX 4070 Max-Q / Mobile
Network type: Wired
Details of the problem
The MWE provided in #11749 (comment) does not work with openmpi 5.0, but it does work with 4.1.6.
The text was updated successfully, but these errors were encountered: