Skip to content

Fix singleton tmp files cleanup #13261

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xbw22109
Copy link
Contributor

In singleton mode, directory cleaning needs to be done by the ompi library.
However, there are issues in the current implementation that prevent complete cleanup of the directory.

Singleton Mode Cleanup Behavior

When the program terminates normally

1. The session dir is only partially cleaned.(Fixed)

The directory /tmp/ompi.<username>.<uid> will still be left behind, along with some subdirectories.
I modified the directory structure and enabled recursive deletion.

2. Segment files created by btl/sm are not properly removed.(Fixed)

In ompi/opal/mca/btl/sm/btl_sm_module.c: function sm_finalize, unlink will never be done, singleton or not. However, in non-singleton mode, the file will be cleaned by pmix after opal_pmix_register_cleanup succeeds.

// shmem has been closed. These two calls will return directly.
opal_shmem_unlink(&mca_btl_sm_component.seg_ds);
opal_shmem_segment_detach(&mca_btl_sm_component.seg_ds);

I registered the segment cleanup with OPAL, which is similar to mca_btl_smcuda_component_init.
Question : Although not directly related to this pr. In opal/mca/btl/sm/btl_sm_component.c around line 421, is this correct? is this need to be removed?

    /* set flag indicating btl not inited */
    mca_btl_sm.btl_inited = false;

When the program terminates by ctrl-C (see issue #13002)

Question : No file cleaning up is possible in this case. Should we explicitly document this behavior or add a signal handling mechanism? It is not easy to add a signal handling mechanism in btl/sm.

@jsquyres
Copy link
Member

The btl cleanup looks good; looks like that was a miss that we never noticed because pmix cleaned it up for us.

I'm a little dubious of the session directory cleanup:

  1. Isn't this order backwards? Shouldn't we remove the top session directory last (after the proc and session dirs)?
  2. It doesn't feel like a bug that the recursive flags were set to "false" in the other opal_os_dirpath_destroy() calls -- it's been that way for (literally) years. I'm curious as to what the reason was (is?) for it to be false.
  3. How can we know if we can destroy the top session dir? I see destroy_top_session_dir set to true, below, but is that really true? What if there are other processes using that top session directory -- how can this process know if it is safe to remove the top session directory?

@rhc54
Copy link
Contributor

rhc54 commented May 23, 2025

1. Isn't this order backwards?  Shouldn't we remove the top session directory _last_ (after the proc and session dirs)?

Yes - you remove proc, then job, then session (i.e., work your way upwards)

2. It doesn't feel like a bug that the recursive flags were set to "false" in the other `opal_os_dirpath_destroy()` calls -- it's been that way for (literally) years.  I'm curious as to what the reason was (is?) for it to be `false`.

Can't say - however, PMIx and PRRTE always set that flag to "true". I suppose that in this one special case (a singleton), you could technically just call dirpath_destroy on the top-level session directory with recursive set to "true" since no other job can share the tree. 🤷‍♂️

3. How can we know if we can destroy the top session dir?  I see `destroy_top_session_dir` set to `true`, below, but is that really true?  What if there are other processes using that top session directory -- how can this process know if it is safe to remove the top session directory?

The top session directory's name includes the pid of the singleton process - this is necessary to ensure that some other process doesn't attempt to remove it from underneath you. Accordingly, no other process can use it, so it is safe to remove it.

@jsquyres
Copy link
Member

The top session directory's name includes the pid of the singleton process - this is necessary to ensure that some other process doesn't attempt to remove it from underneath you. Accordingly, no other process can use it, so it is safe to remove it.

Is this code specific to the singleton case? If so, I missed that.

@rhc54
Copy link
Contributor

rhc54 commented May 24, 2025

Is this code specific to the singleton case?

Yeah, it's a little convoluted, but it still is restricted to singleton case. You only set the "destroy_xxx" flag if we are a singleton and therefore created the session directory ourselves. So you only execute the destruct code if that flag is set, which means you have to be a singleton.

You might want to double-check that the logic behind that didn't get changed and is correct. Just glancing, it looked like it was still okay.

@xbw22109
Copy link
Contributor Author

Thank you @rhc54 for the detailed explanation — I completely agree with your points and appreciate the context you provided.

Is this code specific to the singleton case?

Yes. And here is some additional information.

When PRRTE is present, it is responsible for both the creation and cleanup of the session directory.
OMPI retrieves the session directory *path* from PMIx after the directory has been created.

In ompi/runtime/ompi_rte.c:797, function :ompi_rte_init,

    OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_TMPDIR, &pname, &val, PMIX_STRING);
    if (OPAL_SUCCESS == rc && NULL != val) {
        opal_process_info.top_session_dir = val;
        val = NULL;  // protect the string
    } else {
        /* we need to create something */
        rc = _setup_top_session_dir(&opal_process_info.top_session_dir);
        if (OPAL_SUCCESS != rc) {
            error = "top session directory";
            goto error;
        }
    }

If PRRTE is running correctly, OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_TMPDIR, &pname, &val, PMIX_STRING) is guaranteed to succeed, because PRRTE will always create and register the session directory under all circumstances.

In the else branch, we manually create the session directory. Entering this branch indicates that no PRRTE is available, which I believe, in the Open MPI 5.0.x implementation, only occurs in singleton mode.

In this case, Open MPI should create the session directory itself, handle its cleanup, and use the PID to distinguish the top-level session directory.

These responsibilities were not fully handled in the original implementation. In the original implementation, the top session directory is placed under the system’s temporary directory, and the job session dir is constructed as *top_session_dir*/ompi.*nodename*.*uid*/jf.0/*opal_jobid*.

This structure requires additional directory parsing to perform a full cleanup, which the original implementation does not handle. As a result, the directory top_session_dir/ompi.nodename.uid/jf.0(e.g., /tmp/ompi.x.1000/jf.0 on my machine) , is never deleted after a singleton run.

@xbw22109
Copy link
Contributor Author

Question : No file cleaning up is possible in this case. Should we explicitly document this behavior or add a signal handling mechanism? It is not easy to add a signal handling mechanism in btl/sm.

In addition, based on my observation, starting from Open MPI 5.0.x, the singleton no longer launches a thread that runs pmix server. This is why singleton processes can no longer clean up their paths upon abnormal termination. In contrast, for non-singleton programs, PRRTE or OpenPMIx takes care of path cleanup after an abnormal termination.

So, what should we do?

Adding a signal handler for singleton programs may not be a good idea, since it would be registered into the "user program" during MPI_Init. On the other hand, we would need an additional "registration system" to track and manage files created by sm.

Personally, I lean toward documenting this behavior—indicating that when a singleton process terminates abnormally, we cannot guarantee the cleanup of temporary files.

@xbw22109
Copy link
Contributor Author

Question : Although not directly related to this pr. In opal/mca/btl/sm/btl_sm_component.c around line 421, is this correct? is this need to be removed?

I’ve taken a closer look and have now understood it myself — it turns out to be just a variable initialization. There’s no need to follow up on this further.

@rhc54
Copy link
Contributor

rhc54 commented May 24, 2025

the singleton no longer launches a thread that runs pmix server

Errr...that's not true. You initialized PMIx, and so the PMIx library is indeed running its progress thread. Technically, you could follow the same code as in prte.c and setup events (using the PMIx event base) to capture SIGTERM and friends, and then cleanly terminate when fired.

Bit of work, so up to you guys - suspect users will complain if it doesn't clean up, but... 🤷‍♂️

@xbw22109
Copy link
Contributor Author

Errr...that's not true. You initialized PMIx, and so the PMIx library is indeed running its progress thread. Technically, you could follow the same code as in prte.c and setup events (using the PMIx event base) to capture SIGTERM and friends, and then cleanly terminate when fired.

I hadn't noticed that before — I'll go take a look.

My earlier question was based on the assumption that the PMIx library is not running at all in singleton mode in 5.0.x. If what you said is correct, then following the same code as in prte.c to handle file cleanup — rather than just documenting the behavior — would be the best solution.

@rhc54
Copy link
Contributor

rhc54 commented May 30, 2025

Just a thought: why is the sm BTL component creating a backing file when operating as a singleton? There's nobody you can communicate with over that transport - so maybe one step would be to just not create the backing file if the "singleton" flag is set?

Edited: or I guess just have the btl/sm component disqualify itself in singleton mode?

@xbw22109
Copy link
Contributor Author

xbw22109 commented Jun 2, 2025

Let me summarize what I’ve observed so far to clarify the current behavior and my remaining questions:

1.Singleton mode in OMPI 5.0.x

Starting from the user's call to MPI_Init, the code eventually reaches ompi_rte_init() at ompi/runtime/ompi_rte.c:590, where PMIx_Init() is invoked.
In singleton mode, this call returns PMIX_ERR_UNREACH; however, the PMIx internal process thread still starts and runs.
No subprocess is launched during this process, and prte does not run.

2.Non-singleton mode

In non-singleton mode, the application is launched via mpirun, which calls exec(prterun) (prte.c).
The cleanup logic is implemented in prte.c — it installs signal handlers (e.g., for SIGINT), which set flags through a pipe mechanism, and the actual handling is done in prte_finalize().
The cleanup of the session directory is unrelated to PMIx Server — it is handled entirely within prte/prterun.

3.A special case: MPI_Comm_spawn

When the user calls MPI_Comm_spawn in singleton mode, the function start_dvm() in ompi/dpm/dpm.c is executed.
This function forks a prte subprocess, re-invokes PMIx_Init to establish a connection, and sets the singleton flag to false.
I will later verify whether the session directory cleanup is handled correctly in this scenario.

Summary

In summary, singleton mode runs without prte support by default.
However, if the application calls functions like MPI_Comm_spawn, the runtime will switch from singleton mode to non-singleton mode.

My question / confusion

What I’m still unclear about is:
How should we implement proper “abnormal termination detection” in singleton mode?
From my observations, prte handles signals like SIGINT.
We should avoid installing signal handlers during MPI_Init, since doing so may conflict with signal handling logic in the user application.

@rhc54
Copy link
Contributor

rhc54 commented Jun 2, 2025

Just a couple of comments:

runtime will switch from singleton mode to non-singleton mode

Not exactly, or at least not in a way that impacts this discussion. The prte that was spun off to support the singleton for comm_spawn has no idea that the original process created a backing file in /dev (or some other location), nor does it have knowledge of or access to the session directory created by the singleton at startup. So it cannot cleanup those things. I suppose you could modify the dpm code to have it "register" the areas for cleanup - might be a reasonable solution. Pretty simple PMIx call.

How should we implement proper “abnormal termination detection” in singleton mode?

I agree that having the MPI library trap the signal is probably not a good idea. I think you could argue that the session directory is (in this circumstance) the responsibility of the user. Greater concern is the backing file when placed in a non-obvious location like /dev.

What I'd suggest is modifying the btl/sm component to disqualify itself when in a singleton so the backing file never gets created in the first place. After all, there is nobody to communicate with over that channel. This is true even if they call comm_spawn. So there really isn't any point in creating this backing file, and no obvious need for the sm component itself in this case.

@xbw22109 xbw22109 force-pushed the fix-singletons-cleanup branch from 4bfc636 to 40ee5d2 Compare June 8, 2025 11:30
@xbw22109
Copy link
Contributor Author

xbw22109 commented Jun 8, 2025

What I'd suggest is modifying the btl/sm component to disqualify itself when in a singleton so the backing file never gets created in the first place. After all, there is nobody to communicate with over that channel. This is true even if they call comm_spawn. So there really isn't any point in creating this backing file, and no obvious need for the sm component itself in this case.

thanks to @rhc54 , that's a very good point.

I spent some time reading the code and was pleasantly surprised to find that btl/sm has similar logic.
ompi/opal/mca/btl/sm/btl_sm_component.c : mca_btl_sm_component_init around line 313.

    /* disable if there are no local peers */
    if (0 == MCA_BTL_SM_NUM_LOCAL_PEERS) {
        BTL_VERBOSE(("No peers to communicate with. Disabling sm."));
        return NULL;
    }

However, in ompi/ompi/runtime/ompi_rte.c around line 888, opal_process_info.num_local_peers was set to 1, not 0, for singleton.
After the change, not only does btl/sm get automatically closed in singleton mode, but btl/smcuda will also be automatically closed. This is perfectly reasonable.

btl/sm segment file in /dev/shm will never be created in singleton mode now.

@xbw22109 xbw22109 force-pushed the fix-singletons-cleanup branch from 40ee5d2 to 7e18deb Compare June 8, 2025 11:46
@xbw22109
Copy link
Contributor Author

xbw22109 commented Jun 8, 2025

Let me restate the purpose of this PR:

In singleton mode, directory cleaning needs to be done by the ompi library.
However, there are issues in the current implementation that prevent complete cleanup of the directory.

This commit fixes some of these issues.

  1. btl/sm will not unlink its segments file. We never noticed this in non-singleton mode because pmix cleaned it up for us.

After fixing this, we can clean up the segment file created by sm in /dev/shm. (when singletons normally terminated)

  1. Modified the singleton session directory structure and enabled recursive deletion.

After this, we can cleanup the session dir. (when singletons normally terminated)

  1. Fix a bug - local peer number of a singleton should be 0, not 1.

After this, the btl/sm and btl/smcuda components will return NULL during their init process and will be automatically closed.(as same as mpirun -n 1 ./a.out)
btl/sm segment file in /dev/shm will never be created in singleton mode now!

Currently, only the session directory from a singleton process that terminates abnormally fails to be cleaned up.

@rhc54
Copy link
Contributor

rhc54 commented Jun 8, 2025

Just to be clear:

#define PMIX_LOCAL_SIZE   "pmix.local.size"       // (uint32_t) #procs in this job on this node

The value returned in ompi_rte.c includes the proc itself, and thus would be 1 for a singleton. Only reason it matters is when a non-singleton proc is alone on a node - in which case, the value would be 1 and it still makes no sense to include the sm components.

All depends on how OMPI wants to interpret its internal "num_local_peers" variable. 🤷‍♂️

@xbw22109 xbw22109 force-pushed the fix-singletons-cleanup branch from 7e18deb to ae35d75 Compare June 8, 2025 12:09
@xbw22109
Copy link
Contributor Author

xbw22109 commented Jun 8, 2025

I forgot the sign-off-by. Nothing else changed.

@xbw22109
Copy link
Contributor Author

xbw22109 commented Jun 8, 2025

The` value returned in ompi_rte.c includes the proc itself, and thus would be 1 for a singleton. Only reason it matters is when a non-singleton proc is alone on a node - in which case, the value would be 1 and it still makes no sense to include the sm components.
All depends on how OMPI wants to interpret its internal "num_local_peers" variable. 🤷‍♂️

    /* get the number of local peers - required for wireup of
     * shared memory BTL, defaults to local node */
    OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_LOCAL_SIZE,
                                   &pname, &u32ptr, PMIX_UINT32);
    if (PMIX_SUCCESS == rc) {
        opal_process_info.num_local_peers = u32 - 1;  // want number besides ourselves
    }

It appears that MPI consistently uses a value that is decremented by one, which is why singleton mode is represented as 0.

@xbw22109 xbw22109 marked this pull request as ready for review June 8, 2025 12:19
@rhc54
Copy link
Contributor

rhc54 commented Jun 11, 2025

One other little refinement you could consider: if the singleton does a comm_spawn, it will start a "prte" to shepherd the launch of the child job. Down towards the bottom of ompi/dpm/dpm.c, you could use PMIx_Job_control to register the singleton's session directory for cleanup by the "prte". Then, if the singleton abnormally terminates, the "prte" will clean it up on its way out.

Only helps for the case where a comm_spawn was done - but that might prove to cover a majority of singleton executions.

@xbw22109
Copy link
Contributor Author

I agree with your opinion.

Additionally, I am wondering who might be using the session dir in singleton mode? Is it possible to postpone session directory creation to MPI_Comm_spawn?

Perhaps some work needs to be done to ensure that components do not generate files in singleton mode. I am not very familiar with how session dir works in PMIX and PRRTE. Is it possible to not use session dir at all during singleton execution (before spawn)?

@rhc54
Copy link
Contributor

rhc54 commented Jun 11, 2025

The only things going into that directory tree (prior to spawn) would be files OMPI puts into it. I can think of just two things that would be possibilities: (a) shared memory backing files, and (b) opal_output that had been directed to go to files. You're taking care of the first. IIRC, there is an MCA param that controls the second, so maybe that could trigger the session directory creation as well as spawn?

PMIx only creates session directories if the host is a server, which the MPI app is definitely not - so you won't see anything from PMIx here. PRRTE isn't involved in the MPI app, so nothing to be concerned about there - daemons will take care of themselves.

Only in singleton mode, directory cleaning needs to be done
by the program itself.
There are some problems with these parts of the code that cause
the directory to not be cleaned.
This commit fixes *some of* these issues.

1. btl/sm will not unlink its segments file. We never noticed this
in non-singleton mode because pmix cleaned it up for us.

After this, we can clean up the segment file created by sm in
/dev/shm.(when singletons normally terminated)

2. Modified the singleton session directory structure and enabled
recursive deletion.

After this, we can cleanup the session dir. (when singletons normally
terminated)

3. Fix a bug - local peer number of a singleton should be 0, not 1.

After this, the btl/sm and btl/smcuda components will return NULL
during their init process and will be automatically closed.
btl/sm segment file in /dev/shm will never be created in singleton mode
now.

4. If the singleton does a comm_spawn, register the singleton's session
directory for cleanup by the "prte".

Signed-off-by: xbw <78337767+xbw22109@users.noreply.github.com>
@xbw22109 xbw22109 force-pushed the fix-singletons-cleanup branch from ae35d75 to 0aab3e9 Compare June 13, 2025 07:57
@xbw22109
Copy link
Contributor Author

One other little refinement you could consider: if the singleton does a comm_spawn, it will start a "prte" to shepherd the launch of the child job. Down towards the bottom of ompi/dpm/dpm.c, you could use PMIx_Job_control to register the singleton's session directory for cleanup by the "prte". Then, if the singleton abnormally terminates, the "prte" will clean it up on its way out.
Only helps for the case where a comm_spawn was done - but that might prove to cover a majority of singleton executions.

Registered the cleanup directory after PRTE startup.

@xbw22109
Copy link
Contributor Author

Just leaving some notes:

I checked OMPI’s usage of the session directory — please note that I might have missed some cases.

The following components may use the job session directory:

  • btl/sm: will not open in singleton mode.(disqualify itself)
  • btl/smcuda: will not open in singleton mode.(disqualify itself)
  • sharedfp/sm: I think it will open in singleton mode.
  • coll/xhc: I think it will not open in singleton mode.(disqualify itself)
  • btl/usnic: I’m not sure — when the corresponding device exists, I’m not certain whether it will create files in singleton mode.

The following components may use the proc session directory:

  • osc/rdma, osc/sm, osc/ucx: it looks like in singleton mode these will go through osc/rdma → btl/self, so no files should be created — but I’m not fully certain.
  • vprotocol: not sure.

Using singleton mode can be helpful for checking errors that are unrelated to MPI. I think abnormal termination in singleton mode is also quite common.

Despite the many challenges, I think it would be better to avoid creating top_session_dir and job_session_dir in singleton mode, and to output necessary files (such as opal_output) to a specified location — in my opinion, even cwd would be preferable to /tmp.

Once we decide not to create top_session_dir and job_session_dir during OMPI startup in singleton mode, it implies the following:

  • we must ensure that the components mentioned above do not attempt to create files in singleton mode, as this could cause the application to crash.
  • In addition, any components added in the future must be clearly aware that if they intend to generate files, they must handle the singleton mode case explicitly.

These checks and adjustments would involve a fairly wide range of code, and this is not something I intend to implement in this PR. The points above are just ideas for discussion.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants