Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use process group signal handling to reach past the shell #1126

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

twhitehead
Copy link

Spawn processes is their own process group and then signal the process group intstead of just the process we spawned to ensure the signals also reach any subprocesses started by the process we spawned (which is always the case as we are wrapping all our processes spawning with "/bin/sh -c ...").

This make sending SIGINT work again, so revert back to that instead of SIGKILL. The reason SIGINT wasn't working was the shell logic is that this is expected to be sent to the process group, so it doesn't need to pass it along or anything. See the bash signals documentation page for more details

https://www.gnu.org/software/bash/manual/html_node/Signals.html

Description
Target release

It doesn't matter to me.

Type of contribution
  • changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
  • changes to a module not authored by you
  • new module contribution or edit of a module authored by you
Copyright
  • I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.

  • the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.

Tests
  • I added a new regtest or modified an existing regtest to validate my changes.
  • I verified that all regtests are passed successfully on GitHub Actions.

Spawn processes is their own process group and then signal the
process group intstead of just the process we spawned to ensure
the signals also reach any subprocesses started by the process we
spawned (which is always the case as we are wrapping all our
processes spawning with "/bin/sh -c ...").

This make sending SIGINT work again, so revert back to that instead
of SIGKILL. The reason SIGINT wasn't working was the shell logic
is that this is expected to be sent to the process group, so it
doesn't need to pass it along or anything. See the bash signals
documentation page for more details

https://www.gnu.org/software/bash/manual/html_node/Signals.html
@GiovanniBussi
Copy link
Member

@twhitehead thanks for your contribution!

Can you please also write which is the problem that was solved in this way? I wrote the code originally but I do not remember problems with SIGKILL. Actually, the inline doc says that:

    // the destructor implies we do not need the subprocess anymore, so SIGKILL
    // is the fastest exit.
    // if we want to gracefully kill the process with a delay, it would be cleaner
    // to have another member function

Is there a reason why SIGINT is better?

Then a related note. There is a functionality that is currently off by default, namely the stop/cont option (you need to set an env var to enable it). This would allow in principle to suspend a process and resume it later, and I recall was causing problem with MPI in some settings. Is this PR addressing perhaps that issue as well?

Thanks

@Iximiel
Copy link
Member

Iximiel commented Oct 10, 2024

Hi @twhitehead
I need to reword your PR message to check if I understood it:

  • you use setpgid(0,0) to set the child process group id (pgid) the same as its pid
  • then on the various kill(-pid,signal) calls you use the "negative pid" so that kill will pass the signal to all of the processes that have the same gpid of the child process?

In this way every subprocess spawned by the child gets the same signal.

Is that right?

@twhitehead
Copy link
Author

twhitehead commented Oct 10, 2024

@Iximiel that is exactly correct.

The issue with the way it currently is is only the /bin/sh wrapper command gets the signals.

@twhitehead
Copy link
Author

@GiovanniBussi I don't actually know if SIGINT is better than SIGKILL or not for you application.

SIGINT is what the terminal sends a program when you press CTRL+C. Technically this isn't telling the program to terminate though. Rather it saying to stop what it is currently doing. In a REPL, for example, SIGINT does not generally terminates the program, but rather aborts the current evaluate loop and returns to waiting for more input. Python actually throws a KeyboardInterrupt exception upon receiving a SIGINT, to make it easy to implement a limited scope abort action. MDAnalysis does not appear to catch, and it does terminate upon receiving SIGINT.

The main thing about SIGINT is it doesn't immediately kill the process. Rather the the process can catch it and handle it gracefully. By comparison SIGKILL is instance death with no chance at a cleanup. I used SIGINT as that is what the pior version used. Probably what you actually want is SIGTERM. This means quit, but, unlike SIGKILL, it can be captured so the program can quite gracefully. An ideal implementation would send a SIGTERM wait a bit, and then follow it up with a SIGKILL in the event the program does not exit gracefully on its own within a reasonable timeframe (maybe 10-30s).

@twhitehead
Copy link
Author

To be clear, I don't actually use PLUMMED myself. Rather I am a technical person who supports Canadian researchers wanting to running various application on the Canadian super computers. This patch came out of looking into issues a Canadian research was having running the prior version of PLUMMED as a LAMMPS plugin on one of our clusters.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants