-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Cleanup created containers on receptor work cancel
#439
Comments
receptor work cancel
receptor work cancel
Replaced instances of the word "pod" with "container" to avoid confusion. |
I am not sure about passing the pid directory via a command line argument like this. It seems like it would require some sort of templating / variable substitution in the receptor config. It may be less intrusive to inject an environment variable like |
That would have to be unique for each work unit, right? That's the weird thing to me. Otherwise, receptor may have multiple work units, some running, and some not. It needs to know which pid file came from which work unit, because it only uses this pidfile when it gets the cancel command for some specific work unit. |
We can't reasonably start on a change to ansible-runner right now, because we would be throwing around speculation about what receptor would provide. receptorWith the above proposal, we want receptor to move first, establishing (in a PoC) the environment variable that will be set, and instructions about how to write a pid file in a way that will map the pid file to the work unit id. Might the code need to create the directory? Will receptor clean up old directories? runnerThen we can develop a PoC inside of runner that will read that environment variable and pass an option to
We may need to confirm these options work the same in other podman versions. awxOr could we just trash both of those ideas and pass the new pidfile option(s) in Then AWX would have to do some extra cleanup (when?). |
I dont think this is a problem anymore. @AlanCoding says we have tests for this. |
Background
Stopping the underlying container when
receptor work cancel
is ran is showing to be harder than we thought due to the underlyiing pod running in a separate process tree.Let us look at the two process trees and reason how and if we can reliably kill the disconnected podman tree from the receptor tree.
Below we have
sleep.yml
running on an execution node.Note that we effectively have 2 process trees here. Let's give the two process trees a name.
POSIX Processes Notes
SIGHUP
to all processes in TPGIDReceptor Tree
Podman Tree
Playing Around
Let's try sending signals and see what happens.
Process SignalsSIGINT
ansible-runner-worker
podman run parent
podman run child
SIGTERM
ansible-runner-worker
podman run parent
SIGTERM
podman run child
Alright, we found a case where the
podman run
process will clean-up the podman tree. However,podman run child
isn't really known to receptor noransible-runner worker
. Can we send a signal higher up in the process tree to a process something we manager knows about?podman run parent
andpodman run child
are in the same process group.ansible-runner
may not know aboutpodman run child
but it knows aboutpodman run parent
so maybe it could seis killed too.nd a signal to the process group.Note: Seeing the below error sometimes. This may be a podman bug.
SIGTERM
-podman run parent
Nope, that didn't work hmm MAYBE because of the error discovered above?
Next theory,
Thoughts
Even if we could get the right signal to be sent to the right process in the receptor tree, there is the edge case where all or part of the receptor tree could be
SIGKILL
due to a number of reasons (OOM being the most common). There are no kernel level facilities that I can think of that we can lean on to ensure that when a process exists in the receptor tree, that the podman process tree exits also. Therefore, the only option that I see is to rely on Receptor to maintain some sort of state about the podman process tree and to cleanup the podman process tree from Receptor.Proposal
Receptor will pass the receptor work directory to
ansbile-runner worker
.ansible-runner worker --container-pid-file-dir=/tmp/receptor/awx_1/KcipiXHj/pidfiles/
Ansible runner worker will then forward to podman the location to write the pid file viapodman run --conmon-pidfile=/tmp/receptor/awx_1/KcipiXHj/pidfiles/conmon.pid
. Receptor will be responsible for sendingSIGKILL
whenreceptor work cancel KcipiXHj
is issued, to any pid in the/tmp/receptor/awx_1/KcipiXHj/pidfiles/
dir.The text was updated successfully, but these errors were encountered: