Exploring how non-init processes run and terminate (or not) inside docker container running as PID 1.
When the Linux kernel boots it starts one process in userland called init that will always get PID 1. The job of this init process is to:
- start other processes
- be the ancestor (direct or indirect) of all processes
- adopt orphaned processes
- terminate all processes on shutdown
PID 1 is protected in Linux:
- It will never receive any signal if this process didn't explicitly create a handle for this signal. For example, if you send
SIGTERM
to PID 1 process that didn't create a handle for this signal, the OS will not send this signal and never terminate the process by default. - It will never receive
SIGKILL
orSIGSTOP
unless it comes from the ancestor namespace.
So this means that you can't terminate or kill process with PID 1 like any other process.
At the time of writing, the latest stable version of Linux is 6.1, so we will examine the source code of this version.
Our focus is on understanding how the init process obtains the SIGNAL_UNKILLABLE
flag, which determines which processes can ignore signals.
This describes how PID 1 in any namespace gets the SIGNAL_UNKILLABLE
flag:
- Linux starts the init process in
/init/main.c
in the rest_init function. - We are interested in line 694 which passes the
kernel_init
function, which starts the init process, as an argument to the function user_mode_thread from/kernel/fork.c
that starts a process in user mode. user_mode_thread
in line 2747 invokes the kernel_clone function.kernel_clone
in line 2671 invokes the copy_process function.copy_process
is a big function, but in line 2443 it sets theSIGNAL_UNKILLABLE
flag for the process that passes theis_child_reaper
check.- is_child_reaper is a function in
include/linux/pid.h
that checks if the process has PID 1 in its own namespace.
flowchart LR
subgraph g1 ["/init/main.c"]
id1(rest_init)
end
subgraph g2 ["/kernel/fork.c"]
direction TB
id2(user_mode_thread) --> id3(kernel_clone) --> id4(copy_process) --> |is_child_reaper| id5("set flag SIGNAL_UNKILLABLE")
end
id1 --> |kernel_init| id2
style id5 fill:#e47c7c
style g1 fill:none
style g2 fill:none
Now that PID 1 has the SIGNAL_UNKILLABLE
flag, let's see how signals behave when this flag is set. All functions are contained in one file /kernel/signal.c
.
- We start with the send_signal_locked function. It sets the bool
force
variable. We'll note that in lines 1247-1251, it setsforce
to true if the signal coming from the ancestor namespace. Then it calls the__send_signal_locked
function. - __send_signal_locked function in line 1090 will not send the signal to the process if the
prepare_signal
function returns false. - prepare_signal function in line 970 will return false if the
sig_ignored
function will return true. - sig_ignored will return true if the
sig_task_ignored
function will return true. - sig_task_ignored function in lines 89-91 will return true if three conditions are met:
unlikely(t->signal->flags & SIGNAL_UNKILLABLE)
- if process has theSIGNAL_UNKILLABLE
flag set.handler == SIG_DFL
- if the handler for a signal is the default handler, meaning the process didn't register a handler.!(force && sig_kernel_only(sig))
- if this expression is falseforce
variable set to falsesig_kernel_only(sig)
signal should be notSIGKILL
orSIGSTOP
- So there are multiple outcomes:
- The signal is not
SIGKILL
orSIGSTOP
, so this checksig_kernel_only(sig)
will return false, so theforce
variable is irrelevant. Since the whole expression will be false. - If the signal is
SIGKILL
orSIGSTOP
, thenforce
is relevant. If it is set to true, meaning that the signal coming from the ancestor namespace, the whole expression will be true and the signal will be sent. If it was false, meaning that the signal coming from the current namespace, the whole expression will be false and the signal will be ignored.
- The signal is not
To summarize, signals sent to PID 1 will be ignored if:
- The process has the
SIGNAL_UNKILLABLE
flag set, which is the case for any process with PID 1. - The process did not create a handler for the signal.
- The signal is
SIGKILL
orSIGSTOP
and it's sent from the process's own namespace.
Docker runs processes in namespaces, and the first process created gets PID 1 in its own namespace. PID 1 is special in Linux. When the command docker stop
is run, it will send SIGTERM
to the PID 1 inside the container. So if the process doesn't have a handler for SIGTERM
, it will not receive it. Docker will wait for 10 seconds and then send SIGKILL
. Since that signal will be coming from the ancestor namespace, it will be sent to PID 1 inside the container, and it will be "violently" killed.
And even if the process handles SIGTERM
, it could spawn children. The parent process should propagate SIGTERM
to its children. So the process with PID 1 inside the container should take the role of the init process on top of its core functions.
Not every container should have some kind of init process. You could handle SIGTERM
(or you don't mind your process of being killed with SIGKILL
) and don't spawn any children. But you probably would want some kind of init process. For that, for example, you can use tini and use -g
argument with it to propagate SIGTERM
to all processes inside the container.
There are some script that demonstrate different scenarios of running process in Docker. Execute run.sh
with this arguments:
python_noterm_bare
- Running python script that don't handleSIGTERM
not inside Docker, then terminating it. It should terminate without problem.python_term_bare
- Running python script that handleSIGTERM
not inside Docker, then terminating it. It should terminate without problem and also output the log message that it receivedSIGTERM
signal.python_noterm_docker_noinit
- Running python script that don't handleSIGTERM
inside Docker without init process, then terminating it. It should fail to terminate, and will be killed after 10 seconds.python_term_docker_noinit
- Running python script that handleSIGTERM
inside Docker without init process, then terminating it. It should terminate without problem and also output the log message that it receivedSIGTERM
signal.python_noterm_docker_init
- Running python script that don't handleSIGTERM
inside Docker with init process, then terminating it. It should terminate without problem.python_term_docker_init
- Running python script that handleSIGTERM
inside Docker with init process, then terminating it. It should terminate without problem and also output the log message that it receivedSIGTERM
signal.multi_python_docker_noinit
- Running multiple python scripts (processes) that handleSIGTERM
inside Docker without init process, then terminating it. It should fail to terminate, and will be killed after 10 seconds.multi_python_docker_init
- Running multiple python scripts (processes) that handleSIGTERM
inside Docker with init process, then terminating it. It should terminate without problem and also each python process output the log message that it receivedSIGTERM
signal.