Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Stuck at "Cleaning... X files" for long periods of time #2444

Open
user062 opened this issue May 7, 2024 · 5 comments · May be fixed by #2550
Open

Stuck at "Cleaning... X files" for long periods of time #2444

user062 opened this issue May 7, 2024 · 5 comments · May be fixed by #2550

Comments

@user062
Copy link

user062 commented May 7, 2024

Ninja version:1.12.0
OS: Archlinux

Whenever I run ninja to build a project, it gets stuck at Cleaning... X files for a long period of time (sometimes up to 5 minutes) even when there are no files to clean.
While it is stuck, there is 0 CPU usage, so it's not doing anything.

htop shows that a spawned process is in a zombie state waiting to be reaped:
ninja_meson_problem

Here is a related meson report: mesonbuild/meson#10645

I appreciate any help debugging this further.

Thanks a lot.

@scivision
Copy link
Contributor

Could you give an example project where this happens e.g. CMake or Meson etc.

@user062
Copy link
Author

user062 commented May 17, 2024

@scivision, This happens when building GIMP.

I investigated a bit. I found that it's waiting on this ppoll call:
subprocess-posix.cc:267

@ntrrgc
Copy link

ntrrgc commented Jan 8, 2025

This is occasionally also hitting GStreamer (also using meson).

Attaching a debugger I could confirm ninja was waiting inside ppoll() for a events on a pipe (the only pipe open in that process, arguably the one given to meson, fd 4, listening to POLLIN | POLLPRI).

The weird thing is that doing ps -n -P |grep <pipe inode> only returned the ninja process, meaning that the other end of the pipe had already been closed. meson was a zombie (<defunt>) process at that point.

When I detached the debugger ninja unstuck itself.

There is still the possiblity that meson shared that pipe with another process I didn't catch and that process terminated somewhere between the time I attached the debugger and inspected for processes that had that same pipe open. Otherwise this would hint to a kernel bug, because you would expect ppoll() to return and notify POLLHUP if the other end of the pipe was closed and all data has been read (and if not all data has been read, ppoll() should still return with a POLLIN event while data is available).

@ntrrgc
Copy link

ntrrgc commented Jan 8, 2025

I managed to reproduce the bug a second time, and this time I caught who was keeping the write-side of the pipe open.

$ ls /proc/801351/fd
total 0
lrwx------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 0 -> /dev/pts/1
lrwx------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 1 -> /dev/pts/1
lrwx------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 2 -> /dev/pts/1
lr-x------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 3 -> 'pipe:[2292603]'
$ lsof -n -P |grep 2292603
ninja     801351                           ntrrgc    3r     FIFO               0,15       0t0   2292603 pipe
sccache   801474                           ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801475 sccache            ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801476 tokio-run          ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801477 tokio-run          ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801478 tokio-run          ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
...

sccache is a compiler wrapper for cached compilation artifacts and remote compilation. It replaces your compiler by a wrapper, which is likely how it got executed by meson, and that wrapper spawns a sccache server if none is running. Quoting its README:

The sccache command will spawn a server process if one is not already running

Also:

You can run sccache --stop-server to terminate the server. It will also terminate after (by default) 10 minutes of inactivity.

Which explains how after a rather long time ninja got unstuck. The sccache server exited due to inactivity.

What I haven't figured out yet is how sccache ends up holding up to the pipe. Generally when you daemonize a subprocess you want to close all file descriptors you don't need, and you can see in the utils::daemonize() function of sccache that it does try to at least close stdin, stdout and stderr and replace them with /dev/null. However, that by itself wouldn't take care of higher file descriptors, and we can see in the previous output that the pipe existed on fd=4.

@ntrrgc
Copy link

ntrrgc commented Jan 8, 2025

I had a look at how ninja creates subprocesses in POSIX systems.

It has two different modes, depending on whether the console is shared to the subprocess or not.

If the console is not to be shared to the subprocess, a pipe is created and given to the subprocess as both stdout and stderr. Care is also taken to close the original file descriptors of both ends of the pipe. This all makes sense.

However, if the console is to be shared to the subprocess, the pipe is still created and intentionally left open for the subprocess, with the intention of this being used to detect when the subprocess exited (!).

// In the console case, output_pipe is still inherited by the child and
// closed when the subprocess finishes, which then notifies ninja.

Here is a table of the file descriptors, of the parent process before the posix_spawn() and of the subprocesses in the two different modes:

fd parent process (ninja) subprocess (no-console) subprocess (shared console)
0 stdin /dev/null stdin
1 stdout write-side of pipe stdout
2 stderr write-side of pipe stderr
3 read-side of pipe (with CLOEXEC) closed closed
4 write-side of pipe (closed after spawn) closed write-side of pipe

Relying on EOF from a pipe given to a subprocess in a non-standard fd without its knowledge is not a very orthodox way of detecting when a subprocess has exited and can lead to strange situations like the one here with sccache.

The same problem can occur if any subprocess (or subprocess of the subprocess etc.) tries to spawn any daemon and its code doesn't take the extra care to use something like close_range() to close all file descriptors it doesn't know of. And if a foreground process did that for whatever reason, it would be wrongly signalled to ninja as if the process had finished.

Normally, instead of this you would rely on something like the SIGCHLD signal (which would still make ppoll() return) -- or pidfd if restricted to recent Linux.

ntrrgc added a commit to ntrrgc/sccache that referenced this issue Jan 8, 2025
Otherwise, when the compiler wrapper spawns the sccache server, the
server may end up with unintended file descriptors, which can lead to
unexpected problems.

This is particularly problematic if any of those file descriptors that
accidentally end up in the server process is a pipe, as the pipe will
only be closed when all the processes with that file descriptor close it
or exit.

This was causing sccache to hang ninja, as ninja uses the EOF of a pipe
given to the subprocess to detect when that subprocess has exited:
ninja-build/ninja#2444 (comment)

This patch adds a dependency on the
[close_fds](https://crates.io/crates/close_fds) crate, which
automatically chooses an appropriate mechanism to close a range of file
descriptors. On Linux 5.9+ that mechanism will be libc::close_range().

Fixes mozilla#2313
ntrrgc added a commit to ntrrgc/sccache that referenced this issue Jan 9, 2025
Otherwise, when the compiler wrapper spawns the sccache server, the
server may end up with unintended file descriptors, which can lead to
unexpected problems.

This is particularly problematic if any of those file descriptors that
accidentally end up in the server process is a pipe, as the pipe will
only be closed when all the processes with that file descriptor close it
or exit.

This was causing sccache to hang ninja, as ninja uses the EOF of a pipe
given to the subprocess to detect when that subprocess has exited:
ninja-build/ninja#2444 (comment)

This patch adds a dependency on the
[close_fds](https://crates.io/crates/close_fds) crate, which
automatically chooses an appropriate mechanism to close a range of file
descriptors. On Linux 5.9+ that mechanism will be libc::close_range().

Fixes mozilla#2313
ntrrgc added a commit to ntrrgc/ninja that referenced this issue Jan 10, 2025
For background, see
ninja-build#2444 (comment).

In short, when running subprocesses that share the terminal, ninja
intentionally leaves a pipe open before exec() so that it can use EOF
from that pipe to detect when the subprocess has exited.

That mechanism is problematic: If the subprocess ends up spawning
background processes (e.g. sccache), those would also inherit the pipe
by default. In that case, ninja may not detect process termination until
all background processes have quitted.

This patch makes it so that, instead of propagating the pipe file
descriptor to the subprocess without its knowledge, ninja keeps both
ends of the pipe to itself, and uses a SIGCHLD handler to close the
write end of the pipe when the subprocess has truly exited.

During testing I found Subprocess::Finish() lacked EINTR retrying, which
made ninja crash prematurely. This patch also fixes that.

Fixes ninja-build#2444
@ntrrgc ntrrgc linked a pull request Jan 10, 2025 that will close this issue
ntrrgc added a commit to ntrrgc/ninja that referenced this issue Jan 10, 2025
For background, see
ninja-build#2444 (comment).

In short, when running subprocesses that share the terminal, ninja
intentionally leaves a pipe open before exec() so that it can use EOF
from that pipe to detect when the subprocess has exited.

That mechanism is problematic: If the subprocess ends up spawning
background processes (e.g. sccache), those would also inherit the pipe
by default. In that case, ninja may not detect process termination until
all background processes have quitted.

This patch makes it so that, instead of propagating the pipe file
descriptor to the subprocess without its knowledge, ninja keeps both
ends of the pipe to itself, and uses a SIGCHLD handler to close the
write end of the pipe when the subprocess has truly exited.

During testing I found Subprocess::Finish() lacked EINTR retrying, which
made ninja crash prematurely. This patch also fixes that.

Fixes ninja-build#2444
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants