The proc filesystem (procfs) is a virtual filesystem in the Linux kernel that provides information about processes. It is one of the interfaces which follows the “Everything is a file” paradigm. The procfs file system was designed a long time ago, when an average system ran no more than a few hundred processes. At that time it was not a problem to open one or several proc files per each process to get some information. Nowadays, a system can have hundreds of thousands of processes or even more. In this context, the idea to perform per-process open doesn’t look so brilliant. In this article, we are going to explore all aspects of procfs, and find out which ones can be optimized.
The idea to optimize the proc file system came when we found that CRIU, the software to checkpoint/restore processes, spends a significant amount of time reading procfs files. We have seen how a similar problem had been solved for sockets, so we decided to implement something similar to sock-diag but for processes.
We understood how hard it is to change a very mature interface in the kernel, but the biggest surprise was how many kernel developers support the idea of a new interface. They don’t know how a new interface should look like, but they are aware that the existing solution doesn’t work well in critical situations.
We have seen this many times: a server becomes laggy, vmstat shows lots of swap used,
ps ax
needs 10+ seconds to show something, and top
never comes up with any
output.
This article doesn’t propose any specific interface; instead it tries to give more details about the problem, and suggest ways to fix it.
In procfs, each running process is represented by a directory /proc/{PID}. Each such directory contains dozens of files with information about a process, divided into groups. Let’s have a quick tour (note that $$ is a shell built-in variable, expanding to the current process’ PID):
$ ls -F /proc/$$
attr/ exe@ mounts projid_map status
autogroup fd/ mountstats root@ syscall
auxv fdinfo/ net/ sched task/
cgroup gid_map ns/ schedstat timers
clear_refs io numa_maps sessionid timerslack_ns
cmdline limits oom_adj setgroups uid_map
comm loginuid oom_score smaps wchan
coredump_filter map_files/ oom_score_adj smaps_rollup
cpuset maps pagemap stack
cwd@ mem patch_state stat
environ mountinfo personality statm
All these files have different and unique formats. Most are ASCII text, easily readable by a human. Except when it doesn't.
$ cat /proc/$$/stat
24293 (bash) S 21811 24293 24293 34854 24876 4210688 6325 19702 0 10 15 7 33 35 20 0 1 0 47892016 135487488 3388 18446744073709551615 94447405350912 94447406416132 140729719486816 0 0 0 65536 3670020 1266777851 1 0 0 17 2 0 0 0 0 0 94447408516528 94447408563556 94447429677056 140729719494655 140729719494660 140729719494660 140729719496686 0
To understand this set of numbers, one needs to read proc(5) man page or kernel documentation. For example, the field number 2 is the file name of the executable, in parentheses, and the field number 19 is a process’ nice value.
One might think that while this file is not very human-readable, parsing it by a program is fast and trivial, and that is true, unless the executable name contains spaces or, even worse, parentheses, none of which are escaped by the kernel:
$ cp /bin/sleep 'ha ha ))) '
$ ./ha\ ha\ \)\)\)\ 1h
[1] 7397
$ cut -c-50 /proc/7397/stat
7397 (ha ha ))) ) S 3859 7397 3859 34816 7412 1077
Some of /proc/{PID} files have more human readable formats:
$ cat /proc/$$/status | head -n 5
Name: bash
Umask: 0002
State: S (sleeping)
Tgid: 24293
Ngid: 0
The following questions come to mind:
- How often do users look at these files directly?
- How much time it takes for the kernel to encode its internal binary data into text?
- What is the overhead of such pseudo file system?
- How good is this interface for developers of monitoring tools?
- How much time do these tool need to parse text data?
- How often do people have problems caused by bad performance of this interface in critical situations?
It would not be incorrect to say that the majority of users tend to use
tools like ps
or top
rather than inspect raw files from procfs.
To answer the other questions, we need to do a few experiments first. Let’s find out where the kernel spends the time to generate these files.
We need to understand what has to be done to get information about all processes. For that, we have to read contents of the /proc/ directory and find all subdirectories which names are decimal numbers. For each such subdirectory, we need to open a file, read its content and close it.
This results in three system calls that have to be made, one of them creating a file
descriptor, an operation that requires the kernel to allocate some internal
objects. The open()
and close()
system calls do not provide us with any
information, so we can say that the time spent calling those is the overhead
from this interface.
Here is the first experiment: call open()
and close()
for each process, without
reading file contents. In here we open and close every /proc/*/stat
file:
$ time ./task_proc_all --noread stat
tasks: 50290
real 0m0.177s
user 0m0.012s
sys 0m0.162s
Same for /proc/*/#uid
:
$ time ./task_proc_all --noread loginuid
tasks: 50289
real 0m0.176s
user 0m0.026s
sys 0m0.145
It doesn’t matter which file is opened, as its content is generated from the
read()
system call, which we don’t use yet.
Now, let’s take look at the corresponding perf output, to profile kernel functions.
- 92.18% 0.00% task_proc_all [unknown]
- 0x8000
- 64.01% __GI___libc_open
- 50.71% entry_SYSCALL_64_fastpath
- do_sys_open
- 48.63% do_filp_open
- path_openat
- 19.60% link_path_walk
- 14.23% walk_component
- 13.87% lookup_fast
- 7.55% pid_revalidate
4.13% get_pid_task
+ 1.58% security_task_to_inode
1.10% task_dump_owner
3.63% __d_lookup_rcu
+ 3.42% security_inode_permission
+ 14.76% proc_pident_lookup
+ 4.39% d_alloc_parallel
+ 2.93% get_empty_filp
+ 2.43% lookup_fast
+ 0.98% do_dentry_open
2.07% syscall_return_via_sysret
1.60% 0xfffffe000008a01b
0.97% kmem_cache_alloc
0.61% 0xfffffe000008a01e
- 16.45% __getdents64
- 15.11% entry_SYSCALL_64_fastpath
sys_getdents
iterate_dir
- proc_pid_readdir
- 7.18% proc_fill_cache
+ 3.53% d_lookup
1.59% filldir
+ 6.82% next_tgid
+ 0.61% snprintf
- 9.89% __close
+ 4.03% entry_SYSCALL_64_fastpath
0.98% syscall_return_via_sysret
0.85% 0xfffffe000008a01b
0.61% 0xfffffe000008a01e
1.10% syscall_return_via_sysret
Here we can see that the kernel spent roughly 75% of the time to create and destroy file descriptors, and about 16% to list processes.
Now we know the time it takes to open()
and close()
one file per each process,
but we can’t yet say how significant the overhead is in the whole process,
as we lack data for comparison. Let’s try to read those files, choosing the
“most popular” ones. It’s easy to find out which ones -- run ps
or top
under
strace
, see what it does. It appears that both these tools read at least
/proc/{PID}/stat
and /proc/{PID}/status
for each process.
Let’s perform open/read/close on /proc/*/status
. It is one of the bigger files
having a fixed number of fields.
$ time ./task_proc_all status
tasks: 50283
real 0m0.455s
user 0m0.033s
sys 0m0.417s
Perf output:
- 93.84% 0.00% task_proc_all [unknown] [k] 0x0000000000008000
- 0x8000
- 61.20% read
- 53.06% entry_SYSCALL_64_fastpath
- sys_read
- 52.80% vfs_read
- 52.22% __vfs_read
- seq_read
- 50.43% proc_single_show
- 50.38% proc_pid_status
- 11.34% task_mem
+ seq_printf
+ 6.99% seq_printf
- 5.77% seq_put_decimal_ull
1.94% strlen
+ 1.42% num_to_str
- 5.73% cpuset_task_status_allowed
+ seq_printf
- 5.37% render_cap_t
+ 5.31% seq_printf
- 5.25% render_sigset_t
0.84% seq_putc
0.73% __task_pid_nr_ns
+ 0.63% __lock_task_sighand
0.53% hugetlb_report_usage
+ 0.68% _copy_to_user
1.10% number
1.05% seq_put_decimal_ull
0.84% vsnprintf
0.79% format_decode
0.73% syscall_return_via_sysret
0.52% 0xfffffe000003201b
+ 20.95% __GI___libc_open
+ 6.44% __getdents64
+ 4.10% __close
We can see that only about 60% of the time was spent in read()
syscalls. If we
take a look at the detailed profile, we can see that about 45% of the time was spent in
functions like seq_printf
, seq_put_decimal_ull
-- the ones that perform binary to text
conversion. Apparently, encoding binary data into text is not that cheap.
Do we really need a human-readable interface to retrieve this data from
the kernel? How often do we want to see the raw /proc
contents? Why do tools
like top
and ps
have to decode text data back into a binary format?
Let's see how fast is an interface that
- uses a binary format (rather than text);
- does not require a minimum of three system calls and one file descriptor per process.
To much surprise, the first such interface appeared in the Linux kernel back in 2004: [0/2][ANNOUNCE] nproc: netlink access to /proc information
This was an attempt to address the current problems with /proc. In short, it exposed the same information via netlink (implemented for a small subset).
Unfortunately, the community didn’t show much interest in this work. The most recent attempt to introduce something like this has been made only two years ago: [PATCH 0/15] task_diag: add a new interface to get information about processes. Let's see what it's all about.
The task-diag interface is based on the following principles:
- Transactional: write a request, read back a response
- Netlink message format (binary and extendable; same as used by
sock_diag
) - Ability to specify a set of processes to get info about
- Optimal grouping of attributes (any attribute in a group can't affect a response time)
This interface was presented at a few conferences. It was (experimentally)
integrated into pstools
and CRIU, and (to some degree) into the perf tool. All
three experiments shown that performance is always better.
The kernel community expressed some interest in this work. The primary debate was about what transport should be used to transfer data between kernel and userspace. The initial idea to use netlink sockets was declined. Partly it was due to known unresolved issues in netlink code base, and partly due to a belief that the netlink interface was designed only for a network subsystem. It was suggested instead to use a transactional file in procfs, meaning that a user would open a file, write a request into a file descriptor, and then reads a response back from it. As usual, there were people who didn’t like this approach either. Alas, the solution accceptable by everyone has not been found yet.
Now let’s compare the proposed task_diag
with traditional procfs.
The task-diag code has a test tool, which can be used for our experiments.
In the first experiment, we will get process PIDs and credentials. Here is an example of data that a test program reads for each process:
$ ./task_diag_all one -c -p $$
pid 2305 tgid 2305 ppid 2299 sid 2305 pgid 2305 comm bash
uid: 1000 1000 1000 1000
gid: 1000 1000 1000 1000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
Now let’s run it for the same set of processes what we used earlier with procfs.
$ time ./task_diag_all all -c
real 0m0.048s
user 0m0.001s
sys 0m0.046s
It only takes 0.05 seconds to get data enough to show a process tree. In case of procfs, we need 0.177 seconds to merely open and close one file per process, without even reading any data from it!
Here is a perf output:
- 82.24% 0.00% task_diag_all [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath
- entry_SYSCALL_64_fastpath
- 81.84% sys_read
vfs_read
__vfs_read
proc_reg_read
task_diag_read
- taskdiag_dumpit
+ 33.84% next_tgid
13.06% __task_pid_nr_ns
+ 6.63% ptrace_may_access
+ 5.68% from_kuid_munged
- 4.19% __get_task_comm
2.90% strncpy
1.29% _raw_spin_lock
3.03% __nla_reserve
1.73% nla_reserve
+ 1.30% skb_copy_datagram_iter
+ 1.21% from_kgid_munged
1.12% strncpy
Nothing stands out, except for the fact that there are no visible time-consuming functions to be optimized.
Now let’s see how many system calls are needed to get information about all processes:
$ perf trace -s ./task_diag_all all -c -q
Summary of events:
task_diag_all (54326), 185 events, 95.4%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
read 49 40.209 0.002 0.821 4.126 9.50%
mmap 11 0.051 0.003 0.005 0.007 9.94%
mprotect 8 0.047 0.003 0.006 0.009 10.42%
openat 5 0.042 0.005 0.008 0.020 34.86%
munmap 1 0.014 0.014 0.014 0.014 0.00%
fstat 4 0.006 0.001 0.002 0.002 10.47%
access 1 0.006 0.006 0.006 0.006 0.00%
close 4 0.004 0.001 0.001 0.001 2.11%
write 1 0.003 0.003 0.003 0.003 0.00%
rt_sigaction 2 0.003 0.001 0.001 0.002 15.43%
brk 1 0.002 0.002 0.002 0.002 0.00%
prlimit64 1 0.001 0.001 0.001 0.001 0.00%
arch_prctl 1 0.001 0.001 0.001 0.001 0.00%
rt_sigprocmask 1 0.001 0.001 0.001 0.001 0.00%
set_robust_list 1 0.001 0.001 0.001 0.001 0.00%
set_tid_address 1 0.001 0.001 0.001 0.001 0.00%
In case of procfs, we need more than 150000 system calls to get this information, while task_diag requires a bit more than 50.
Let’s take a look at the real workload. For example, we want to show a process tree with command lines for each process. For that, we need to know the command line, the pid, and the parent pid for each process.
In case of task_diag, the test program sends a request to get general information plus command lines for all processes and then reads the requested data back:
$ time ./task_diag_all all --cmdline -q
real 0m0.096s
user 0m0.006s
sys 0m0.090s
In case of procfs, we need to read two files per each process: /proc/pid/status
and /proc/pid/cmdline
.
$ time ./task_proc_all status
tasks: 50278
real 0m0.463s
user 0m0.030s
sys 0m0.427s
$ time ./task_proc_all cmdline
tasks: 50281
real 0m0.270s
user 0m0.028s
sys 0m0.237s
Here we can see that task_diag is about 7 times faster than procfs (0.096 vs 0.27 + 0.46). Usually, the increase of performance even by a few per cent is a good result. Here the effect is much more significant.
Another thing to mention is a number of kernel allocations. This factor is critical when a system is under memory pressure. Let’s compare how many kernel allocations happen for procfs and task-diag.
$ perf trace --event 'kmem:*alloc*' ./task_proc_all status 2>&1 | grep kmem | wc -l
58184
$ perf trace --event 'kmem:*alloc*' ./task_diag_all all -q 2>&1 | grep kmem | wc -l
188
Let’s see how many allocations are required to run a trivial process:
$ perf trace --event 'kmem:*alloc*' true 2>&1 | grep kmem | wc -l
94
Procfs requires 600 times more in-kernel memory allocations than the task diag interface. It is just another point why procfs works so slow in critical situations, and there is a room for optimization.
We hope that this article will attract more developers interested in improving this part of the kernel.
Many thanks to David Ahern, Andy Lutomirski, Stephen Hemminger, Oleg Nesterov, W. Trevor King, Arnd Bergmann, Eric W. Biederman, and other people who helped to develop and improve the task diag interface.