-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Swap out nsenter
with Go-native code
#264
base: main
Are you sure you want to change the base?
Conversation
Augh, so sorry for the lack of response here! @nicholascioli, can you fix the conflicts and DCO? I'll see if I can get a maintainer looking at this in the meantime. 🙂 |
This commit swaps out the previous implementation of using the command `nsenter` to switch networking namespaces. By using the Go impl of network namespace switching, read-only filesystems and OSs without the `nsenter` command (such as Talos) should still work. Signed-off-by: Nicholas Cioli <nicholascioli@users.noreply.github.com>
No worries! I've rebased and added the DCO, but I am not a go expert and had to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicholascioli thanks a lot for the change! I'll be honest, this change is kinda scary on paper. Normally we require only one approval for proxy-init changes but I'll want someone else to review this too.
I spent a bit of time going through write-ups on this topic to better understand the problem. Leaving this as a summary for other folks interested in reviewing this change.
For its goroutine model, Go ships with a userspace scheduler. The userspace scheduler is responsible for distributing N routines on M threads (N:M multiplexing model). To achieve this, each OS thread will receive a scheduling context. The scheduler is greedy, it may steal work from other threads. In particular when a thread will block (e.g. syscall) its work will be stolen by another kernel thread. If no thread exists (in the cache or actively doing work) then a new one will be created.
As the folks from Weave have noticed this doesn't go well (no pun intended) with network namespace switching logic. Most of the network namespace or cgroup operations require system calls. If a program repeatedly executes in arbitrary namespaces, it is possible that during syscall execution the runtime will create a new thread. Each thread operates in its own network namespace. This increases the chances that a thread may operate in a network namespace it is not supposed to (yuck). To make matters worse, once a goroutine is unblocked, it goes back into a global queue which means any thread may pick it back up. A go routine may switch the network namespace it is operating in by virtue of trapping into the kernel (double yuck).
There are some options here, such as using runtime.LockOSThread()
which effectively pins a goroutine to an OS thread. This doesn't mean the operation is not volatile, each bit of code needs to ensure it locks the thread and doesn't unlock it midpoint in the execution.
I'm split about this change. On the one hand, I see the use of bringing this functionality into the CNI plugin to unblock operating systems that do not ship with nsenter
. On the other, I'm reticent to rubber stamp a change that can cause UB for users running this in production on operating systems that support nsenter
. To me the problem doesn't seem to be entirely fixed, we just have more determinism than before.
This looks fine, but perhaps before we go further we should answer some questions:
- Should this be feature flagged? Can this break anyone's system? If it does, how do we go back to behaviour that doesn't break?
- How can we improve logging here? If this happens in a system, will we spend weeks debugging this? Can we rely on people to run
strace
on their hosts? Should we include more logs? - Are there any automated tests we can come up with to increase our confidence in shipping this?
I know that's a lot to put on you @nicholascioli, I think some of these decisions can be offloaded onto the team (especially the first) but your input is highly valued as the person submitting the changeset. Lastly, for tests, I'm happy to shepherd or to give more concrete ideas. It is also something we might be able to do as a follow-up.
log.Errorf("could not switch to target network namespace \"%s\": %s", firewallConfiguration.NetNs, err.Error()) | ||
return nil, err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instrument the logger with the current network namespace we're executing in? I'd err on the side of caution here since ub in this space will be incredibly painful to debug without good logging.
// | ||
// Note: Result needs to be defined here since `netNs.Do` only returns | ||
// an error. | ||
result := make([]byte, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do var result []byte
? Even if it's a zero length array we still allocate a pointer that is then set to whatever cmd.CombinedOutput
returns, so it might be redundant?
|
||
return result, err | ||
} else { | ||
return doCommand() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log at the debug level that we're executing in the root network namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nicholascioli for tackling this and @mateiidavid for the detailed writeup.
It seems this follows the recommended instructions from the ns
package docs. The Weave article is indeed scary, in particular about the OS thread lock workaround:
One could argue that locking a goroutine with runtime.LockOSThread could help, but a) the goroutine might spawn a new goroutine which would run in a wrong namespace b) locking does not prevent the runtime from creating a new OS thread for scheduling.
But as pointed out in the ns
docs and in golang/go#20676, new threads won't be spawned from locked Ms after go 1.10. And IIUC, the linkerd-cni binary is executed from scratch for each event (ADD pod, which is what we only care about), so it's not a long-lived process where concurrent runtime.LockOSThread
could conflict.
I second @mateiidavid's recommendations to fully log whatever we can here. I would also love to have linkerd/linkerd2#8892 resuscitated to actually have some logs to look at! 😉
I think putting this behind a flag disabled by default would make sense while we build more confidence about it. And I also need to address linkerd/linkerd2#11567 so we can actually run this in CI 😣
As described in this issue, this commit removes the need for the
nsenter
binary on the host for linkerd to function in CNI mode. Read-only filesystems and OSs without thensenter
binary (such as Talos OS) cannot currently run linkerd without this change.I tried running the integration tests, but couldn't get the complete integration test to work even with no changes. That could be because I was running docker on Windows and passing the socket into WSL which then passed it into the dev container, but I don't see why that wouldn't work.
Also, if you prefer that I use the code mentioned here instead, please let me know. I'm not sure how to test this on Talos itself, but am definitely willing to try!