-
Notifications
You must be signed in to change notification settings - Fork 30
Kernel panic with vxlan in openvswitch (via openshift) #2382
Comments
The same issue also affects Weave on CoreOS beta: https://gist.github.com/SpComb/bc439fcda4ff9d54105c28bfd4a44916 This is 100% reproducible on Vagrant with Weave 1.9.3, and packet.net with Weave 1.9.3 and 2.2.1. Both machines kernel panic as soon as weave establishes an active vxlan connection between two nodes... a single node configured with an unreachable peer will not panic. I fear that promoting CoreOS 1688.3.0 to stable would cause serious damage to weave users... the resulting kernel panic also shows up as corrupted Docker images for me (files truncated to zero bytes). |
CoreOS alpha 1702.1.0 on Linux 4.15.7-coreos seems fine, weave launches and works without any kernel warnings. Seems to be something specific to the CoreOS beta 1688.3.0 Linux 4.14.24-coreos kernel. The CoreOS stable 1632.3.0 Linux 4.14.19-coreos kernel is working fine. |
Repro steps with weave:
This is probably specific to openvswitch + vxlan, so I don't think it will happen on e.g. flannel? BTW: be prepared for filesystem corruption on
|
This is probably the commit in the v4.14.24-coreos branch introducing the panic: coreos/linux@4699beb#diff-4f541554c5f8f378effc907c8f0c9115 This upstream commit pretty clearly references this kernel panic: torvalds/linux@f15ca72#diff-4f541554c5f8f378effc907c8f0c9115
|
Good find. |
I can confirm that reverting the relevant changes is simple and does fix the issue |
I should note, weave is used as the cluster networking layer. |
BTW: the patch in the linked forum thread is not a backport of the fix from 4.15, it's a revert of the problematic commit in 4.14. |
@squeed @SpComb Has the fix been requested for backport to the 4.14 branch? I didn't see it from a quick scroll through patchwork and the netdev archives yet. I could send the request to start the process, unless one of you still plans to do it. |
It looks like the problematic CoreOS beta 1688.3.0 got promoted to CoreOS stable 1688.4.0 with the Linux 4.14.30 kernel still containing the buggy version of the vxlan driver: https://github.com/coreos/linux/commits/v4.14.30/drivers/net/vxlan.c Still need to confirm this, but initial signs show that CoreOS stable nodes running weave are now kernel panicing after an update. Ping @dm0- to escalate this - no, I don't think anyone here has requested a 4.14 kernel backport for the fix. |
CoreOS stable updates for 1688.4.0 are currently paused for an unrelated reason (#2284), which also protects CoreOS stable nodes running weave from this issue for now: https://groups.google.com/forum/#!topic/coreos-user/5ihE2cKuYck Confirm that newly provisioned CoreOS stable nodes using the 1688.4.0 image with Linux 4.14.30 are kernel panicing once weave establishes vxlan connections:
|
@SpComb We need to issue a new stable release anyway, and will make sure it includes a fix. Thanks for your persistence. |
Wow, so a cluster-crashing kernel bug reported on the coreos beta channels 2 weeks ago still make it into a stable-channel release, but just happen to get not-rolled-out due to some other (even more serious?) bug? |
This should be fixed in beta 1722.2.0 and stable 1688.5.0, due shortly. |
We're carrying coreos/linux@f5f2102 to fix this and have requested a backport. |
Will the beta channel 1722.2.0 and stable channel 1688.5.0 releases happen simultaneously, or will the fixed 4.14.30 kernel be available in the beta channel for testing before getting released as a stable update? I have not verified that the commit in 4.15 fixes the crash with weave in 4.14, although I'm hopeful it will... unfortunately I don't know how to test the 4.14.30-coreos kernel branch locally. Were you able to repro the kernel panic on 4.14.24-30 with weave, and verify that it was fixed in the new 4.14.30-r1 kernel? |
I'm having similar issues with Rancher in a cluster with 1000+ containers. When they are created and removed rapidly, CoreOS 1632.3.0 panics with Fatal exception in interrupt. I couldn't get the full core dump yet. Kernel 4.14.19. |
@SpComb The releases will likely be simultaneous. I was able to repro with 4.14.30 and verified that the repro failed on 4.14.30-r1. (Your instructions made that process trivial; thanks!) @fabiorauber The broken patch was introduced in 4.14.24, so you may be seeing a different problem. Please file a new bug if the upcoming releases don't fix your issue. |
It turns out that 1688.5.0 was broken and not releasable, so this issue remains unresolved in stable. Beta 1722.2.0 will be rolling out shortly. |
This issue should be fixed in stable 1688.5.3, which is rolling out now. |
Confirmed that CoreOS beta 1722.2.0 with Linux 4.14.30-coreos-r1 and CoreOS stable 1688.5.3 with Linux 4.14.32-coreos both fix this issue, weave now works and is no longer panicking the kernel. Thanks for the backport. I hope it eventually finds its way via netdev into the 4.14 stable tree as well, or this issue may start showing up in other distros too. |
Damn. Amazon released 4.14.33-51.34.amzn1 yesterday or so as part of their "Amazon Linux version 2018.03" which definitely appears to have this issue. Hoping they also backport or patch.... otherwise weave panics the OS causing a continuous "groundhog day" reboot cycle. Had to terminate my failed instances and re-create from images of ECS instances that I hadn't yet broken by patching. |
Issue Report
The Openshift openvswitch-based network kernel panics as soon as a pod receives a packet from an external node.
Bug
Container Linux Version
1688.3.0
Environment
libvirt+qemu
Reproduction Steps
This is a bit complicated. I have a hybrid OpenShift cluster on qemu, where some workers are CentOS and some are Container Linux (don't judge)
I've got a script and some bootstrapping instructions here: https://github.com/squeed/os-on-cl
Once you have a cluster running:
kubectl run --rm -ri --image alpine test /bin/sh
ip addr
Other information
The traceback:
The text was updated successfully, but these errors were encountered: