Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

hostgw backend fails to replace old route table entries #801

Closed
julia-stripe opened this issue Aug 29, 2017 · 1 comment · Fixed by #803
Closed

hostgw backend fails to replace old route table entries #801

julia-stripe opened this issue Aug 29, 2017 · 1 comment · Fixed by #803

Comments

@julia-stripe
Copy link
Contributor

julia-stripe commented Aug 29, 2017

In our cluster, flannel is failing to replace existing routes in the route table with new routes. Here's a log of the failure message

{"log":"I0829 17:00:21.967998       1 network.go:83] Subnet added: 10.32.10.0/24 via 10.68.29.72\n","stream":"stderr","time":"2017-08-29T17:00:21.968055987Z"}
{"log":"W0829 17:00:21.968144       1 network.go:106] Replacing existing route to 10.32.10.0/24 via 10.68.26.131 with 10.32.10.0/24 via 10.68.29.72.\n","stream":"stderr","time":"2017-08-29T17:00:21.968211104Z"}
{"log":"E0829 17:00:21.968207       1 network.go:108] Error deleting route to 10.32.10.0/24: no such process\n","stream":"stderr","time":"2017-08-29T17:00:21.96826321Z"}

Basically for some reason when flannel sends a message to the netlink socket asking the kernel to update the route table, the kernel returns a "no such process" error. I'm able to delete routes with sudo ip route delete, and when I delete the routes manually, Flannel is able to create the new routes correctly.

As a result of this the route table ends up being pretty badly misconfigured and what happens is that no packets can be sent through the cni0 bridge (all network connections from a container fail with no route to host).

Expected Behavior

The route table should get updated when the subnet -> IP address mapping changes

Current Behavior

New routes are added, but any update that requires a deletion fails.

Steps to Reproduce (for bugs)

Terminate nodes in our Kubernetes cluster and bring up new nodes (with new IP addresses)

Context

Your Environment

  • Flannel version: 0.7.1 (none of the hostgw code has changed since then, though)
  • Backend used: hostgw
  • Kubernetes version (if used): 1.7.3
  • Operating System and version: Ubuntu 16.04
  • Link to your project (optional):
@julia-stripe
Copy link
Contributor Author

Found some more information!

I straced flannel, and these are the messages it's sending to the netlink socket:

https://gist.github.com/julia-stripe/c2a4aafbccf3533d738be1e665a79eb8

I parsed them all (using pyroute2: http://docs.pyroute2.org/debug.html

and got this resuilt for the failed message:


{'attrs': [('RTA_DST', '10.32.5.0'),
           ('RTA_GATEWAY', '10.68.28.131'),
           ('RTA_OIF', 0)],
 'dst_len': 24,
 'family': 2,
 'flags': 0,
 'header': {'flags': 5,
            'length': 52,
            'pid': 0,
            'sequence_number': 4,
            'type': 25},
 'proto': 0,
 'scope': 0,
 'src_len': 0,
 'table': 254,
 'tos': 0,
 'type': 0}

So basically what's happening is that Flannel sets RTA_OIF (the interface ID for the network interface) to 0 when it should be 2 (on our machines). This value (the interface id) comes from the linkIndex struct member, which appears to be unset. So it seems like the linkIndex struct member being 0 (instead of the right interface id) is the culprit.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant