InternalError when accessing kubernetes cluster with kubernetes service #5031

gclawes · 2020-12-02T02:13:04Z

Description

What happened:

On the initial install of the 5.0.0 kubernetes_service in a cluster kubectl commands work without issue. After the teleport cert expires and tsh login is re-ran, kubectl receives an InternalError.

LivewareProblem :: ~ % tsh --insecure --proxy=teleport.lan:3080 login
WARNING: You are using insecure connection to SSH proxy https://teleport.lan:3080
Enter password for Teleport user gclawes:
WARNING: You are using insecure connection to SSH proxy https://teleport.lan:3080
Please press the button on your U2F key
> Profile URL:        https://teleport.lan:3080
  Logged in as:       gclawes
  Cluster:            lan
  Roles:              admin*
  Logins:             gclawes
  Kubernetes:         enabled
  Kubernetes cluster: "pico"
  Kubernetes users:   gclawes
  Kubernetes groups:  system:masters
  Valid until:        2020-12-02 08:57:37 -0500 EST [valid for 12h0m0s]
  Extensions:         permit-agent-forwarding, permit-port-forwarding, permit-pty
* RBAC is only available in Teleport Enterprise
  https://gravitational.com/teleport/docs/enterprise
LivewareProblem :: ~ % kubectl get nodes
Error from server (InternalError): an error on the server ("Internal Server Error") has prevented the request from succeeding

The following errors appear in the proxy component:

Dec 01 20:57:24 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58564] regular/proxy.go:268
Dec 01 20:57:24 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58565] regular/proxy.go:268
Dec 01 20:57:24 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58566] regular/proxy.go:268
Dec 01 20:57:36 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58569] regular/proxy.go:268
Dec 01 20:57:36 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58569] regular/proxy.go:268
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:05 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180

Running tsh kube login <cluster name> does not fix the issue:

LivewareProblem :: ~ % tsh kube login pico
Logged into kubernetes cluster "pico"
LivewareProblem :: ~ % kubectl get nodes
Error from server (InternalError): an error on the server ("Internal Server Error") has prevented the request from succeeding

Restarting the proxy component does resolve the issue.

Teleport is an open-source teleport 5.0.0 configured with separate auth/proxy components:

proxy - debian armv7
auth - ubuntu x86_64
kubernets_service - armv8 k3s cluster (custom ARM build from Request: ARM container builds #3384 (comment))

What you expected to happen: kubectl works as intended

How to reproduce it (as minimally and precisely as possible):

Configure cluster as above.
Log into cluster with tsh --proxy=<proxy> login.
Configure kubeconfig with tsh kube login <cluster name>
Run kubectl commands sucessfully
Allow tsh client certificate to expire*
Log into cluster again with tsh login
kubectl commands fail with above InternalError message
Check proxy logs
Restart proxy
kubectl commands succeed

*NOTE: I have not tested this with a certificate ttl shorter than the default of 12hr.

Environment

Teleport version (use teleport version): 5.0.0
Tsh version (use tsh version): 5.0.0
OS (e.g. from /etc/os-release): 5.0.0
Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): Dedicated hardware (see above)

The text was updated successfully, but these errors were encountered:

stevenGravy · 2020-12-03T01:51:01Z

@gclawes how did you configure the kube_service? Are you using a static or dynamic token?

gclawes · 2020-12-03T01:56:25Z

Created a long-lived token with tctl tokens add --type=kube --ttl=8760h

# tctl tokens ls
Token                            Type Labels Expiry Time (UTC)
-------------------------------- ---- ------ ---------------------------------
<token redacted>                 Kube        26 Nov 21 17:10 UTC (8607h14m49s)

awly · 2020-12-03T02:06:58Z

@gclawes did you restart/recreate the kubernetes_service in this setup by chance?
#5038 fixes a caching bug in the proxy, which is triggered when you recreate or upgrade https://github.com/gravitational/teleport/tree/master/examples/chart/teleport-kube-agent

gclawes · 2020-12-03T02:08:31Z

I did not restart the kubernetes_service, only the proxy.

awly · 2020-12-03T04:04:08Z

Thanks. Can you show the output of tctl get kube_service on the auth server?
And how many kubernetes_services are you connecting to this cluster?

It may be related to #5008, due to a kubernetes_service created and deleted earlier, that still lingers in the teleport backend.

pcallewaert · 2020-12-03T06:33:46Z

I think I'm encountering the same problem. Currently 3 kubernetes clusters are connected to the main teleport server, but 1 cluster now always returns Forbidden when trying to connect. I'm sure it worked initially but at some point it stopped working. Restarting the kube-agent or the main teleport server did nothing.
I didn't make an issue yet as I didn't know how I got in that condition.

By executing your command, I see now that the "bad" cluster is returned multiple times:

# tctl get kube_service | grep "\- name"
  - name: aks-ndt-sales-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: nephroflow
  - name: aks-ndt-internal-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production

The other ones are only registered once, and are also working perfectly.
Can I make it worse by removing all the aks-ndt-shared-westeurope-production entries and restarting the kube-agent? I'm hoping that it will temporary fix the problem.

gclawes · 2020-12-03T12:54:37Z

I show the same:

root@eschatologist:~# tctl get kube_service | grep "\- name"
  - name: pico
  - name: pico
  - name: pico

Though I set this k3s cluster up a week ago, I don't fully recall if I re-installed the kubernetes_proxy multiple times when I set the cluster up.

I guess I should clarify my response above, I did not reset the kubernetes_proxy after encountering the issue, only the proxy at that point, but I may have reinstalled the kubernetes_proxy in the k3s cluster during the initial cluster setup.

I'll try to find time to re-do the whole setup and replicate the issue, but it may take a day or two.

awly · 2020-12-03T17:12:05Z

@gclawes @pcallewaert yep, it all lines up now.

When a kube-agent restarts for any reason (upgrade or automatically rescheduled by k8s), it registers as a new kube_service. The old version lingers in teleport forever (fixed by #5008).
When a proxy makes the first connection to any k8s cluster, it picks any registered kube_service to connect through and caches that (fixed by #5038).

You can temporarily fix this by removing all kube_service entries and restarting the kube-agent (or wait like 10min for it to re-register itself automatically).
Note: you have to remove kube_service by UUID, not by k8s cluster name.
For example:

kind: kube_service
metadata:
  id: 1607014852039357769
  name: 39b063dd-f222-4739-a1c9-5f81f4113edf
spec:
  addr: 127.0.0.1:3027
  hostname: ""
  kube_clusters:
  - name: gke
  rotation:
    current_id: ""
    last_rotated: "0001-01-01T00:00:00Z"
    schedule:
      standby: "0001-01-01T00:00:00Z"
      update_clients: "0001-01-01T00:00:00Z"
      update_servers: "0001-01-01T00:00:00Z"
    started: "0001-01-01T00:00:00Z"
  version: 5.0.0-dev
version: v2

You have to run tctl rm kube_service/39b063dd-f222-4739-a1c9-5f81f4113edf and not tctl rm kube_service/gke.

Here's a command to remove them all:

$ for uuid in $(tctl get kube_service | grep '  name:' | awk '{print $2}'); do tctl rm kube_service/$uuid; done

gclawes · 2020-12-03T17:36:41Z

Cool, I'll give that a shot.

Regarding the token for setting up the kubernetes service:

How does that relate to the backend registration?
Was using tctl tokens add --type=kube correct for that? The main documentation page indicates that that's a "dynamic token", which doesn't work according to the teleport-kube-agent chart, but the release notes for 5.0.0 shows that as an example command:

awly · 2020-12-03T19:04:58Z

TL;DR: you can use dynamic tokens with kubernetes_service but you shouldn't use them with teleport-kube-agent.

How does that relate to the backend registration?

Tokens are used by teleport instances (any instances, not just kubernetes_service) to bootstrap credentials at startup. After instance credentials were bootstrapped and stored on disk, the instance uses those to register and communicate with the auth service.

Was using tctl tokens add --type=kube correct for that? The main documentation page indicates that that's a "dynamic token", which doesn't work according to the teleport-kube-agent chart, but the release notes for 5.0.0 shows that as an example command:

In general, you can use both static and dynamic tokens for any teleport service. The difference is that dynamic tokens expire after 30min.

The caveat with the teleport-kube-agent is that it doesn't have persistence - the credentials it bootstraps with the token are lost on restart. So it re-bootstraps using a token on each start.
If you give it a dynamic token, it will work on first start, but will fail to re-bootstrap when the token expires and kube-agent restarts.
A static token never expires, so kube-agent can re-bootstrap indefinitely.

kube-agent doesn't have persistence, because I couldn't find a portable way to do so in Kubernetes. All the persistent disk options are cloud provider specific.

gclawes · 2020-12-03T19:13:47Z

Ah, ok, so what I did with --ttl=8760h basically makes the dynamic token last a year, which I figured was a good time horizon for this cluster.

Could persistence be done by creating a kubernetes Secret object? We ran across a similar issue in the MetalLB project recently, one of the contributors came up with a way for the MetalLB controller to create a Secret needed quorum maintenance: metallb/metallb#747

If teleport creates a Secret, it can be created with an ownerReference so the secret gets cleaned up when the deployment goes away.

awly · 2020-12-03T23:44:34Z

Yeah, we could definitely teach teleport to use the k8s API for storage.
It currently only supports local directory for credential storage, and I don't think you can mount a k8s Secret as a writable volume.
Here's a tracking issue for tighter k8s integration: #4832

awly · 2020-12-16T19:25:43Z

The fixes mentioned above are now in 5.0.1
Please upgrade and reopen if you still experience the same issues

awly added the kubernetes-access label Dec 3, 2020

awly self-assigned this Dec 3, 2020

awly added this to the 5.0.1 milestone Dec 3, 2020

awly mentioned this issue Dec 8, 2020

Multiple fixes for k8s forwarder (#5038) #5076

Merged

awly closed this as completed Dec 16, 2020

gclawes mentioned this issue Jul 26, 2021

Allow teleport-kube-agent Helm Chart to use dynamic token #5585

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternalError when accessing kubernetes cluster with kubernetes service #5031

InternalError when accessing kubernetes cluster with kubernetes service #5031

gclawes commented Dec 2, 2020

stevenGravy commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

pcallewaert commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

gclawes commented Dec 3, 2020 •

edited

Loading

awly commented Dec 3, 2020

awly commented Dec 16, 2020

InternalError when accessing kubernetes cluster with kubernetes service #5031

InternalError when accessing kubernetes cluster with kubernetes service #5031

Comments

gclawes commented Dec 2, 2020

Description

Environment

stevenGravy commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

pcallewaert commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

gclawes commented Dec 3, 2020

awly commented Dec 3, 2020

gclawes commented Dec 3, 2020 • edited Loading

awly commented Dec 3, 2020

awly commented Dec 16, 2020

gclawes commented Dec 3, 2020 •

edited

Loading