Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

InternalError when accessing kubernetes cluster with kubernetes service #5031

Closed
gclawes opened this issue Dec 2, 2020 · 13 comments
Closed
Assignees
Milestone

Comments

@gclawes
Copy link

gclawes commented Dec 2, 2020

Description

What happened:

On the initial install of the 5.0.0 kubernetes_service in a cluster kubectl commands work without issue. After the teleport cert expires and tsh login is re-ran, kubectl receives an InternalError.

LivewareProblem :: ~ % tsh --insecure --proxy=teleport.lan:3080 login
WARNING: You are using insecure connection to SSH proxy https://teleport.lan:3080
Enter password for Teleport user gclawes:
WARNING: You are using insecure connection to SSH proxy https://teleport.lan:3080
Please press the button on your U2F key
> Profile URL:        https://teleport.lan:3080
  Logged in as:       gclawes
  Cluster:            lan
  Roles:              admin*
  Logins:             gclawes
  Kubernetes:         enabled
  Kubernetes cluster: "pico"
  Kubernetes users:   gclawes
  Kubernetes groups:  system:masters
  Valid until:        2020-12-02 08:57:37 -0500 EST [valid for 12h0m0s]
  Extensions:         permit-agent-forwarding, permit-port-forwarding, permit-pty
* RBAC is only available in Teleport Enterprise
  https://gravitational.com/teleport/docs/enterprise
LivewareProblem :: ~ % kubectl get nodes
Error from server (InternalError): an error on the server ("Internal Server Error") has prevented the request from succeeding

The following errors appear in the proxy component:

Dec 01 20:57:24 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58564] regular/proxy.go:268
Dec 01 20:57:24 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58565] regular/proxy.go:268
Dec 01 20:57:24 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58566] regular/proxy.go:268
Dec 01 20:57:36 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58569] regular/proxy.go:268
Dec 01 20:57:36 teleport teleport[2668]: INFO [SUBSYSTEM] Connected to auth server: 10.10.0.51:3025 trace.fields:map[dst:10.10.0.62:3023 src:10.10.0.104:58569] regular/proxy.go:268
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:57:46 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:05 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180
Dec 01 20:58:06 teleport teleport[2668]: ERRO [PROXY:PRO] Error forwarding to https://remote.kube.proxy.teleport.cluster.local/api?timeout=32s, err: dialing through a tunnel: no tunnel connection found: no kube reverse tunnel for 081b8ec8-ce9e-4d39-9512-bc2581dad9c9.lan found, dialing directly: dial tcp: address remote.kube.proxy.teleport.cluster.local: missing port in address forward/fwd.go:180

Running tsh kube login <cluster name> does not fix the issue:

LivewareProblem :: ~ % tsh kube login pico
Logged into kubernetes cluster "pico"
LivewareProblem :: ~ % kubectl get nodes
Error from server (InternalError): an error on the server ("Internal Server Error") has prevented the request from succeeding

Restarting the proxy component does resolve the issue.

Teleport is an open-source teleport 5.0.0 configured with separate auth/proxy components:

What you expected to happen: kubectl works as intended

How to reproduce it (as minimally and precisely as possible):

  1. Configure cluster as above.
  2. Log into cluster with tsh --proxy=<proxy> login.
  3. Configure kubeconfig with tsh kube login <cluster name>
  4. Run kubectl commands sucessfully
  5. Allow tsh client certificate to expire*
  6. Log into cluster again with tsh login
  7. kubectl commands fail with above InternalError message
  8. Check proxy logs
  9. Restart proxy
  10. kubectl commands succeed

*NOTE: I have not tested this with a certificate ttl shorter than the default of 12hr.

Environment

  • Teleport version (use teleport version): 5.0.0

  • Tsh version (use tsh version): 5.0.0

  • OS (e.g. from /etc/os-release): 5.0.0

  • Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): Dedicated hardware (see above)

@stevenGravy
Copy link
Contributor

@gclawes how did you configure the kube_service? Are you using a static or dynamic token?

@gclawes
Copy link
Author

gclawes commented Dec 3, 2020

Created a long-lived token with tctl tokens add --type=kube --ttl=8760h

# tctl tokens ls
Token                            Type Labels Expiry Time (UTC)
-------------------------------- ---- ------ ---------------------------------
<token redacted>                 Kube        26 Nov 21 17:10 UTC (8607h14m49s)

@awly
Copy link
Contributor

awly commented Dec 3, 2020

@gclawes did you restart/recreate the kubernetes_service in this setup by chance?
#5038 fixes a caching bug in the proxy, which is triggered when you recreate or upgrade https://github.com/gravitational/teleport/tree/master/examples/chart/teleport-kube-agent

@gclawes
Copy link
Author

gclawes commented Dec 3, 2020

I did not restart the kubernetes_service, only the proxy.

@awly
Copy link
Contributor

awly commented Dec 3, 2020

Thanks. Can you show the output of tctl get kube_service on the auth server?
And how many kubernetes_services are you connecting to this cluster?

It may be related to #5008, due to a kubernetes_service created and deleted earlier, that still lingers in the teleport backend.

@pcallewaert
Copy link

I think I'm encountering the same problem. Currently 3 kubernetes clusters are connected to the main teleport server, but 1 cluster now always returns Forbidden when trying to connect. I'm sure it worked initially but at some point it stopped working. Restarting the kube-agent or the main teleport server did nothing.
I didn't make an issue yet as I didn't know how I got in that condition.

By executing your command, I see now that the "bad" cluster is returned multiple times:

# tctl get kube_service | grep "\- name"
  - name: aks-ndt-sales-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: nephroflow
  - name: aks-ndt-internal-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production
  - name: aks-ndt-shared-westeurope-production

The other ones are only registered once, and are also working perfectly.
Can I make it worse by removing all the aks-ndt-shared-westeurope-production entries and restarting the kube-agent? I'm hoping that it will temporary fix the problem.

@gclawes
Copy link
Author

gclawes commented Dec 3, 2020

I show the same:

root@eschatologist:~# tctl get kube_service | grep "\- name"
  - name: pico
  - name: pico
  - name: pico

Though I set this k3s cluster up a week ago, I don't fully recall if I re-installed the kubernetes_proxy multiple times when I set the cluster up.

I guess I should clarify my response above, I did not reset the kubernetes_proxy after encountering the issue, only the proxy at that point, but I may have reinstalled the kubernetes_proxy in the k3s cluster during the initial cluster setup.

I'll try to find time to re-do the whole setup and replicate the issue, but it may take a day or two.

@awly
Copy link
Contributor

awly commented Dec 3, 2020

@gclawes @pcallewaert yep, it all lines up now.

When a kube-agent restarts for any reason (upgrade or automatically rescheduled by k8s), it registers as a new kube_service. The old version lingers in teleport forever (fixed by #5008).
When a proxy makes the first connection to any k8s cluster, it picks any registered kube_service to connect through and caches that (fixed by #5038).

You can temporarily fix this by removing all kube_service entries and restarting the kube-agent (or wait like 10min for it to re-register itself automatically).
Note: you have to remove kube_service by UUID, not by k8s cluster name.
For example:

kind: kube_service
metadata:
  id: 1607014852039357769
  name: 39b063dd-f222-4739-a1c9-5f81f4113edf
spec:
  addr: 127.0.0.1:3027
  hostname: ""
  kube_clusters:
  - name: gke
  rotation:
    current_id: ""
    last_rotated: "0001-01-01T00:00:00Z"
    schedule:
      standby: "0001-01-01T00:00:00Z"
      update_clients: "0001-01-01T00:00:00Z"
      update_servers: "0001-01-01T00:00:00Z"
    started: "0001-01-01T00:00:00Z"
  version: 5.0.0-dev
version: v2

You have to run tctl rm kube_service/39b063dd-f222-4739-a1c9-5f81f4113edf and not tctl rm kube_service/gke.

Here's a command to remove them all:

$ for uuid in $(tctl get kube_service | grep '  name:' | awk '{print $2}'); do tctl rm kube_service/$uuid; done

@awly awly self-assigned this Dec 3, 2020
@awly awly added this to the 5.0.1 milestone Dec 3, 2020
@gclawes
Copy link
Author

gclawes commented Dec 3, 2020

Cool, I'll give that a shot.

Regarding the token for setting up the kubernetes service:

  1. How does that relate to the backend registration?
  2. Was using tctl tokens add --type=kube correct for that? The main documentation page indicates that that's a "dynamic token", which doesn't work according to the teleport-kube-agent chart, but the release notes for 5.0.0 shows that as an example command:

@awly
Copy link
Contributor

awly commented Dec 3, 2020

TL;DR: you can use dynamic tokens with kubernetes_service but you shouldn't use them with teleport-kube-agent.

How does that relate to the backend registration?

Tokens are used by teleport instances (any instances, not just kubernetes_service) to bootstrap credentials at startup. After instance credentials were bootstrapped and stored on disk, the instance uses those to register and communicate with the auth service.

Was using tctl tokens add --type=kube correct for that? The main documentation page indicates that that's a "dynamic token", which doesn't work according to the teleport-kube-agent chart, but the release notes for 5.0.0 shows that as an example command:

In general, you can use both static and dynamic tokens for any teleport service. The difference is that dynamic tokens expire after 30min.

The caveat with the teleport-kube-agent is that it doesn't have persistence - the credentials it bootstraps with the token are lost on restart. So it re-bootstraps using a token on each start.
If you give it a dynamic token, it will work on first start, but will fail to re-bootstrap when the token expires and kube-agent restarts.
A static token never expires, so kube-agent can re-bootstrap indefinitely.

kube-agent doesn't have persistence, because I couldn't find a portable way to do so in Kubernetes. All the persistent disk options are cloud provider specific.

@gclawes
Copy link
Author

gclawes commented Dec 3, 2020

Ah, ok, so what I did with --ttl=8760h basically makes the dynamic token last a year, which I figured was a good time horizon for this cluster.

Could persistence be done by creating a kubernetes Secret object? We ran across a similar issue in the MetalLB project recently, one of the contributors came up with a way for the MetalLB controller to create a Secret needed quorum maintenance: metallb/metallb#747

If teleport creates a Secret, it can be created with an ownerReference so the secret gets cleaned up when the deployment goes away.

@awly
Copy link
Contributor

awly commented Dec 3, 2020

Yeah, we could definitely teach teleport to use the k8s API for storage.
It currently only supports local directory for credential storage, and I don't think you can mount a k8s Secret as a writable volume.
Here's a tracking issue for tighter k8s integration: #4832

@awly
Copy link
Contributor

awly commented Dec 16, 2020

The fixes mentioned above are now in 5.0.1
Please upgrade and reopen if you still experience the same issues

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants