Remove the need to `rm /var/lib/teleport` #2838

kontsevoy · 2019-07-05T23:32:13Z

Problem

When users are learning Teleport, it's common for them to dramatically re-configure the auth server (change the cluster name, etc). When they do this, they frequently get stuck because the existing nodes fail to re-join the cluster (or they lose trusted clusters).

This happens because nodes (and trusted clusters) have their old cluster credentials stored in /var/lib/teleport and those credentials are no longer valid. The error message "access denied" in the logs aren't very helpful, even though it's truthful.

Proposal

When a node re-connects to the cluster, it should so something like this:

If there are no node credentials in /var/lib/teleport -> use the join token
If there are existing credentials in /var/lib/teleport -> use the credentials
If "access denied" error happens because the credentials are not valid, try to use token.
If token was accepted and a node received new credentials, _overwrite the existing data in /var/lib/teleport with the new creds.

The same logic should apply to trusted clusters. If they suddenly start getting "access denied", they should try to use the join token.

Re-join intervals

Customers are also complaining that it takes up to 15 minutes for a trusted cluster to re-appear if there's a brief network disconnect. Let's do the following:

The logic for node (or a trusted cluster, or an IoT node) for re-connecting should be the same. (back-off intervals, etc).
Once a connection is restored, let's make sure that it takes no more than a minute for nodes/clusters to reconnect after that point. This is especially important for IoT applications, where waiting for 15 minutes at scale causes issues.

Question: can we safely be even more aggressive? 30 seconds? 15 seconds?

Warnings

Watch out for endless loops! i.e. use token -> get creds -> access denied -> use token, etc.
Make sure NOT to blow up valid credentials.

CC @klizhentas to review.

┆Issue is synchronized with this Asana task by Unito

The text was updated successfully, but these errors were encountered:

klizhentas · 2019-07-05T23:37:13Z

the design looks good to me, very helpful feature

kontsevoy · 2019-07-05T23:55:25Z

On cluster/node disconnects. I suspect we don't have a code in place which detects a dead idle socket on the node side, i.e. when a node/cluster holds a live connection to a cluster, the node (probably) only detects failure when it tries to ping it, NOT when the network cable is unplugged, that's why our reconnect logic is so relaxed (because it's probably tied to the ping frequency, which must NOT be increased because it will lead to severe network over-utilization).

klizhentas · 2019-07-25T19:42:25Z

I've investigated this problem a bit more today and came to conclusion that this fix is harder to accomplish than I thought for a couple of reasons:

It is hard to reliably detect out of sync errors, as there are quite many of them and not always manifest themselves in the same way
The system that fails to connect needs a full reboot - from proxies to nodes, and this will complicate the existing state machine even more - after trying several recovery approaches I ended up in a dead end state with completely broken cluster.

So I'm cautiously removing this from this milestone until I get some better ideas on how to fix this.

webvictim · 2019-10-01T17:27:53Z

This (or at least a very similar) issue caught up a user on evaluation recently and also stumped me for quite a while trying to figure out why they were having issues trying to join a node to a cluster.

The issue here went as follows:

The user ran tctl nodes add on their auth server and tried to run the resulting command (to join a node with a token) on a separate node machine

teleport start \
   --roles=node \
   --token=dd739637c606a4cf1434031ff0ecfebb \
   --ca-pin=sha256:2e491d37c8ce1d3509fe31bfc46018728a70725cbc09697fbf71c4fb776d0657 \
   --auth-server=172.17.0.3:3025

Because they were using an IP address and not a hostname (for evaluation), they got an error: ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: cannot validate certificate for 172.17.0.3 because it doesn't contain any IP SANs. service/connect.go:65
To fix this, I advised them to set public_addr under auth_service in their config to 172.17.0.3:3025 and restart Teleport, so the auth server would regenerate the certificate to include the IP SAN
They did this, but rather than fixing the error, it changed to ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: certificate signed by unknown authority. service/connect.go:65
On my advice they tried removing /var/lib/teleport on the auth server and restarting, but still got the same error on the node

Eventually we figured out that the node had cached the old cert from the auth service and as such the join was failing. After removing /var/lib/teleport on the node and running the join command again, it worked first time.

I think this may be a subtly different bug to the original issue and if so, it would be nice to fix it.

I would suggest one of two fixes:

warn the user that they might try deleting /var/lib/teleport to fix their issue
(preferably) automatically delete the old auth server cert from the database, fetch a new cert and try joining again in the event of that error (x509: certificate signed by unknown authority)

Given that we're providing the CA pin hash on the command line it seems like it'd be fairly simple to detect that the existing cert we're caching doesn't match that CA pin, so delete the cache and refetch a fresh version.

one000mph · 2019-10-11T14:23:14Z

I encountered a similar error message due to cached cert mismatch and was able to resolve it by removing the offending node tctl rm nodes/<node-uuid> and reregistering it. I agree that it would be better to avoid the problem in the first place and/or document a better fix (working on this!).

webvictim · 2019-12-02T17:33:11Z

Another customer got hit with a very similar issue to this recently. It wasn't at all clear to them that they needed to delete /var/lib/teleport, but after they did everything just worked as expected.

This is a big cause of failures and frustration that we should improve on.

zmb3 · 2020-05-05T20:02:04Z

+1 - Would love to see an easier workflow here. As we get our first teleport cluster up and running we're constantly making changes that require us to reconfigure or redeploy the auth server, and getting each node to rejoin the cluster is tedious, even for a small amount of nodes.

ghost · 2020-05-05T20:47:49Z

Definitely needs a less manual process to rejoin nodes to a cluster. We're targeting automated infrastructure, so in a worst case scenario where we lose teleport, it would be much friendlier if we didn't need to go through every node that was registered to trigger a re-registration.

stevenGravy · 2020-05-08T14:31:42Z

In cases where a node attempts to connect to a cluster who's been reset with a different cluster name the log can get overwhelmed with these messages. In some cases you do get the node address attempting the access even with the right token. In others not which makes it harder to troubleshoot. The fix is to reset the node's /var/lib/teleport dir and reconnect.

ERRO [AUTH:1]    "Failed to retrieve client pool. Client cluster <oldname>, target cluster <newname>, error:  \nERROR REPORT:\nOriginal Error: *trace.NotFoundError \"/authorities/host/<old name>\" is not found\nStack Trace:\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/dynamo/dynamodbbk.go:816 github.com/gravitational/teleport/lib/backend/dynamo.(*DynamoDBBackend).getKey\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/dynamo/dynamodbbk.go:395 github.com/gravitational/teleport/lib/backend/dynamo.(*DynamoDBBackend).Get\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/sanitize.go:103 github.com/gravitational/teleport/lib/backend.(*Sanitizer).Get\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/report.go:130 github.com/gravitational/teleport/lib/backend.(*Reporter).Get\n\t/gopath/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 github.com/gravitational/teleport/lib/services/local.(*CA).GetCertAuthority\n\t/gopath/src/github.com/gravitational/teleport/lib/cache/cache.go:540 github.com/gravitational/teleport/lib/cache.(*Cache).GetCertAuthority\n\t/gopath/src/github.com/gravitational/teleport/lib/auth/middleware.go:336 github.com/gravitational/teleport/lib/auth.ClientCertPool\n\t/gopath/src/github.com/gravitational/teleport/lib/auth/middleware.go:152 github.com/gravitational/teleport/lib/auth.(*TLSServer).GetConfigForClient\n\t/opt/go/src/crypto/tls/handshake_server.go:147 crypto/tls.(*Conn).readClientHello\n\t/opt/go/src/crypto/tls/handshake_server.go:43 crypto/tls.(*Conn).serverHandshake\n\t/opt/go/src/crypto/tls/conn.go:1364 crypto/tls.(*Conn).Handshake\n\t/opt/go/src/net/http/server.go:1783 net/http.(*conn).serve\n\t/opt/go/src/runtime/asm_amd64.s:1358 runtime.goexit\nUser Message: \"/authorities/host/<old name>\" is not found\n." auth/middleware.go:160

alanlubold · 2020-05-11T20:29:11Z

I have an issue that starts out the same way as what is discussed earlier in this thread. I certainly fall into the category of someone who is learning Teleport.

I try to make a node join a freshly installed single node cluster, using the IP, and get the error :

ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: cannot validate certificate for 172.17.0.3 because it doesn't contain any IP SANs. service/connect.go:65

Then I try adding public_addr under auth_service in my config to 172.17.0.3:3025 (up until now I did not have a teleport.yaml file) and restarting, etc., and now I get

ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: certificate signed by unknown authority. service/connect.go:65

At this point I am advised to delete /var/lib/teleport to remove old cert, but upon restarting the teleport service and trying to generate a new cert with tctl nodes add I get an error that I don't even have a auth server:

error: tctl must be either used on the auth server or provided with the identity file via --identity flag

And I find upon looking at /var/lib/teleport that it is indeed almost empty.

Is there something about how I am cleaning out the var/lib/teleport folder that I am missing? I am just deleting it entirely, like so: sudo rm -r /var/lib/teleport

Thanks

russjones · 2020-09-22T20:35:23Z

Related #2122.

webvictim · 2021-01-23T13:19:07Z

#5355 is a prime example of the need for this.

russjones · 2022-04-14T21:47:18Z

We've improved the error message the user gets when this happens in #10108 #11550 and #11754

We're also moving most of our instructions to use teleport configure more heavily which will prompt the user if the data directory already contains state in #11867 #11457

Update e ref to bring in: * Display last updated time on billing summary (#2733) * User monitor, Okta UAC, Okta access requests respect user locks. (#2815) * [v13] ci: Remove 'v' from OCI version labels (#2820) * [v13] chore: Bump Go to v1.20.12 (#2838) * [v13] ci: Build single-arch images for legacy/operator OCI (#2857) Changelog: Update Go to 1.20.12

* dronegen: enumerate linux tag build pipelines Unroll the loops over architectures, fips and package type and just list all the pipelines explicitly. This makes it a bit easier to understand the combinations and to replace them with GitHub actions workflows. Running `make dronegen` results in no changes to `.drone.yml`. * dronegen: Convert some linux tag pipelines to GitHub Actions Convert some of the linux-based tag build pipelines to run on GitHub Actions. The following pipelines have been converted: build-linux-amd64 build-linux-amd64-centos7 build-linux-amd64-centos7-fips build-linux-386 build-linux-arm build-linux-amd64-deb build-linux-amd64-deb-fips build-linux-amd64-centos7-rpm build-linux-amd64-centos7-fips-rpm build-linux-386-deb build-linux-386-rpm build-linux-arm-deb build-linux-arm-rpm The GHA workflows builds tarballs as well as deb/rpm packages in the one workflow, so the `-deb` and `-rpm` pipelines will need to be manually removed from `.drone.yml`. The amd64 centos7 and non-centos7 pipelines have been combined as they were calling the same `make` target duplicating work. The amd64 build is always done on centos7. As a result, we do not name the pipeline with -centos7 any more, but we do still specify it as the build.assets `Makefile` still has a centos7 target which is called, and we do still release an asset named with "centos7". Still remaining of the linux-based tag build pipelines are the arm64 pipelines which are already converted using a different workflow and the non-native windows build. * dronegen: Convert linux-arm64 pipeline to common pipeline Convert the build-linux-arm64 and push-build-linux-arm64 pipelines to use the common `release-linux` workflow instead of the arm64-specific `release-linux-arm64` workflow. This aligns it with the other linux build pipelines and allows us to get rid of the specific workflow. The pipelines for building the arm64 rpm and deb packages have not been generated by dronegen for some time now - since the arm64 build was converted to GitHub Actions. The OS packages were still built as dronegen does not remove pipelines, so the existing pipelines from before the GHA migration remained in `.drone.yml` and continued to run. These os packaging pipelines will be manually removed in a subsequent commit. * dronegen: Add pipeline for building legacy AMIs Generate a pipeline for calling a GitHub Actions workflow to generate the legacy AMIs. There were two existing manually added pipelines - `build-oss-amis` and `build-ent-amis` - that are replaced by this. The new pipeline needs to be manually added and the old ones manually removed. * dronegen: Replace oci build with GitHub Actions workflow Replace the `teleport-container-images-branch-tag` workflow that builds the legacy and operator OCI images with a call to the GitHub Actions workflow that does the same on GitHub Actions. This requires the manual addition of the `build-oci` pipeline and manual removal of the `teleport-container-images-branch-tag` pipeline, followed by running `make dronegen` to flesh out `build-oci` and sign .drone.yml. * dronegen: Remove now-unused functions and vars Remove the now-unused functions and vars after converting pipelines to calling GitHub Actions instead of running stuff on Drone. * drone: Update .drone.yml for GHA workflows Update .drone.yml by running `make dronegen` to update the following pipelines to call GitHub Actions to build instead of building on Drone: build-linux-amd64 build-linux-amd64-fips build-linux-386 build-linux-arm64 build-linux-arm Add two new pipelines for building AMIs and OCIs on GHA: build-legacy-amis build-oci Remove the following pipelines as the build of deb/rpm packages are done within the above pipelines on GitHub Actions now and the ami/oci pipelines have been replaced: build-linux-amd64-deb build-linux-amd64-fips-deb build-linux-amd64-centos7-rpm build-linux-amd64-centos7-fips-rpm build-linux-386-deb build-linux-386-rpm build-linux-arm64-deb build-linux-arm64-rpm build-linux-arm-deb build-linux-arm-rpm build-oss-amis build-ent-amis teleport-container-images-branch-tag Remove the following pipelines as AMD64 builds are always centos7 builds, but we were just doing it twice. No need for these any more, as the GHA workflow will build the release artifacts for these with the centos7 targets: build-linux-amd64-centos7 build-linux-amd64-centos7-fips The pipelines were added/removed using the following script, followed by `make dronegen`: AWK_SCRIPT=' /^---$/ { printf "%s", accumulator; accumulator = "" } /^---$/ || accumulator { accumulator = accumulator $0 "\n" } /^name: / { drop = $2 == to_remove if ($2 == before && to_add) { printf "---\nname: %s\n", to_add } if (!drop) { printf "%s", accumulator } accumulator = "" next } !drop && !accumulator { print } ENDFILE { printf "%s", accumulator }' toremove=( build-linux-amd64-{centos7,centos7-fips} build-linux-amd64-{deb,fips-deb,centos7-rpm,centos7-fips-rpm} build-linux-386-{deb,rpm} build-linux-arm64-{deb,rpm} build-linux-arm-{deb,rpm} build-{oss,ent}-amis teleport-container-images-branch-tag ) add_before=build-buildboxes toadd=( build-legacy-amis build-oci ) for pipeline in "${toremove[@]}"; do gawk -i inplace -v to_remove=$pipeline "$AWK_SCRIPT" .drone.yml done for pipeline in "${toadd[@]}"; do gawk -i inplace -v to_add=$pipeline -v before=$add_before "$AWK_SCRIPT" .drone.yml done * Update e ref for updated workflows, Go update, et al. Update e ref to bring in: * Display last updated time on billing summary (#2733) * User monitor, Okta UAC, Okta access requests respect user locks. (#2815) * [v13] ci: Remove 'v' from OCI version labels (#2820) * [v13] chore: Bump Go to v1.20.12 (#2838) * [v13] ci: Build single-arch images for legacy/operator OCI (#2857) Changelog: Update Go to 1.20.12

kontsevoy added this to the 4.1 "Seattle" milestone Jul 5, 2019

klizhentas removed this from the 4.1 "Seattle" milestone Jul 25, 2019

webvictim added this to the 4.3 "Concord" milestone Dec 2, 2019

benarent removed this from the 4.3 "Concord" milestone Dec 24, 2019

webvictim added bug enhancement R2 labels Feb 10, 2020

webvictim mentioned this issue Apr 22, 2020

Deleting /var/lib/teleport/host_uuid breaks reversetunnel routing #3617

Open

webvictim mentioned this issue May 19, 2020

Make configuring teleport nodes easier using teleport configure #3732

Closed

benarent added this to the 5.0 Codename TBD milestone Jun 23, 2020

benarent added the quickstart Related to Quickstart label Aug 6, 2020

russjones modified the milestones: 5.0 "Snoqualmie", Runway Milestone Sep 19, 2020

russjones mentioned this issue Oct 6, 2020

TLS Handshake Error Bad Certificate #2179

Closed

russjones modified the milestones: Runway Milestone, 5.1 "San Diego" Dec 1, 2020

russjones modified the milestones: 6.0 "San Diego", 6.1 Jan 26, 2021

russjones added this to the Runway Milestone milestone Feb 3, 2021

webvictim mentioned this issue Feb 25, 2021

teleport-kube-agent doesn't re-register after proxy is recreated #5720

Closed

11 tasks

russjones modified the milestones: Runway Milestone, 7.0 Mar 10, 2021

masa213f mentioned this issue May 7, 2021

Monitor tsh(teleport) availability cybozu-go/neco#1574

Closed

4 tasks

russjones modified the milestone: 7.0 "Stockholm" May 25, 2021

russjones mentioned this issue Jun 3, 2021

Remove 6.2 from preview #7159

Merged

russjones assigned andrejtokarcik Jul 16, 2021

russjones modified the milestones: 7.0 "Stockholm", 7.1, 7.2 Jul 16, 2021

russjones modified the milestones: 7.2, 7.3 Sep 11, 2021

benarent mentioned this issue Oct 7, 2021

The return of the /var/lib/teleport problem. | *trace.ConnectionProblemError https://teleport.cluster.local/v2/domain remote error: tls: internal error #8512

Closed

russjones modified the milestones: 7.3, 9.0 Oct 8, 2021

russjones unassigned andrejtokarcik Nov 29, 2021

pschisa added the c-pa Internal Customer Reference label Dec 7, 2021

russjones mentioned this issue Dec 7, 2021

Update the teleport-cluster Helm chart to use StatefulSets with PVs #9020

Closed

webvictim mentioned this issue Feb 4, 2022

Improve error message on node joining a cluster with existing state #10108

Closed

pschisa added the c-pg Internal Customer Reference label Feb 23, 2022

pschisa added the c-nl Internal Customer Reference label May 5, 2022

nivasomu added the c-va Internal Customer Reference label Feb 14, 2023

zmb3 removed the R2 label Jan 18, 2024

webvictim mentioned this issue May 28, 2024

add self-repair for malformed instance certs #41467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the need to `rm /var/lib/teleport` #2838

Remove the need to `rm /var/lib/teleport` #2838

kontsevoy commented Jul 5, 2019 •

edited by sync-by-unito bot

Loading

klizhentas commented Jul 5, 2019

kontsevoy commented Jul 5, 2019 •

edited

Loading

klizhentas commented Jul 25, 2019

webvictim commented Oct 1, 2019 •

edited

Loading

one000mph commented Oct 11, 2019

webvictim commented Dec 2, 2019 •

edited

Loading

zmb3 commented May 5, 2020

ghost commented May 5, 2020 •

edited by ghost

Loading

stevenGravy commented May 8, 2020 •

edited by benarent

Loading

alanlubold commented May 11, 2020

russjones commented Sep 22, 2020

webvictim commented Jan 23, 2021 •

edited

Loading

russjones commented Apr 14, 2022

Remove the need to rm /var/lib/teleport #2838

Remove the need to rm /var/lib/teleport #2838

Comments

kontsevoy commented Jul 5, 2019 • edited by sync-by-unito bot Loading

Problem

Proposal

Re-join intervals

Warnings

klizhentas commented Jul 5, 2019

kontsevoy commented Jul 5, 2019 • edited Loading

klizhentas commented Jul 25, 2019

webvictim commented Oct 1, 2019 • edited Loading

one000mph commented Oct 11, 2019

webvictim commented Dec 2, 2019 • edited Loading

zmb3 commented May 5, 2020

ghost commented May 5, 2020 • edited by ghost Loading

stevenGravy commented May 8, 2020 • edited by benarent Loading

alanlubold commented May 11, 2020

russjones commented Sep 22, 2020

webvictim commented Jan 23, 2021 • edited Loading

russjones commented Apr 14, 2022

Remove the need to `rm /var/lib/teleport` #2838

Remove the need to `rm /var/lib/teleport` #2838

kontsevoy commented Jul 5, 2019 •

edited by sync-by-unito bot

Loading

kontsevoy commented Jul 5, 2019 •

edited

Loading

webvictim commented Oct 1, 2019 •

edited

Loading

webvictim commented Dec 2, 2019 •

edited

Loading

ghost commented May 5, 2020 •

edited by ghost

Loading

stevenGravy commented May 8, 2020 •

edited by benarent

Loading

webvictim commented Jan 23, 2021 •

edited

Loading