Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove the need to rm /var/lib/teleport #2838

Open
kontsevoy opened this issue Jul 5, 2019 · 13 comments
Open

Remove the need to rm /var/lib/teleport #2838

kontsevoy opened this issue Jul 5, 2019 · 13 comments
Labels
bug c-nl Internal Customer Reference c-pa Internal Customer Reference c-pg Internal Customer Reference c-sn Internal Customer Reference c-va Internal Customer Reference enhancement quickstart Related to Quickstart
Milestone

Comments

@kontsevoy
Copy link
Contributor

kontsevoy commented Jul 5, 2019

Problem

When users are learning Teleport, it's common for them to dramatically re-configure the auth server (change the cluster name, etc). When they do this, they frequently get stuck because the existing nodes fail to re-join the cluster (or they lose trusted clusters).

This happens because nodes (and trusted clusters) have their old cluster credentials stored in /var/lib/teleport and those credentials are no longer valid. The error message "access denied" in the logs aren't very helpful, even though it's truthful.

Proposal

When a node re-connects to the cluster, it should so something like this:

  1. If there are no node credentials in /var/lib/teleport -> use the join token
  2. If there are existing credentials in /var/lib/teleport -> use the credentials
  3. If "access denied" error happens because the credentials are not valid, try to use token.
  4. If token was accepted and a node received new credentials, _overwrite the existing data in /var/lib/teleport with the new creds.

The same logic should apply to trusted clusters. If they suddenly start getting "access denied", they should try to use the join token.

Re-join intervals

Customers are also complaining that it takes up to 15 minutes for a trusted cluster to re-appear if there's a brief network disconnect. Let's do the following:

  • The logic for node (or a trusted cluster, or an IoT node) for re-connecting should be the same. (back-off intervals, etc).
  • Once a connection is restored, let's make sure that it takes no more than a minute for nodes/clusters to reconnect after that point. This is especially important for IoT applications, where waiting for 15 minutes at scale causes issues.

Question: can we safely be even more aggressive? 30 seconds? 15 seconds?

Warnings

  • Watch out for endless loops! i.e. use token -> get creds -> access denied -> use token, etc.
  • Make sure NOT to blow up valid credentials.

CC @klizhentas to review.

┆Issue is synchronized with this Asana task by Unito

@kontsevoy kontsevoy added this to the 4.1 "Seattle" milestone Jul 5, 2019
@klizhentas
Copy link
Contributor

the design looks good to me, very helpful feature

@kontsevoy
Copy link
Contributor Author

kontsevoy commented Jul 5, 2019

On cluster/node disconnects. I suspect we don't have a code in place which detects a dead idle socket on the node side, i.e. when a node/cluster holds a live connection to a cluster, the node (probably) only detects failure when it tries to ping it, NOT when the network cable is unplugged, that's why our reconnect logic is so relaxed (because it's probably tied to the ping frequency, which must NOT be increased because it will lead to severe network over-utilization).

@klizhentas
Copy link
Contributor

I've investigated this problem a bit more today and came to conclusion that this fix is harder to accomplish than I thought for a couple of reasons:

  • It is hard to reliably detect out of sync errors, as there are quite many of them and not always manifest themselves in the same way
  • The system that fails to connect needs a full reboot - from proxies to nodes, and this will complicate the existing state machine even more - after trying several recovery approaches I ended up in a dead end state with completely broken cluster.

So I'm cautiously removing this from this milestone until I get some better ideas on how to fix this.

@klizhentas klizhentas removed this from the 4.1 "Seattle" milestone Jul 25, 2019
@webvictim
Copy link
Contributor

webvictim commented Oct 1, 2019

This (or at least a very similar) issue caught up a user on evaluation recently and also stumped me for quite a while trying to figure out why they were having issues trying to join a node to a cluster.

The issue here went as follows:

  • The user ran tctl nodes add on their auth server and tried to run the resulting command (to join a node with a token) on a separate node machine
teleport start \
   --roles=node \
   --token=dd739637c606a4cf1434031ff0ecfebb \
   --ca-pin=sha256:2e491d37c8ce1d3509fe31bfc46018728a70725cbc09697fbf71c4fb776d0657 \
   --auth-server=172.17.0.3:3025
  • Because they were using an IP address and not a hostname (for evaluation), they got an error: ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: cannot validate certificate for 172.17.0.3 because it doesn't contain any IP SANs. service/connect.go:65
  • To fix this, I advised them to set public_addr under auth_service in their config to 172.17.0.3:3025 and restart Teleport, so the auth server would regenerate the certificate to include the IP SAN
  • They did this, but rather than fixing the error, it changed to ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: certificate signed by unknown authority. service/connect.go:65
  • On my advice they tried removing /var/lib/teleport on the auth server and restarting, but still got the same error on the node

Eventually we figured out that the node had cached the old cert from the auth service and as such the join was failing. After removing /var/lib/teleport on the node and running the join command again, it worked first time.

I think this may be a subtly different bug to the original issue and if so, it would be nice to fix it.

I would suggest one of two fixes:

  1. warn the user that they might try deleting /var/lib/teleport to fix their issue
  2. (preferably) automatically delete the old auth server cert from the database, fetch a new cert and try joining again in the event of that error (x509: certificate signed by unknown authority)

Given that we're providing the CA pin hash on the command line it seems like it'd be fairly simple to detect that the existing cert we're caching doesn't match that CA pin, so delete the cache and refetch a fresh version.

@one000mph
Copy link
Contributor

I encountered a similar error message due to cached cert mismatch and was able to resolve it by removing the offending node tctl rm nodes/<node-uuid> and reregistering it. I agree that it would be better to avoid the problem in the first place and/or document a better fix (working on this!).

@webvictim
Copy link
Contributor

webvictim commented Dec 2, 2019

Another customer got hit with a very similar issue to this recently. It wasn't at all clear to them that they needed to delete /var/lib/teleport, but after they did everything just worked as expected.

This is a big cause of failures and frustration that we should improve on.

@zmb3
Copy link
Collaborator

zmb3 commented May 5, 2020

+1 - Would love to see an easier workflow here. As we get our first teleport cluster up and running we're constantly making changes that require us to reconfigure or redeploy the auth server, and getting each node to rejoin the cluster is tedious, even for a small amount of nodes.

@ghost
Copy link

ghost commented May 5, 2020

Definitely needs a less manual process to rejoin nodes to a cluster. We're targeting automated infrastructure, so in a worst case scenario where we lose teleport, it would be much friendlier if we didn't need to go through every node that was registered to trigger a re-registration.

@stevenGravy
Copy link
Contributor

stevenGravy commented May 8, 2020

In cases where a node attempts to connect to a cluster who's been reset with a different cluster name the log can get overwhelmed with these messages. In some cases you do get the node address attempting the access even with the right token. In others not which makes it harder to troubleshoot. The fix is to reset the node's /var/lib/teleport dir and reconnect.

ERRO [AUTH:1]    "Failed to retrieve client pool. Client cluster <oldname>, target cluster <newname>, error:  \nERROR REPORT:\nOriginal Error: *trace.NotFoundError \"/authorities/host/<old name>\" is not found\nStack Trace:\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/dynamo/dynamodbbk.go:816 github.com/gravitational/teleport/lib/backend/dynamo.(*DynamoDBBackend).getKey\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/dynamo/dynamodbbk.go:395 github.com/gravitational/teleport/lib/backend/dynamo.(*DynamoDBBackend).Get\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/sanitize.go:103 github.com/gravitational/teleport/lib/backend.(*Sanitizer).Get\n\t/gopath/src/github.com/gravitational/teleport/lib/backend/report.go:130 github.com/gravitational/teleport/lib/backend.(*Reporter).Get\n\t/gopath/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 github.com/gravitational/teleport/lib/services/local.(*CA).GetCertAuthority\n\t/gopath/src/github.com/gravitational/teleport/lib/cache/cache.go:540 github.com/gravitational/teleport/lib/cache.(*Cache).GetCertAuthority\n\t/gopath/src/github.com/gravitational/teleport/lib/auth/middleware.go:336 github.com/gravitational/teleport/lib/auth.ClientCertPool\n\t/gopath/src/github.com/gravitational/teleport/lib/auth/middleware.go:152 github.com/gravitational/teleport/lib/auth.(*TLSServer).GetConfigForClient\n\t/opt/go/src/crypto/tls/handshake_server.go:147 crypto/tls.(*Conn).readClientHello\n\t/opt/go/src/crypto/tls/handshake_server.go:43 crypto/tls.(*Conn).serverHandshake\n\t/opt/go/src/crypto/tls/conn.go:1364 crypto/tls.(*Conn).Handshake\n\t/opt/go/src/net/http/server.go:1783 net/http.(*conn).serve\n\t/opt/go/src/runtime/asm_amd64.s:1358 runtime.goexit\nUser Message: \"/authorities/host/<old name>\" is not found\n." auth/middleware.go:160 

@alanlubold
Copy link

I have an issue that starts out the same way as what is discussed earlier in this thread. I certainly fall into the category of someone who is learning Teleport.

I try to make a node join a freshly installed single node cluster, using the IP, and get the error :

ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: cannot validate certificate for 172.17.0.3 because it doesn't contain any IP SANs. service/connect.go:65

Then I try adding public_addr under auth_service in my config to 172.17.0.3:3025 (up until now I did not have a teleport.yaml file) and restarting, etc., and now I get

ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: certificate signed by unknown authority. service/connect.go:65

At this point I am advised to delete /var/lib/teleport to remove old cert, but upon restarting the teleport service and trying to generate a new cert with tctl nodes add I get an error that I don't even have a auth server:

error: tctl must be either used on the auth server or provided with the identity file via --identity flag

And I find upon looking at /var/lib/teleport that it is indeed almost empty.

Is there something about how I am cleaning out the var/lib/teleport folder that I am missing? I am just deleting it entirely, like so: sudo rm -r /var/lib/teleport

Thanks

@russjones
Copy link
Contributor

Related #2122.

@webvictim
Copy link
Contributor

webvictim commented Jan 23, 2021

#5355 is a prime example of the need for this.

@russjones russjones modified the milestones: 6.0 "San Diego", 6.1 Jan 26, 2021
@russjones
Copy link
Contributor

We've improved the error message the user gets when this happens in #10108 #11550 and #11754

We're also moving most of our instructions to use teleport configure more heavily which will prompt the user if the data directory already contains state in #11867 #11457

@pschisa pschisa added the c-nl Internal Customer Reference label May 5, 2022
@nivasomu nivasomu added the c-va Internal Customer Reference label Feb 14, 2023
camscale added a commit that referenced this issue Dec 8, 2023
Update e ref to bring in:
* Display last updated time on billing summary (#2733)
* User monitor, Okta UAC, Okta access requests respect user locks. (#2815)
* [v13] ci: Remove 'v' from OCI version labels (#2820)
* [v13] chore: Bump Go to v1.20.12 (#2838)
* [v13] ci: Build single-arch images for legacy/operator OCI (#2857)

Changelog: Update Go to 1.20.12
github-merge-queue bot pushed a commit that referenced this issue Dec 8, 2023
* dronegen: enumerate linux tag build pipelines

Unroll the loops over architectures, fips and package type and just list
all the pipelines explicitly. This makes it a bit easier to understand
the combinations and to replace them with GitHub actions workflows.

Running `make dronegen` results in no changes to `.drone.yml`.

* dronegen: Convert some linux tag pipelines to GitHub Actions

Convert some of the linux-based tag build pipelines to run on GitHub
Actions. The following pipelines have been converted:

    build-linux-amd64
    build-linux-amd64-centos7
    build-linux-amd64-centos7-fips
    build-linux-386
    build-linux-arm
    build-linux-amd64-deb
    build-linux-amd64-deb-fips
    build-linux-amd64-centos7-rpm
    build-linux-amd64-centos7-fips-rpm
    build-linux-386-deb
    build-linux-386-rpm
    build-linux-arm-deb
    build-linux-arm-rpm

The GHA workflows builds tarballs as well as deb/rpm packages in the one
workflow, so the `-deb` and `-rpm` pipelines will need to be manually
removed from `.drone.yml`. The amd64 centos7 and non-centos7 pipelines
have been combined as they were calling the same `make` target
duplicating work. The amd64 build is always done on centos7. As a
result, we do not name the pipeline with -centos7 any more, but we do
still specify it as the build.assets `Makefile` still has a centos7
target which is called, and we do still release an asset named with
"centos7".

Still remaining of the linux-based tag build pipelines are the arm64
pipelines which are already converted using a different workflow and the
non-native windows build.

* dronegen: Convert linux-arm64 pipeline to common pipeline

Convert the build-linux-arm64 and push-build-linux-arm64 pipelines to
use the common `release-linux` workflow instead of the arm64-specific
`release-linux-arm64` workflow. This aligns it with the other linux
build pipelines and allows us to get rid of the specific workflow.

The pipelines for building the arm64 rpm and deb packages have not been
generated by dronegen for some time now - since the arm64 build was
converted to GitHub Actions. The OS packages were still built as
dronegen does not remove pipelines, so the existing pipelines from
before the GHA migration remained in `.drone.yml` and continued to run.

These os packaging pipelines will be manually removed in a subsequent
commit.

* dronegen: Add pipeline for building legacy AMIs

Generate a pipeline for calling a GitHub Actions workflow to generate
the legacy AMIs. There were two existing manually added pipelines -
`build-oss-amis` and `build-ent-amis` - that are replaced by this.

The new pipeline needs to be manually added and the old ones manually
removed.

* dronegen: Replace oci build with GitHub Actions workflow

Replace the `teleport-container-images-branch-tag` workflow that builds
the legacy and operator OCI images with a call to the GitHub Actions
workflow that does the same on GitHub Actions.

This requires the manual addition of the `build-oci` pipeline and manual
removal of the `teleport-container-images-branch-tag` pipeline, followed
by running `make dronegen` to flesh out `build-oci` and sign .drone.yml.

* dronegen: Remove now-unused functions and vars

Remove the now-unused functions and vars after converting pipelines to
calling GitHub Actions instead of running stuff on Drone.

* drone: Update .drone.yml for GHA workflows

Update .drone.yml by running `make dronegen` to update the following
pipelines to call GitHub Actions to build instead of building on Drone:

build-linux-amd64
build-linux-amd64-fips
build-linux-386
build-linux-arm64
build-linux-arm

Add two new pipelines for building AMIs and OCIs on GHA:

build-legacy-amis
build-oci

Remove the following pipelines as the build of deb/rpm packages are done
within the above pipelines on GitHub Actions now and the ami/oci
pipelines have been replaced:

build-linux-amd64-deb
build-linux-amd64-fips-deb
build-linux-amd64-centos7-rpm
build-linux-amd64-centos7-fips-rpm
build-linux-386-deb
build-linux-386-rpm
build-linux-arm64-deb
build-linux-arm64-rpm
build-linux-arm-deb
build-linux-arm-rpm
build-oss-amis
build-ent-amis
teleport-container-images-branch-tag

Remove the following pipelines as AMD64 builds are always centos7 builds, but
we were just doing it twice. No need for these any more, as the GHA workflow
will build the release artifacts for these with the centos7 targets:

build-linux-amd64-centos7
build-linux-amd64-centos7-fips

The pipelines were added/removed using the following script, followed by
`make dronegen`:

AWK_SCRIPT='
/^---$/ { printf "%s", accumulator; accumulator = "" }
/^---$/ || accumulator { accumulator = accumulator $0 "\n" }
/^name: / {
	drop = $2 == to_remove
	if ($2 == before && to_add) {
		printf "---\nname: %s\n", to_add
	}
	if (!drop) { printf "%s", accumulator }
	accumulator = ""
	next
}
!drop && !accumulator { print }
ENDFILE { printf "%s", accumulator }'

toremove=(
	build-linux-amd64-{centos7,centos7-fips}
	build-linux-amd64-{deb,fips-deb,centos7-rpm,centos7-fips-rpm}
	build-linux-386-{deb,rpm}
	build-linux-arm64-{deb,rpm}
	build-linux-arm-{deb,rpm}
	build-{oss,ent}-amis
	teleport-container-images-branch-tag
)
add_before=build-buildboxes
toadd=(
	build-legacy-amis
	build-oci
)

for pipeline in "${toremove[@]}"; do
	gawk -i inplace -v to_remove=$pipeline "$AWK_SCRIPT" .drone.yml
done
for pipeline in "${toadd[@]}"; do
	gawk -i inplace -v to_add=$pipeline -v before=$add_before "$AWK_SCRIPT" .drone.yml
done

* Update e ref for updated workflows, Go update, et al.

Update e ref to bring in:
* Display last updated time on billing summary (#2733)
* User monitor, Okta UAC, Okta access requests respect user locks. (#2815)
* [v13] ci: Remove 'v' from OCI version labels (#2820)
* [v13] chore: Bump Go to v1.20.12 (#2838)
* [v13] ci: Build single-arch images for legacy/operator OCI (#2857)

Changelog: Update Go to 1.20.12
@zmb3 zmb3 removed the R2 label Jan 18, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug c-nl Internal Customer Reference c-pa Internal Customer Reference c-pg Internal Customer Reference c-sn Internal Customer Reference c-va Internal Customer Reference enhancement quickstart Related to Quickstart
Projects
None yet
Development

No branches or pull requests