-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Remove the need to rm /var/lib/teleport
#2838
Comments
the design looks good to me, very helpful feature |
On cluster/node disconnects. I suspect we don't have a code in place which detects a dead idle socket on the node side, i.e. when a node/cluster holds a live connection to a cluster, the node (probably) only detects failure when it tries to ping it, NOT when the network cable is unplugged, that's why our reconnect logic is so relaxed (because it's probably tied to the ping frequency, which must NOT be increased because it will lead to severe network over-utilization). |
I've investigated this problem a bit more today and came to conclusion that this fix is harder to accomplish than I thought for a couple of reasons:
So I'm cautiously removing this from this milestone until I get some better ideas on how to fix this. |
This (or at least a very similar) issue caught up a user on evaluation recently and also stumped me for quite a while trying to figure out why they were having issues trying to join a node to a cluster. The issue here went as follows:
Eventually we figured out that the node had cached the old cert from the auth service and as such the join was failing. After removing I think this may be a subtly different bug to the original issue and if so, it would be nice to fix it. I would suggest one of two fixes:
Given that we're providing the CA pin hash on the command line it seems like it'd be fairly simple to detect that the existing cert we're caching doesn't match that CA pin, so delete the cache and refetch a fresh version. |
I encountered a similar error message due to cached cert mismatch and was able to resolve it by removing the offending node |
Another customer got hit with a very similar issue to this recently. It wasn't at all clear to them that they needed to delete This is a big cause of failures and frustration that we should improve on. |
+1 - Would love to see an easier workflow here. As we get our first teleport cluster up and running we're constantly making changes that require us to reconfigure or redeploy the auth server, and getting each node to rejoin the cluster is tedious, even for a small amount of nodes. |
Definitely needs a less manual process to rejoin nodes to a cluster. We're targeting automated infrastructure, so in a worst case scenario where we lose teleport, it would be much friendlier if we didn't need to go through every node that was registered to trigger a re-registration. |
In cases where a node attempts to connect to a cluster who's been reset with a different cluster name the log can get overwhelmed with these messages. In some cases you do get the node address attempting the access even with the right token. In others not which makes it harder to troubleshoot. The fix is to reset the node's
|
I have an issue that starts out the same way as what is discussed earlier in this thread. I certainly fall into the category of someone who is learning Teleport. I try to make a node join a freshly installed single node cluster, using the IP, and get the error : ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: cannot validate certificate for 172.17.0.3 because it doesn't contain any IP SANs. service/connect.go:65 Then I try adding public_addr under auth_service in my config to 172.17.0.3:3025 (up until now I did not have a teleport.yaml file) and restarting, etc., and now I get ERRO [PROC:1] Node failed to establish connection to cluster: Get https://172.17.0.3:3025/v1/webapi/find: x509: certificate signed by unknown authority. service/connect.go:65 At this point I am advised to delete /var/lib/teleport to remove old cert, but upon restarting the teleport service and trying to generate a new cert with tctl nodes add I get an error that I don't even have a auth server: error: tctl must be either used on the auth server or provided with the identity file via --identity flag And I find upon looking at /var/lib/teleport that it is indeed almost empty. Is there something about how I am cleaning out the var/lib/teleport folder that I am missing? I am just deleting it entirely, like so: sudo rm -r /var/lib/teleport Thanks |
Related #2122. |
#5355 is a prime example of the need for this. |
Update e ref to bring in: * Display last updated time on billing summary (#2733) * User monitor, Okta UAC, Okta access requests respect user locks. (#2815) * [v13] ci: Remove 'v' from OCI version labels (#2820) * [v13] chore: Bump Go to v1.20.12 (#2838) * [v13] ci: Build single-arch images for legacy/operator OCI (#2857) Changelog: Update Go to 1.20.12
* dronegen: enumerate linux tag build pipelines Unroll the loops over architectures, fips and package type and just list all the pipelines explicitly. This makes it a bit easier to understand the combinations and to replace them with GitHub actions workflows. Running `make dronegen` results in no changes to `.drone.yml`. * dronegen: Convert some linux tag pipelines to GitHub Actions Convert some of the linux-based tag build pipelines to run on GitHub Actions. The following pipelines have been converted: build-linux-amd64 build-linux-amd64-centos7 build-linux-amd64-centos7-fips build-linux-386 build-linux-arm build-linux-amd64-deb build-linux-amd64-deb-fips build-linux-amd64-centos7-rpm build-linux-amd64-centos7-fips-rpm build-linux-386-deb build-linux-386-rpm build-linux-arm-deb build-linux-arm-rpm The GHA workflows builds tarballs as well as deb/rpm packages in the one workflow, so the `-deb` and `-rpm` pipelines will need to be manually removed from `.drone.yml`. The amd64 centos7 and non-centos7 pipelines have been combined as they were calling the same `make` target duplicating work. The amd64 build is always done on centos7. As a result, we do not name the pipeline with -centos7 any more, but we do still specify it as the build.assets `Makefile` still has a centos7 target which is called, and we do still release an asset named with "centos7". Still remaining of the linux-based tag build pipelines are the arm64 pipelines which are already converted using a different workflow and the non-native windows build. * dronegen: Convert linux-arm64 pipeline to common pipeline Convert the build-linux-arm64 and push-build-linux-arm64 pipelines to use the common `release-linux` workflow instead of the arm64-specific `release-linux-arm64` workflow. This aligns it with the other linux build pipelines and allows us to get rid of the specific workflow. The pipelines for building the arm64 rpm and deb packages have not been generated by dronegen for some time now - since the arm64 build was converted to GitHub Actions. The OS packages were still built as dronegen does not remove pipelines, so the existing pipelines from before the GHA migration remained in `.drone.yml` and continued to run. These os packaging pipelines will be manually removed in a subsequent commit. * dronegen: Add pipeline for building legacy AMIs Generate a pipeline for calling a GitHub Actions workflow to generate the legacy AMIs. There were two existing manually added pipelines - `build-oss-amis` and `build-ent-amis` - that are replaced by this. The new pipeline needs to be manually added and the old ones manually removed. * dronegen: Replace oci build with GitHub Actions workflow Replace the `teleport-container-images-branch-tag` workflow that builds the legacy and operator OCI images with a call to the GitHub Actions workflow that does the same on GitHub Actions. This requires the manual addition of the `build-oci` pipeline and manual removal of the `teleport-container-images-branch-tag` pipeline, followed by running `make dronegen` to flesh out `build-oci` and sign .drone.yml. * dronegen: Remove now-unused functions and vars Remove the now-unused functions and vars after converting pipelines to calling GitHub Actions instead of running stuff on Drone. * drone: Update .drone.yml for GHA workflows Update .drone.yml by running `make dronegen` to update the following pipelines to call GitHub Actions to build instead of building on Drone: build-linux-amd64 build-linux-amd64-fips build-linux-386 build-linux-arm64 build-linux-arm Add two new pipelines for building AMIs and OCIs on GHA: build-legacy-amis build-oci Remove the following pipelines as the build of deb/rpm packages are done within the above pipelines on GitHub Actions now and the ami/oci pipelines have been replaced: build-linux-amd64-deb build-linux-amd64-fips-deb build-linux-amd64-centos7-rpm build-linux-amd64-centos7-fips-rpm build-linux-386-deb build-linux-386-rpm build-linux-arm64-deb build-linux-arm64-rpm build-linux-arm-deb build-linux-arm-rpm build-oss-amis build-ent-amis teleport-container-images-branch-tag Remove the following pipelines as AMD64 builds are always centos7 builds, but we were just doing it twice. No need for these any more, as the GHA workflow will build the release artifacts for these with the centos7 targets: build-linux-amd64-centos7 build-linux-amd64-centos7-fips The pipelines were added/removed using the following script, followed by `make dronegen`: AWK_SCRIPT=' /^---$/ { printf "%s", accumulator; accumulator = "" } /^---$/ || accumulator { accumulator = accumulator $0 "\n" } /^name: / { drop = $2 == to_remove if ($2 == before && to_add) { printf "---\nname: %s\n", to_add } if (!drop) { printf "%s", accumulator } accumulator = "" next } !drop && !accumulator { print } ENDFILE { printf "%s", accumulator }' toremove=( build-linux-amd64-{centos7,centos7-fips} build-linux-amd64-{deb,fips-deb,centos7-rpm,centos7-fips-rpm} build-linux-386-{deb,rpm} build-linux-arm64-{deb,rpm} build-linux-arm-{deb,rpm} build-{oss,ent}-amis teleport-container-images-branch-tag ) add_before=build-buildboxes toadd=( build-legacy-amis build-oci ) for pipeline in "${toremove[@]}"; do gawk -i inplace -v to_remove=$pipeline "$AWK_SCRIPT" .drone.yml done for pipeline in "${toadd[@]}"; do gawk -i inplace -v to_add=$pipeline -v before=$add_before "$AWK_SCRIPT" .drone.yml done * Update e ref for updated workflows, Go update, et al. Update e ref to bring in: * Display last updated time on billing summary (#2733) * User monitor, Okta UAC, Okta access requests respect user locks. (#2815) * [v13] ci: Remove 'v' from OCI version labels (#2820) * [v13] chore: Bump Go to v1.20.12 (#2838) * [v13] ci: Build single-arch images for legacy/operator OCI (#2857) Changelog: Update Go to 1.20.12
Problem
When users are learning Teleport, it's common for them to dramatically re-configure the auth server (change the cluster name, etc). When they do this, they frequently get stuck because the existing nodes fail to re-join the cluster (or they lose trusted clusters).
This happens because nodes (and trusted clusters) have their old cluster credentials stored in
/var/lib/teleport
and those credentials are no longer valid. The error message "access denied" in the logs aren't very helpful, even though it's truthful.Proposal
When a node re-connects to the cluster, it should so something like this:
/var/lib/teleport
-> use the join token/var/lib/teleport
-> use the credentials/var/lib/teleport
with the new creds.The same logic should apply to trusted clusters. If they suddenly start getting "access denied", they should try to use the join token.
Re-join intervals
Customers are also complaining that it takes up to 15 minutes for a trusted cluster to re-appear if there's a brief network disconnect. Let's do the following:
Question: can we safely be even more aggressive? 30 seconds? 15 seconds?
Warnings
CC @klizhentas to review.
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: