Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

teleport.cluster.local errors obfuscate the actual problem with a Teleport cluster #11289

Closed
pschisa opened this issue Mar 21, 2022 · 4 comments · Fixed by #15996
Closed

teleport.cluster.local errors obfuscate the actual problem with a Teleport cluster #11289

pschisa opened this issue Mar 21, 2022 · 4 comments · Fixed by #15996
Assignees
Labels
bug c-bi Internal Customer Reference c-nl Internal Customer Reference c-sk Internal Customer Reference error-msg Improving customer facing error messages.

Comments

@pschisa
Copy link
Contributor

pschisa commented Mar 21, 2022

Description

When something goes wrong with Teleport connections (as is commonly the case during onboarding), the following error message crops up frequently:

2022-03-18T15:40:12-06:00 ERRO [PROC:1]    Proxy failed to establish connection to cluster: Post "https://teleport.cluster.local/v2/tokens/register": x509: certificate is valid for xxxxx, not teleport.cluster.local, object not found. service/connect.go:73
2022-03-18T15:40:12-06:00 INFO [AUTH]      Attempting registration via proxy server. auth/register.go:167
2022-03-18T15:40:12-06:00 ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://teleport.cluster.local/v2/tokens/register": x509: certificate is valid for xxxxxx, not teleport.cluster.local, object not found. service/connect.go:73
^C2022-03-18T15:40:14-06:00 INFO [PROC:1]    Got signal "interrupt", exiting immediately. service/signals.go:86
2022-03-18T15:40:14-06:00 INFO [PROC:1]    Node stopping connection attempts, teleport is shutting down. service/connect.go:88

My understanding is this is a fallback mechanism of Teleport failing (calling teleport.cluster.local). However, in the logging this ends up being the final error message an end user sees, creating a red herring situation where many users believe they have issues with their TLS certificates. I have not seen this error message help a customer resolve a problem yet, it only serves to obfuscate the issue earlier in the logging.

Request is to suppress the logging of this error message since it seems to most often produce a red herring, obfuscating the real problem and making deployments take longer as end users cannot self-correct issues.

Other open Github issues with similar problems:
#5631
#6080
#6405
#7783
#8919
#10108

@webvictim
Copy link
Contributor

All these x509 errors fall into the same bucket. There's no problem with the certs here - it's actually running in insecure mode - but this error stops people reading on in the logs to see the real issue (incorrect SSH principal):

2022-05-05T13:06:31Z ERRO [PROC:1]    "Node failed to establish connection to cluster: Failed to connect to Auth Server directly or over tunnel, no methods remaining.\n\tGet \"https://teleport.cluster.local/v2/domain\": x509: certificate is valid for localhost, localhost, localhost.local, not 6c6f63616c686f7374.teleport.cluster.local, Get \"https://teleport.cluster.local/v2/domain\": ssh: handshake failed: ssh: principal \"teleport\" not in the set of valid principals for given certificate: [\"1812de42-7ff2-43b2-ad7c-d6c1c4c9b6c5.localhost\" \"1812de42-7ff2-43b2-ad7c-d6c1c4c9b6c5\" \"localhost.localhost\" \"localhost\" \"127.0.0.1\" \"::1\" \"remote.kube.proxy.teleport.cluster.local\"]." service/connect.go:86

@pschisa pschisa added the c-sk Internal Customer Reference label May 5, 2022
@zmb3 zmb3 added the error-msg Improving customer facing error messages. label May 5, 2022
@pschisa pschisa added the c-bi Internal Customer Reference label Jul 8, 2022
@pschisa
Copy link
Contributor Author

pschisa commented Jul 8, 2022

Another example of where the teleport.cluster.local logging is obfuscating the real problem, even in debugging:

INFO [CLIENT]    Successful auth with proxy <redacted-acutal-proxy>. client/api.go:2721
DEBU [KEYSTORE]  Returning Teleport TLS certificate "<redacted>" valid until "<redacted>". client/keystore.go:307

ERROR REPORT:
Original Error: *trace.ConnectionProblemError Get &#34;https://teleport.cluster.local/v2/authorities/host?load_keys=false&#34;: EOF
Stack Trace:
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/lib/httplib/httplib.go:146 github.com/gravitational/teleport/lib/httplib.ConvertResponse
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/lib/auth/clt.go:293 github.com/gravitational/teleport/lib/auth.(*Client).Get
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/lib/auth/clt.go:474 github.com/gravitational/teleport/lib/auth.(*Client).GetCertAuthorities
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/lib/client/api.go:3173 github.com/gravitational/teleport/lib/client.(*TeleportClient).GetTrustedCA
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/lib/client/api.go:3183 github.com/gravitational/teleport/lib/client.(*TeleportClient).UpdateTrustedCA
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/lib/client/api.go:3034 github.com/gravitational/teleport/lib/client.(*TeleportClient).ActivateKey
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/tool/tsh/tsh.go:1137 main.onLogin
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/tool/tsh/tsh.go:776 main.Run
	/private/tmp/teleport-20220702-63537-gtsmq0/teleport-9.3.9/tool/tsh/tsh.go:368 main.main
	/usr/local/Cellar/go/1.18.3/libexec/src/runtime/proc.go:250 runtime.main
	/usr/local/Cellar/go/1.18.3/libexec/src/runtime/asm_amd64.s:1571 runtime.goexit
User Message: Get &#34;https://teleport.cluster.local/v2/authorities/host?load_keys=false&#34;: EOF

@strideynet
Copy link
Contributor

strideynet commented Aug 2, 2022

Some useful details/history on teleport.cluster.local: #7922

What is teleport.cluster.local

teleport.cluster.local is an arbitrary value we use for two distinct purposes within Teleport.

Firstly, we use as the host part of URLs when making HTTP requests to the auth service. In lib/auth.Client, we replace the client’s default http.Transport with one that has a configured DialContext, and this means that teleport.cluster.local is ignored for the process of opening a net.Conn to the server and instead some other connection method is used (e.g TLS ALPN via the proxy, SSH via the proxy). This means that this value has little effect beyond appearing in log messages and errors where requests fail (e.g Get &#34;https://teleport.cluster.local/v2/authorities/host?load_keys=false&#34;: EOF: here the real issue is the EOF and is unrelated to teleport.cluster.local.)

We also hard-code this arbitrary value into the DNS SANs of certificates issued to Teleport hosts (e.g proxy, auth server) and use this value in the client when validating the Teleport host credentials. This means that if you have a copy of the Host CA, you can connect to one of these services and trust it without that service specifically being made aware of all of its possible "addresses" (this is more complex than is first imagined when you consider the various tunneling/routing options we have available, in many cases clients are not connecting to the auth server with any awareness of its real address and instead rely on the Proxy forward traffic to it). This value therefore shows up when some part of the TLS handshake fails.

How can we improve this?

Realistically, it is impossible for us to use one fix or similar to make teleport.cluster.local go away. This is a broad issue. The only potential "solution" would be to come up with a magic string that is clearer, but I do not personally believe such a thing exists.

For the first category (where teleport.cluster.local is appearing in failed HTTP requests), I believe we can handle user confusion by introducing proper documentation of the significance of this value. As the transport could be connected to any number of addresses, it would be difficult for us to modify this in such a way that the real upstream address used shows in errors emitted by the HTTP client.

For the second category (where teleport.cluster.local shows due to an issue during the TLS handshake), we'd be better off targetting the individual conditions where this occurs and providing more useful error messages. These errors are generally quite opaque and in many cases we could detect what has gone wrong and then return a more helpful message (e.g "TLS Handshake failed due to XYZ: It's possible that your server is running in --insecure and this value needs providing to tsh. Please see https://docs.teleport.com for more").

I would argue that the teleport.cluster.local is not the confusing part of these errors. The errors themselves are opaque and we are misattributing that to the teleport.cluster.local string, in many cases, to treat these errors as similar because of this string would be a mistake. I expect a large amount of these errors to begin to disappear as we introduce the connection hints feature, and for those that remain, we could intelligently return useful messages to users to provide a path to resolution (e.g enable --insecure)

Therefore, my suggested route for this issue would be to:

  • Complete Document teleport.cluster.local #10103, and consider this issue closed as well
  • Encourage people who come across this issue to raise specific issues for the errors that are occurring, so more useful messages can be added.

@dromadaire54
Copy link

It is impossible to override this value

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug c-bi Internal Customer Reference c-nl Internal Customer Reference c-sk Internal Customer Reference error-msg Improving customer facing error messages.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants