Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

Closed
incubator4 opened this issue May 8, 2022 · 5 comments
Closed

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

incubator4 opened this issue May 8, 2022 · 5 comments
Labels

Comments

@incubator4
Copy link

Previously
My company has been using Boundary as a secure connection tool for a long time since version 0.6.2
We use aws kms and postgresql for Dependency, and deploy controlller and a few workers in different regions as K8S Deployment in EKS Cluster.
It workers well for now. Because there is a new release version, I decide to upgrade our infrastructure to latest release version 0.8.0.

Describe the bug
I follow the Document Upgrade and Database Migration .
Backup database -> scale controller deployment to replicas: 0 -> run the migration Job -> Upgrade the Controller pod image from 0.6.2 to 0.8.0

Actually, I use to allocated less resources to the controller, then the plugin load faild again and again.
Then I found similar problem in #1813 , and I allocate more resource to controller and the problem was solved
Maybe the error message can be more friendly in feature.

It seems works well until now.But workers cannot connect to controller with following log.

Try to connect

{
  "id": "hQu6GKChpn",
  "source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
  "specversion": "1.0",
  "type": "system",
  "data": {
    "version": "v0.1",
    "op": "worker.(Worker).createClientConn",
    "data": {
      "address": "<hidden-controller-address>:9201",
      "msg": "connected to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2022-05-07T20:10:15.386411333Z"
}

Failure log

It keeps reporting errors.

{
  "id": "wYzqTUlkSy",
  "source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing unable to dial to controller: dial tcp: lookup <available address> on 172.20.0.10:53: no such host\"",
    "error_fields": {},
    "id": "e_7X4lj1saof",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "msg": "error making status request to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2022-05-07T20:10:18.889134769Z"
}

Workers cannot connect to controller use both of version 0.6.2 and 0.8.0,then I rollback upgrade and restore database.

To Reproduce
Steps to reproduce the behavior:

  1. Upgrade controller from 0.6.2 to 0.8.0
  2. See error
@justenwalker
Copy link
Contributor

FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host looks like a DNS issue. Whatever <available address> is (like boundary-controller01.mycompany.net) -- dns lookup is failing to find it.

That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.

This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.

@malnick
Copy link
Collaborator

malnick commented May 11, 2022

Thanks for raising this @justenwalker - I won't rule out #2072 being related here, but just to be sure since this does look DNS related, can you exec into the worker container and run a nslookup or telnet to the IP it's unable to connect to?

@malnick malnick added the triage label May 11, 2022
@justenwalker
Copy link
Contributor

justenwalker commented May 11, 2022

@incubator4 raised the issue, so they'd have to try this. Just added comment to the other issue since I encountered this problem because of loadbalancer health checks; so it seems plausibly related.

@incubator4
Copy link
Author

FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host looks like a DNS issue. Whatever <available address> is (like boundary-controller01.mycompany.net) -- dns lookup is failing to find it.

That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.

This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.

In fact, is an aws external ELB with public address (just like boundary-controller01.mycompany.net).
And I use liveness/readiness tcp check with api port with 9200, it might be some dns error.
I've seen issue #2072 ,it was similar with my another issue #2062 , i thought both of these might be one question( 0.7.6 works but 0.8.0 not)

@jefferai
Copy link
Member

jefferai commented Sep 6, 2022

Hi there -- has this been addressed in later releases for you?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants