Worker cannot connect to Controller while upgrade to 0.8.0 #2063

incubator4 · 2022-05-08T14:35:20Z

Previously
My company has been using Boundary as a secure connection tool for a long time since version 0.6.2
We use aws kms and postgresql for Dependency, and deploy controlller and a few workers in different regions as K8S Deployment in EKS Cluster.
It workers well for now. Because there is a new release version, I decide to upgrade our infrastructure to latest release version 0.8.0.

Describe the bug
I follow the Document Upgrade and Database Migration .
Backup database -> scale controller deployment to replicas: 0 -> run the migration Job -> Upgrade the Controller pod image from 0.6.2 to 0.8.0

~~Actually, I use to allocated less resources to the controller, then the plugin load faild again and again.~~
~~Then I found similar problem in #1813 , and I allocate more resource to controller and the problem was solved~~
Maybe the error message can be more friendly in feature.

It seems works well until now.But workers cannot connect to controller with following log.

Try to connect

{
  "id": "hQu6GKChpn",
  "source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
  "specversion": "1.0",
  "type": "system",
  "data": {
    "version": "v0.1",
    "op": "worker.(Worker).createClientConn",
    "data": {
      "address": "<hidden-controller-address>:9201",
      "msg": "connected to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2022-05-07T20:10:15.386411333Z"
}

Failure log

It keeps reporting errors.

{
  "id": "wYzqTUlkSy",
  "source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing unable to dial to controller: dial tcp: lookup <available address> on 172.20.0.10:53: no such host\"",
    "error_fields": {},
    "id": "e_7X4lj1saof",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "msg": "error making status request to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2022-05-07T20:10:18.889134769Z"
}

Workers cannot connect to controller use both of version 0.6.2 and 0.8.0,then I rollback upgrade and restore database.

To Reproduce
Steps to reproduce the behavior:

Upgrade controller from 0.6.2 to 0.8.0
See error

The text was updated successfully, but these errors were encountered:

justenwalker · 2022-05-11T20:09:47Z

FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host looks like a DNS issue. Whatever <available address> is (like boundary-controller01.mycompany.net) -- dns lookup is failing to find it.

That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.

This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.

malnick · 2022-05-11T21:27:52Z

Thanks for raising this @justenwalker - I won't rule out #2072 being related here, but just to be sure since this does look DNS related, can you exec into the worker container and run a nslookup or telnet to the IP it's unable to connect to?

justenwalker · 2022-05-11T21:36:13Z

@incubator4 raised the issue, so they'd have to try this. Just added comment to the other issue since I encountered this problem because of loadbalancer health checks; so it seems plausibly related.

incubator4 · 2022-05-12T08:15:12Z

FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host looks like a DNS issue. Whatever <available address> is (like boundary-controller01.mycompany.net) -- dns lookup is failing to find it.

That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.

This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.

In fact, is an aws external ELB with public address (just like boundary-controller01.mycompany.net).
And I use liveness/readiness tcp check with api port with 9200, it might be some dns error.
I've seen issue #2072 ,it was similar with my another issue #2062 , i thought both of these might be one question( 0.7.6 works but 0.8.0 not)

jefferai · 2022-09-06T13:06:02Z

Hi there -- has this been addressed in later releases for you?

malnick added the triage label May 11, 2022

jefferai mentioned this issue May 20, 2022

Failed to WebSocket dial use Boundary Cli with version 0.8.0 #2062

Closed

covetocove closed this as completed Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

incubator4 commented May 8, 2022

justenwalker commented May 11, 2022

malnick commented May 11, 2022

justenwalker commented May 11, 2022 •

edited

Loading

incubator4 commented May 12, 2022

jefferai commented Sep 6, 2022

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

Comments

incubator4 commented May 8, 2022

Try to connect

Failure log

justenwalker commented May 11, 2022

malnick commented May 11, 2022

justenwalker commented May 11, 2022 • edited Loading

incubator4 commented May 12, 2022

jefferai commented Sep 6, 2022

justenwalker commented May 11, 2022 •

edited

Loading