Exact 10m proxy connection timeout #3133

macmiranda · 2023-03-30T13:18:18Z

Hi,

Just noticed this behavior today that’s kinda boggling me. Clients connecting via Boundary to a Postgres database either by using

boundary connect (and inputting creds) or
boundary connect postgres (automatically authenticates) or
Boundary Desktop + external program (e.g. pgAdmin4)

all get the same error at exactly 10m00s of running a query.

As an illustration of that, see the following output:

❯ boundary connect postgres -target-id ttcp_naGseB1PAs -dbname data_warehouse

psql (15.2, server 14.6)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES128-GCM-SHA256, compression: off)
You are now connected to database "data_warehouse" as user "v-token-to-data_war-PFXW7wY0o4V4emQh9KCn-1680173682".

data_warehouse=> select pg_sleep(601);

SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Succeeded.
psql (15.2, server 14.6)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES128-GCM-SHA256, compression: off)

Postgres itself logs to postgresql.log:

2023-03-30 09:43:18 UTC:10.14.37.204(51926):*****@*****:[3435]:LOG: could not send data to client: Connection timed out
2023-03-30 09:43:18 UTC:10.14.37.204(51926):*****@*****:[3435]:FATAL: connection to client lost

It happens with at least Boundary versions 0.10.5, 0.11.2 and 0.12.0.

Below are the relevant Boundary controller and worker logs:

{"id":"vkhYtQVHQO","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_CTEMrPiFJN","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.520607375Z"}
2023-03-30T08:23:28.520Z [ERROR] encountered an error sending an error event:
  error:=
  | event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.521Z [ERROR] event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.521Z [ERROR] event.WriteError: unable to write error: session.(ConnectionRepository).closeOrphanedConnections: db.Query: unknown, unknown: error #0: context deadline exceeded
2023-03-30T08:23:28.521Z [ERROR] encountered an error sending an error event:
  error:=
  | event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.521Z [ERROR] event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.521Z [ERROR] event.WriteError: unable to write error: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back
2023-03-30T08:23:28.521Z [ERROR] encountered an error sending an error event:
  error:=
  | event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.521Z [ERROR] event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.521Z [ERROR] event.WriteError: unable to write error: session.WorkerStatusReport: Error closing orphaned connections for worker w_1qqKtUnH5H: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back: unknown: error #500
{"id":"ndOl83VsdJ","source":"https://hashicorp.com/boundary/boundary-controller-0/controller","specversion":"1.0","type":"error","data":{"error":"session.(ConnectionRepository).closeOrphanedConnections: db.Query: unknown, unknown: error #0: context deadline exceeded","error_fields":{"Code":0,"Msg":"","Op":"session.(ConnectionRepository).closeOrphanedConnections","Wrapped":{"Code":0,"Msg":"","Op":"db.Query","Wrapped":{}}},"id":"e_iLgC1isb68","version":"v0.1","op":"session.(ConnectionRepository).closeOrphanedConnections","request_info":{"id":"gtraceid_oRh3Tvp3SzY2Uqyt0vof","method":"/controller.servers.services.v1.ServerCoordinationService/Status"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.520890808Z"}
{"id":"QHV7YGVDyX","source":"https://hashicorp.com/boundary/boundary-controller-0/controller","specversion":"1.0","type":"error","data":{"error":"session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back","error_fields":{"Code":0,"Msg":"error comparing state","Op":"session.(ConnectionRepository).closeOrphanedConnections","Wrapped":{"Code":0,"Msg":"","Op":"db.DoTx","Wrapped":{}}},"id":"e_iLgC1isb68","version":"v0.1","op":"session.(ConnectionRepository).closeOrphanedConnections","request_info":{"id":"gtraceid_oRh3Tvp3SzY2Uqyt0vof","method":"/controller.servers.services.v1.ServerCoordinationService/Status"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.521249236Z"}
{"id":"zpmF4fuP1Q","source":"https://hashicorp.com/boundary/boundary-controller-0/controller","specversion":"1.0","type":"error","data":{"error":"session.WorkerStatusReport: Error closing orphaned connections for worker w_1qqKtUnH5H: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back: unknown: error #500","error_fields":{"Code":500,"Msg":"Error closing orphaned connections for worker w_1qqKtUnH5H: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back","Op":"session.WorkerStatusReport","Wrapped":null},"id":"e_iLgC1isb68","version":"v0.1","op":"session.WorkerStatusReport","request_info":{"id":"gtraceid_oRh3Tvp3SzY2Uqyt0vof","method":"/controller.servers.services.v1.ServerCoordinationService/Status"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.521356357Z"}
{"id":"ZBjEoaUwLD","source":"https://hashicorp.com/boundary/boundary-worker-0/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_18tR27tOYI","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.523464217Z"}
2023-03-30T08:23:28.523Z [ERROR] encountered an error sending an error event:
  error:=
  | event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.523Z [ERROR] event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.523Z [ERROR] event.WriteError: unable to write error: session.(ConnectionRepository).closeOrphanedConnections: db.Query: unknown, unknown: error #0: context deadline exceeded
2023-03-30T08:23:28.523Z [ERROR] encountered an error sending an error event:
  error:=
  | event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.523Z [ERROR] event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.523Z [ERROR] event.WriteError: unable to write error: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back
{"id":"0jisII16lE","source":"https://hashicorp.com/boundary/boundary-controller-0/controller","specversion":"1.0","type":"error","data":{"error":"session.(ConnectionRepository).closeOrphanedConnections: db.Query: unknown, unknown: error #0: context deadline exceeded","error_fields":{"Code":0,"Msg":"","Op":"session.(ConnectionRepository).closeOrphanedConnections","Wrapped":{"Code":0,"Msg":"","Op":"db.Query","Wrapped":{}}},"id":"e_TCbtGabie1","version":"v0.1","op":"session.(ConnectionRepository).closeOrphanedConnections","request_info":{"id":"gtraceid_nzMgg1hruTbaIjxbniic","method":"/controller.servers.services.v1.ServerCoordinationService/Status"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.523659533Z"}
{"id":"42hw5WFTzC","source":"https://hashicorp.com/boundary/boundary-controller-0/controller","specversion":"1.0","type":"error","data":{"error":"session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back","error_fields":{"Code":0,"Msg":"error comparing state","Op":"session.(ConnectionRepository).closeOrphanedConnections","Wrapped":{"Code":0,"Msg":"","Op":"db.DoTx","Wrapped":{}}},"id":"e_TCbtGabie1","version":"v0.1","op":"session.(ConnectionRepository).closeOrphanedConnections","request_info":{"id":"gtraceid_nzMgg1hruTbaIjxbniic","method":"/controller.servers.services.v1.ServerCoordinationService/Status"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.523770068Z"}
2023-03-30T08:23:28.524Z [ERROR] encountered an error sending an error event:
  error:=
  | event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.524Z [ERROR] event.WriteError: event.(Eventer).writeError: event.(Eventer).retrySend: failed to send event: 2 errors occurred:
2023-03-30T08:23:28.524Z [ERROR] event.WriteError: unable to write error: session.WorkerStatusReport: Error closing orphaned connections for worker w_nkXuxs6JmW: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back: unknown: error #500
{"id":"HFQBHddmmB","source":"https://hashicorp.com/boundary/boundary-controller-0/controller","specversion":"1.0","type":"error","data":{"error":"session.WorkerStatusReport: Error closing orphaned connections for worker w_nkXuxs6JmW: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back: unknown: error #500","error_fields":{"Code":500,"Msg":"Error closing orphaned connections for worker w_nkXuxs6JmW: session.(ConnectionRepository).closeOrphanedConnections: error comparing state: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back","Op":"session.WorkerStatusReport","Wrapped":null},"id":"e_TCbtGabie1","version":"v0.1","op":"session.WorkerStatusReport","request_info":{"id":"gtraceid_nzMgg1hruTbaIjxbniic","method":"/controller.servers.services.v1.ServerCoordinationService/Status"}},"datacontentype":"application/cloudevents","time":"2023-03-30T08:23:28.523988484Z"}

Executing the same query without Boundary works as expected.

The text was updated successfully, but these errors were encountered:

macmiranda · 2023-03-30T21:23:49Z

Similar symptoms as #1813

macmiranda · 2023-03-31T10:20:08Z

@irenarindos @jefferai any thoughts?

irenarindos · 2023-03-31T15:36:18Z

@macmiranda thanks for reporting this! I'll look into this and try to reproduce it.

Could you provide your worker and controller configs (with any sensitive information redacted)? I'm also curious if you could provide more details about your deployment; are there any load balancers or proxies in front of the workers?

Are you able to execute the same query from the machine where a worker is running from, without Boundary?
And just to rule it out, what is the session max seconds value when you perform a read on the target?

macmiranda · 2023-03-31T17:32:59Z

Hi @irenarindos,

Thanks for looking into this, it's creating real issues for our data engineers (their queries are often very long-lasting).

We have different Boundary setups but I'll explain the easiest of them (since it happens on every cluster, no matter what):

We have 3 controllers and 3 workers deployed on a single Kubernetes cluster. They are exposed via Ingress (ultimately a Load Balancer), and have hostnames like boundary.domain.com (which people use for Boundary Desktop and Boundary CLI) and boundary-worker.domain.com, respectively.
We are using AWS KMS worker auth.

The configs are pretty standard. Here's the controller one:

disable_mlock = true
log_format    = "standard"

controller {
  name        = "env://HOSTNAME"
  description = "Boundary controller"
  database {
    url = "env://DATABASE_URL"
    max_open_connections = 5
  }
  public_cluster_addr = "boundary-controller.boundary"
}

listener "tcp" {
  purpose     = "api"
  address     = "0.0.0.0"
  tls_disable = true
}

listener "tcp" {
  purpose     = "cluster"
  address     = "0.0.0.0"
}

listener "tcp" {
  purpose     = "ops"
  address     = "0.0.0.0"
  tls_disable = true
}

kms "awskms" {
  purpose    = "root"
  key_id     = "global_root"
  kms_key_id = "*********"
  region     = "eu-central-1"
}

kms "awskms" {
  purpose    = "worker-auth"
  key_id     = "global_worker_auth"
  kms_key_id = "*******"
  region     = "eu-central-1"
}

kms "awskms" {
  purpose    = "recovery"
  key_id     = "global_recovery"
  kms_key_id = "*******"
  region     = "eu-central-1"
}

Here's the worker one:

disable_mlock = true
enable_worker_auth_debugging = true
log_format    = "standard"

worker {
  name        = "env://HOSTNAME"
  description = "Boundary worker"
  public_addr = "boundary-worker.domain.com"
  initial_upstreams = [
    "boundary-controller-0.boundary-internal.boundary",
    "boundary-controller-1.boundary-internal.boundary",
    "boundary-controller-2.boundary-internal.boundary",
  ]
}

listener "tcp" {
  purpose     = "proxy"
  address     = "0.0.0.0"
  tls_disable = true
}

listener "tcp" {
  purpose     = "ops"
  address     = "0.0.0.0"
  tls_disable = true
}

kms "awskms" {
  purpose    = "worker-auth"
  key_id     = "global_worker_auth"
  kms_key_id = "********"
  region     = "eu-central-1"
}

I've enabled enable_worker_auth_debugging recently in an attempt to discover something new about this issue. It is false otherwise.

Communication between worker and controller uses a Kubernetes service of type ClusterIP. The controller cluster address is also a Kubernetes service of type ClusterIP. All Pods are running in the same namespace.

As I mentioned, we have a different setup where the controllers and workers are in different clusters, so they use Ingress for the internal communication as well, but I don't want to overcomplicate things at this point.

Regarding your questions:

Are you able to execute the same query from the machine where a worker is running from, without Boundary?

It's difficult to say because the workers are running in Pods for which the (Hashicorp's official) container image does not have psql installed and have no root filesystem write access. From any other Pod in the same namespace, that's definitely true.
In fact, our workaround at the moment is to create Pod's running socat and port-forward to the Boundary targets via these Pods.

And just to rule it out, what is the session max seconds value when you perform a read on the target?

This is just one of the targets with the lowest max session limit (8h):

❯ boundary targets read -id ttcp_eJ9M8IXX8c

Target information:
  Created Time:               Sun, 05 Mar 2023 19:26:51 CET
  ID:                         ttcp_eJ9M8IXX8c
  Name:                       Boundary
  Session Connection Limit:   -1
  Session Max Seconds:        28800
  Type:                       tcp
  Updated Time:               Sun, 05 Mar 2023 19:28:04 CET
  Version:                    2

In fact, some of the targets we were testing with had a 5-day session duration limit.

Just to wrap up, I did some more testing today, and even though the problem persists (with the same symptoms), I didn't get the log errors again. So I'm not quite sure the errors above are so relevant (though I initially thought they were). It is really strange. I get the same exact behavior as yesterday but absolutely nothing that indicates a problem in the logs (except for the fact that every single connection is closed after 10m).

irenarindos · 2023-03-31T17:48:51Z

@macmiranda thanks so much for all the detail!

I have one more question- is it possible your load balancer is causing the timeout/ could you see if it's possible to configure it with a larger idle timeout value?

macmiranda · 2023-03-31T18:19:00Z

We use AWS NLBs for your Ingresses.

From their documentation:

We set the idle timeout value for TCP flows to 350 seconds. You can't modify this value. Clients or targets can use TCP keepalive packets to reset the idle timeout. Keepalive packets sent to maintain TLS connections can't contain data or payload.

Source: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

It would be strange if the first Keep Alive packet were sent before 350s but the second weren't sent before 600s, but I do agree that's something we need to check. I'll try to capture packets on both ends of the connection (though the worker Pod is going to be a bit harder).

In the mean time, if you could try that in your lab, that'd be great.

macmiranda · 2023-03-31T18:20:04Z

On a side note, we are planning to release the Boundary Helm chart we use for your deployments publicly, so maybe that could be helpful.

macmiranda · 2023-04-03T10:10:55Z

It turns out the timeout wasn't set by Boundary or the Load Balancer. It was Ingress NGINX setting proxy-stream-timeout (which defaults to 10m)

Thanks for all the help and sorry to bother you with this issue.

jefferai · 2023-04-03T15:09:23Z

Glad it's solved!

omkensey · 2023-04-03T15:15:01Z

Just following up here to note that workers should never be load-balanced. They need 1:1 correspondence between a particular IP address and port and a worker advertising that IP and port.

In a Kubernetes context, where you pretty much have to "load balance" in front of the worker, you can either create a LoadBalancer with a unique hostname per worker, or use a NodePort service with a fixed port that's unique per worker and just pick a node name to be the advertised hostname of the worker.

macmiranda changed the title ~~Exactly 10m postgres proxy connection timeout~~ Exact 10m postgres proxy connection timeout Mar 30, 2023

macmiranda changed the title ~~Exact 10m postgres proxy connection timeout~~ [BUG] Exact 10m postgres proxy connection timeout Mar 31, 2023

macmiranda changed the title ~~[BUG] Exact 10m postgres proxy connection timeout~~ Exact 10m postgres proxy connection timeout Apr 3, 2023

macmiranda changed the title ~~Exact 10m postgres proxy connection timeout~~ Exact 10m proxy connection timeout Apr 3, 2023

macmiranda closed this as completed Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact 10m proxy connection timeout #3133

Exact 10m proxy connection timeout #3133

macmiranda commented Mar 30, 2023 •

edited

Loading

macmiranda commented Mar 30, 2023

macmiranda commented Mar 31, 2023

irenarindos commented Mar 31, 2023

macmiranda commented Mar 31, 2023

irenarindos commented Mar 31, 2023

macmiranda commented Mar 31, 2023

macmiranda commented Mar 31, 2023

macmiranda commented Apr 3, 2023

jefferai commented Apr 3, 2023

omkensey commented Apr 3, 2023

Exact 10m proxy connection timeout #3133

Exact 10m proxy connection timeout #3133

Comments

macmiranda commented Mar 30, 2023 • edited Loading

macmiranda commented Mar 30, 2023

macmiranda commented Mar 31, 2023

irenarindos commented Mar 31, 2023

macmiranda commented Mar 31, 2023

irenarindos commented Mar 31, 2023

macmiranda commented Mar 31, 2023

macmiranda commented Mar 31, 2023

macmiranda commented Apr 3, 2023

jefferai commented Apr 3, 2023

omkensey commented Apr 3, 2023

macmiranda commented Mar 30, 2023 •

edited

Loading