-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Exact 10m proxy connection timeout #3133
Comments
Similar symptoms as #1813 |
@irenarindos @jefferai any thoughts? |
@macmiranda thanks for reporting this! I'll look into this and try to reproduce it. Could you provide your worker and controller configs (with any sensitive information redacted)? I'm also curious if you could provide more details about your deployment; are there any load balancers or proxies in front of the workers? Are you able to execute the same query from the machine where a worker is running from, without Boundary? |
Hi @irenarindos, Thanks for looking into this, it's creating real issues for our data engineers (their queries are often very long-lasting). We have different Boundary setups but I'll explain the easiest of them (since it happens on every cluster, no matter what):
The configs are pretty standard. Here's the controller one:
Here's the worker one:
I've enabled Communication between worker and controller uses a Kubernetes service of type As I mentioned, we have a different setup where the controllers and workers are in different clusters, so they use Ingress for the internal communication as well, but I don't want to overcomplicate things at this point. Regarding your questions:
It's difficult to say because the workers are running in Pods for which the (Hashicorp's official) container image does not have
This is just one of the targets with the lowest max session limit (8h):
In fact, some of the targets we were testing with had a 5-day session duration limit. Just to wrap up, I did some more testing today, and even though the problem persists (with the same symptoms), I didn't get the log errors again. So I'm not quite sure the errors above are so relevant (though I initially thought they were). It is really strange. I get the same exact behavior as yesterday but absolutely nothing that indicates a problem in the logs (except for the fact that every single connection is closed after 10m). |
@macmiranda thanks so much for all the detail! I have one more question- is it possible your load balancer is causing the timeout/ could you see if it's possible to configure it with a larger idle timeout value? |
We use AWS NLBs for your Ingresses. From their documentation:
It would be strange if the first Keep Alive packet were sent before 350s but the second weren't sent before 600s, but I do agree that's something we need to check. I'll try to capture packets on both ends of the connection (though the worker Pod is going to be a bit harder). In the mean time, if you could try that in your lab, that'd be great. |
On a side note, we are planning to release the Boundary Helm chart we use for your deployments publicly, so maybe that could be helpful. |
It turns out the timeout wasn't set by Boundary or the Load Balancer. It was Ingress NGINX setting Thanks for all the help and sorry to bother you with this issue. |
Glad it's solved! |
Just following up here to note that workers should never be load-balanced. They need 1:1 correspondence between a particular IP address and port and a worker advertising that IP and port. In a Kubernetes context, where you pretty much have to "load balance" in front of the worker, you can either create a LoadBalancer with a unique hostname per worker, or use a NodePort service with a fixed port that's unique per worker and just pick a node name to be the advertised hostname of the worker. |
Hi,
Just noticed this behavior today that’s kinda boggling me. Clients connecting via Boundary to a Postgres database either by using
boundary connect
(and inputting creds) orboundary connect postgres
(automatically authenticates) orall get the same error at exactly
10m00s
of running a query.As an illustration of that, see the following output:
Postgres itself logs to postgresql.log:
It happens with at least Boundary versions
0.10.5
,0.11.2
and0.12.0
.Below are the relevant Boundary controller and worker logs:
Executing the same query without Boundary works as expected.
The text was updated successfully, but these errors were encountered: