Unstable Connections and constant resets #1813

giulio1979 · 2022-01-05T07:57:08Z

I doing some long running testing on boundary and I notice after a few hours that my connections became highly unstable:

For example:

WARNING: Upload failed: /20220105Jan1641368655?partNumber=5&uploadId=09d93ce5-6faf-46c6-82ed-3cf89b86262f ([Errno 104] Connection reset by peer)
WARNING: Waiting 3 sec...
upload: '/tmp/1' -> 's3://test/20220105Jan1641368655' (104857600 bytes in 14.6 seconds, 6.83 MB/s) [1 of 1]
WARNING: Retrying failed request: /?delimiter=%2F (Remote end closed connection without response)
WARNING: Waiting 3 sec...
delete: 's3://test/20220105Jan1641368655'
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
ERROR: Cannot retrieve any response status before encountering an EPIPE or ECONNRESET exception
WARNING: Upload failed: /20220105Jan1641368675?partNumber=4&uploadId=1c0ac6bd-5fae-459a-8b9b-966287daa702 ([Errno 104] Connection reset by peer)
WARNING: Waiting 3 sec...
upload: '/tmp/1' -> 's3://test/20220105Jan1641368675' (104857600 bytes in 9.3 seconds, 10.73 MB/s) [1 of 1]
delete: 's3://test/20220105Jan1641368680'
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
upload: '/tmp/1' -> 's3://test/20220105Jan1641368687' (104857600 bytes in 3.2 seconds, 30.81 MB/s) [1 of 1]
delete: 's3://test/20220105Jan1641368684'
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
ERROR: Cannot retrieve any response status before encountering an EPIPE or ECONNRESET exception
WARNING: Upload failed: /20220105Jan1641368692?partNumber=2&uploadId=07ada446-cc1e-46cb-8941-800368f12797 ([Errno 32] Broken pipe)
WARNING: Waiting 3 sec...
upload: '/tmp/1' -> 's3://test/20220105Jan1641368692' (104857600 bytes in 9.3 seconds, 10.78 MB/s) [1 of 1]
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
upload: '/tmp/1' -> 's3://test/20220105Jan1641368703' (104857600 bytes in 3.6 seconds, 27.58 MB/s) [1 of 1]
delete: 's3://test/20220105Jan1641368700'
delete: 's3://test/20220105Jan1641368703'
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
ERROR: Cannot retrieve any response status before encountering an EPIPE or ECONNRESET exception
WARNING: Upload failed: /20220105Jan1641368708?partNumber=2&uploadId=f9b807d3-f41f-4941-ad89-94c85193c484 ([Errno 32] Broken pipe)
WARNING: Waiting 3 sec...
upload: '/tmp/1' -> 's3://test/20220105Jan1641368708' (104857600 bytes in 8.6 seconds, 11.57 MB/s) [1 of 1]
WARNING: Retrying failed request: /20220105Jan1641368708 (Remote end closed connection without response)
WARNING: Waiting 3 sec...

The boundary controller is filled with:

2022-01-05T07:46:02.752Z [ERROR] event.WriteError: unable to write error: session.(Repository).CloseDeadConnectionsForWorker: db.DoTx: unknown, unknown: error #0: sql: transaction has already been committed or rolled back
2022-01-05T07:46:10.455Z [ERROR] error event: op=session.(Repository).ConnectConnection request_info:id=gtraceid_jjg7wzTAXddMoW460QX5 request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="session.(Repository).ConnectConnection: db.Create: create failed: unknown: error #0: timeout: context canceled" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg:create failed Op:db.Create Wrapped:timeout: context canceled]" error_fields:Code=unknown error_fields:Msg="" id=e_Vm5QSmXa67 version=v0.1
2022-01-05T07:46:10.455Z [ERROR] error event: error="session.(Repository).ConnectConnection: db.DoTx: unknown, unknown: error #0: sql: transaction has already been committed or rolled back" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.DoTx Wrapped:sql: transaction has already been committed or rolled back]" id=e_Vm5QSmXa67 version=v0.1 op=session.(Repository).ConnectConnection request_info:id=gtraceid_jjg7wzTAXddMoW460QX5 request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection
2022-01-05T07:46:10.455Z [ERROR] error event: error="db.DoTx: unknown, unknown: error #0: timeout: context canceled; sql: transaction has already been committed or rolled back" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=db.DoTx error_fields:Wrapped="timeout: context canceled; sql: transaction has already been committed or rolled back" id=e_nQITvmnLHb version=v0.1 op=db.DoTx request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection request_info:id=gtraceid_QiHMuDXvyxH2lMdBMzOR
2022-01-05T07:46:10.455Z [ERROR] error event: request_info:id=gtraceid_QiHMuDXvyxH2lMdBMzOR request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="session.(Repository).ConnectConnection: db.DoTx: unknown, unknown: error #0: timeout: context canceled; sql: transaction has already been committed or rolled back" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.DoTx Wrapped:timeout: context canceled; sql: transaction has already been committed or rolled back]" id=e_nQITvmnLHb version=v0.1 op=session.(Repository).ConnectConnection
2022-01-05T07:46:10.461Z [ERROR] error event: op=db.DoTx request_info:id=gtraceid_OhuoX7GEqN0sVjyC84Lj request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="db.DoTx: unknown, unknown: error #0: timeout: context canceled; sql: transaction has already been committed or rolled back" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=db.DoTx error_fields:Wrapped="timeout: context canceled; sql: transaction has already been committed or rolled back" id=e_ATlZjUOqoP version=v0.1
2022-01-05T07:46:10.461Z [ERROR] error event: error="session.(Repository).ConnectConnection: db.DoTx: unknown, unknown: error #0: timeout: context canceled; sql: transaction has already been committed or rolled back" error_fields:Wrapped="map[Code:unknown Msg: Op:db.DoTx Wrapped:timeout: context canceled; sql: transaction has already been committed or rolled back]" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection id=e_ATlZjUOqoP version=v0.1 op=session.(Repository).ConnectConnection request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection request_info:id=gtraceid_OhuoX7GEqN0sVjyC84Lj
2022-01-05T07:46:17.666Z [ERROR] error event: id=e_UL1Tjy2BDb version=v0.1 op=session.fetchConnectionStates request_info:id=gtraceid_6TJkH2PfRZUFcuavJ8aM request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="session.fetchConnectionStates: db.SearchWhere: unknown, unknown: error #0: context canceled" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.fetchConnectionStates error_fields:Wrapped="map[Code:unknown Msg: Op:db.SearchWhere Wrapped:context canceled]"
2022-01-05T07:46:17.666Z [ERROR] error event: request_info:id=gtraceid_sgltIGrlE95mcS3s7Bu5 request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="session.(Repository).ConnectConnection: db.Update: unknown, unknown: error #0: timeout: context canceled" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.Update Wrapped:timeout: context canceled]" id=e_cerA6Lzl4c version=v0.1 op=session.(Repository).ConnectConnection
2022-01-05T07:46:17.666Z [ERROR] error event: error="session.(Repository).ConnectConnection: session.fetchConnectionStates: db.SearchWhere: unknown, unknown: error #0: context canceled" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:session.fetchConnectionStates Wrapped:map[Code:unknown Msg: Op:db.SearchWhere Wrapped:context canceled]]" id=e_UL1Tjy2BDb version=v0.1 op=session.(Repository).ConnectConnection request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection request_info:id=gtraceid_6TJkH2PfRZUFcuavJ8aM
2022-01-05T07:46:17.666Z [ERROR] error event: error="session.(Repository).ConnectConnection: db.DoTx: unknown, unknown: error #0: sql: transaction has already been committed or rolled back" error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.DoTx Wrapped:sql: transaction has already been committed or rolled back]" id=e_UL1Tjy2BDb version=v0.1 op=session.(Repository).ConnectConnection request_info:id=gtraceid_6TJkH2PfRZUFcuavJ8aM request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection
2022-01-05T07:46:17.666Z [ERROR] error event: error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.DoTx Wrapped:sql: transaction has already been committed or rolled back]" id=e_cerA6Lzl4c version=v0.1 op=session.(Repository).ConnectConnection request_info:id=gtraceid_sgltIGrlE95mcS3s7Bu5 request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="session.(Repository).ConnectConnection: db.DoTx: unknown, unknown: error #0: sql: transaction has already been committed or rolled back"
2022-01-05T07:46:32.368Z [ERROR] encountered an error sending an error event:
error:=
| 5 errors occurred:
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: reached max of 3: too many retries
|
2022-01-05T07:46:32.369Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: reached max of 3: too many retries
2022-01-05T07:46:32.369Z [ERROR] event.WriteError: unable to write error: session.(Repository).CloseDeadConnectionsForWorker: db.Exec: unknown, unknown: error #0: timeout: context deadline exceeded
2022-01-05T07:46:32.523Z [ERROR] encountered an error sending an error event:
error:=
| 5 errors occurred:
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: reached max of 3: too many retries
|
2022-01-05T07:46:32.523Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: reached max of 3: too many retries
2022-01-05T07:46:32.524Z [ERROR] event.WriteError: unable to write error: session.(Repository).CloseDeadConnectionsForWorker: db.DoTx: unknown, unknown: error #0: sql: transaction has already been committed or rolled back
2022-01-05T07:46:32.543Z [ERROR] error event: error_fields:Code=unknown error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.Update Wrapped:map[Code:unknown Msg: Op:db.lookupAfterWrite Wrapped:map[Code:unknown Msg: Op:db.LookupById Wrapped:timeout: context canceled]]]" id=e_iY5khR2Iub version=v0.1 op=session.(Repository).ConnectConnection request_info:id=gtraceid_8z1hlE6J3SbVc15yUiTN request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection error="session.(Repository).ConnectConnection: db.Update: db.lookupAfterWrite: db.LookupById: unknown, unknown: error #0: timeout: context canceled"
2022-01-05T07:46:32.544Z [ERROR] error event: error="session.(Repository).ConnectConnection: db.DoTx: unknown, unknown: error #0: sql: transaction has already been committed or rolled back" error_fields:Msg="" error_fields:Op=session.(Repository).ConnectConnection error_fields:Wrapped="map[Code:unknown Msg: Op:db.DoTx Wrapped:sql: transaction has already been committed or rolled back]" error_fields:Code=unknown id=e_iY5khR2Iub version=v0.1 op=session.(Repository).ConnectConnection request_info:method=/controller.servers.services.v1.SessionService/ConnectConnection request_info:id=gtraceid_8z1hlE6J3SbVc15yUiTN
2022-01-05T07:46:39.874Z [ERROR] encountered an error sending an error event:
error:=
.....

On the worker side I have similar logs:

2022-01-05T07:57:50.391Z [ERROR] event.WriteError: unable to write error: status error grace period has expired, canceling all sessions on worker
2022-01-05T07:57:50.391Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_cSDF88rUIR data:connection_id=sc_XT6JbMGssl data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:50.391Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_Y8R1f0BbHC data:connection_id=sc_0qtJjeQAEs data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:50.391Z [ERROR] error event: error="error marking connection as connected: rpc error: code = Canceled desc = context canceled" id=e_swQCtfM1fZ version=v0.1 op=worker.(Worker).handleProxy info:session_id=s_Y8R1f0BbHC info:endpoint=tcp://minio:9000 info:msg="error handling proxy"
2022-01-05T07:57:50.391Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_Y8R1f0BbHC data:connection_id=sc_tgtzYwGXWH data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:50.391Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_Y8R1f0BbHC data:connection_id=sc_8hpuyvaBuv data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:50.391Z [INFO] system event: op=worker.(Worker).cleanupConnections data:session_id=s_apiHjEz0MZ data:connection_id=sc_vdX9KjJedu data:msg="terminated connection due to cancellation or expiration" version=v0.1
2022-01-05T07:57:50.391Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:msg="terminated connection due to cancellation or expiration" data:session_id=s_apiHjEz0MZ data:connection_id=sc_ICKWDQcssi
2022-01-05T07:57:50.392Z [ERROR] error event: id=e_SAzzTGQS1t version=v0.1 op=worker.(Worker).handleProxy info:session_id=s_apiHjEz0MZ info:endpoint=tcp://minio:9000 info:msg="error handling proxy" error="error marking connection as connected: rpc error: code = Canceled desc = context canceled"
2022-01-05T07:57:50.541Z [ERROR] error event: error="could not find session ID "s_r30xHLEI78" in local state after closing connections" id=e_iMCCpP1Wnq version=v0.1 op=session.CloseConnections info:msg="error marking connection closed in state"
2022-01-05T07:57:50.605Z [ERROR] error event: version=v0.1 op=session.CloseConnections info:msg="error marking connection closed in state" error="could not find session ID "s_r30xHLEI78" in local state after closing connections" id=e_W9JKYjtQf8
2022-01-05T07:57:50.657Z [ERROR] error event: id=e_Ke2oamVehD version=v0.1 op=session.CloseConnections info:msg="error marking connection closed in state" error="could not find session ID "s_r30xHLEI78" in local state after closing connections"
2022-01-05T07:57:57.753Z [ERROR] error event: id=e_nzIWC8Nplj version=v0.1 op=worker.(Worker).sendWorkerStatus info:msg="error making status request to controller" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
2022-01-05T07:57:57.765Z [ERROR] error event: error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" id=e_nzIWC8Nplj version=v0.1 op=worker.(Worker).sendWorkerStatus info:msg="error making status request to controller"
2022-01-05T07:57:57.789Z [ERROR] error event: id=e_nzIWC8Nplj version=v0.1 op=worker.(Worker).sendWorkerStatus info:msg="error making status request to controller" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
2022-01-05T07:57:57.828Z [ERROR] error event: op=worker.(Worker).sendWorkerStatus info:msg="error making status request to controller" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" id=e_nzIWC8Nplj version=v0.1
2022-01-05T07:57:57.880Z [ERROR] encountered an error sending an error event:
error:=
| 5 errors occurred:
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: reached max of 3: too many retries
|
2022-01-05T07:57:57.881Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: reached max of 3: too many retries
2022-01-05T07:57:57.881Z [ERROR] event.WriteError: unable to write error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022-01-05T07:57:57.881Z [ERROR] error event: error="status error grace period has expired, canceling all sessions on worker" id=e_6tLK728ACa version=v0.1 op=worker.(Worker).sendWorkerStatus info:last_status_time="2022-01-05 07:42:14.841128169 +0000 UTC m=+0.035583208" info:grace_period=15s
2022-01-05T07:57:57.890Z [ERROR] error event: error="status error grace period has expired, canceling all sessions on worker" id=e_6tLK728ACa version=v0.1 op=worker.(Worker).sendWorkerStatus info:last_status_time="2022-01-05 07:42:14.841128169 +0000 UTC m=+0.035583208" info:grace_period=15s
2022-01-05T07:57:57.906Z [ERROR] error event: id=e_6tLK728ACa version=v0.1 op=worker.(Worker).sendWorkerStatus info:grace_period=15s info:last_status_time="2022-01-05 07:42:14.841128169 +0000 UTC m=+0.035583208" error="status error grace period has expired, canceling all sessions on worker"
2022-01-05T07:57:57.951Z [ERROR] error event: info:last_status_time="2022-01-05 07:42:14.841128169 +0000 UTC m=+0.035583208" info:grace_period=15s error="status error grace period has expired, canceling all sessions on worker" id=e_6tLK728ACa version=v0.1 op=worker.(Worker).sendWorkerStatus
2022-01-05T07:57:58.046Z [ERROR] encountered an error sending an error event:
error:=
| 5 errors occurred:
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: event not written to enough sinks
| * event.(Eventer).retrySend: reached max of 3: too many retries
|
2022-01-05T07:57:58.046Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: event not written to enough sinks
* event.(Eventer).retrySend: reached max of 3: too many retries
2022-01-05T07:57:58.046Z [ERROR] event.WriteError: unable to write error: status error grace period has expired, canceling all sessions on worker
2022-01-05T07:57:58.046Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_cSDF88rUIR data:connection_id=sc_3Huj82F6Fy data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:58.047Z [ERROR] error event: id=e_pF0m45MQUa version=v0.1 op=worker.(Worker).handleProxy info:session_id=s_cSDF88rUIR info:endpoint=tcp://minio:9000 info:msg="error handling proxy" error="error marking connection as connected: rpc error: code = Canceled desc = context canceled"
2022-01-05T07:57:58.047Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:msg="terminated connection due to cancellation or expiration" data:session_id=s_cSDF88rUIR data:connection_id=sc_C4hhrI4Ozv
2022-01-05T07:57:58.047Z [INFO] system event: data:session_id=s_Ma0N94Y7PK data:connection_id=sc_XAXwPPCUqb data:msg="terminated connection due to cancellation or expiration" version=v0.1 op=worker.(Worker).cleanupConnections
2022-01-05T07:57:58.047Z [ERROR] error event: error="error marking connection as connected: rpc error: code = Canceled desc = context canceled" id=e_moRNR6GVCK version=v0.1 op=worker.(Worker).handleProxy info:msg="error handling proxy" info:session_id=s_Ma0N94Y7PK info:endpoint=tcp://minio:9000
2022-01-05T07:57:58.047Z [ERROR] error event: version=v0.1 op=worker.(Worker).handleProxy info:session_id=s_Ma0N94Y7PK info:endpoint=tcp://minio:9000 info:msg="error handling proxy" error="error marking connection as connected: rpc error: code = Canceled desc = context canceled" id=e_qAHZNQG1Jc
2022-01-05T07:57:58.047Z [INFO] system event: op=worker.(Worker).cleanupConnections data:session_id=s_Ma0N94Y7PK data:connection_id=sc_9l0nPWfulY data:msg="terminated connection due to cancellation or expiration" version=v0.1
2022-01-05T07:57:58.047Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_Ma0N94Y7PK data:connection_id=sc_aUude9EOyz data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:58.047Z [INFO] system event: version=v0.1 op=worker.(Worker).cleanupConnections data:session_id=s_r30xHLEI78 data:connection_id=sc_9VceKSkPBf data:msg="terminated connection due to cancellation or expiration"
2022-01-05T07:57:58.214Z [ERROR] error event: id=e_1cduNMyIff version=v0.1 op=session.CloseConnections info:msg="error marking connection closed in state" error="could not find session ID "s_Y8R1f0BbHC" in local state after closing connections"
2022-01-05T07:57:58.215Z [ERROR] error event: error="could not find session ID "s_Y8R1f0BbHC" in local state after closing connections" id=e_l8HRS0JgnX version=v0.1 op=session.CloseConnections info:msg="error marking connection closed in state"
2022-01-05T07:57:58.241Z [ERROR] error event: info:msg="error marking connection closed in state" error="could not find connection ID "sc_8uJi4FBqar" for session ID "s_apiHjEz0MZ" in local state after closing connections" id=e_8XTtbHlsR8 version=v0.1 op=session.CloseConnections
2022-01-05T07:58:05.350Z [ERROR] error event: error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" id=e_YL8putv1hy version=v0.1 op=worker.(Worker).sendWorkerStatus info:msg="error making status request to controller"
2022-01-05T07:58:05.357Z [ERROR] error event: info:msg="error making status request to controller" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" id=e_YL8putv1hy version=v0.1 op=worker.(Worker).sendWorkerStatus
2022-01-05T07:58:05.381Z [ERROR] error event: info:msg="error making status request to controller" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" id=e_YL8putv1hy version=v0.1 op=worker.(Worker).sendWorkerStatus
2022-01-05T07:58:05.421Z [ERROR] error event: id=e_YL8putv1hy version=v0.1 op=worker.(Worker).sendWorkerStatus info:msg="error making status request to controller" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
2022-01-05T07:58:05.521Z [ERROR] encountered an error sending an error event:

Steps to reproduce the behavior:

setup a testing framework that will perform automatically boundary authenticate, boundary connect and perform continuous data transfers
run multiple (over 10) scenarios concurrently

Expected behavior
Connections can be established and data transfer is stable across time

Additional context
Boundary version: 0.7.3
Postgres: 12
Boundary server: 4CPU core/8GB RAM
Boundary worker: 4CPU core/8GB RAM
Postgres: 8 CPU Cores/8GB RAM, high iops disks

jefferai · 2022-01-05T15:20:57Z

The various context deadlines being exceeded suggest that the machine is out of resources (for instance, possibly swapping). Do you have information about CPU/RAM usage during this time? How much data was being transferred over how many concurrent connections, and how often were these being created?

giulio1979 · 2022-01-05T15:32:53Z

Hi, on which component should I check ?

In my scenario I used 1 minio as a server and each test scenario is transfering a file to this server via boundary worker.

The file is 100MB, transfer takes 3-4 seconds then s3 client is shutdown and restarted, so I would say every 4-5 seconds a connection is established via boundary.

Sometimes i can do 4-5 of these sessions concurrently, sometimes i can reach 16 before the whole controller collapses.

jefferai · 2022-01-05T16:03:02Z

Ideally check for both the controller and worker.

Also, what is your target definition? Specifically, what are your connection count and expiration settings?

giulio1979 · 2022-01-05T16:32:19Z

I have nothing special on the target:

Controller snapshot:

Worker snapshot:

To me this looks a bit fishy, the controller is doing the heavy lifting and not the worker. When it reached 1 CPU core usage it started throwing the deadline exceeded exception.

jefferai · 2022-01-05T16:58:19Z

Do you have any way to figure out the environment that the controller is running in? I suspect that GOMAXPROCS is being set in some way which would limit the CPU cores available to a single core. I could see a scenario where, if the controller is being limited to a single core, the ability of the controller to process requests is being blocked by (relatively speaking) long calls to the database, whereas normally you'd expect the database connections and requests to be spread out across cores. In a similar line of questioning, do you have configured limits on the number of database connections allowed within the Boundary controller's config files?

If you're running via something like systemd you could try running with boundary dev from your shell (first do an env | grep GOMAXPROCS) and see if you get a different result. Or explicitly set it like GOMAXPROCS=8 boundary dev to start a postgres Docker container and run that way, although you'll need to adjust the built-in dev settings for your needs.

Also, do you have any idea of how many sessions get established overall before you start seeing issues?

giulio1979 · 2022-01-05T18:06:56Z

In my case I am running the standard containers in a kubernetes cluster. I do not have any variables set other then the DB connection string.

It looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: boundary
name: boundary
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: boundary
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: boundary
service: boundary
spec:
automountServiceAccountToken: true
containers:
- args:
- boundary server -config /boundary/boundary.hcl --log-level=trace
command:
- /bin/sh
- -c
env: {{ include "postgres.env" . | indent 18 }}
image: hashicorp/boundary:0.7.3
imagePullPolicy: IfNotPresent
name: boundary
ports:
- containerPort: 9200
protocol: TCP
- containerPort: 9201
protocol: TCP
- containerPort: 9202
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /boundary
mountPropagation: None
name: boundary-config
readOnly: true
resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: 2000m
memory: 4Gi
dnsPolicy: ClusterFirst
enableServiceLinks: true
initContainers:
- args:
- boundary database init -config /boundary/boundary.hcl || boundary database migrate -config /boundary/boundary.hcl || true
command:
- /bin/sh
- -c
env: {{ include "postgres.env" . | indent 18 }}
image: hashicorp/boundary:0.7.3
imagePullPolicy: IfNotPresent
name: boundary-init
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /boundary
mountPropagation: None
name: boundary-config
readOnly: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
shareProcessNamespace: false
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: boundary-config
optional: false
name: boundary-config

The postgres env mentioned there looks like:

name: POSTGRES_USER
valueFrom:
secretKeyRef:
key: username
name: postgres.boundary-pg-cluster.credentials
name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: postgres.boundary-pg-cluster.credentials
name: HOSTNAME
value: boundary
name: BOUNDARY_STATUS_GRACE_PERIOD
value: "90"
{ name: BOUNDARY_PG_URL, value: "{{ printf "postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@boundary-pg-cluster:5432/boundary?sslmode=disable" }}"}
{{- end -}}

I have tried adding BOUNDARY_STATUS_GRACE_PERIOD, changing max_open_connections to 2, 5, 10 without observing any improvement.

The only thing I noticed is Postgres database server is very heavily used, already at 10 concurrent sessions is using 8 CPU cores and sometimes experiencing deadlocks in the function public.wh_update_session_connection().

jefferai · 2022-01-05T18:14:13Z

Oh, you're the same reporter as #1812 . This all makes sense then. I think the first thing to tackle is that, because I think fixing the database deadlocks will likely have a big impact here too.

micchickenburger · 2022-03-14T16:45:25Z

We are also seeing these issues.

On a controller:

Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]: 2022-03-12T11:57:44.586Z [ERROR] encountered an error sending an error event:
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   error:=
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   | 5 errors occurred:
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: reached max of 3: too many retries
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]: 2022-03-12T11:57:44.587Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: reached max of 3: too many retries
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]: 2022-03-12T11:57:44.587Z [ERROR] event.WriteError: unable to write error: db.DoTx: unknown, unknown: error #0: dbw.Rollback: timeout: context deadline exceeded; sql: transaction has already been committed or rolled back
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]: 2022-03-12T11:57:44.728Z [ERROR] encountered an error sending an error event:
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   error:=
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   | 5 errors occurred:
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |         * event.(Eventer).retrySend: reached max of 3: too many retries
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   |
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:   
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]: 2022-03-12T11:57:44.728Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]:         * event.(Eventer).retrySend: reached max of 3: too many retries
Mar 12 11:57:44 ip-10-15-0-142 boundary[2891]: 2022-03-12T11:57:44.728Z [ERROR] event.WriteError: unable to write error: db.DoTx: unknown, unknown: error #0: dbw.Rollback: timeout: context deadline exceeded; sql: transaction has already been committed or rolled back

On a worker at the same moment:

Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: {"id":"h7Wz5Xvbrt","source":"https://hashicorp.com/boundary/us-east-2-worker-o8cbse15od","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_4Tsz0GY6Qp","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2022-03-12T11:57:44.47979807Z"}
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: {"id":"vcfo5t8bmT","source":"https://hashicorp.com/boundary/us-east-2-worker-o8cbse15od","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_4Tsz0GY6Qp","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2022-03-12T11:57:44.490043988Z"}
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: {"id":"G6rW5mDDsu","source":"https://hashicorp.com/boundary/us-east-2-worker-o8cbse15od","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_4Tsz0GY6Qp","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2022-03-12T11:57:44.519191601Z"}
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: {"id":"E89B6honwX","source":"https://hashicorp.com/boundary/us-east-2-worker-o8cbse15od","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_4Tsz0GY6Qp","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2022-03-12T11:57:44.573399Z"}
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: 2022-03-12T11:57:44.654Z [ERROR] encountered an error sending an error event:
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   error:=
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   | 5 errors occurred:
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   |         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   |         * event.(Eventer).retrySend: reached max of 3: too many retries
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   |
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:   
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: 2022-03-12T11:57:44.654Z [ERROR] event.WriteError: event.(Eventer).writeError: 5 errors occurred:
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:         * event.(Eventer).retrySend: event not written to enough sinks
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]:         * event.(Eventer).retrySend: reached max of 3: too many retries
Mar 12 11:57:44 ip-10-15-0-96 boundary[1049]: 2022-03-12T11:57:44.654Z [ERROR] event.WriteError: unable to write error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The controller service did get killed eventually by the OOM reaper:

$ sudo journalctl -u boundary --no-pager | grep 9/KILL
Mar 14 15:49:40 ip-10-15-0-142 systemd[1]: boundary.service: Main process exited, code=killed, status=9/KILL

$ sudo dmesg | grep boundary
[229196.961106] [   2891]   113  2891   350263   190888  1581056        0             0 boundary
[229196.961109] [   2899]   113  2899   181883      814   126976        0             0 boundary-plugin
[229196.961111] [   2906]   113  2906   178386      466    98304        0             0 boundary-plugin
[229196.961113] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/boundary.service,task=boundary,pid=2891,uid=113
[229196.961143] Out of memory: Killed process 2891 (boundary) total-vm:1401052kB, anon-rss:695128kB, file-rss:68424kB, shmem-rss:0kB, UID:113 pgtables:1544kB oom_score_adj:0
[229196.963277] oom_reaper: reaped process 2891 (boundary), now anon-rss:695224kB, file-rss:69052kB, shmem-rss:0kB

While it is routine for the boundary controllers to get killed by oom_reaper the workers never do. Indeed, just looking at the current memory usage, we can see controllers require much more than workers do:

Controllers

$ sudo systemctl status boundary
● boundary.service - Boundary
     Loaded: loaded (/etc/systemd/system/boundary.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-03-14 15:49:41 UTC; 52min ago
   Main PID: 9302 (boundary)
      Tasks: 23 (limit: 1081)
     Memory: 583.6M
     CGroup: /system.slice/boundary.service
             ├─9302 /usr/local/bin/boundary server -config /etc/boundary.hcl
             ├─9310 /tmp/656155661/boundary-plugin-host-aws
             └─9318 /tmp/3661262235/boundary-plugin-host-azure

$ sudo systemctl status boundary
● boundary.service - Boundary
     Loaded: loaded (/etc/systemd/system/boundary.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-03-12 00:15:47 UTC; 2 days ago
   Main PID: 3035 (boundary)
      Tasks: 23 (limit: 1081)
     Memory: 615.4M
     CGroup: /system.slice/boundary.service
             ├─3035 /usr/local/bin/boundary server -config /etc/boundary.hcl
             ├─3042 /tmp/2645974992/boundary-plugin-host-aws
             └─3049 /tmp/1177565111/boundary-plugin-host-azure

Workers

$ sudo systemctl status boundary
● boundary.service - Boundary
     Loaded: loaded (/etc/systemd/system/boundary.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-03-12 00:19:37 UTC; 2 days ago
   Main PID: 1049 (boundary)
      Tasks: 8 (limit: 1081)
     Memory: 56.0M
     CGroup: /system.slice/boundary.service
             └─1049 /usr/local/bin/boundary server -config /etc/boundary.hcl

$ sudo systemctl status boundary
● boundary.service - Boundary
     Loaded: loaded (/etc/systemd/system/boundary.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-03-12 00:18:30 UTC; 2 days ago
   Main PID: 1042 (boundary)
      Tasks: 8 (limit: 1081)
     Memory: 60.2M
     CGroup: /system.slice/boundary.service
             └─1042 /usr/local/bin/boundary server -config /etc/boundary.hcl

Also, our controllers constantly use 50-100% CPU usage, even under no load. The workers basically use no CPU usage.

irenarindos · 2022-05-02T20:14:56Z

@giulio1979 @micchickenburger We have performance improvements that will be delivered soon in our 0.8.0 release; please let me know if you still see these issues after an upgrade.

jefferai · 2022-06-22T19:39:34Z

I believe this has been addressed by changes in 0.8.x and 0.9.0. I'm going to close this for the moment; if you still have issues please reopen with details!

jefferai assigned mgaffney and irenarindos Jan 5, 2022

gsusmi added the triage label Jan 25, 2022

incubator4 mentioned this issue May 8, 2022

Worker cannot connect to Controller while upgrade to 0.8.0 #2063

Closed

jefferai closed this as completed Jun 22, 2022

macmiranda mentioned this issue Mar 30, 2023

Exact 10m proxy connection timeout #3133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable Connections and constant resets #1813

Unstable Connections and constant resets #1813

giulio1979 commented Jan 5, 2022 •

edited

Loading

jefferai commented Jan 5, 2022

giulio1979 commented Jan 5, 2022

jefferai commented Jan 5, 2022

giulio1979 commented Jan 5, 2022

jefferai commented Jan 5, 2022

giulio1979 commented Jan 5, 2022

jefferai commented Jan 5, 2022

micchickenburger commented Mar 14, 2022

irenarindos commented May 2, 2022

jefferai commented Jun 22, 2022

Unstable Connections and constant resets #1813

Unstable Connections and constant resets #1813

Comments

giulio1979 commented Jan 5, 2022 • edited Loading

jefferai commented Jan 5, 2022

giulio1979 commented Jan 5, 2022

jefferai commented Jan 5, 2022

giulio1979 commented Jan 5, 2022

jefferai commented Jan 5, 2022

giulio1979 commented Jan 5, 2022

jefferai commented Jan 5, 2022

micchickenburger commented Mar 14, 2022

Controllers

Workers

irenarindos commented May 2, 2022

jefferai commented Jun 22, 2022

giulio1979 commented Jan 5, 2022 •

edited

Loading