Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Websocket connect hangs if Pebble timeout elapses first #1246

Closed
benhoyt opened this issue Jun 5, 2024 · 0 comments · Fixed by #1247
Closed

Websocket connect hangs if Pebble timeout elapses first #1246

benhoyt opened this issue Jun 5, 2024 · 0 comments · Fixed by #1247
Assignees

Comments

@benhoyt
Copy link
Collaborator

benhoyt commented Jun 5, 2024

If the Python code takes too long to run between the POST /v1/exec to start a Pebble exec, and connecting to all (2 or 3, depending on combine_stderr) of the websockets, the Python code will hang indefinitely and never time out. This is because there's no socket timeout set during the websocket connect phase, and Pebble waits rather than rejecting the connection (even though it's already exceeded its waitIOConnected timeout).

You can test this by adding time.sleep(5.1) between the POST and the websocket connections, here:

diff --git a/ops/pebble.py b/ops/pebble.py
index 831c778..ad2fb3e 100644
--- a/ops/pebble.py
+++ b/ops/pebble.py
@@ -2751,6 +2751,7 @@ class Client:
 
         stderr_ws: Optional[_WebSocket] = None
         try:
+            time.sleep(5.1)
             control_ws = self._connect_websocket(task_id, 'control')
             stdio_ws = self._connect_websocket(task_id, 'stdio')
             if not combine_stderr:

Then fire up pebble run in one terminal, and run a Pebble exec in another:

$ .tox/unit/bin/python -m test.pebble_cli exec -- echo foo
# after 5s, the Pebble logs will show "timeout waiting for websocket connections",
# but it will hang here

We should almost certainly have a (relatively short) timeout on the socket during connect. Though we have to unset the timeout during after it's connected, as the websockets for control and stdio are essentially long-polling, and will wait an arbitrary amount of time till input arrives. With this fix, we get this (after 10s = 5s Pebble timeout + 5s connect timeout):

$ .tox/unit/bin/python -m test.pebble_cli --socket=/var/lib/pebble/default/.pebble.socket exec -- echo foo
ChangeError: cannot perform the following tasks:
- Execute command "echo" (exec 31: timeout waiting for websocket connections: context deadline exceeded)
----- Logs from task 0 -----
2024-06-05T16:08:18+12:00 ERROR exec 31: timeout waiting for websocket connections: context deadline exceeded
-----

We can probably also improve Pebble's handling of this, as it should know that the waitIOConnected bit has already timed out, but fixing it in Ops will be a great start! I'll push up a PR soon.

@benhoyt benhoyt self-assigned this Jun 5, 2024
@benhoyt benhoyt closed this as completed in d8c9807 Jun 6, 2024
tonyandrewmeyer added a commit to tonyandrewmeyer/operator that referenced this issue Jun 26, 2024
…cal#1247)

This is the fix for the issue described at
canonical#1246. Essentially, if the
Pebble timeout has already elapsed, Pebble will happily wait
indefinitely for the connect to go through, and the Python side will
hang. Add a timeout during the connect phase to cut this short.

Fixes canonical#1246.

---------

Co-authored-by: Tony Meyer <tony.meyer@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant