Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

http: Revert the default timeout to 60 seconds #74

Merged
merged 1 commit into from
May 25, 2022

Conversation

nirs
Copy link
Member

@nirs nirs commented May 24, 2022

In commit d5e9c75 (http: Configurable inactivity timeout) we
changed the default timeout from 60 seconds to 15. The reasoning was
that clients have no reason to connect and keep the connection idle for
long time. Once a client connects, it is expected to start sending
requests. On the first request, the socket timeout is replaced by the
ticket inactivity timeout, set by the user creating the transfer.

Turns out that there is a valid use case for idle clients, and the
shorter timeout breaks downloads of big images (reproduced with 8 TiB
image. The failure flow is:

  1. Client connects and send an EXTENTS request.
  2. While EXTENTS request is collecting data, client connects multiple
    downloads connections.
  3. The connected download threads wait on a queue for work, but since
    EXTENTS request did not finish, the connections are idle.
  4. After 15 seconds the server close the idle connections.
  5. When the EXTENTS request finish, the client fails to send request to
    the server.

Here is example failure on the client side from ovirt-stress backup run:

  ...
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get
    self._con.request("GET", self.url.path, headers=headers)
  File "/usr/lib64/python3.6/http/client.py", line 1273, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.6/http/client.py", line 1004, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

Looking in the server logs, we can see that EXTENTS request took about
24 seconds:

2022-05-24 16:22:57,548 INFO    (Thread-75) [extents] [local]
EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero
...
2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE
connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch
2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]

Downloading 8 TiB disk is an edge case, but this can happen with smaller
images on very fragmented file system, or if there is another reason
that cause EXTENTS request to be slow.

Revert the timeout back to the previous value used in ovirt 4.4.

We may shorten the timeout once we support partial extents:
https://bugzilla.redhat.com/1924940

Fixes #71

Signed-off-by: Nir Soffer nsoffer@redhat.com

@nirs nirs added the bug Issue is a bug or fix for a bug label May 24, 2022
@nirs nirs requested review from vjuranek, bennyz and aesteve-rh May 24, 2022 17:25
@nirs nirs self-assigned this May 24, 2022
@nirs nirs added this to the ovirt-4.5.1 milestone May 24, 2022
bennyz
bennyz previously approved these changes May 24, 2022
vjuranek
vjuranek previously approved these changes May 25, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we
changed the default timeout from 60 seconds to 15. The reasoning was
that clients have no reason to connect and keep the connection idle for
long time. Once a client connects, it is expected to start sending
requests. On the first request, the socket timeout is replaced by the
ticket inactivity timeout, set by the user creating the transfer.

Turns out that there is a valid use case for idle clients, and the
shorter timeout breaks downloads of big images (reproduced with 8 TiB
image. The failure flow is:

1. Client connects and send an EXTENTS request.
2. While EXTENTS request is collecting data, client connects multiple
   downloads connections.
3. The connected download threads wait on a queue for work, but since
   EXTENTS request did not finish, the connections are idle.
4. After 15 seconds the server close the idle connections.
5. When the EXTENTS request finish, the client fails to send request to
   the server.

Here is example failure on the client side from ovirt-stress backup run:

      ...
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy
        self._src.write_to(self._dst, req.length, self._buf)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to
        res = self._get(length)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get
        self._con.request("GET", self.url.path, headers=headers)
      File "/usr/lib64/python3.6/http/client.py", line 1273, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output
        self.send(msg)
      File "/usr/lib64/python3.6/http/client.py", line 1004, in send
        self.sock.sendall(data)
    BrokenPipeError: [Errno 32] Broken pipe

Looking in the server logs, we can see that EXTENTS request took about
24 seconds:

    2022-05-24 16:22:57,548 INFO    (Thread-75) [extents] [local]
    EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero
    ...
    2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE
    connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch
    2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]

Downloading 8 TiB disk is an edge case, but this can happen with smaller
images on very fragmented file system, or if there is another reason
that cause EXTENTS request to be slow.

Revert the timeout back to the previous value used in ovirt 4.4.

We may shorten the timeout once we support partial extents:
https://bugzilla.redhat.com/1924940

Fixes: oVirt#71
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs nirs dismissed stale reviews from vjuranek and bennyz via 98db427 May 25, 2022 09:33
@nirs nirs force-pushed the socket-timeout branch from f86933b to 98db427 Compare May 25, 2022 09:33
@nirs nirs merged commit 0568314 into oVirt:master May 25, 2022
@nirs nirs deleted the socket-timeout branch May 25, 2022 09:43
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Issue is a bug or fix for a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backup of huge disk breaks because of inactivity_timeout
4 participants