Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

listdir() sometimes blocks program #117

Open
MichaelKorn opened this issue Jul 15, 2021 · 5 comments
Open

listdir() sometimes blocks program #117

MichaelKorn opened this issue Jul 15, 2021 · 5 comments

Comments

@MichaelKorn
Copy link
Contributor

The use of listdir() sometimes yields to a blocked program.

Expected Behavior

  1. There should be a reasonable timeout for a listdir() call.
  2. TimeoutError: [Errno 110] Connection timed out should be handled.

Current Behavior

The use case:

  • The application checks smb share folder (with listdir()) more than 3 times per second.
  • The connection to the smb share is slow and maybe not completely stable (over VPN).
    • It is possible that a part of the communication between the server and client get lost.
  • This application runs within a docker container (python:3.9-slim) with smbprotocol==1.5.1.

Sometimes the application hangs during the call of smbclient.listdir(). Until that point, the application usually runs for many hours, sometimes a few days.
During such failure state I retrieved debug information with faulthandler.dump_traceback(all_threads=True) (I removed the output for some Threads (DB connection and debugging)):

Thread 0x00007f00eeffd700 (most recent call first):
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/transport.py", line 140 in _recv
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/transport.py", line 119 in recv
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/connection.py", line 1119 in _process_message_thread
  File "/usr/local/lib/python3.9/threading.py", line 892 in run
  File "/usr/local/lib/python3.9/threading.py", line 954 in _bootstrap_inner
  File "/usr/local/lib/python3.9/threading.py", line 912 in _bootstrap

Thread 0x00007f01008fb740 (most recent call first):
  File "/usr/local/lib/python3.9/threading.py", line 312 in wait
  File "/usr/local/lib/python3.9/threading.py", line 574 in wait
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/connection.py", line 852 in receive
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/open.py", line 1219 in _create_response
  File "/usr/local/lib/python3.9/site-packages/smbclient/_io.py", line 314 in commit
  File "/usr/local/lib/python3.9/site-packages/smbclient/_io.py", line 465 in open
  File "/usr/local/lib/python3.9/site-packages/smbclient/_io.py", line 410 in __enter__
  File "/usr/local/lib/python3.9/site-packages/smbclient/_os.py", line 242 in listdir
  File ..... my call of smbclient.listdir()

For 10 Minutes the Threads are just waiting. Then the timeout of 10min applies:

Socket recv(%s) timed out
Sending SMB2 Echo to keep connection alive
Socket recv(4) (total 4)

After that both Threads are waiting like before.
After about 16 minutes (from calling listdir()) an Exception occurs:

Exception in thread msg_worker-MY-HOSTNAME:445:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/connection.py", line 1119, in _process_message_thread
    b_msg = self.transport.recv(600)
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/transport.py", line 119, in recv
    b_packet_size, timeout = self._recv(4, timeout)
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/transport.py", line 147, in _recv
    b_data = self._sock.recv(read_len)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/connection.py", line 1199, in _process_message_thread
    self.disconnect(False)
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/connection.py", line 799, in disconnect
    self.transport.close()
  File "/usr/local/lib/python3.9/site-packages/smbprotocol/transport.py", line 89, in close
    self._sock.shutdown(socket.SHUT_RDWR)
OSError: [Errno 107] Transport endpoint is not connected

After this exception the smbprotocol module ends in an infinite loop of OSError: [Errno 107] Transport endpoint is not connected from self._sock.shutdown(socket.SHUT_RDWR). This seems to be fixed with the newest PR (I just mention the last part of the issue for completeness and to endorse the release of the latest version).

@jborean93
Copy link
Owner

This seems to be fixed with the newest PR (I just mention the last part of the issue for completeness and to endorse the release of the latest version).

The infinite loop is fixed but the underlying block is still present or does that PR fix this whole issue for you?

I'm hoping to go through the PR queue and issues sometime soon so I can push a new release. It's been a hectic few months for me unfortunately so I haven't had the time to really work on this lately.

@MichaelKorn
Copy link
Contributor Author

Thanks @jborean93, for reply.
I applied the PR to the application 5 days ago.
But it is too early to assess the effects, as the problem only occurs every few days (I think 10 days was the longest interval without any occurrence). However, with regard to my logs, debugging and code change in PR, I suspect:

  1. Only the infinite loop of OSError: [Errno 107] Transport endpoint is not connected should be fixed.
    1. The subject could also be more complicated and I don't want to mix this ticket with a likely independent error state.
      For example, I also observed twice how 64 credits were used up one after the other. Then 0 Credits were left and the smbclient was trapped in some other infinite loop. I would also create a ticket for this as soon as I have useful explorations.
  2. The application will still be blocked sporadically for 10 minutes.

@MichaelKorn
Copy link
Contributor Author

@jborean93 The issue just occurred again and the behavior was as expected:

  1. The application was blocked for 16 minutes.
  2. The application returned with
File "/usr/local/lib/python3.9/site-packages/smbprotocol/transport.py", line 147, in _recv
    b_data = self._sock.recv(read_len)
TimeoutError: [Errno 110] Connection timed out`
  1. Thanks to the update from Ignore ENOTCONN socket errors #110 the application recovered and no restart was needed.

@jborean93
Copy link
Owner

This one sounds like it's going to be tricky to solve. Based on your description it sounds like the response for the listdir call isn't being returned due to some connection issue so that thread is infinitely waiting for a response. Over time the connection is eventually dropped waking up that thread as it was notified there was a failure. Unfortunately it's that timeout that's taking a while to receive so I'm not sure how to try and speed that up.

Another route is to have a timeout while waiting for the response and have that raise an exception if the response wasn't received in time. This might be the way forward but I'm unsure what a sane value should be here for this. Should we just wait 30 seconds, a minute, something else. How would that apply to things that are expected to wait for ages like a change notification request or just a busy server unable to confirm the data was written to the underlying storage.

@MichaelKorn
Copy link
Contributor Author

Timeouts usually cause further problems, but with network communication involved it is unfortunately not possible without it. From a user perspective I would expect one of the two concepts:

  1. global timeout for requests
    1. Default value for global timeout (eg. 60 s).
    2. Default value can be globally overridden be user (Like in my situation with a VPN in between I maybe need 100 s).
    3. API (requests) calls like listdir() should still offer a timeout option to the user.
  2. only local timeouts for requests
    1. Default values for every request (via API) (eg. 40 s for basic operations and 80s for operations that are likely to take longer).
    2. API (requests) calls like listdir() should offer a timeout option to the user.

Change notification requests are special and I do not have enough insides into the smb protocol to really comment on this. The Server has to response to a subscription creation, so a subscription could be treated as a normal request (with client side timeout). Later, notifications are send by the server, so a server side timeout should be used.
In addition the healthiness of the connection between client and Server muss be periodically checked.

Checking the healthiness of the connection is maybe partially implemented:

except _TimeoutError:
log.debug("Sending SMB2 Echo to keep connection alive")
for sid in self.session_table.keys():
req = self.send(SMB2Echo(), sid=sid)
# Set this reserved field to 1 as we use that internally to check whether the outstanding
# requests queue should be cleared in this thread or not.
req.message['reserved'] = 1
continue

The echo command is used here to keep the connect alive, but the server sends also a response. If there is no response (again with timeout) the connection is already dead. If you use that and maybe also increase the echo frequency (current interval is 10 min), then in my case the blockage could be less than 16 min.
An API-based timeout could also be built in here: Always if a request is added to or removed from self.outstanding_requests the next timeout time could be estimated and used for the self.transport.recv call. The implementation in detail also depends on whether we assume that it is possible that only individual requests go wrong and the connection itself remains intact or it is more more likely that the entire connection breaks off and everything is lost from then on.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants