Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Reuse connectionpool in HTTPFileSystem? #240

Closed
rabernat opened this issue Feb 25, 2020 · 2 comments
Closed

Reuse connectionpool in HTTPFileSystem? #240

rabernat opened this issue Feb 25, 2020 · 2 comments

Comments

@rabernat
Copy link
Contributor

This issue originally came up in zarr-developers/zarr-python#536.

HTTPFilesystem is slow to fetch many files, because it does not reuse a connectionpool. Example:

from fsspec.implementations.http import HTTPFileSystem
fs = HTTPFileSystem()

url_base = 'https://mur-sst.s3.us-west-2.amazonaws.com/zarr/time'
def get_chunk_fsspec(n):
    return fs.cat(url_base + f'/{n}')

%prun all_data = [get_chunk_fsspec(i) for i in range(20)]

As you can see, SSL_do_handshake is called 20 times and takes most of the time.

         205967 function calls (205947 primitive calls) in 7.177 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    2.726    0.136    2.730    0.137 {built-in method _openssl.SSL_do_handshake}
       20    1.871    0.094    1.871    0.094 {method 'connect' of '_socket.socket' objects}
       40    1.853    0.046    1.853    0.046 {built-in method _openssl.SSL_read}
       20    0.270    0.014    0.270    0.014 {built-in method _openssl.SSL_CTX_load_verify_locations}
...

If we do basically the same thing with requests

import requests

s = requests.Session()
def get_chunk_http(n):
    r = s.get(url_base + f'/{n}')
    r.raise_for_status()
    return r.content

%prun all_data = [get_chunk_http(i) for i in range(20)] 

In this case, because we reused the session, it goes much faster:

 
         91599 function calls (91579 primitive calls) in 2.114 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       40    1.699    0.042    1.699    0.042 {built-in method _openssl.SSL_read}
        1    0.142    0.142    0.142    0.142 {built-in method _openssl.SSL_do_handshake}
        1    0.069    0.069    0.069    0.069 {method 'connect' of '_socket.socket' objects}
       20    0.025    0.001    0.025    0.001 {built-in method _socket.gethostbyname}
        1    0.016    0.016    0.016    0.016 {built-in method _openssl.SSL_CTX_load_verify_locations}
       20    0.007    0.000    0.007    0.000 {built-in method _scproxy._get_proxy_settings}
...

Could HTTPFileSystem be configured to reuse a session in a similar way?

@martindurant
Copy link
Member

This is a bug in the implementation, and is supposed to work as you suggest. The following change fixes it.

--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
@@ -107,7 +107,7 @@ class HTTPFileSystem(AbstractFileSystem):
             return list(sorted(out))

     def cat(self, url):
-        r = requests.get(url, **self.kwargs)
+        r = self.session.get(url, **self.kwargs)
         r.raise_for_status()
         return r.content

Note that reusing sessions does present a small thread safety issue: it is possible for the pool to run out of connections when called from multiple threads simultaneously, resulting in connections being evicted and closed.

@martindurant
Copy link
Member

Sneaked it in here: d80034e

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants