Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Loading file list of large directory is too slow #4706

Open
esevan opened this issue Jun 21, 2019 · 8 comments
Open

Loading file list of large directory is too slow #4706

esevan opened this issue Jun 21, 2019 · 8 comments

Comments

@esevan
Copy link
Contributor

esevan commented Jun 21, 2019

It's duplicate of #3114 . But no answer in that issue thread. Is there any progress on it such as pagination?

@kevin-bates
Copy link
Member

I think this issue is related #4575. Did you see how these issues fair in Lab?

@esevan
Copy link
Contributor Author

esevan commented Jun 22, 2019

@kevin-bates Thanks for response. Actually I'm using Lab. For more detail, I opened the directory which has 10,000 images. In that case everything hang even terminal input not working. I can see request timeout in the end.

I'll add my env detail soon. Sorry for lack of information.

@kevin-bates
Copy link
Member

I think you might get better traction opening this issue in https://github.com/jupyterlab/jupyterlab since that's where the front-end focus is these days. I suspect "files" are treated differently than directories (which is where I saw the difference in Lab) relative to rendering.

@esevan
Copy link
Contributor Author

esevan commented Jun 26, 2019

@kevin-bates Well.. I decided to post here because I could check this happens in classic notebook (/tree) as well as jupyter lab (/lab).

The following shows it takes 7.40 seconds to get response from the notebook server for the request of 25,089 dentries under trainB.

$ ls datasets/horse2zebra/trainB | wc -c
25089

image

EDIT)
I'm guessing server stocked in following code.

if content:
model['content'] = contents = []
os_dir = self._get_os_path(path)
for name in os.listdir(os_dir):
try:
os_path = os.path.join(os_dir, name)
except UnicodeDecodeError as e:
self.log.warning(
"failed to decode filename '%s': %s", name, e)
continue
try:
st = os.lstat(os_path)
except OSError as e:
# skip over broken symlinks in listing
if e.errno == errno.ENOENT:
self.log.warning("%s doesn't exist", os_path)
else:
self.log.warning("Error stat-ing %s: %s", os_path, e)
continue
if (not stat.S_ISLNK(st.st_mode)
and not stat.S_ISREG(st.st_mode)
and not stat.S_ISDIR(st.st_mode)):
self.log.debug("%s not a regular file", os_path)
continue
if self.should_list(name):
if self.allow_hidden or not is_file_hidden(os_path, stat_res=st):
contents.append(
self.get(path='%s/%s' % (path, name), content=False)
)
model['format'] = 'json'

The sample data used for description is from https://github.com/junyanz/CycleGAN

@kevin-bates
Copy link
Member

Thanks for the information. So it sounds like you see roughly the same behavior between classic and lab with files (unlike I saw with directories). I figured the delay was in the client-side rendering, but you're showing essentially server-side code, which implies thousands of directories should have resulted in the same behavior - i.e., delay seen in both Notebook and Lab (contrary to what I found).

((That repo you link is interesting. The sample for the failure case is particularly entertaining.))

@kevin-bates
Copy link
Member

Hmm, I still see the same behaviors with files. I touched 10,000 files in my notebook directory (for i in {1..10000}; do touch zzz_${i}; done). Then ran notebook (with debug enabled).

With Notebook "classic", I see the contents api completes in just over 1 second, but the rendering (not sure if that's the approrpriate use of the term here, not a front-end dev) on the order of 48 seconds as I attempt to scroll. This scrolling is also accompanied by "Page Unresponsive" dialogs (using Chrome).

[D 08:55:48.309 NotebookApp] 200 GET /api/sessions?_=1561563455783 (::1) 1.09ms
[D 08:55:48.312 NotebookApp] 200 GET /api/terminals?_=1561563455784 (::1) 1.11ms
[D 08:55:49.608 NotebookApp] 200 GET /api/contents?type=directory&_=1561563455785 (::1) 1115.29ms
[D 08:56:36.954 NotebookApp] 200 GET /api/sessions?_=1561563455786 (::1) 0.90ms
[D 08:56:36.955 NotebookApp] 200 GET /api/terminals?_=1561563455787 (::1) 0.71ms
[D 08:56:38.371 NotebookApp] 200 GET /api/contents?type=directory&_=1561563455788 (::1) 1126.69ms
[D 08:57:30.000 NotebookApp] 200 GET /api/sessions?_=1561563455789 (::1) 0.94ms
[D 08:57:30.003 NotebookApp] 200 GET /api/terminals?_=1561563455790 (::1) 1.16ms
[D 08:57:31.326 NotebookApp] 200 GET /api/contents?type=directory&_=1561563455791 (::1) 1130.55ms

Switching the url to Lab, I see the same contents api taking just over 1 second, but the scrolling appears to be fine, with gaps between contents calls taking on the order of 8 seconds. However, I see no delay in the UI, so I suspect this "retrieval & rendering work" is happening in the background.

[D 08:36:54.697 NotebookApp] 200 GET /api/sessions?1561563414694 (::1) 1.13ms
[D 08:36:54.698 NotebookApp] 200 GET /api/terminals?1561563414695 (::1) 0.85ms
[D 08:36:56.337 NotebookApp] 200 GET /api/contents/?content=1&1561563415174 (::1) 1160.75ms
[D 08:37:04.696 NotebookApp] 200 GET /api/sessions?1561563424693 (::1) 1.07ms
[D 08:37:04.698 NotebookApp] 200 GET /api/terminals?1561563424694 (::1) 0.90ms
[D 08:37:06.371 NotebookApp] 200 GET /api/contents/?content=1&1561563425175 (::1) 1193.78ms
[D 08:37:14.696 NotebookApp] 200 GET /api/sessions?1561563434693 (::1) 1.08ms
[D 08:37:14.698 NotebookApp] 200 GET /api/terminals?1561563434694 (::1) 0.81ms
[D 08:37:16.374 NotebookApp] 200 GET /api/contents/?content=1&1561563435179 (::1) 1192.88ms

Not sure why the contents api is occuring during scrolling given the contents service doesn't appear to have paging. This might just be how the front-end is written in order to deal with updates. I suspect there's a general assumption that notebook directories are sparsely populated - which is reasonable IMO.

@esevan
Copy link
Contributor Author

esevan commented Jun 27, 2019

Not sure why the contents api is occuring during scrolling given the contents service doesn't appear to have paging.

I suspect this is due to periodic refreshing behavior of directory change in Jupyter Lab because contents API does not return the path under the directory incrementally (referring to the code I attached above).

As for the test result, my environment magnifies the problem since it resides in remote server and allocated resource to the server is quite small (CPU 2, Memory 4Gi).

So I tested in local, result is similar with @kevin-bates reported.
After touching 10,000 files by for i in {1..10000}; do touch zzz_${i}; done
Actually I can see the result shows RTT of the contents request is quite big and suspect it's linearly increasing.

[D 12:57:27.835 LabApp] 200 GET /api/contents/contents_test/10000?content=1&1561607589692 (10.113.66.26) 1233.79ms

So I increased the number of files to 10x more than above by for i in {1..100000}; do touch zzz_${i}; done

[D 13:05:59.067 LabApp] 200 GET /api/contents/contents_test/100000?content=1&1561608090756 (10.113.66.26) 11399.31ms

I can check contents api almost takes 10x more time to get response.

Here I could find the problem I think.

  1. Listing directory in server side takes long time proportional to the number of files the directory has.
    -> This can be a big problem, I think notebook contents api needs to support incremental contents API and responsive UI or pagination should be developed in front end side.

  2. Jupyter Lab hangs while server handles the request. So another request to the server cannot be responded.
    -> I'm not sure if this is an issue due to the server side logic, but suspect due to both side (front and back)
    -> I could check the browser cannot send subsequent request while the server handles the request.
    : Maybe HOL blocking issue happens.
    -> The coroutine seems blocked due to following code is not an async code.

    if content:
    model['content'] = contents = []
    os_dir = self._get_os_path(path)
    for name in os.listdir(os_dir):
    try:
    os_path = os.path.join(os_dir, name)
    except UnicodeDecodeError as e:
    self.log.warning(
    "failed to decode filename '%s': %s", name, e)
    continue
    try:
    st = os.lstat(os_path)
    except OSError as e:
    # skip over broken symlinks in listing
    if e.errno == errno.ENOENT:
    self.log.warning("%s doesn't exist", os_path)
    else:
    self.log.warning("Error stat-ing %s: %s", os_path, e)
    continue
    if (not stat.S_ISLNK(st.st_mode)
    and not stat.S_ISREG(st.st_mode)
    and not stat.S_ISDIR(st.st_mode)):
    self.log.debug("%s not a regular file", os_path)
    continue
    if self.should_list(name):
    if self.allow_hidden or not is_file_hidden(os_path, stat_res=st):
    contents.append(
    self.get(path='%s/%s' % (path, name), content=False)
    )
    model['format'] = 'json'

  3. Cannot render 100,000 files in the browser, though server managed to respond.
    -> I believe this is highly due to frontend problem. Should request incrementally and provide better UX.

@miraculixx
Copy link

miraculixx commented Jun 17, 2020

@esevan Great analysis. I am experiencing the same issue (in particular, JupyterLab hangs intermittently). Here are a few more insights from my POV.

I believe this is highly due to frontend problem.

In my case the key problem is that JupyterLab seems to issue a new call to the /api/contents API before and after (?) each cell execution, with the ?content=1 flag set. This in turn issues a call to the FileContentsManager.get(..., content=True), thus requesting the actual file contents. Note we have subclassed FileContentsManager to support storing notebooks in a database, which aggravates the problem -- already with 100 or so notebooks this can slow down the process to the point where the api call can take 5-10 seconds to complete. In conclusion, its not really a UI issue, though it is caused by the way the UI requests the contents listing.

-> I'm not sure if this is an issue due to the server side logic, but suspect due to both side

The server side logic seems ok (except that it is blocking, not sure the ContentsManager api supports async). However it is not quite clear to me why JupyterLab requests the file contents when all it really does is to refresh the directory listing. In particular, JupyterLab - like the previous Jupyter server file listing, i.e. /tree - issues a specific get request to get the actual contents once the file/notebook is opened.

I see several possible approaches to improve the situation:

  1. Return some dummy contents on directory requests which would speed up the process.
  2. Change JupyterLab to get directory listings with ?content=0
  3. Cache the actual contents for some time in the server

Not sure if option 1 interferes with the JupyterLab UI logic (it may use the actual contents to display icons or other information).

From my perspective option 2 would the best. We should avoid option 3 as it is bound to introduce consistency issues.

Currently I don't have the capacity to dig further or open an issue in the JupyterLab tracker, any support would be appreciated.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants