-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
offset out of range for 65536-byte buffer #14
Comments
Thanks for the report, @sergeyvilov. I tried to get a URL to this dataset using my own kaggle account, but it seems like you either have to use their library/API, or save cookies from the browser session and use wget/curl. How did you manage to get a simple URL to use with unzip-http? |
Hi @saulpw and thanks for the fast reply. I just clicked on Download All button under the Data Explorer on the Data tab, then canceled the download and chose Copy download link in the Firefox by clicking on the cancelled download in the Firefox download window. The link starts with https://storage.googleapis.com/kaggle-... and ends with .zip. It works then with The problem with kaggle-api is that it can't download directories. When one tries to download files one-by-one, one gets the error 'Too many requests' after some time (for this dataset there're too many small .dcm files which raises Too many requests error). I thought that with If |
Thanks, that allowed me to get the URL too. When I use this code it works fine for me:
So I'm not sure why you're seeing that error. We did fix an issue like this awhile ago for 64-bit .zip files. Are you using the most recent version of unzip_http? You're right though, that unzip-http does a separate HTTP request per file. It wouldn't be impossible to make unzip_http be able to download multiple contiguous files with one request, as you suggest, but it would be somewhat complicated and I unfortunately don't have the time at the moment to pull it together. If you're interested in making it happen I'd certainly review a PR for it. |
Thank you very much for testing. Looks very very strange. I'm using a clean conda environment with unzip-http 0.4 and Python 3.11 (also tried with 3.8) and execute the same code as you, the issue is observed on MacOS Big Sur (my home laptop), Ubuntu 18.04.6, and Rocky Linux 8.8. (remote servers) Again, Concerning the number of requests, I think, it might be useful to merge individual requests for contiguous regions in future releases. I am pretty sure that not only the kaggle server limits the requests rate, so many users may also run into this issue when trying to download many files from the archive. |
I get the same error. I am trying to download parts of the DocLayNet_extra.zip from here:
|
Same here with
Would it be enough to change the 65536 to a larger number (I have no idea if that make sense) when needed? |
So, I tried all these cases myself from the CLI and they seem to work fine with v0.5.1. For example:
I think the problem is that we never pushed v0.5.1 to PyPI. So these problems should be fixed once we do that (hopefully tonight). If any errors like this happen still with v0.5.1, please open a new issue. |
While attempting to download files from an ultra-large zip archive (355Gb) I got the following:
The archive link can be obtained by downloading a Kaggle dataset from here.
Unfortunately, I can't provide a direct link without exposing my kaggle credentials
The text was updated successfully, but these errors were encountered: