offset out of range for 65536-byte buffer #14

sergeyvilov · 2023-08-31T16:07:52Z

While attempting to download files from an ultra-large zip archive (355Gb) I got the following:

Traceback (most recent call last):
File "/Users/sergey.vilov/tmp/test/test.py", line 5, in
binfp = rzf.open('train_images/10005/18667/100.dcm')
File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 192, in open
f = list(self.matching_files(fn))
File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 186, in matching_files
for f in self.files.values():
File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 109, in files
self._files = {r.filename:r for r in self.infoiter()}
File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 109, in
self._files = {r.filename:r for r in self.infoiter()}
File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 151, in infoiter
struct.unpack_from(self.fmt_cdirentry, resp.data, offset=filehdr_index)
struct.error: offset -138557274 out of range for 65536-byte buff

The archive link can be obtained by downloading a Kaggle dataset from here.
Unfortunately, I can't provide a direct link without exposing my kaggle credentials

saulpw · 2023-08-31T17:55:24Z

Thanks for the report, @sergeyvilov. I tried to get a URL to this dataset using my own kaggle account, but it seems like you either have to use their library/API, or save cookies from the browser session and use wget/curl. How did you manage to get a simple URL to use with unzip-http?

sergeyvilov · 2023-08-31T18:24:24Z

Hi @saulpw and thanks for the fast reply. I just clicked on Download All button under the Data Explorer on the Data tab, then canceled the download and chose Copy download link in the Firefox by clicking on the cancelled download in the Firefox download window. The link starts with https://storage.googleapis.com/kaggle-... and ends with .zip. It works then with wget or curl without cookies.

The problem with kaggle-api is that it can't download directories. When one tries to download files one-by-one, one gets the error 'Too many requests' after some time (for this dataset there're too many small .dcm files which raises Too many requests error). I thought that with unzip-http one could try to overcome this.

If unzip-http uses a separate HTTP request per file, then one may run into the same 'Too many requests' issue. If unzip-http could read the whole directory/directories with a single request (provided it's contiguous in the zip archive) or with the minimal number of requests, this would be awesome!

saulpw · 2023-08-31T20:49:19Z

Thanks, that allowed me to get the URL too. When I use this code it works fine for me:

import unzip_http

rzf = unzip_http.RemoteZipFile(URL_FROM_KAGGLE)
binfp = rzf.open('train_images/10005/18667/100.dcm')
binfp.read()

So I'm not sure why you're seeing that error. We did fix an issue like this awhile ago for 64-bit .zip files. Are you using the most recent version of unzip_http?

You're right though, that unzip-http does a separate HTTP request per file. It wouldn't be impossible to make unzip_http be able to download multiple contiguous files with one request, as you suggest, but it would be somewhat complicated and I unfortunately don't have the time at the moment to pull it together. If you're interested in making it happen I'd certainly review a PR for it.

sergeyvilov · 2023-08-31T21:33:29Z

Thank you very much for testing.

Looks very very strange. I'm using a clean conda environment with unzip-http 0.4 and Python 3.11 (also tried with 3.8) and execute the same code as you, the issue is observed on MacOS Big Sur (my home laptop), Ubuntu 18.04.6, and Rocky Linux 8.8. (remote servers) Again, wget with the same link works fine.

Concerning the number of requests, I think, it might be useful to merge individual requests for contiguous regions in future releases. I am pretty sure that not only the kaggle server limits the requests rate, so many users may also run into this issue when trying to download many files from the archive.

Ulipenitz · 2024-01-24T12:58:25Z

I get the same error.
error: offset -25009427 out of range for 65536-byte buffer

I am trying to download parts of the DocLayNet_extra.zip from here:
https://developer.ibm.com/exchanges/data/all/doclaynet/
I extracted the actual url from the HTML element.

import unzip_http

rzf = unzip_http.RemoteZipFile("https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip")
binfp = rzf.open('PDF/876c27352fc096c0572aa1141dd5e4465fec098b31d67c42df5ba955709b4979.pdf') # or 'DocLayNet_extra/PDF/....' not sure how to use the library yet
binfp.read()

dmetivie · 2024-04-11T22:59:38Z

Same here with

zip_url = 'https://knmi-ecad-assets-prd.s3.amazonaws.com/download/ECA_blend_rr.zip'
filename_in_zip = 'RR_STAID000031.txt'

Would it be enough to change the 65536 to a larger number (I have no idea if that make sense) when needed?

saulpw · 2024-07-03T21:46:36Z

So, I tried all these cases myself from the CLI and they seem to work fine with v0.5.1. For example:

$ ./unzip-http https://knmi-ecad-assets-prd.s3.amazonaws.com/download/ECA_blend_rr.zip RR_STAID000031.txt                                                                                  
Extracting RR_STAID000031.txt to RR_STAID000031.txt...
0s  0.26/0.18MB  (20.41 MB/s)  RR_STAID000031.txt

I think the problem is that we never pushed v0.5.1 to PyPI. So these problems should be fixed once we do that (hopefully tonight).

If any errors like this happen still with v0.5.1, please open a new issue.

saulpw closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

offset out of range for 65536-byte buffer #14

offset out of range for 65536-byte buffer #14

sergeyvilov commented Aug 31, 2023 •

edited

Loading

saulpw commented Aug 31, 2023

sergeyvilov commented Aug 31, 2023 •

edited

Loading

saulpw commented Aug 31, 2023

sergeyvilov commented Aug 31, 2023 •

edited

Loading

Ulipenitz commented Jan 24, 2024

dmetivie commented Apr 11, 2024

saulpw commented Jul 3, 2024

offset out of range for 65536-byte buffer #14

offset out of range for 65536-byte buffer #14

Comments

sergeyvilov commented Aug 31, 2023 • edited Loading

saulpw commented Aug 31, 2023

sergeyvilov commented Aug 31, 2023 • edited Loading

saulpw commented Aug 31, 2023

sergeyvilov commented Aug 31, 2023 • edited Loading

Ulipenitz commented Jan 24, 2024

dmetivie commented Apr 11, 2024

saulpw commented Jul 3, 2024

sergeyvilov commented Aug 31, 2023 •

edited

Loading

sergeyvilov commented Aug 31, 2023 •

edited

Loading

sergeyvilov commented Aug 31, 2023 •

edited

Loading