-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Content-Range header for multiple part request #2248
Comments
Cool idea to implement a lazy tar parser on top of HF Hub!! What's the context/goals there? Re. support for multiple ranges in a single Range request, I think I remember @Kakulukian took a look at this at some point (was this you @Kakulukian?) |
In essence, the idea (detailed in the Our requirement is to swiftly retrieve a set of specific files from datasets on huggingface. These datasets typically comprise numerous (e.g., 1k) tar archives, each containing numerous image files. The archive in which an image resides depends on the image's id modulo 1000. Notably, one such dataset is nyanko7/danbooru2023, containing roughly 8 million images spread across 2k+ archive files. In our practical application, we often begin by querying images based on metadata like tags, obtaining a list of required image ids (often over 1k, sometimes exceeding 100k), then fetching all images based on these ids to make a dataset. For this purpose, we're developing a library called cheesechaser. Though still a work in progress, it already supports the aforementioned danbooru2023 dataset. Based on our current tests, downloading 10k specified images (with consecutive ids spread across 1000 archive files) totaling approximately 18gb, using 12 threads took about 17 minutes, involving roughly 10k download requests. This performance is satisfactory, significantly faster than downloading and decompressing approximately 9tb of complete tar archives, with minimal local disk usage. However, we've identified areas for improvement in performance. Primarily, due to the large volume of download requests and relatively small file sizes, most time is spent establishing connections rather than downloading. Additionally, as the number of downloaded files increases, excessive requests strain huggingface's cdn resources. Therefore, supporting multi-part range requests could significantly boost performance and alleviate pressure on the cdn service by enabling simultaneous downloads of multiple files within the same archive. Furthermore, after raising this issue and attempting to use multi-part range, we encountered some more problems:
|
When you request multiple ranges, the response will be in a multipart/byteranges content type, including a boundary. Each subsequent range corresponds to a specific block separated by this boundary with content-range header (https://www.rfc-editor.org/rfc/rfc7233#page-21) For example for your request:
Response:
|
I just reproduce this Reproduce codeimport time
from pprint import pprint
import requests
# ranges to get
ranges = [
(0, 99),
(1200, 1369),
(2000, 2209),
(2146660100, 2146660200),
]
# get ranges with standalone requests
datas = []
for i, (x, y) in enumerate(ranges):
start_time = time.time()
resp = requests.get(
'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
headers={
'Range': f'bytes={x}-{y}'
},
)
print(f'Range {i}, response: {resp!r}, length: {len(resp.content)}, time cost: {time.time() - start_time:.3f}s')
datas.append(bytes(resp.content))
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
# get all the data with one request
start_time = time.time()
resp = requests.get(
'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
headers={
'Range': f'bytes={",".join(map(lambda ix: f"{ix[0]}-{ix[1]}", ranges))}'
},
)
print(f'Multipart response: {resp!r}')
print(f'Time cost: {time.time() - start_time:.3f}s')
print('Headers:')
pprint(dict(resp.headers))
print(f'Content length: {len(resp.content)}')
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
full_bytes = resp.content
start_pos = 0
current_i = 0
while True:
try:
next_sep = full_bytes.index(b'\r\n\r\n', start_pos)
except ValueError:
break
lines = list(filter(bool, full_bytes[start_pos: next_sep].decode().splitlines(keepends=False)))
pairs = [line.split(':', maxsplit=1) for line in lines]
headers = {
key.strip(): value.strip()
for key, value in pairs
}
start_bytes, end_bytes = headers['Content-Range'].split(' ')[-1].split('/')[0].split('-', maxsplit=1)
start_bytes, end_bytes = int(start_bytes), int(end_bytes)
length = end_bytes - start_bytes + 1
current_data = full_bytes[next_sep + 4: next_sep + 4 + length]
start_pos = next_sep + 4 + length
print(f'Multipart, range {current_i}, headers: {headers!r}, byte-ranges: {(start_bytes, end_bytes)}')
assert current_data == datas[current_i], f'Range {current_i} not match!'
print(f'Range {current_i} matched!')
current_i += 1
if current_i < len(datas):
print(f'Range {list(range(current_i, len(datas)))} not matched!')
else:
print('Match success!') On my local machineWhen i run this on my local environment the result is (the time cost of multipart request is really slow, but the result is correct, status code is 206 as expected)
my local env
On huggingface spaceWhen i run this code on huggingface space (i deployed a jupyterlab in hfspace), the output is (failed, the entire file is returned)
the env
So, 2 problems:
|
I'm developing a library to download files for tar archives on huggingface repository:
this is based on the
Range
header in http request, so download tar archives withRange: bytes=xxx-yyy
will only download the specific file instead of the full archive file.In some cases, we need to download many files from different tar archives, and many of them are from the same archive. So im considering using
Range: bytes=xxx-yyy,zzz-ttt
to download all of them with only one http request. This can greatly improve the performance of batch downloading, and can also reduce the pressure to the huggingface cdn.But in my test, when using multiple part ranges, the
Content-Range
header seems gone in response.The output is like this, no
Content-Range
found. The length of content seems okay, but i dont know what are the ranges of each part.This header information is really important. So can it be added? or is there an alternative solution to download multiple parts at one time, and save each parts to different files?
The text was updated successfully, but these errors were encountered: