-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Bulk write speed using zarr and GCS #619
Comments
@rabernat I noticed that you have a PR (#252) which implements a version of GCSStore using google storage api! After noticing the performance difference, I took the important bits from that PR to write this GCSStore. class GCSStore(MutableMapping):
def __init__(self, bucket, prefix):
client = storage.Client(project=project(), credentials=credentials())
self.bucket = storage.Bucket(client, bucket)
if not prefix.endswith("/"):
prefix = prefix + "/"
self.prefix = prefix
def _full_name(self, key):
return f"{self.prefix}{key}"
def __contains__(self, key):
name = self._full_name(key)
return self.bucket.get_blob(name) is not None
def __iter__(self):
blobs = self.bucket.list_blobs(prefix=self.prefix)
for blob in blobs:
yield blob.name[len(self.prefix) :]
def __setitem__(self, key, value):
name = self._full_name(key)
blob = self.bucket.blob(name, chunk_size=hsize("1gb"))
blob.upload_from_string(value, content_type="application/octet-stream")
def __len__(self):
iterator = self.bucket.list_blobs(prefix=self.prefix)
return sum(1 for _ in iterator)
def __getitem__(self, key):
name = self._full_name(key)
blob = self.bucket.get_blob(name)
if not blob:
raise KeyError("Blob {name} not found")
return blob.download_as_bytes()
def __delitem__(self, key):
name = self._full_name(key)
self.bucket.delete_blob(name) Running the code with this change: store = GCSStore(str(_bucketname), "zarr-test/data.zarr")
root = zarr.group(store=cache) I get the following timings:
Since this is a significant speed up over gcsfs (which is taking ~21secs), I am going to continue using this in my project. (Note that I am passing 1G as the chunk size in my blob to get the best throughput for my data) @martindurant I spent a significant part of the afternoon to go over the code in gcfs (and fsspec), and I think some of the abstractions there are fantastic. I think the majority of the time spent is in this call (note sure if there are knobs to make this faster)
I don't think this call is affected by |
Whilst it can be good to have a specific and simple implementation that does the job well, I would prefer if we can backport the performance improvement to gcsfs, and I hope you can help with this, @skgbanga . Two points to note: zarr writes to the backend via setitem, which currently calls open/write, but ought to call pipe as follows: --- a/fsspec/mapping.py
+++ b/fsspec/mapping.py
@@ -148,8 +148,7 @@ class FSMap(MutableMapping):
"""Store value in key"""
key = self._key_to_str(key)
self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
- with self.fs.open(key, "wb") as f:
- f.write(value)
+ self.fs.pipe_file(key, value)
def __iter__(self):
return (self._str_to_key(x) for x in self.fs.find(self.root)) Secondly, see #606 for application of concurrent/async methods for zarr - but this is only read-only. The same thing should/will be done for writing too, and would help you. Finally, it is a shame that GCS does not support chunkwise uploads - each new chunk can only start when the previous one is finished. |
btw: the first thing |
@martindurant Thanks for the reply. I will take a more detailed look, but when I was stepping through all the code, on One of the first thing I wanted to do was make fsspec default size as 100MB. With the default size, I was getting numbers ~14secs with GCSStore. |
This suggests that it is not memory copies that are at fault, at least not within the gcsfs code. |
Yeah, I was pretty sure it was not memory copies, that's why I mentioned this:
I have a naive question though. Is there any reason you are not using google cloud |
Originally, it was because the google code was so convoluted, a situation which has improved somewhat since. We find very valuable some features that can only be achieved with a separate implementation:
In addition, gcloud's dependencies are hard to manage, and we want as little direct hooks into their auth as possible. I cannot immediately find any references to asyncio having performance issues for streaming an upload. |
I will do some more testing over the weekend. |
So I added time calls around this function: import time
now = time.time()
async with self.session.request(
method=method,
url=path,
params=kwargs,
json=jsonin,
headers=headers,
data=datain,
timeout=self.requests_timeout,
) as r:
spent = time.time() - now
print("Time spent in request call is", spent) and can say that out of ~21 seconds, more than 18s are spent in this function. Unfortunately this is the first time I am looking at async stuff, so I don't think I will be able to help further :( I think it might just be down to how the two methods are doing their network calls. For google storage API, I noticed this via strace:
For gcsfs, I noticed this: (doing strace is slighly insidious on this method because it uses threading)
No clue why asyncio http request method is calling epoll_wait every time it sends some data. |
I don't think there's a way fo me to delve into the innards of aiohttp to change the network layer :| I may have a closer look at the gcloud code - I'm pretty surprised if it's using a nonstandard transport. I do notice that the gcsfs writes are apparently larger, so maybe wait happens in the gcloud stack too, but after many more writes - and the relative sizes of the network buffers might be important here. Did you try after replacement of the |
hey @martindurant, sorry for the late reply.
I did notice that. Those writes are even more than 65K TCP size window, so there is definitely a heavy flow control going on. (I think I will do a tcp capture to verify that). I will try to do that over the weekend. |
Can we consider this issue resolved? It has become quite broad... |
Please consider this closed from my end. @martindurant Unfortunately, I haven't been able to work on your stuff. |
This question is similar to #595, but instead of streaming, it concerns with the zarr array write using GCS as a store backend.
Consider the following code:
I am creating a 1G numpy array and assigning that to a zarr array (which has the same shape and only 1 chunk of the same size).
When I run the above code, I get the following numbers:
Compare this time with manually uploading the zarr file (created with store as DirectoryStore) directly:
This is surprising.
To make the comparison more 'fair', I used google's python api to upload this chunk of data using following code:
Running the above gives the following result:
To summarize:
@rabernat Did you ever observe something similar?
The text was updated successfully, but these errors were encountered: