Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

quick overview example not working with to_zarr function with gcs store #4556

Closed
skgbanga opened this issue Oct 30, 2020 · 4 comments · Fixed by fsspec/filesystem_spec#480
Closed

Comments

@skgbanga
Copy link

Hello,

Consider the following code:

import os

import xarray as xr
import numpy as np
import zarr
import gcsfs

from .helpers import project, credentials, bucketname   # project specific

def make_store(key):
    if key == "memory":
        return zarr.MemoryStore()
    if key == "disc":
        return zarr.DirectoryStore("example.zarr")
    if key == "gcs":
        gcs = gcsfs.GCSFileSystem(project=project(), token=credentials())
        root = os.path.join(bucketname, "xarray-testing")
        return gcsfs.GCSMap(root, gcs=gcs, check=False)

    raise Exception(f"{key} not supported")


data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "baz": np.pi})
ds.to_zarr(make_store("gcs"), consolidated=True, mode="w")

The example dataset is from the quick overview example.

The above code works fine for both MemoryStore and DirectoryStore
When run with 'gcs' key, the above code generates a rather long exception, but the important detail is:

> /home/sandeep/.venv/valkyrie/lib/python3.8/site-packages/gcsfs/core.py(1004)_pipe_file()
   1002         consistency = consistency or self.consistency
   1003         bucket, key = self.split_path(path)
-> 1004         size = len(data)
   1005         out = None
   1006         if size < 5 * 2 ** 20:

ipdb> p data
array(3.14159265)

pi is value associated with baz key.

I also have implemented a custom zarr store (Details of which are present in this zarr issue) which gives more insight into the issue:

~/.venv/valkyrie/lib/python3.8/site-packages/zarr/core.py in set_basic_selection(self, selection, value, fields)                                                             
   1212         # handle zero-dimensional arrays
   1213         if self._shape == ():
-> 1214             return self._set_basic_selection_zd(selection, value, fields=fields)                                                                                     
   1215         else:
   1216             return self._set_basic_selection_nd(selection, value, fields=fields)                                                                                     

~/.venv/valkyrie/lib/python3.8/site-packages/zarr/core.py in _set_basic_selection_zd(self, selection, value, fields)                                                         
   1497         # encode and store
   1498         cdata = self._encode_chunk(chunk)
-> 1499         self.chunk_store[ckey] = cdata
   1500
   1501     def _set_basic_selection_nd(self, selection, value, fields=None):

~gcsstore.py in __setitem__(self, key, value)                                                                                
     30         name = self._full_name(key)
     31         blob = self.bucket.blob(name, chunk_size=human_size("1gib"))
---> 32         blob.upload_from_string(value, content_type="application/octet-stream") 

~/.venv/valkyrie/lib/python3.8/site-packages/google/cloud/storage/blob.py in upload_from_string(self, data, content_type, client, predefined_acl, if_generation_match, if_generation_not_match, if_metageneration_match, if_metageneration_not_match, timeout, checksum)
   2437             "md5", "crc32c" and None. The default is None.
   2438         """
-> 2439         data = _to_bytes(data, encoding="utf-8")
   2440         string_buffer = BytesIO(data)
   2441         self.upload_from_file(

~/.venv/valkyrie/lib/python3.8/site-packages/google/cloud/_helpers.py in _to_bytes(value, encoding)
    368         return result
    369     else:
--> 370         raise TypeError("%r could not be converted to bytes" % (value,))
    371 
    372 

TypeError: array(3.14159265) could not be converted to bytes

It seems to me that zarr is not converting the data into its serialized representation (via their codec library) and is directly passing the datatype into MutableMapping which results in an exception since google libraries don't know how to convert the passed data (np.pi) into bytes.

ipdb> u
> gcsstore.py(32)__setitem__()
     30         name = self._full_name(key)
     31         blob = self.bucket.blob(name, chunk_size=human_size("1gib"))
---> 32         blob.upload_from_string(value, content_type="application/octet-stream")
     33 
     34     def __len__(self):

ipdb> p key
'baz/0'
ipdb> p value
array(3.14159265)

Please let me know if you think I should raise this issue in zarr project rather than here.

version of xarray and zarr:

xarray                   0.16.1
zarr                     2.5.0
@rabernat
Copy link
Contributor

@martindurant, do you have any idea what could be going on here?

@martindurant
Copy link
Contributor

Looks like a special case of a numpy scalar. I can catch this in fsspec - please wait.

@raybellwaves
Copy link
Contributor

@skgbanga can this be closed on latest(s) xarray, zarr, fsspec, gcsfs?

@dcherian
Copy link
Contributor

Please reopen if this is not resolved

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants