Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

multiple zarr files + fsspec.get_mapper #286

Closed
Mikejmnez opened this issue Apr 24, 2020 · 2 comments
Closed

multiple zarr files + fsspec.get_mapper #286

Mikejmnez opened this issue Apr 24, 2020 · 2 comments

Comments

@Mikejmnez
Copy link

I have a sequence of zarr files distributed across different nodes that I want to read in parallel, while only providing a string (glob-like) path.

The behavior I want to emulate:
For netcdf-files, we can do this using

url = fsspec.open_local(paths)

where paths is given by

paths= '/directoryA/*/subdirectoryB/*.nc'

such that
len(glob(paths)) = len(url)
e.g. 5 (5 nc-files distributed on different directories). The url is then used as an argument for xarray.open_mfdataset

The problem
zarr files open with a mapper (url=fsspec.get_mapper(paths) with url as an argument to xarray.open_zarr), and a glob-like path does not work as nicely (compact) as it does with fsspec.open_local() and nc-files. That is, given

paths= '/directoryA/*/subdirectoryB/*'

(where the zarr stores appear as directories) we get

len(fsspec.get_mapper(paths))=0

If you just try, the right hand side is zero, while the LHS > 0.

A solution to the problem is to just pass the glob-like path directly to _open_zarr (with proper modifications to _open_zarr function much like xarray.open_mfdataset). I am just wondering if fsspec.get_mapper(paths) can take a glob-like path string and I just haven't figured out yet how...

@martindurant
Copy link
Member

This falls between some concepts:

  • zarr has a very well-defined spec, and would not, I think, be interested in changing its open functions to allow for multiple mappers
  • fsspec could take a glob in get_mapper and produce a set of mappers, but it's not clear how they would be collected into one output; for zarr to read this, it would also need a "virtual" .zgroup file

So indeed, intake-xarray could do this (glob-> list of mappers -> list of xarrays to be joined) or xarray itslef could do this like mfdataset. Note that since zarr may lean more on fsspec in the future ( zarr-developers/zarr-python#546 ), it may make sense to discuss this with them and/or xarray.

@Mikejmnez
Copy link
Author

Mikejmnez commented Apr 24, 2020

Thanks @martindurant , this is very helpful. I agree that it would be nice to follow up with zarr developers.

One thing:
When calling xarray.open_mfdataset through intake-xarray, my understanding after going through the code, is that the interpretation of the glob path takes place at the intake-xarray level. Even though, xarray.open_mfdataset can accept a glob-path directly or a list of directories, it is on intake.netcdf.py that a glob-path is passed to xarray as a list. This happens in the definition of _open_dataset lines 50-64:

url = fsspec.open_local(self.url_path, **self.storage_options)

If url is originally a glob, fsspec.open_local returns a list which is then passed to xarray.open_mfdataset.

I wrote a xarray.open_mzarr emulating the behavior of xarray.open_mfdataset , which can also read multiple files from a glob path, or a list of paths. That is how I stumbled upon this issue. Note that fsspec.open_local does not work with zarr files since these are interpreted by such function as directories.

Finally
It is possible, like you say, to leave the interpretation of the glob-path to take place at the xarray level. This could be done by writing on intake.xzarr.py within the definition of _open_dataset something like

    if "*" in url or isinstance(url, list):
        self._mapper = self.urlpath
    else:
        self._mapper = fsspec.get_mapper(self.urlpath, **self.storage_options)
    self._ds = _open_zarr(self._mapper, chunks=self.chunks, **kwargs)

with self._mapper being passed to xarray. If I do this, there is no problem, the xarray.open_mzarr creates the dataset as it was intended (much like xarray.open_mfdataset, but I wonder if there is something that missing by doing this...

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants