-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
HTTP Store #373
HTTP Store #373
Conversation
Please also consider the HTTP implementation in fsspec, which comes with a |
@martindurant - last I checked (a few months ago), fsspec had no tests and was not considered ready for production. I see that things have moved along quite a bit since. Perhaps my effort here, like with #252, is obsolete. |
As with google or azure, there may well be a useful benefit to having a simpler storage class which only does the minimum required for zarr, rather than trying to be a whole file-system interface. However, it is also nice to have things in one place and with a consistent design/API. I have been working on documentation and preparing fsspec for (alpha) release, but it has not been my main focus. The tests are only fairly rudimentary, and work is certainly needed. If I successfully make compatibility code for arrow-hdfs, s3fs and gcsfs ( fsspec/gcsfs#116 ) then it can probably already be considered as the backend for dask, swapping out code currently in dask.bytes. |
Honestly I'd be happy to see this integrated into Zarr. The code is very simple and I also know of some use cases where cloud storage is not involved and this is a perfect fit. |
Naive question, how hard would it be to setup a read-only web-server for testing? I suspect that server could fetch data from a |
? |
This is the fixture fsspec uses (same as the command above, but with some retry/shutdown stuff) Note that the command was different in py2, if you wanted to support that. |
This back-end is very cool, and is likely what my team would need for our imaging use case. However, there is one thing that I want to make sure I understand -- when I run the code in the PR and attempt to discover any nested groups, it notes that this store only works with "consolidated metadata". The imaging use cases that I represent require nested groups, and I can't figure out if this feature is supported but in a different way, or not supported by this store type. If it's not supported, could you think of ways to add it? Our existing solution (not using zarr) involves explicitly storing a JSON key: value map in each group that specifies the location of any sub-groups. |
In limited cases of well-behaved servers giving directory-hierarchy links, the following may work for you: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.http.HTTPFileSystem Zarr stores subgroups in exactly the way you suggest. "Consolidated metadata" means gather up all of that information throughout a dataset into a single file, so that the number of connections to remote is minimised during the parsing phase. It is a useful optimisation. My implementation, if it works for your system, does not need this, but it's still a goo idea. |
I believe that if you run |
Yes nested groups would be supported. Just to elaborate a bit on what
Martin said, basically the workflow for using the HTTPStore currently would
be:
Step 1. Create the full hierarchy of groups and arrays and store as files
on a local file system (using a DirectoryStore).
Step 2. Run zarr.consolidate_metadata(), which packs all the metadata into
a single .zmetadata file in the root folder.
Step 3. Serve those files via HTTP.
Step 4. Access the data via HTTP using HTTPStore and
zarr.open_consolidated().
Using consolidated metadata avoids the need for the HTTP server to support
directory listing. There is no standard way to get a directory listing via
HTTP, so this is probably the only option.
If the HTTP server additionally supported the PROPFIND method from WebDAV
then you could use that to get directory listings and wouldn't need to use
consolidated metadata. But that's probably beyond the scope of the
HTTPStore (maybe a WebDAVStore).
Hth.
…On Fri, 11 Jan 2019 at 14:48, Martin Durant ***@***.***> wrote:
In limited cases of well-behaved servers giving directory-hierarchy links,
the following may work for you:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.http.HTTPFileSystem
The file-system has a get_mapper() method that gives you the kind of
object that zarr needs.
Zarr stores subgroups in exactly the way you suggest. "Consolidated
metadata" means gather up all of that information throughout a dataset into
a single file, so that the number of connections to remote is minimised
during the parsing phase. It is a useful optimisation. My implementation,
if it works for your system, does not need this, but it's still a goo idea.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#373 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QvJOmbSV7T4sO5ofyNbBHWFhVhXcks5vCKRPgaJpZM4Zj-bm>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Right. My implementation of |
Has anyone tried this with a |
WebDAVStore is something very desirable! If this PR implements HTTPStore compatibly then 🎆 |
Thanks @aparamon for the comment. I believe webdav is an extension of plain HTTP, and this store only uses the GET method, so it should be able to read data from a webdav server or a plain HTTP server. I think the path for getting this PR complete would be just to add some tests. It's a read-only store, and a local HTTP server would need to be run, so tests would need some special setup, but should be fairly straightforward. Support for write operations via webdav is out of scope for this PR I think, but if anyone wanted to implement a full |
I am closing this PR and offering a sketch for the way forward. It is not feasible to implement every possible type of remote storage protocol within the zarr-python package. Indeed, zarr allows us to bring our own storage classes with mutable mappings. However, this has some limitations: mutable mappings don't necessarily have other methods that more full-features zarr storage classes have, such as file size, etc. Since this discussion started, fsspec has matured a lot. I think we should consider making fsspec an optional zarr dependency. We should write a zarr storage class for a generic fsspec filesystem which can take advantage of more features of the fsspec API than just a mutable mapping. Then we should hook into fsspec's resolver mechanism. That would allow us to do things like I will try to work on implementing this in a separate PR. |
@rabernat Let me know if you need any help. I helped with some work over at Kedro (https://github.com/quantumblacklabs/kedro) to make our DataSets (read: storage classes) use fsspec. It made sense for us, because it cut down on needing to have every file format duplicated n times with minor changes for n storage mediums (e.g CSV would need CSVBlob, CSVS3, CSVGCP, etc...) whilst we not only have one CSV storage class which loads and saves data based on the filepath provided using fsspec. |
Obviously, let me know when |
Indeed @martindurant, we actually have an open issue about this here: https://github.com/quantumblacklabs/kedro/issues/26 |
To be clear, if someone else (@ZainPatelQB or @martindurant) wants to take the lead this, I would be thrilled. I unfortunately exist in a state of extreme overcommitment at this time. I have no clear idea when I can actually find the time for this. It is not a hard task. Probably 100 lines of code max. I'd be happy to review PRs. @ZainPatelQB - thanks for sharing kedro. It looks amazing! |
This is what Dask does, essentially just one line https://github.com/dask/dask/blob/master/dask/array/core.py#L2804 |
In keeping with the holiday theme of new stores (e.g. #299), I have created a bare bones implementation of an HTTP store. This is something I have been thinking about for a long time: the simplest possible way to access zarr data over the internet. Basically, the idea is to just issue http GET requests for all the desired data. This store only makes sense with consolidated metadata, since http does not support directory listing, and read-only stores. However, for public data, this drastically simplifies the process of accessing remote data, bypassing the need for external libraries such as s3fs, gcsfs, etc. It also opens the door to decentralized peer-to-peer sharing of zarr data: just fire up a web server in front of your consolidated DirectoryStore.
I feel like this is a promising path forward towards incorporating some sort of built-in remote dataset access within zarr. We have two long-pending PRs (#293 and #252) which implement custom classes for Azure Blob Storage and GCS. Given the overlap with @martindurant's s3fs, gcsfs, etc., it's not obvious that it is worth the effort of maintaining these sorts of stores within zarr. HTTP store is a middle path: if you just want read-only access to public data, zarr can provide that. Otherwise, you need the third-party libraries.
I'm not sure the best way to test this, since it is fundamentally a read-only store and there are no existing examples of that to copy. Suggestions welcome. However, the following code works:
TODO:
tox -e docs
)