Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Virtual Zarr Support #11

Open
alxmrs opened this issue Jun 17, 2024 · 2 comments
Open

Virtual Zarr Support #11

alxmrs opened this issue Jun 17, 2024 · 2 comments

Comments

@alxmrs
Copy link

alxmrs commented Jun 17, 2024

This is quite a catalog of weather data! Congrats, I’m really impressed.

If you wanted this catalog to be truly overpowered (e.g. useful to the weather researchers, geospatial scientists, etc), I recommend finding a way to make all this data accessible via Zarr. With Zarr V3 around the corner, you should be able to add a few metadata files at the top of each bucket to make everything Zarr-accessible (in a language agnostic way). Here are a few pointers to get started:

Patrick, please reach out to me over email — I’d love to collaborate with you on what you’re building. (al(at)merose(dot)com).

@patrick-zippenfenig
Copy link
Member

Hi @alxmrs! Thanks for sharing.

I am not sure how feasible it is to access om files from this data repository directly. The file format is highly specific for fast and efficient gridded time-series storage. For improved performance Open-Meteo is using the Swift programming language with bindings to C code.

It would be feasible to write client libraries for other programming languages, but the om file format is not intended as a general purpose format. I am thinking about some extensions to make it work more generic (more data dimensions, metadata attributes), but it is still very domain specific.

For VirtualiZarr, is this a server side implementation to return data in chunks (Like Apache Arrow Feather)? Or is the general idea to read om files directly in Python?

Additionally, this data repository requires meta attributes which are hard-coded in the primary Open-Meteo repository. E.g. Information about the data grid, time resolution and length of each time-chunk. This would be easy to expose as a JSON file.

@alxmrs
Copy link
Author

alxmrs commented Jun 23, 2024

Hey Patrick,

Thanks for your response! A few thoughts:

I am not sure how feasible it is to access om files from this data repository directly.

It may not be feasible... yet. But, I think it could be. The beauty of Zarr is that it's more of an array protocol than a file format. Given a few modifications -- namely ZEP003, I think it will be possible to directly read om data from clients (across various languages).

In this scenario, VirtualiZarr would be run up-front in a batch setting to provide metadata files to a bucket somewhere (say, this open data on S3). From there, Zarr-clients should be able to read the files directly. And, ideally, folks would be able to read om files directly in Python.

I'm happy to hear that the open-meteo constants can be exposed in JSON; that makes me suspect such an integration with Zarr is more feasible.

the om file format is not intended as a general purpose format

That may be. But, I think you've addressed a really important access pattern that I suspect folks in the Zarr community would want to integrate with (xref: google-research/arco-era5#12).

One major benefit I see with having om be Zarr-readable is that it would reorganizing the data for different access patterns a matter of rechunking, which is well understood.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants