Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ZEP0004 Review - Zarr Conventions #262

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ docs/_build

# pycharm
.idea

.DS_Store
141 changes: 141 additions & 0 deletions docs/conventions/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
===========
Conventions
===========

Why Conventions?
~~~~~~~~~~~~~~~~

Zarr Conventions provide a mechanism to standardize metadata and layout of Zarr data
in order to meet domain-specific application needs without changes to the
core Zarr data model and specification, and without specification extensions.

Conventions must fit completely within the Zarr data / metadata model of groups, arrays, and attributes thereof, requiring
no changes or extension to the specification.
A Zarr implementation itself should not even be aware of the existence of the convention.
The line between a convention and an extension may be blurry in some cases.
The key distinction lies in the implementation: the responsibility for interpreting a *convention* relies completely with downstream,
domain-specific software, while an *extension* must be handled by the Zarr implementation itself.
A good rule of thumb is that a user should be able to safely ignore the convention and still be able to interact with the data via the core Zarr library,
even if some domain-specific context or functionality is missing.
If the data are completely meaningless or unintelligible without the convention, then it should be an extension instead.

Conventions can also help users switch between different storage libraries more flexibly.
Since Zarr and HDF5 implement nearly identical data models, a single convention can be applied to both formats.
This allows downstream software to maintain better separation of concerns between storage and domain-specific logic.

Conventions are modular and composable. A single group or array can conform to multiple conventions.


Describing Conventions
~~~~~~~~~~~~~~~~~~~~~~

Conventions Document
--------------------

Conventions are described by a *convention document*.
TODO: say more about the structure and format of this document

Explicit Conventions
--------------------

The preferred way of identifying the presence of a convention in a Zarr group or array is via the attribute `zarr_conventions`.
This attribute must be an array of strings; each string is an identifier for the convention.
Multiple conventions may be present.

For example, a group metadata JSON document with conventions present might look like this

.. code-block:: json

{
"zarr_format": 3,
"node_type": "group",
"attributes": {
"zarr_conventions": ["units-v1", "foo],
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice of schema-language

wanted to create a separate thread since prior one is overloaded. If the idea of "reference/contain a schema for a convention" would generally be accepted, might be worth to look into defining it in https://linkml.io/ instead of jsonschema since 1. more human readable/friendly ; 2. can be converted to jsonschema (or pydantic or ... see https://linkml.io/linkml/intro/overview.html#feature-rich-modeling-language )

Might be easier to establish such schemas. Not yet sure if would be easier to use in some cases, so might be worthwhile accompanying with both linkml and jsonschema urls... sorry if I am adding another level of complexity right away - but wanted to establish the "target horizon" right away ;-)

}

where `units-v1` and `bar` are the convention identifiers.


Legacy Conventions
------------------

A legacy convention is a convention already in use that predates this ZEP.
Data conforming to legacy conventions will not have the `zarr_conventions` attribute.
The conventions document must therefore specify how software can identify the presence of the convention through a series of rules or tests.

For those comfortable with the terminology, legacy conventions can be thought of as a "conformance class" and a corresponding "conformance test".

Namespacing
-----------

Conventions may choose to store their attributes on a specific namespace.
This ZEP does not specify how namespacing works; that is up to the convention.
For example, the namespace may be specified as a prefix on attributes, e.g.

.. code-block:: json

{
"attributes": {"units-v1:units": "m^2"}
}


or via a nested JSON object, e.g.

.. code-block:: json

{
"attributes": {"units-v1": {"units: "m^2"}}
}

The use of namespacing is optional and is up to the convention to decide.


Proposing Conventions
~~~~~~~~~~~~~~~~~~~~~

New conventions are proposed via a pull-request to the `zarr-specs` repo which adds a new conventions document.
If the convention is already documented elsewhere, the convention document can just contain a reference to the external documentation.
The author of the PR is expected to convene the relevant domain community to review and discuss the ZEP.
This includes posting a link to the PR on relevant forums, mailing lists, and social-media platforms.

The goal of the discussion is to reach a _consensus_ among the domain community regarding the convention.
The Zarr steering council, together with the PR author, will determine if a consensus has been reached, at which point the PR
can be merged and the convention published on the website.
If a consensus cannot be reached, the steering council may still decide to publish the convention, accompanied by a
disclaimer that it is not a consensus, and noting any objections that were raised during the discussion.

It is also possible that multiple, competing conventions exist in the same domain. While not ideal, it's not up to
the Zarr community to resolve such domain-specific debates.
These conventions should still be documented in a central location, which hopefully helps move towards alignment.

Conventions should be versioned using incremental integers, starting from 1.
Or, if the community already has an existing versioning system for their convention, that can be used instead (e.g. CF conventions).
The community is free to update their convention via a pull request using the same consensus process described above.
The conventions document should include a changelog.
Details of how to manage changes and backwards compatibility are left to the domain community.


Existing Conventions
~~~~~~~~~~~~~~~~~~~~


This page lists the Zarr conventions. The proposal to formalize the conventions is introduced in `ZEP0004 <https://zarr.dev/zeps/draft/ZEP0004.html>`_.

Some of the widely used conventions are:

- `GDAL <https://gdal.org/drivers/raster/zarr.html>`_
- `OME-NGFF <https://ngff.openmicroscopy.org/>`_
- `NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_
- `Xarray <https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html>`_

Any new conventions accepted by the `ZEP <https://zarr.dev/zeps/active/ZEP0000.html>`_ process will be listed here.

.. toctree::
:glob:
:maxdepth: 1
:titlesonly:
:caption: Contents:

xarray

99 changes: 99 additions & 0 deletions docs/conventions/xarray.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
======================
Xarray Zarr Convention
======================

+---------------------+----------------------+
| Convention Type | Legacy |
+---------------------+----------------------+
| Zarr Spec Versions | V2 |
+---------------------+----------------------+
| Status | Active |
+---------------------+----------------------+
| Active Dates | 2018 - present |
+---------------------+----------------------+
| Version | 1 |
+---------------------+----------------------+

See also `Zarr Encoding Specification <https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html>`_
in the Xarray docs.


Description
-----------

`Xarray`_ is a Python library for working with labeled multi-dimensional arrays.
Xarray was originally designed to read only `NetCDF`_ files, but has since added support for
other formats.
In implementing support for the `Zarr <https://zarr.dev>`_ storage format, Xarray developers
made some *ad hoc* choices about how to store NetCDF-style data in Zarr.
These choices have become a de facto convention for mapping the Zarr data model to the
`NetCDF data model <https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html>`_

First, Xarray can only read and write Zarr groups. There is currently no support
for reading / writing individual Zarr arrays. Zarr groups are mapped to
Xarray ``Dataset`` objects, which correspond to NetCDF-4 / HDF5 groups.

Second, from Xarray's point of view, the key difference between
NetCDF and Zarr is that all NetCDF arrays have *dimension names* while Zarr
arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must
somehow encode and decode the name of each array's dimensions.

To accomplish this, Xarray developers decided to define a special Zarr array
attribute: ``_ARRAY_DIMENSIONS``. The value of this attribute is a list of
dimension names (strings), for example ``["time", "lon", "lat"]``. When writing
data to Zarr, Xarray sets this attribute on all variables based on the variable
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
arrays, raising an error if it can't be found. The attribute is used to define
the variable dimension names and then removed from the attributes dictionary
returned to the user.

Because of these choices, Xarray cannot read arbitrary array data, but only
Zarr data with valid ``_ARRAY_DIMENSIONS`` attributes on each array.

After decoding the ``_ARRAY_DIMENSIONS`` attribute and assigning the variable
dimensions, Xarray proceeds to [optionally] decode each variable using its
standard `CF Conventions`_ decoding machinery used for NetCDF data.

Finally, it's worth noting that Xarray writes (and attempts to read)
"consolidated metadata" by default (the ``.zmetadata`` file), which is another
non-standard Zarr extension, albeit one implemented upstream in Zarr-Python.

.. _Xarray: http://xarray.dev
.. _NetCDF: https://www.unidata.ucar.edu/software/netcdf
.. _CF Conventions: http://cfconventions.org


Identifying the Presence of this Convention
-------------------------------------------

In implementing this convention, Xarray developers made the unfortunate choice of not
including any explicit identifier in the Zarr metadata. Therefore, the only way to
determine whether the convention is being used is to attempt to examine contents of the
Zarr dataset and look for the following properties:

* A single flat group containing one or more arrays
* The presence of the ``_ARRAY_DIMENSIONS`` attribute on each array, whose contents are
a list of dimension names (strings)
* If the dimension name corresponds to another array name within the group, that array is
assumed to be a dimension coordinate. Dimension coordinates arrays must be 1D
and have the same length as the corresponding dimension.


CF Conventions
--------------

It is common for data stored in Zarr using the Xarray convention to also follow
the `Climate and Forecast (CF) Metadata Conventions <CF Conventions>`_.

A high-level description of these conventions, quoted from the CF Documentation is as follows:

The NetCDF library [NetCDF] is designed to read and write data that has been structured
according to well-defined rules and is easily ported across various computer platforms.
The netCDF interface enables but does not require the creation of self-describing datasets.
The purpose of the CF conventions is to require conforming datasets to contain sufficient
metadata that they are self-describing in the sense that each variable in the file has an
associated description of what it represents, including physical units if appropriate,
and that each value can be located in space (relative to earth-based coordinates) and time.

The CF Conventions are massive and cover a wide range of topics. Readers should consult the
`CF Conventions`_ documentation for more information.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ A good starting point is the :ref:`zarr-core-specification-v3.0`.
.. toctree::

Home <https://zarr.dev>
specs
conventions
ZEPs <https://zarr.dev/zeps>
Implementations <https://github.com/zarr-developers/zarr_implementations>

Expand Down
6 changes: 6 additions & 0 deletions docs/specs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ Specifications
v3/stores
v3/array-storage-transformers

.. toctree::
:maxdepth: 1
:caption: Conventions

Conventions <conventions/index.rst>

.. toctree::
:maxdepth: 1
:caption: v2
Expand Down