Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Mixed formats in catalog #414

Closed
aulemahal opened this issue Dec 14, 2021 · 3 comments · Fixed by #416
Closed

Mixed formats in catalog #414

aulemahal opened this issue Dec 14, 2021 · 3 comments · Fixed by #416

Comments

@aulemahal
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I work at a climate service provider. We do most of our work on local servers and a lot of the data is local. We want to use intake_esm to handle these in-house datasets, at two levels : the organisation-level and the project-level. Having a somewhat clean catalog at the project level is possible, but it is a complicated task at the org-level : i want to include a maximum of what we have in a single catalog and it's messy.

This means both netCDFs and zarr, as we are trying to move from the first to the second, but it's evidently impossible to completely abandon netCDF.

Currently, "assets.format" is mandatory and must be a single string ("netcdf" or "zarr"). The catalog and DataSource logic implies all url/paths use the same xarray "engine".

Describe the solution you'd like
The esm-collection-spec has "assets.format_column_name" and it says :" The column name which contains the data format, allowing for variable data types in one catalog. Mutually exclusive with format."
I'd like this to be reflected in intake_esm.

Describe alternatives you've considered
Having 2 catalogs. Or implementing some messy workaround in a custom subclass of esm_datastore.

Additional context
I can suggest a PR. My idea is the following:

  • Making Assets.format optional. Add None as a valid enum item of DataFormat.
  • Add format_column_name to ESMDataSource and pass it in esm_datastore.__getitem__.
  • In ESMDataSource.__init__: If data_format is not None, set a new "_data_format" column to that, if it is None, rename the format_column_name to "_data_format".
  • Use that new column in ESMDataSource._open_dataset.

That's the most elegant way I could think of, but I am open to suggestion if you are ok with this addition.

@andersy005
Copy link
Member

The esm-collection-spec has "assets.format_column_name" and it says :" The column name which contains the data format, allowing for variable data types in one catalog. Mutually exclusive with format."
I'd like this to be reflected in intake_esm.

👍🏽 . The current behavior is definitely a bug which was introduced in #368. We should be able to alter Assets to ensure that format and format_column_name are mutually exclusive as mentioned in the SPEC.

class Assets(pydantic.BaseModel):
column_name: pydantic.StrictStr
format: DataFormat
format_column_name: typing.Optional[pydantic.StrictStr]

I'm happy to look into this unless you want to work on it :)

@aulemahal
Copy link
Contributor Author

I have time to work on this, I can try to come up with a PR quite soon.

@andersy005
Copy link
Member

Awesome!

I think the necessary changes are straightforward:

  1. Make format an optional argument in Asset
  2. Update the validator to ensure that either format or format_column_name is set to a valid value:
    @pydantic.root_validator
    def _validate_data_format(cls, values):
    data_format, format_column_name = values.get('format'), values.get('format_column_name')
    if data_format is not None and format_column_name is not None:
    raise ValueError('Cannot set both format and format_column_name')
    return values
  3. Ensure the right code path is taken in ESMDataSource when format_column_name is set. Here are some relevant lines:
    self.data_format = data_format.value
    and
    self.xarray_open_kwargs = _get_xarray_open_kwargs(self.data_format, xarray_open_kwargs)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants