Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

First draft of the data loading system #122

Merged
merged 185 commits into from
Jan 30, 2023
Merged

First draft of the data loading system #122

merged 185 commits into from
Jan 30, 2023

Conversation

davidorme
Copy link
Collaborator

@davidorme davidorme commented Nov 28, 2022

Description

This is quite a big PR providing a first draft of the data loading system. More docs needed, but the module docstrings and the intro data.md show the main idea.

The basic idea is:

  • The Data object is just an extended dictionary of xarray.DataArray objects - handy indexing and selection methods etc. It exposes the dictionary interface for getting values (data['var'] gets the DataArray for var) but also contains other data and methods.
  • But the stored DataArrays need to be validated/checked using load_dataarray before they get added. There is registry that allows the definition of dimension specific mapping and validation - just implemented for spatial dimensions at present, but we'll need to apply it to temporal once this PR and Main timing loop #121 are all in merged into develop.
  • There is then a wrapper around load_dataarray to give load_from_file. That uses another registry that allows people to define file format specific recipes to turn .xyz into a data array.
  • Finally, there is another wrapper around load_from_file that allows a config file to be used to add $N \in (1, 2, ...)$ files.

There are some things that need fixing - particularly in setting which dimensions a variable has to match. We also need to hash out the group of core fixed dimensions: space, time, soil layers, etc.

Fixes #104

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@davidorme
Copy link
Collaborator Author

@alexdewar Thanks for that review. I've made the following changes:

  • Data.load_from_file is now readers.load_to_dataarray, as suggested to separate data reading mechanism from insertion into Data object. Docs and tests updated to match.
  • The optional data_var_name in data configuration and related functions has been removed. Users are now expected to use canonical variable names in input files. Docs and tests updated
  • Spatial axis validation no longer supports handling datasets that are a superset of the grid (more cell_ids or xy points). Users are expected to provide spatial datasets that have an unambiguous one-to-one mapping onto grid cells. This removes the strict argument to core.grid.map_xy_to_cell_indexing.

@vgro - does that sound sensible? We now have a systematically strict approach on matching data in inputs onto the configuration.

@alexdewar - are you happy to re-review? None of this changes the wider structure/function/docs but does make for a cleaner slimmed down codebase.

I have not removed the multi-format support in readers.py as I'm expecting to want to use it fairly shortly.

@davidorme
Copy link
Collaborator Author

I had missed a fix in the docs, so sphinx was failing on RTD, but this is now building correctly.

@TaranRallings and @jacobcook1995 - could you review the HTML pages on RTD? Links again:

Descriptions for users:
Axes
Data
API docs with rendered docstrings
Data
Axes
Readers

@alexdewar Do those code changes seem ok?

@davidorme
Copy link
Collaborator Author

The data API and user description have the biggest changes (no more Data.load_from_file and no more alternative variable names in the data configuration). The axes changes are just forcing spatial data to always have a one-to-one mapping onto the grid cells, and not allow oversized spatial data, so small differences in the circumstances of when the validators fail.

Copy link
Collaborator

@jacobcook1995 jacobcook1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good to me, had a few small suggestions and a query, but nothing major

Copy link
Collaborator

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only had a quick look-through this time, but it's looking good to me!

@davidorme davidorme merged commit de9bcc4 into develop Jan 30, 2023
@davidorme davidorme deleted the feature/data_system branch January 30, 2023 13:39
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementation of data loading system.
8 participants