From e343945a7682a8d0d7d5abf1ef0804640f528ed6 Mon Sep 17 00:00:00 2001 From: mgrover1 Date: Fri, 16 Sep 2022 14:58:01 -0500 Subject: [PATCH 1/5] add section on intake-esm keys --- .../understand-keys-and-how-to-change-them.md | 146 ++++++++++++++++++ docs/source/index.md | 1 + 2 files changed, 147 insertions(+) create mode 100644 docs/source/how-to/understand-keys-and-how-to-change-them.md diff --git a/docs/source/how-to/understand-keys-and-how-to-change-them.md b/docs/source/how-to/understand-keys-and-how-to-change-them.md new file mode 100644 index 00000000..b2dd336e --- /dev/null +++ b/docs/source/how-to/understand-keys-and-how-to-change-them.md @@ -0,0 +1,146 @@ +--- +jupytext: + text_representation: + format_name: myst +kernelspec: + display_name: Python 3 + name: python3 +--- + +# Understanding Intake-ESM Keys and How to Use Them + +Intake-ESM helps with aggregating your datasets using some `keys`. Here, we dig into what exactly these keys are, how they are constructed, and how you can change them. Understanding how this work will help you control how your datasets are merged together, and remove the mystery behind these strings of text. + +## Imports and Spin Up a Dask Cluster + +We start first with importing `intake` and a `LocalCluster` from `dask.distributed` + +```{code-cell} ipython3 +import intake +from distributed import LocalCluster + +cluster = LocalCluster() +``` + +## Investigate a CMIP6 Catalog + +Let's start with a sample CMIP6 catalog! This is a fairly large dataset. + +```{code-cell} ipython3 +url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json" +catalog = intake.open_esm_datastore(url) +catalog.df.head() +``` + +Typically, the next step would be to search and load your datasets using `.to_dataset_dict()` or `.to_datatree()` + +```{code-cell} ipython3 +catalog_subset = catalog.search(variable_id='ua') +dsets = catalog_subset.to_dataset_dict() +print(dsets) +``` + +### Investigating the Keys + +The keys for these datasets include some helpful information - but you might be wondering what this all means and where this text comes from... + +```{code-cell} ipython3 +print(list(dsets)) +``` + +When `intake-esm` aggregates these datasets, it uses some pre-determined metadata, defined in the catalog file. We can look at which fields are used for aggregation, or merging of the datasets, using the following + +```{code-cell} ipython3 +print(catalog.esmcat.aggregation_control.groupby_attrs) +``` + +Let's go back to our data catalog... and find these fields. You'll notice they are all column labels! These are key components of the metadata. + +```{code-cell} ipython3 +catalog_subset.df +``` + +## Using keys_info() + +These groupby attributes are columns in our catalog! This means that the datasets which will be merged using the hierarchy: + +``` +activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label +``` + +A more clear of taking a look at these aggregation variables using the `.keys_info()` method for the catalog: + +```{code-cell} ipython3 +catalog_subset.keys_info() +``` + +## Change our Groupby/Aggregation Controls + +If we wanted to instead aggregate our datasets at the member_id level, we can change that using the following method: + +```{code-cell} ipython3 +original_groupby_attributes = catalog.esmcat.aggregation_control.groupby_attrs +new_groupby_attributes = original_groupby_attributes + ["member_id"] +print(new_groupby_attributes) +``` + +Now that we have our new groupby attributes, we can assign these to our catalog subset. + +```{code-cell} ipython3 +catalog_subset.esmcat.aggregation_control.groupby_attrs = new_groupby_attributes +``` + +Let's check our new keys! You'll notice we now have 97 keys, aggregated on + +``` +activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label --> member_id +``` + +```{code-cell} ipython3 +catalog_subset.keys_info() +``` + +### Load our Datasets with the New Keys + +We can now load our new datasets to our dictionary of datasets using: + +```{code-cell} ipython3 +dsets = catalog_subset.to_dataset_dict() +``` + +And if we only wanted the first key, we could use the following to grab the first key in the list. Notice how we now have our member_id at the end! + +```{code-cell} ipython3 +first_key = catalog_subset.keys()[0] +first_key +``` + +And the .to_dask() method to load our dataset into our notebook. + +```{code-cell} ipython3 +ds = catalog_subset[first_key].to_dask() +ds +``` + +### Compare this Dataset with the Original Catalog Configuration + +Compare this to our original catalog, which aggregated one level higher, placing all of the `member_id`s into the same dataset. + +**Notice how our metadata now mentions there are 65 member_ids in this dataset, compared to 1 in the previous dataset** + +```{code-cell} ipython3 +original_ds = catalog[catalog.keys()[0]].to_dask() +original_ds +``` + +## Conclusion + +These `intake-esm` keys can be a bit abstract when first accessing your data, but understanding them is essential to understand **how** `intake-esm` aggregates your data, and how you can change these aggregation controls for your desired analysis! We hope this helped demystify these keys. + +```{code-cell} ipython3 +--- +tags: [hide-input, hide-output] +--- +import intake_esm # just to display version information +intake_esm.show_versions() +``` diff --git a/docs/source/index.md b/docs/source/index.md index 7f72d695..6e36a60f 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -89,6 +89,7 @@ how-to/define-and-use-derived-variable-registry.md how-to/use-catalogs-with-assets-containing-multiple-variables.md how-to/filter-catalog-by-substring-and-regex-criteria.md how-to/enforce-search-query-criteria-via-require-all-on.md +how-to/understand-keys-and-how-to-change-them.md how-to/manipulate-catalog.md ``` From e1c23fe7033c79c325b0e28ba279a579d8d51ea6 Mon Sep 17 00:00:00 2001 From: Max Grover Date: Fri, 16 Sep 2022 15:09:56 -0500 Subject: [PATCH 2/5] Update docs/source/how-to/understand-keys-and-how-to-change-them.md Co-authored-by: Anderson Banihirwe --- docs/source/how-to/understand-keys-and-how-to-change-them.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/how-to/understand-keys-and-how-to-change-them.md b/docs/source/how-to/understand-keys-and-how-to-change-them.md index b2dd336e..2367b19b 100644 --- a/docs/source/how-to/understand-keys-and-how-to-change-them.md +++ b/docs/source/how-to/understand-keys-and-how-to-change-them.md @@ -7,7 +7,7 @@ kernelspec: name: python3 --- -# Understanding Intake-ESM Keys and How to Use Them +# Understanding intake-ESM keys and how to use them Intake-ESM helps with aggregating your datasets using some `keys`. Here, we dig into what exactly these keys are, how they are constructed, and how you can change them. Understanding how this work will help you control how your datasets are merged together, and remove the mystery behind these strings of text. From 1b7e60057f33c09851602bfd760f7b8ef2457855 Mon Sep 17 00:00:00 2001 From: Max Grover Date: Fri, 16 Sep 2022 15:20:59 -0500 Subject: [PATCH 3/5] Apply suggestions from Anderson Co-authored-by: Anderson Banihirwe --- .../how-to/understand-keys-and-how-to-change-them.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/source/how-to/understand-keys-and-how-to-change-them.md b/docs/source/how-to/understand-keys-and-how-to-change-them.md index 2367b19b..18cd4fe1 100644 --- a/docs/source/how-to/understand-keys-and-how-to-change-them.md +++ b/docs/source/how-to/understand-keys-and-how-to-change-them.md @@ -11,7 +11,7 @@ kernelspec: Intake-ESM helps with aggregating your datasets using some `keys`. Here, we dig into what exactly these keys are, how they are constructed, and how you can change them. Understanding how this work will help you control how your datasets are merged together, and remove the mystery behind these strings of text. -## Imports and Spin Up a Dask Cluster +## Import packages and spin up a Dask cluster We start first with importing `intake` and a `LocalCluster` from `dask.distributed` @@ -22,7 +22,7 @@ from distributed import LocalCluster cluster = LocalCluster() ``` -## Investigate a CMIP6 Catalog +## Investigate a CMIP6 catalog Let's start with a sample CMIP6 catalog! This is a fairly large dataset. @@ -40,7 +40,7 @@ dsets = catalog_subset.to_dataset_dict() print(dsets) ``` -### Investigating the Keys +### Investigating the keys The keys for these datasets include some helpful information - but you might be wondering what this all means and where this text comes from... @@ -74,7 +74,7 @@ A more clear of taking a look at these aggregation variables using the `.keys_in catalog_subset.keys_info() ``` -## Change our Groupby/Aggregation Controls +## Change our groupby/aggregation controls If we wanted to instead aggregate our datasets at the member_id level, we can change that using the following method: @@ -100,7 +100,7 @@ activity_id --> institution_id --> source_id --> experiment_id --> table_id --> catalog_subset.keys_info() ``` -### Load our Datasets with the New Keys +### Load our datasets with the new keys We can now load our new datasets to our dictionary of datasets using: @@ -122,7 +122,7 @@ ds = catalog_subset[first_key].to_dask() ds ``` -### Compare this Dataset with the Original Catalog Configuration +### Compare this dataset with the original catalog configuration Compare this to our original catalog, which aggregated one level higher, placing all of the `member_id`s into the same dataset. From 9f46af22969df785b237f7ad5196af07f83d4405 Mon Sep 17 00:00:00 2001 From: Max Grover Date: Fri, 16 Sep 2022 15:40:31 -0500 Subject: [PATCH 4/5] Apply more suggestions from Anderson Co-authored-by: Anderson Banihirwe --- docs/source/how-to/understand-keys-and-how-to-change-them.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/how-to/understand-keys-and-how-to-change-them.md b/docs/source/how-to/understand-keys-and-how-to-change-them.md index 18cd4fe1..d19c2982 100644 --- a/docs/source/how-to/understand-keys-and-how-to-change-them.md +++ b/docs/source/how-to/understand-keys-and-how-to-change-them.md @@ -32,7 +32,7 @@ catalog = intake.open_esm_datastore(url) catalog.df.head() ``` -Typically, the next step would be to search and load your datasets using `.to_dataset_dict()` or `.to_datatree()` +Typically, the next step would be to search and load your datasets using {py:meth}`~intake_esm.core.esm_datastore.to_dataset_dict` or {py:meth}`~intake_esm.core.esm_datastore.to_datatree` ```{code-cell} ipython3 catalog_subset = catalog.search(variable_id='ua') @@ -62,7 +62,7 @@ catalog_subset.df ## Using keys_info() -These groupby attributes are columns in our catalog! This means that the datasets which will be merged using the hierarchy: +These groupby attributes are columns in our catalog! This means that the datasets which will be aggregated using the hierarchy: ``` activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label From 1f6781e0d7764b78ae421de31f939aa8bee4d0c6 Mon Sep 17 00:00:00 2001 From: Max Grover Date: Fri, 16 Sep 2022 15:46:50 -0500 Subject: [PATCH 5/5] Apply few suggestions from Anderson Co-authored-by: Anderson Banihirwe --- .../how-to/understand-keys-and-how-to-change-them.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/source/how-to/understand-keys-and-how-to-change-them.md b/docs/source/how-to/understand-keys-and-how-to-change-them.md index d19c2982..61165b3b 100644 --- a/docs/source/how-to/understand-keys-and-how-to-change-them.md +++ b/docs/source/how-to/understand-keys-and-how-to-change-them.md @@ -17,9 +17,9 @@ We start first with importing `intake` and a `LocalCluster` from `dask.distribut ```{code-cell} ipython3 import intake -from distributed import LocalCluster +from distributed import Client -cluster = LocalCluster() +cluster = Client() ``` ## Investigate a CMIP6 catalog @@ -126,7 +126,9 @@ ds Compare this to our original catalog, which aggregated one level higher, placing all of the `member_id`s into the same dataset. -**Notice how our metadata now mentions there are 65 member_ids in this dataset, compared to 1 in the previous dataset** +```{note} +Notice how our metadata now mentions there are 65 member_ids in this dataset, compared to 1 in the previous dataset +``` ```{code-cell} ipython3 original_ds = catalog[catalog.keys()[0]].to_dask()