diff --git a/ci/environment.yml b/ci/environment.yml index fcb0e3ff..c299c98e 100644 --- a/ci/environment.yml +++ b/ci/environment.yml @@ -11,6 +11,7 @@ dependencies: - h5netcdf>=0.8.1 - intake>=0.6.0 - ipython + - matplotlib - netcdf4>=1.5.5 - pip - pooch diff --git a/docs/environment.yml b/docs/environment.yml index bb00625d..0e1e3eaa 100644 --- a/docs/environment.yml +++ b/docs/environment.yml @@ -13,6 +13,7 @@ dependencies: - sphinx-copybutton - sqlalchemy==1.3.12 - watermark + - python-graphviz - zarr - pip: - gcsfs diff --git a/docs/source/user-guide/ecgtools-history-files-example.ipynb b/docs/source/user-guide/ecgtools-history-files-example.ipynb new file mode 100644 index 00000000..4197dd4a --- /dev/null +++ b/docs/source/user-guide/ecgtools-history-files-example.ipynb @@ -0,0 +1,841 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "starting-spokesman", + "metadata": { + "tags": [] + }, + "source": [ + "# Building an Intake-esm catalog from CESM2 History Files\n", + "\n", + "In this example, we will cover how to build a data catalog from Community Earth System Model (CESM) output. One of the requirements for using intake-esm is having a catalog which is comprised of two pieces:\n", + "* A table of the relevant metadata (ex. file path, variable, stream, etc.)\n", + "* A json describing the dataset, including how to aggregate the variables\n", + "\n", + "Typically, these pieces are constructed \"manually\" using information within the file path, on a very ad-hoc basis. Also, these catalogs are typically only created for \"larger\", community datasets, not neccessarily used within smaller model runs/daily workflows. A new package (currently a prototype), called [ecgtools](https://ecgtools.readthedocs.io/en/latest/) works to solve the issues of generating these intake-esm catalogs. Ecgtools stands for Earth System Model (ESM) Catalog Generation tools. The current catalog generation tools supported are:\n", + "* CMIP6 models\n", + "* CESM \"history\" files\n", + "* CESM \"timeseries\" files\n", + "\n", + "This example provides an overview of using [ecgtools](https://ecgtools.readthedocs.io/en/latest/) for parsing CESM timeseries file model output, and reading in the data using \n", + "[Intake-ESM](https://intake-esm.readthedocs.io/en/latest/). In this example, we use sample CESM data within the test directory for Intake-ESM.\n", + "\n", + "## Downloading ecgtools\n", + "You can install [ecgtools](https://github.com/NCAR/ecgtools) through [PyPI](https://pypi.org/project/docs/) or [conda-forge](https://conda-forge.org/docs/). Examples of the syntax are provided below:\n", + "\n", + "\n", + "### Installing with Conda-Forge\n", + "\n", + "Within your [conda environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), run the following: \n", + "\n", + "\n", + "```bash\n", + "conda install -c conda-forge ecgtools\n", + "```\n", + "\n", + "\n", + "### Installing with Pip\n", + "Alternatively, you could use [PyPI](https://pypi.org/project/docs/), with the `pip install` command shown provided below.\n", + "\n", + "```bash\n", + "pip install ecgtools\n", + "```\n", + "\n", + "## Imports\n", + "The only parts of ecgtools we need are the `Builder` object and the `parse_cesm_history` parser from the CESM parsers! We import `glob` to take a look at the files we are parsing." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "nominated-battle", + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "\n", + "import intake\n", + "from ecgtools import Builder\n", + "from ecgtools.parsers.cesm import parse_cesm_timeseries" + ] + }, + { + "cell_type": "markdown", + "id": "czech-stylus", + "metadata": {}, + "source": [ + "### Understanding the Directory Structure\n", + "\n", + "The first step to setting up the `Builder` object is determining where your files are stored. As mentioned previously, we have a sample dataset of CESM2 model output, which is stored in test directory `/tests/sample_data` directory of this repository.\n", + "\n", + "Taking a look at that directory, we see that there is a single case `g.e11_LENS.GECOIAF.T62_g16.009`" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "south-nancy", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ECOSYS_XKW.024901-031612.nc',\n", + " '../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2.024901-031612.nc',\n", + " '../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ecosys.nday1.CaCO3_form_zint.02490101-03161231.nc']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "glob.glob('../../../tests/sample_data/cesm/*')" + ] + }, + { + "cell_type": "markdown", + "id": "sunset-animation", + "metadata": {}, + "source": [ + "Now that we understand the directory structure, let's make the catalog." + ] + }, + { + "cell_type": "markdown", + "id": "growing-anthony", + "metadata": {}, + "source": [ + "## Build the catalog!\n", + "\n", + "Let's start by inspecting the builder object" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "promising-installation", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\u001b[0;31mInit signature:\u001b[0m\n", + "\u001b[0mBuilder\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mroot_path\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mUnion\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpydantic\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDirectoryPath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpydantic\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDirectoryPath\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mextension\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'.nc'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mdepth\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mexclude_patterns\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mnjobs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mDocstring:\u001b[0m \n", + "Generates a catalog from a list of files.\n", + "\n", + "Parameters\n", + "----------\n", + "root_path : str or list\n", + " Path(s) of root directory.\n", + "extension : str, optional\n", + " File extension, by default None. If None, the builder will look for files with\n", + " \"*.nc\" extension.\n", + "depth : int, optional\n", + " Recursion depth. Recursively crawl `root_path` up to a specified depth, by default 0\n", + "exclude_patterns : list, optional\n", + " Directory, file patterns to exclude during catalog generation.\n", + " These could be substring or regular expressions. by default None\n", + "njobs : int, optional\n", + " The maximum number of concurrently running jobs,\n", + " by default -1 meaning all CPUs are used.\n", + "\u001b[0;31mFile:\u001b[0m ~/anaconda3/envs/intake-esm-dev/lib/python3.9/site-packages/ecgtools/builder.py\n", + "\u001b[0;31mType:\u001b[0m type\n", + "\u001b[0;31mSubclasses:\u001b[0m \n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "?Builder" + ] + }, + { + "cell_type": "markdown", + "id": "ccf85513-d76a-4b31-9908-dc6af39afc12", + "metadata": {}, + "source": [ + "
\n", + "

Info

\n", + " Note that as of 21 June, 2021, the parsing_func parameter is now used in the .build() method!\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "abroad-taxation", + "metadata": {}, + "outputs": [], + "source": [ + "b = Builder(\n", + " # Directory with the output\n", + " '../../../tests/sample_data/cesm/',\n", + " # Depth of 1 since we are sending it to the case output directory\n", + " depth=1,\n", + " # Exclude the timeseries and restart directories\n", + " exclude_patterns=[\"*/tseries/*\", \"*/rest/*\"],\n", + " # Number of jobs to execute - should be equal to # threads you are using\n", + " njobs=5,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "painful-syria", + "metadata": {}, + "source": [ + "Double check the object is set up..." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "studied-jersey", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Builder(root_path=PosixPath('../../../tests/sample_data/cesm'), extension='.nc', depth=1, exclude_patterns=['*/tseries/*', '*/rest/*'], njobs=5)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b" + ] + }, + { + "cell_type": "markdown", + "id": "american-exhibit", + "metadata": {}, + "source": [ + "We are good to go! Let's build the catalog by calling `.build()` on the object, passing in the `parse_cesm_history` parser! By default, it will use the `LokyBackend` which is described in the [Joblib documentation](https://joblib.readthedocs.io/en/latest/parallel.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "false-bristol", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.\n", + "[Parallel(n_jobs=5)]: Done 1 out of 1 | elapsed: 0.4s finished\n", + "[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.\n", + "/Users/mgrover/anaconda3/envs/intake-esm-dev/lib/python3.9/site-packages/ecgtools/parsers/cesm.py:240: UserWarning: Using the default frequency definitions\n", + " warnings.warn('Using the default frequency definitions')\n", + "/Users/mgrover/anaconda3/envs/intake-esm-dev/lib/python3.9/site-packages/ecgtools/parsers/cesm.py:240: UserWarning: Using the default frequency definitions\n", + " warnings.warn('Using the default frequency definitions')\n", + "/Users/mgrover/anaconda3/envs/intake-esm-dev/lib/python3.9/site-packages/ecgtools/parsers/cesm.py:240: UserWarning: Using the default frequency definitions\n", + " warnings.warn('Using the default frequency definitions')\n", + "[Parallel(n_jobs=5)]: Done 3 out of 3 | elapsed: 1.2s remaining: 0.0s\n", + "[Parallel(n_jobs=5)]: Done 3 out of 3 | elapsed: 1.2s finished\n" + ] + } + ], + "source": [ + "b = b.build(parse_cesm_timeseries)" + ] + }, + { + "cell_type": "markdown", + "id": "caring-harrison", + "metadata": {}, + "source": [ + "## Inspect the Catalog" + ] + }, + { + "cell_type": "markdown", + "id": "hydraulic-developer", + "metadata": {}, + "source": [ + "Now that the catalog is built, we can inspect the dataframe which is used to create the catalog by calling `.df` on the builder object" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "composite-insider", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
componentstreamcasemember_idvariablestart_timeend_timetime_rangelong_nameunitsvertical_levelsfrequencypath
0ocnpop.hg.e11_LENS.GECOIAF.T62_g16.009009ECOSYS_XKW0249-010316-12024901-031612XKW for ecosys fluxescm/s1month_1../../../tests/sample_data/cesm/g.e11_LENS.GEC...
1ocnpop.h.ecosys.nday1g.e11_LENS.GECOIAF.T62_g16.009009CaCO3_form_zint0249-01-010316-12-3102490101-03161231CaCO3 Formation Vertical Integralmmol/m^3 cm/s1day_1../../../tests/sample_data/cesm/g.e11_LENS.GEC...
2ocnpop.hg.e11_LENS.GECOIAF.T62_g16.009009O20249-010316-12024901-031612Dissolved Oxygenmmol/m^31month_1../../../tests/sample_data/cesm/g.e11_LENS.GEC...
\n", + "
" + ], + "text/plain": [ + " component stream case member_id \\\n", + "0 ocn pop.h g.e11_LENS.GECOIAF.T62_g16.009 009 \n", + "1 ocn pop.h.ecosys.nday1 g.e11_LENS.GECOIAF.T62_g16.009 009 \n", + "2 ocn pop.h g.e11_LENS.GECOIAF.T62_g16.009 009 \n", + "\n", + " variable start_time end_time time_range \\\n", + "0 ECOSYS_XKW 0249-01 0316-12 024901-031612 \n", + "1 CaCO3_form_zint 0249-01-01 0316-12-31 02490101-03161231 \n", + "2 O2 0249-01 0316-12 024901-031612 \n", + "\n", + " long_name units vertical_levels \\\n", + "0 XKW for ecosys fluxes cm/s 1 \n", + "1 CaCO3 Formation Vertical Integral mmol/m^3 cm/s 1 \n", + "2 Dissolved Oxygen mmol/m^3 1 \n", + "\n", + " frequency path \n", + "0 month_1 ../../../tests/sample_data/cesm/g.e11_LENS.GEC... \n", + "1 day_1 ../../../tests/sample_data/cesm/g.e11_LENS.GEC... \n", + "2 month_1 ../../../tests/sample_data/cesm/g.e11_LENS.GEC... " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b.df" + ] + }, + { + "cell_type": "markdown", + "id": "fitted-audio", + "metadata": {}, + "source": [ + "The resultant dataframe includes the:\n", + "* Component\n", + "* Stream\n", + "* Case\n", + "* Date\n", + "* Frequency\n", + "* Variables\n", + "* Path" + ] + }, + { + "cell_type": "markdown", + "id": "single-storm", + "metadata": {}, + "source": [ + "## Save the Catalog" + ] + }, + { + "cell_type": "markdown", + "id": "b182c26e-4455-476b-ba09-eae29f78599f", + "metadata": {}, + "source": [ + "We can also check to see which files ***were not*** parsed by calling `.invalid_assets`" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "d0560280-113c-464d-928b-071780cf2b92", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: []\n", + "Index: []" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b.invalid_assets" + ] + }, + { + "cell_type": "markdown", + "id": "5faf9882-566d-4106-ae3a-a1ae9e5b1be4", + "metadata": {}, + "source": [ + "This is empty, as expected!" + ] + }, + { + "cell_type": "markdown", + "id": "b5bd5184-3892-4032-971d-dd3245d33490", + "metadata": {}, + "source": [ + "## Save the Catalog\n", + "Now that we have our data catalog, we can save it, by specifying the path to the comma separated values file (`csv`) or compressed csv (`csv.gz`)." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5a984e10-1704-467f-869b-43ba0b67b2c0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saved catalog location: cesm_sample_data.json and cesm_sample_data.csv\n" + ] + } + ], + "source": [ + "b.save(\n", + " 'cesm_sample_data.csv',\n", + " # Column name including filepath\n", + " path_column_name='path',\n", + " # Column name including variables\n", + " variable_column_name='variable',\n", + " # Data file format - could be netcdf or zarr (in this case, netcdf)\n", + " data_format=\"netcdf\",\n", + " # Which attributes to groupby when reading in variables using intake-esm\n", + " groupby_attrs=[\"component\", \"stream\", \"case\"],\n", + " # Aggregations which are fed into xarray when reading in data using intake\n", + " aggregations=[\n", + " {'type': 'union', 'attribute_name': 'variable'},\n", + " {\n", + " \"type\": \"join_existing\",\n", + " \"attribute_name\": \"time_range\",\n", + " \"options\": {\"dim\": \"time\", \"coords\": \"minimal\", \"compat\": \"override\"},\n", + " },\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "naval-external", + "metadata": {}, + "source": [ + "### Use the catalog to read in data" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "equal-ordering", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

catalog with 2 dataset(s) from 3 asset(s):

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unique
component1
stream2
case1
member_id1
variable3
start_time2
end_time2
time_range2
long_name3
units3
vertical_levels1
frequency2
path3
derived_variable0
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "data_catalog = intake.open_esm_datastore(\n", + " \"cesm_sample_data.json\",\n", + ")\n", + "data_catalog" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "insured-conviction", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "--> The keys in the returned dictionary of datasets are constructed as follows:\n", + "\t'component.stream.case'\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " 100.00% [2/2 00:00<00:00]\n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "dsets = data_catalog.to_dataset_dict()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "meaningful-journalism", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ocn.pop.h.ecosys.nday1.g.e11_LENS.GECOIAF.T62_g16.009': \n", + " Dimensions: (nlat: 5, nlon: 5, time: 12)\n", + " Coordinates:\n", + " ULONG (nlat, nlon) float64 dask.array\n", + " TLONG (nlat, nlon) float64 dask.array\n", + " ULAT (nlat, nlon) float64 dask.array\n", + " TLAT (nlat, nlon) float64 dask.array\n", + " * time (time) object 0249-01-02 00:00:00 ... 0249-01-13 00:00:00\n", + " Dimensions without coordinates: nlat, nlon\n", + " Data variables:\n", + " CaCO3_form_zint (time, nlat, nlon) float32 dask.array\n", + " Attributes: (12/15)\n", + " intake_esm_vars: ['CaCO3_form_zint']\n", + " intake_esm_attrs/component: ocn\n", + " intake_esm_attrs/stream: pop.h.ecosys.nday1\n", + " intake_esm_attrs/case: g.e11_LENS.GECOIAF.T62_g16.009\n", + " intake_esm_attrs/member_id: 9\n", + " intake_esm_attrs/variable: CaCO3_form_zint\n", + " ... ...\n", + " intake_esm_attrs/long_name: CaCO3 Formation Vertical Integral\n", + " intake_esm_attrs/units: mmol/m^3 cm/s\n", + " intake_esm_attrs/vertical_levels: 1\n", + " intake_esm_attrs/frequency: day_1\n", + " intake_esm_attrs/path: ../../../tests/sample_data/cesm/g.e11_...\n", + " intake_esm_dataset_key: ocn.pop.h.ecosys.nday1.g.e11_LENS.GECO...,\n", + " 'ocn.pop.h.g.e11_LENS.GECOIAF.T62_g16.009': \n", + " Dimensions: (nlat: 5, nlon: 5, time: 12, sigma: 5)\n", + " Coordinates:\n", + " TLAT (nlat, nlon) float64 dask.array\n", + " ULONG (nlat, nlon) float64 dask.array\n", + " ULAT (nlat, nlon) float64 dask.array\n", + " TLONG (nlat, nlon) float64 dask.array\n", + " * time (time) object 0249-02-01 00:00:00 ... 0250-01-01 00:00:00\n", + " * sigma (sigma) float64 23.4 23.45 23.5 23.55 23.6\n", + " Dimensions without coordinates: nlat, nlon\n", + " Data variables:\n", + " ECOSYS_XKW (time, nlat, nlon) float32 dask.array\n", + " O2 (time, sigma, nlat, nlon) float32 dask.array\n", + " Attributes:\n", + " intake_esm_attrs/component: ocn\n", + " intake_esm_attrs/stream: pop.h\n", + " intake_esm_attrs/case: g.e11_LENS.GECOIAF.T62_g16.009\n", + " intake_esm_attrs/member_id: 9\n", + " intake_esm_attrs/start_time: 0249-01\n", + " intake_esm_attrs/end_time: 0316-12\n", + " intake_esm_attrs/time_range: 024901-031612\n", + " intake_esm_attrs/vertical_levels: 1\n", + " intake_esm_attrs/frequency: month_1\n", + " intake_esm_dataset_key: ocn.pop.h.g.e11_LENS.GECOIAF.T62_g16.009}" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dsets" + ] + }, + { + "cell_type": "markdown", + "id": "center-orbit", + "metadata": {}, + "source": [ + "Let's plot a quick figure from the dataset!" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "comic-hampshire", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "dsets['ocn.pop.h.ecosys.nday1.g.e11_LENS.GECOIAF.T62_g16.009'].CaCO3_form_zint.isel(time=0).plot();" + ] + }, + { + "cell_type": "markdown", + "id": "facial-picking", + "metadata": {}, + "source": [ + "## Conclusion\n", + "Having the ability to easily create intake-esm catalogs from history files can be a powerful tool in your analysis toolkit. These data can be read in relatively quickly, easing the ability to quickly take a look at model output or even share your data with others! For more updates on [ecgtools](https://github.com/NCAR/ecgtools), be sure to follow [the ecgtools repository](https://github.com/NCAR/ecgtools) on Github! Have an idea for another helpful parser? Submit an issue!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "documentary-cable", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "author": "Max Grover", + "date": "2021-06-04", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + }, + "tags": "intake,cesm", + "title": "Building an Intake-esm catalog from CESM2 History Files" + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/user-guide/graphviz_example.ipynb b/docs/source/user-guide/graphviz_example.ipynb new file mode 100644 index 00000000..49cac058 --- /dev/null +++ b/docs/source/user-guide/graphviz_example.ipynb @@ -0,0 +1,1870 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6834b397-d190-4214-a0d4-b91b202517eb", + "metadata": {}, + "source": [ + "# Creating Visualizations of Intake-ESM Catalogs" + ] + }, + { + "cell_type": "markdown", + "id": "88f5fad0-c002-4e58-858a-378af84fa961", + "metadata": {}, + "source": [ + "A common initial task when working with a new dataset is figuring out what data is available. This is especially true when working with climate ensembles with several components and time-frequency output (ex. Community Earth System Model Large Ensemble, CESM-LE). Here, we will examine different methods of investigating this catalog." + ] + }, + { + "cell_type": "markdown", + "id": "d244a6a4-dea7-4872-b2ae-ee9ce82fcdd4", + "metadata": {}, + "source": [ + "## Imports\n", + "\n", + "Here, we will use [intake-esm](https://intake-esm.readthedocs.io/en/latest/) and [graphviz](https://graphviz.readthedocs.io/en/stable/manual.html), which can be installed using the following (including jupyterlab too!)\n", + "\n", + "```\n", + "conda install -c conda-forge jupyterlab intake-esm graphviz\n", + "```\n", + "\n", + "Once you install these packages, open jupyterlab!" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "234fbcdc-68d3-48e1-ac7d-3719b87fd95d", + "metadata": {}, + "outputs": [], + "source": [ + "import intake\n", + "from graphviz import Digraph" + ] + }, + { + "cell_type": "markdown", + "id": "feef9176-0cee-4879-8a64-5eab032de164", + "metadata": {}, + "source": [ + "## Read in intake-esm catalog" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "909babf1-c9dc-4e43-b2cf-cd0f15e455ca", + "metadata": {}, + "outputs": [], + "source": [ + "col = intake.open_esm_datastore(\n", + " 'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "da14d8ea-06c0-4992-9a0a-e44c971779aa", + "metadata": {}, + "source": [ + "Typically, the process is to read in the dataframe containing the metadata, but this can be tough to read/understand what data is all there" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "cefeb1cb-79a8-4b31-bf8b-89de56818841", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variablelong_namecomponentexperimentfrequencyvertical_levelsspatial_domainunitsstart_timeend_timepath
0FLNSnet longwave flux at surfaceatm20Cdaily1.0globalW/m21920-01-01 12:00:002005-12-31 12:00:00s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS....
1FLNSCclearsky net longwave flux at surfaceatm20Cdaily1.0globalW/m21920-01-01 12:00:002005-12-31 12:00:00s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC...
2FLUTupwelling longwave flux at top of modelatm20Cdaily1.0globalW/m21920-01-01 12:00:002005-12-31 12:00:00s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT....
3FSNSnet solar flux at surfaceatm20Cdaily1.0globalW/m21920-01-01 12:00:002005-12-31 12:00:00s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS....
4FSNSCclearsky net solar flux at surfaceatm20Cdaily1.0globalW/m21920-01-01 12:00:002005-12-31 12:00:00s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC...
....................................
437WVELvertical velocityocnRCP85monthly60.0global_oceancentimeter/s2006-01-16 12:00:002100-12-16 12:00:00s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-W...
438NaNNaNocnCTRLstaticNaNglobal_oceanNaNNaNNaNs3://ncar-cesm-lens/ocn/static/grid.zarr
439NaNNaNocnHISTstaticNaNglobal_oceanNaNNaNNaNs3://ncar-cesm-lens/ocn/static/grid.zarr
440NaNNaNocnRCP85staticNaNglobal_oceanNaNNaNNaNs3://ncar-cesm-lens/ocn/static/grid.zarr
441NaNNaNocn20CstaticNaNglobal_oceanNaNNaNNaNs3://ncar-cesm-lens/ocn/static/grid.zarr
\n", + "

442 rows × 11 columns

\n", + "
" + ], + "text/plain": [ + " variable long_name component experiment \\\n", + "0 FLNS net longwave flux at surface atm 20C \n", + "1 FLNSC clearsky net longwave flux at surface atm 20C \n", + "2 FLUT upwelling longwave flux at top of model atm 20C \n", + "3 FSNS net solar flux at surface atm 20C \n", + "4 FSNSC clearsky net solar flux at surface atm 20C \n", + ".. ... ... ... ... \n", + "437 WVEL vertical velocity ocn RCP85 \n", + "438 NaN NaN ocn CTRL \n", + "439 NaN NaN ocn HIST \n", + "440 NaN NaN ocn RCP85 \n", + "441 NaN NaN ocn 20C \n", + "\n", + " frequency vertical_levels spatial_domain units \\\n", + "0 daily 1.0 global W/m2 \n", + "1 daily 1.0 global W/m2 \n", + "2 daily 1.0 global W/m2 \n", + "3 daily 1.0 global W/m2 \n", + "4 daily 1.0 global W/m2 \n", + ".. ... ... ... ... \n", + "437 monthly 60.0 global_ocean centimeter/s \n", + "438 static NaN global_ocean NaN \n", + "439 static NaN global_ocean NaN \n", + "440 static NaN global_ocean NaN \n", + "441 static NaN global_ocean NaN \n", + "\n", + " start_time end_time \\\n", + "0 1920-01-01 12:00:00 2005-12-31 12:00:00 \n", + "1 1920-01-01 12:00:00 2005-12-31 12:00:00 \n", + "2 1920-01-01 12:00:00 2005-12-31 12:00:00 \n", + "3 1920-01-01 12:00:00 2005-12-31 12:00:00 \n", + "4 1920-01-01 12:00:00 2005-12-31 12:00:00 \n", + ".. ... ... \n", + "437 2006-01-16 12:00:00 2100-12-16 12:00:00 \n", + "438 NaN NaN \n", + "439 NaN NaN \n", + "440 NaN NaN \n", + "441 NaN NaN \n", + "\n", + " path \n", + "0 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... \n", + "1 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... \n", + "2 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... \n", + "3 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... \n", + "4 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... \n", + ".. ... \n", + "437 s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-W... \n", + "438 s3://ncar-cesm-lens/ocn/static/grid.zarr \n", + "439 s3://ncar-cesm-lens/ocn/static/grid.zarr \n", + "440 s3://ncar-cesm-lens/ocn/static/grid.zarr \n", + "441 s3://ncar-cesm-lens/ocn/static/grid.zarr \n", + "\n", + "[442 rows x 11 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "col.df" + ] + }, + { + "cell_type": "markdown", + "id": "5288fa3a-d226-4254-98c4-37e6f01e8aa0", + "metadata": {}, + "source": [ + "You can search via intake-esm, using the following syntax" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "8381f59a-13c0-412f-adb9-2a7a1bf991bd", + "metadata": {}, + "outputs": [], + "source": [ + "cat = col.search(experiment='20C', frequency='monthly')" + ] + }, + { + "cell_type": "markdown", + "id": "0508624e-9815-4d98-a985-9ec93aa567a1", + "metadata": {}, + "source": [ + "Here again, it is tough to see everything that is here, also it requires knowing which experiments are in the dataset, and which frequency you are looking for" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "8257f661-747d-47c9-95fb-4bfc140f2a7e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variablelong_namecomponentexperimentfrequencyvertical_levelsspatial_domainunitsstart_timeend_timepath
0FLNSnet longwave flux at surfaceatm20Cmonthly1.0globalW/m21920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLN...
1FLNSCclearsky net longwave flux at surfaceatm20Cmonthly1.0globalW/m21920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLN...
2FLUTupwelling longwave flux at top of modelatm20Cmonthly1.0globalW/m21920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLU...
3FSNSnet solar flux at surfaceatm20Cmonthly1.0globalW/m21920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FSN...
4FSNSCclearsky net solar flux at surfaceatm20Cmonthly1.0globalW/m21920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FSN...
....................................
61VNTflux of heat in grid-y directionocn20Cmonthly60.0global_oceandegC/s1920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-VNT...
62VVELvelocity in grid-y directionocn20Cmonthly60.0global_oceancentimeter/s1920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-VVE...
63WTSsalt flux across top faceocn20Cmonthly60.0global_oceangram/kilogram/s1920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WTS...
64WTTheat flux across top faceocn20Cmonthly60.0global_oceandegC/s1920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WTT...
65WVELvertical velocityocn20Cmonthly60.0global_oceancentimeter/s1920-01-16 12:00:002005-12-16 12:00:00s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WVE...
\n", + "

66 rows × 11 columns

\n", + "
" + ], + "text/plain": [ + " variable long_name component experiment \\\n", + "0 FLNS net longwave flux at surface atm 20C \n", + "1 FLNSC clearsky net longwave flux at surface atm 20C \n", + "2 FLUT upwelling longwave flux at top of model atm 20C \n", + "3 FSNS net solar flux at surface atm 20C \n", + "4 FSNSC clearsky net solar flux at surface atm 20C \n", + ".. ... ... ... ... \n", + "61 VNT flux of heat in grid-y direction ocn 20C \n", + "62 VVEL velocity in grid-y direction ocn 20C \n", + "63 WTS salt flux across top face ocn 20C \n", + "64 WTT heat flux across top face ocn 20C \n", + "65 WVEL vertical velocity ocn 20C \n", + "\n", + " frequency vertical_levels spatial_domain units \\\n", + "0 monthly 1.0 global W/m2 \n", + "1 monthly 1.0 global W/m2 \n", + "2 monthly 1.0 global W/m2 \n", + "3 monthly 1.0 global W/m2 \n", + "4 monthly 1.0 global W/m2 \n", + ".. ... ... ... ... \n", + "61 monthly 60.0 global_ocean degC/s \n", + "62 monthly 60.0 global_ocean centimeter/s \n", + "63 monthly 60.0 global_ocean gram/kilogram/s \n", + "64 monthly 60.0 global_ocean degC/s \n", + "65 monthly 60.0 global_ocean centimeter/s \n", + "\n", + " start_time end_time \\\n", + "0 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "1 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "2 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "3 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "4 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + ".. ... ... \n", + "61 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "62 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "63 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "64 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "65 1920-01-16 12:00:00 2005-12-16 12:00:00 \n", + "\n", + " path \n", + "0 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLN... \n", + "1 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLN... \n", + "2 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLU... \n", + "3 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FSN... \n", + "4 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FSN... \n", + ".. ... \n", + "61 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-VNT... \n", + "62 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-VVE... \n", + "63 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WTS... \n", + "64 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WTT... \n", + "65 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WVE... \n", + "\n", + "[66 rows x 11 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cat.df" + ] + }, + { + "cell_type": "markdown", + "id": "6ebcf22f-3ce9-4d65-82d4-4f1405fcf100", + "metadata": {}, + "source": [ + "## Using Graphviz in a Jupyter Notebook\n", + "\n", + "[Graphviz](https://graphviz.readthedocs.io/en/stable/manual.html) offers an interface to create network graphs\n", + "\n", + "### Main \"components\" of [Graphviz](https://graphviz.readthedocs.io/en/stable/manual.html)\n", + "* Digraph class\n", + " * This is the main class that is used to build the visualization - typically assign to a variable `dot`, but you can use any variable you like! \n", + "* Node\n", + " * The \"bubbles\" which contain a numbered label (ex. '1') and a label (ex. 'HIST')\n", + " * These can be connected together - the **numbered label must be a unique integer**\n", + "* Edge\n", + " * Edges connect the different nodes, using the numbered indices (ex. `.edge('1', '3')` would connect the first and third nodes" + ] + }, + { + "cell_type": "markdown", + "id": "de2bee88-ecc5-4d3b-8233-5edd6e6d7879", + "metadata": {}, + "source": [ + "### Example of case visualization" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "ff030012-2a8c-4eef-be01-e5bb5c4a4ac6", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "1\n", + "\n", + "HIST\n", + "\n", + "\n", + "\n", + "2\n", + "\n", + "ocn\n", + "\n", + "\n", + "\n", + "1->2\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "5\n", + "\n", + "atm\n", + "\n", + "\n", + "\n", + "1->5\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "3\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "2->3\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "4\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "2->4\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "6\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "5->6\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "7\n", + "\n", + "weekly\n", + "\n", + "\n", + "\n", + "5->7\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create Digraph object\n", + "dot = Digraph()\n", + "\n", + "# Create the first node which serves as the main parent\n", + "dot.node('1', label='HIST')\n", + "\n", + "dot.node('2', label='ocn')\n", + "dot.edge('1', '2')\n", + "\n", + "# Add a monthly child from the ocn component parent\n", + "dot.node('3', label='monthly')\n", + "dot.edge('2', '3')\n", + "\n", + "# Add a daily child from the ocn component parent\n", + "dot.node('4', label='daily')\n", + "dot.edge('2', '4')\n", + "\n", + "# Add an atm component node and connect to experiment parent\n", + "dot.node('5', label='atm')\n", + "dot.edge('1', '5')\n", + "\n", + "# Add a monthly child from the atm component parent\n", + "dot.node('6', label='monthly')\n", + "dot.edge('5', '6')\n", + "\n", + "# Add a weekly child from the atm component parent\n", + "dot.node('7', label='weekly')\n", + "dot.edge('5', '7')\n", + "\n", + "# Visualize the graph\n", + "dot" + ] + }, + { + "cell_type": "markdown", + "id": "990b2f8c-aaf2-43d3-99c0-fd8298d6a61a", + "metadata": {}, + "source": [ + "### Looping through the CESM-LE catalog" + ] + }, + { + "cell_type": "markdown", + "id": "c84eb005-b283-49e5-bb6b-52b33d73f1f1", + "metadata": {}, + "source": [ + "Let's apply this to our data catalog, assigning the dataframe with dataset attributes to `df`" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "0fdc759e-9f3a-488d-8c95-40ccbe6938e4", + "metadata": {}, + "outputs": [], + "source": [ + "df = col.df" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "3dd39b0a-bac1-4104-a0fd-d4762dcabd93", + "metadata": {}, + "outputs": [], + "source": [ + "# Create Digraph object - use the left to right orientation instead of vertical\n", + "dot = Digraph(graph_attr={'rankdir': 'LR'})\n", + "\n", + "# Start counting at one for node numbers\n", + "num_node = 1\n", + "\n", + "# Loop through the different experiments\n", + "for experiment in df.experiment.unique():\n", + " exp_i = num_node\n", + " dot.node(str(exp_i), label=experiment)\n", + " num_node += 1\n", + "\n", + " # Loop through the different components in each experiment\n", + " for component in df.loc[df.experiment == experiment].component.unique():\n", + " comp_i = num_node\n", + " dot.node(str(comp_i), label=component)\n", + " dot.edge(str(exp_i), str(comp_i))\n", + " num_node += 1\n", + "\n", + " # Loop through the frequency in each component within each experiment\n", + " for frequency in df.loc[\n", + " (df.experiment == experiment) & (df.component == component)\n", + " ].frequency.unique():\n", + " freq_i = num_node\n", + " dot.node(str(freq_i), label=frequency)\n", + " dot.edge(str(comp_i), str(freq_i))\n", + " num_node += 1\n", + " comp_i += 1\n", + " exp_i += 1" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b6de404f-1160-4c8a-8b23-585330c3489b", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "1\n", + "\n", + "20C\n", + "\n", + "\n", + "\n", + "2\n", + "\n", + "atm\n", + "\n", + "\n", + "\n", + "1->2\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "7\n", + "\n", + "ice_nh\n", + "\n", + "\n", + "\n", + "1->7\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "10\n", + "\n", + "ice_sh\n", + "\n", + "\n", + "\n", + "1->10\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "13\n", + "\n", + "lnd\n", + "\n", + "\n", + "\n", + "1->13\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "16\n", + "\n", + "ocn\n", + "\n", + "\n", + "\n", + "1->16\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "3\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "2->3\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "4\n", + "\n", + "hourly6-1990-2005\n", + "\n", + "\n", + "\n", + "2->4\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "5\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "2->5\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "6\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "2->6\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "8\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "7->8\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "9\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "7->9\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "11\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "10->11\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "12\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "10->12\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "14\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "13->14\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "15\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "13->15\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "17\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "16->17\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "18\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "16->18\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "19\n", + "\n", + "CTRL\n", + "\n", + "\n", + "\n", + "20\n", + "\n", + "atm\n", + "\n", + "\n", + "\n", + "19->20\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "24\n", + "\n", + "ice_nh\n", + "\n", + "\n", + "\n", + "19->24\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "28\n", + "\n", + "ice_sh\n", + "\n", + "\n", + "\n", + "19->28\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "32\n", + "\n", + "lnd\n", + "\n", + "\n", + "\n", + "19->32\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "36\n", + "\n", + "ocn\n", + "\n", + "\n", + "\n", + "19->36\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "21\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "20->21\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "22\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "20->22\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "23\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "20->23\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "25\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "24->25\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "26\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "24->26\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "27\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "24->27\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "29\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "28->29\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "30\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "28->30\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "31\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "28->31\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "33\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "32->33\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "34\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "32->34\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "35\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "32->35\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "37\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "36->37\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "38\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "36->38\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "39\n", + "\n", + "HIST\n", + "\n", + "\n", + "\n", + "40\n", + "\n", + "atm\n", + "\n", + "\n", + "\n", + "39->40\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "44\n", + "\n", + "ice_nh\n", + "\n", + "\n", + "\n", + "39->44\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "48\n", + "\n", + "ice_sh\n", + "\n", + "\n", + "\n", + "39->48\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "52\n", + "\n", + "lnd\n", + "\n", + "\n", + "\n", + "39->52\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "56\n", + "\n", + "ocn\n", + "\n", + "\n", + "\n", + "39->56\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "41\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "40->41\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "42\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "40->42\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "43\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "40->43\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "45\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "44->45\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "46\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "44->46\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "47\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "44->47\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "49\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "48->49\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "50\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "48->50\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "51\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "48->51\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "53\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "52->53\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "54\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "52->54\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "55\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "52->55\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "57\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "56->57\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "58\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "56->58\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "59\n", + "\n", + "RCP85\n", + "\n", + "\n", + "\n", + "60\n", + "\n", + "atm\n", + "\n", + "\n", + "\n", + "59->60\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "66\n", + "\n", + "ice_nh\n", + "\n", + "\n", + "\n", + "59->66\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "70\n", + "\n", + "ice_sh\n", + "\n", + "\n", + "\n", + "59->70\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "74\n", + "\n", + "lnd\n", + "\n", + "\n", + "\n", + "59->74\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "78\n", + "\n", + "ocn\n", + "\n", + "\n", + "\n", + "59->78\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "61\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "60->61\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "62\n", + "\n", + "hourly6-2026-2035\n", + "\n", + "\n", + "\n", + "60->62\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "63\n", + "\n", + "hourly6-2071-2080\n", + "\n", + "\n", + "\n", + "60->63\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "64\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "60->64\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "65\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "60->65\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "67\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "66->67\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "68\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "66->68\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "69\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "66->69\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "71\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "70->71\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "72\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "70->72\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "73\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "70->73\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "75\n", + "\n", + "daily\n", + "\n", + "\n", + "\n", + "74->75\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "76\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "74->76\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "77\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "74->77\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "79\n", + "\n", + "monthly\n", + "\n", + "\n", + "\n", + "78->79\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "80\n", + "\n", + "static\n", + "\n", + "\n", + "\n", + "78->80\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dot" + ] + }, + { + "cell_type": "markdown", + "id": "34fbf1b7-7d1d-4e0e-8b33-523df9bd10cb", + "metadata": {}, + "source": [ + "## Conclusion\n", + "[Graphviz](https://graphviz.readthedocs.io/en/stable/manual.html) can be a helpful tool when visualizing what data is within your data catalog - I hope this provides a good starting point in terms of using this with [intake-esm](https://intake-esm.readthedocs.io/en/latest/) catalogs!" + ] + } + ], + "metadata": { + "author": "Max Grover", + "date": "2021-06-11", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + }, + "tags": "visualization,intake,cesm", + "title": "Creating Visualizations of Intake-ESM Catalogs" + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/user-guide/index.md b/docs/source/user-guide/index.md index ba8d0e57..0b088a09 100644 --- a/docs/source/user-guide/index.md +++ b/docs/source/user-guide/index.md @@ -8,8 +8,10 @@ maxdepth: 2 caption: Core Guide --- overview.ipynb +ecgtools-history-files-example.ipynb search.ipynb multi-variable-assets.ipynb +intake-esm-derived-variables.ipynb ``` ```{toctree} @@ -19,4 +21,7 @@ caption: Tutorials & Examples --- cmip6-tutorial.ipynb manipulating-dataframe.ipynb +intake_cmip6_debug.md +graphviz_example.ipynb + ``` diff --git a/docs/source/user-guide/intake-esm-derived-variables.ipynb b/docs/source/user-guide/intake-esm-derived-variables.ipynb new file mode 100644 index 00000000..f6c1f9e5 --- /dev/null +++ b/docs/source/user-guide/intake-esm-derived-variables.ipynb @@ -0,0 +1,568 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "29947174", + "metadata": {}, + "source": [ + "# Using [Intake-ESM's](https://intake-esm.readthedocs.io/en/latest/) Derived Variable Functionality\n", + "\n", + "## What is a \"Derived Variable\"\n", + "A \"derived variable\" in this case is a variable that doesn't itself exist in an [intake-esm](https://intake-esm.readthedocs.io/en/latest/) catalog, but can be computed (i.e., \"derived\") from variables that do exist in the catalog. \n", + "Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within **the same dataset**. \n", + "\n", + "An example of a derived variable could be temperature in degrees Fahrenheit. Often times, climate model models write temperature in Celsius or Kelvin, but the user may want degrees Fahrenheit!\n", + "This is a really simple example; derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.\n", + "\n", + "A traditional workflow for \"derived variables\" might consist of the following:\n", + "* Load the data\n", + "* Apply some function to the loaded datasets\n", + "* Plot the output\n", + "\n", + "\n", + "But what if we could couple those first two steps? What if we could have some set of **variable definitions**, consisting of variable requirements, such as `dependent variables`, and a function which derives the quantity. This is what the `derived_variable` funtionality offers in `intake-esm`! This enables users to share a \"registry\" of derived variables across catalogs!\n", + "\n", + "Let's get started with an example!" + ] + }, + { + "cell_type": "markdown", + "id": "fa009c07", + "metadata": {}, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "bd9f0efc", + "metadata": {}, + "outputs": [], + "source": [ + "import intake\n", + "import numpy as np\n", + "\n", + "from intake_esm.derived import DerivedVariableRegistry" + ] + }, + { + "cell_type": "markdown", + "id": "0f372c63", + "metadata": { + "tags": [] + }, + "source": [ + "## How to add a Derived Variable\n", + "Let's compute a derived variable - wind speed! This can be derived from using the zonal (`U`) and meridional (`V`) components of the wind." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "30117abd", + "metadata": {}, + "outputs": [], + "source": [ + "def calc_wind_speed(u, v):\n", + " return np.sqrt(u ** 2 + v ** 2)" + ] + }, + { + "cell_type": "markdown", + "id": "5cbba3cf", + "metadata": {}, + "source": [ + "### Creating our Derived Variable Registry\n", + "We need to instantiate our derived variable registry, which will store our derived variable information! We use the variable `dvr` for this (**D**erived**V**ariable**R**egistry)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "7e9ed122", + "metadata": {}, + "outputs": [], + "source": [ + "dvr = DerivedVariableRegistry()" + ] + }, + { + "cell_type": "markdown", + "id": "bb3379f6", + "metadata": {}, + "source": [ + "In order to register this derived variable we need to add a [decorator](https://www.python.org/dev/peps/pep-0318/) for our function, as seen below. This allows us to define our derived variable, dependent variables, and the function associated with the calculation." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "430894dd", + "metadata": {}, + "outputs": [], + "source": [ + "@dvr.register(variable='wind_speed', query={'variable': ['U', 'V']})\n", + "def calc_wind_speed(ds):\n", + " ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)\n", + " return ds" + ] + }, + { + "cell_type": "markdown", + "id": "1a316f5a", + "metadata": {}, + "source": [ + "You'll notice `dvr` now has a registered variable, `wind_speed`, which was defined in the cell above!" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "6166f217", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DerivedVariableRegistry({'wind_speed': DerivedVariable(func=, variable='wind_speed', query={'variable': ['U', 'V']})})" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dvr" + ] + }, + { + "cell_type": "markdown", + "id": "6604a1ce-df45-4e93-a212-769648ed4d68", + "metadata": {}, + "source": [ + "### Loading in a Derived Variable Registry from File\n", + "You can load your derived variables from a module as well!\n", + "\n", + "Within the file `my_registry.py`, we have the following, which mirrors the previous few cells:\n", + "\n", + "```python\n", + "from intake_esm.derived import DerivedVariableRegistry\n", + "\n", + "dvr = DerivedVariableRegistry()\n", + "\n", + "@dvr.register(variable='wind_speed', query={'variable': ['U', 'V']})\n", + "def calc_wind_speed(ds):\n", + " ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)\n", + " return ds\n", + "```\n", + "\n", + "We can now load `my_registry.py` into our notebook! We use the following syntax to load in that registry:\n", + "\n", + "```python\n", + "dvr = DerivedVariableRegistry.load('my_registry')\n", + "dvr\n", + "```\n", + "\n", + "```\n", + "DerivedVariableRegistry({'wind_speed': DerivedVariable(func=, variable='wind_speed', query={'variable': ['U', 'V']})})\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "14b2049b", + "metadata": {}, + "source": [ + "## Read in Data with our Registry\n", + "In this case, we will use data from the CESM Large Ensemble (LENS). This is a climate model ensemble, a subset of which is hosted on the AWS Cloud. If you are interested in learning more about this dataset, check out the [LENS on AWS documentation page](https://ncar.github.io/cesm-lens-aws/).\n", + "\n", + "We load in our derived variable catalog using the `registry` argument." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "08eb271e", + "metadata": {}, + "outputs": [], + "source": [ + "data_catalog = intake.open_esm_datastore(\n", + " 'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json',\n", + " registry=dvr,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "16ed4f25", + "metadata": {}, + "source": [ + "You'll notice we have a new field - `derived_variable` which has 1 unique value." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "f511212e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unique
variable78
long_name75
component5
experiment4
frequency6
vertical_levels3
spatial_domain5
units25
start_time12
end_time13
path427
derived_variable1
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "data_catalog" + ] + }, + { + "cell_type": "markdown", + "id": "d16014fa", + "metadata": {}, + "source": [ + "Let's also subset for monthly frequency, as well as the 20th century (20C) and RCP 8.5 (RCP85) experiments." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "3c4f6206", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

aws-cesm1-le catalog with 7 dataset(s) from 14 asset(s):

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unique
variable2
long_name2
component1
experiment4
frequency4
vertical_levels1
spatial_domain1
units1
start_time7
end_time7
path14
derived_variable1
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "catalog_subset = data_catalog.search(\n", + " variable=['wind_speed'], frequency='monthly', experiment='RCP85'\n", + ")\n", + "\n", + "catalog_subset" + ] + }, + { + "cell_type": "markdown", + "id": "b8669c17", + "metadata": { + "tags": [] + }, + "source": [ + "### Calling `to_dataset_dict` to Load in the Data\n", + "We load in the data, which lazily adds our calculation for `wind_speed` to the datasets!" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "03cb05ad-2a8c-44eb-9a91-91d9bb467ecb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "--> The keys in the returned dictionary of datasets are constructed as follows:\n", + "\t'component.experiment.frequency'\n", + "█\r" + ] + }, + { + "data": { + "text/plain": [ + "dict_keys(['atm.RCP85.monthly', 'atm.HIST.monthly', 'atm.20C.monthly', 'atm.RCP85.hourly6-2071-2080', 'atm.CTRL.monthly', 'atm.RCP85.hourly6-2026-2035', 'atm.20C.hourly6-1990-2005'])" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dsets = catalog_subset.to_dataset_dict(\n", + " xarray_open_kwargs={'backend_kwargs': {'storage_options': {'anon': True}}}\n", + ")\n", + "dsets.keys()" + ] + }, + { + "cell_type": "markdown", + "id": "cd5a1adc-382d-452d-9fd7-e049ea5fd40d", + "metadata": {}, + "source": [ + "Let's look at single dataset from this dictionary of datasets... using the key `atm.CTRL.monthly`" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "c947191f-a7fe-4e69-9ceb-61e084a02db2", + "metadata": {}, + "outputs": [], + "source": [ + "ds = dsets['atm.CTRL.monthly']" + ] + }, + { + "cell_type": "markdown", + "id": "41aa48af", + "metadata": {}, + "source": [ + "## Apply an Annual Mean Calculation\n", + "Let's apply an annual average to the data - since all the years are 365 days long, we do not need any special weighting" + ] + }, + { + "cell_type": "markdown", + "id": "16a70995-1b55-47b0-9874-4a3761ee7503", + "metadata": {}, + "source": [ + "You'll notice upon reading in the dataset, we have three variables:\n", + "* `U`\n", + "* `V`\n", + "* `wind_speed`" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "51d57505", + "metadata": {}, + "outputs": [], + "source": [ + "annual_mean = ds.groupby('time.year').mean('time')" + ] + }, + { + "cell_type": "markdown", + "id": "4d1197a2", + "metadata": { + "tags": [] + }, + "source": [ + "### Plot the Output\n", + "We can use the plotting functions within [`Xarray`](http://xarray.pydata.org/en/stable/) to view the output!" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "18ad9e8f", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "ds.wind_speed.isel(member_id=0, time=0, lev=-1).plot();" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d8aba00", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "author": "Max Grover", + "date": "2021-10-22", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + }, + "tags": "intake,diagnostics,cesm", + "title": "Using Intake-ESM's New Derived Variable Functionality" + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/user-guide/intake_cmip6_debug.md b/docs/source/user-guide/intake_cmip6_debug.md new file mode 100644 index 00000000..3030a433 --- /dev/null +++ b/docs/source/user-guide/intake_cmip6_debug.md @@ -0,0 +1,131 @@ +--- +author: Max Grover +date: 2021-5-14 +tags: intake, cmip6, debug +--- + +# Debugging Intake-ESM Process for Reading in CMIP6 + +This post was motivated by a post from Steve Yeager [@sgyeager](https://github.com/sgyeager), who ran into an error when attempting to read in CMIP6 data via intake-esm. + +For those who are unfamiliar with intake-esm, be sure to read over [the documentation](https://intake-esm.readthedocs.io/en/latest/index.html)! The user guide even includes [an entire portion of their site](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) on looking at CMIP6 data. These resources would be a great place to start. + +The specific workflow of using this package for reading in Sea Water Silinity (`so`) and +Sea Water Potential Temperature (`thetao`) are given below + +## Reproducing the Error + +```python +import intake +``` + +Set a path to the CMIP catalog - in this case, since we using the glade file system, use this and open the catalog using the intake-esm extension. + +```python +# Set the path for the catalog file +catalog_file = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json' + +# Open the catalog using intake +col = intake.open_esm_datastore(catalog_file) +``` + +Now, we set which experiment to use. In this case, we are interested in the Ocean Model Intercomparison Project (OMIP) since we are looking at ocean data, setting + +```python +experiment_id = ['omip1'] +``` + +In terms of the variables of interest, we are interested in Sea Water Silinity (`so`) and Sea Water Potential Temperature (`thetao`), so we set + +```python +variable_id = ['so', 'thetao'] +``` + +The last variable to change is the table_id which corresponds to the temporal frequency. Since we are interested in monthly data, we set + +```python +table_id = 'Omon' +``` + +Putting all that together, we setup the catalog and call `to_dataset_dict` which assembles a dictionary of datasets + +```python +cat = col.search( + experiment_id=['omip1'], + variable_id=['thetao', 'so'], + table_id='Omon' +) +dset_dict = cat.to_dataset_dict() +``` + +But this returns an error of: + +``` +AggregationError: + Failed to merge multiple datasets in group with key=OMIP.CNRM-CERFACS.CNRM-CM6-1.omip1.Omon.gn into a single xarray Dataset as variables. + + *** Arguments passed to xarray.merge() ***: + + - objs: a list of 2 datasets + - kwargs: {} + + ******************************************** +``` + +## Determining the Cause of the Error + +At this point, let's go back and inspect the subset catalog we have, calling the `catalog.df` + +```python +cat.df +``` + +Notice how in the error, the source_id which triggered the error is `CNRM-CM6-1`, so let's subset for that and investigate the problem... + +```python +cat.df[cat.df.source_id == 'CNRM-CM6-1'] +``` + +Another option for looking at the subset for problematic key would be (returns a pandas dataframe): + +```python +cat['OMIP.CNRM-CERFACS.CNRM-CM6-1.omip1.Omon.gn'] +``` + +Here, we focus on the `time_range` column - noticing that the last time step for `so` is `194912` while the last timestep for datasets with `thetao` is `199912`, with the dates formatted `YYYYMM`. + +Some of the files are missing here which is the fundamental issue. + +When `intake` attempts to concatenate these two datasets, it struggles and return an error, since the data is missing on the system. There is missing data here - reading variables separately is workaround, although it should be noted that at the end of the day, the missing files are the core problem. + +## Applying our "Workaround" + +To work around this problem, If you would still like to work with the data, you will need to read in datasets using separate queries as shown below + +```python +# Search and read in dataset for so +cat_so = col.search( + experiment_id=['omip1'], + variable_id=['so'], + table_id='Omon' +) +dset_dict_so = cat_so.to_dataset_dict() + +# Search and read in dataset for thetao +cat_thetao = col.search( + experiment_id=['omip1'], + variable_id=['thetao'], + table_id='Omon' +) +dset_dict_thetao = cat_thetao.to_dataset_dict() +``` + +Another option is to turn off aggregation within `to_dataset_dict()`, using the following syntax + +```python +dsets = cat.to_dataset_dict(aggregate=False) +``` + +This will return a dataset for **every** file in the archive and the keys in `dset_dict` will be constructed using all the fields in the catalog. + +Since `aggregate=False` will yield a large number of individual datasets, it might be overwhelming and difficult to determine the problem.