Create Zarr function to create json references and LRA feature #1068

oloapinivad · 2024-04-09T13:37:24Z

PR description:

This tries to develop the support for zarr files within AQUA making use of kerchunk to create specific zarr, following what is discussed in #1034. First tests showed that it is quite easy but it needs to create different json files if the data have different chunking. This can be achieved quite easily with two subselection for LRA files.

More will follow soon

Issues closed by this pull request:

Close #1034

Tests are included if a new feature is included.
Documentation is included if a new feature is included.
Docstrings are updated if needed.
Changelog is updated.
environment.yml and pyproject.toml are updated if needed, together with the lumi installation tool.

oloapinivad · 2024-04-10T07:23:55Z

So the current PR includes a new create_zarr() function that given a netcdf files list returns a json file. The files must have all the same time dimension but the last one due to construction of zarr fsspec/kerchunk#447.

This is not something we can guarantee with LRA, since we have monthly and yearly files by definition when the LRA is created. This is a serious limitation, which I tried to circumvent using two different json creation and combining them afterwards with intake. An example of the resulting lra entry can be seen here below:

  lra-r100-monthly-zarr:
    driver: zarr
    description: LRA data monthly at r100 on zarr
    args:
      consolidated: false
      combine: nested
      urlpath:
      - reference::/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-full.json
      - reference::/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-partial.json
    metadata:
      source_grid_name: lon-lat

However, with such approch, the Zarr is massively SLOWER than NetCDF

If I run a test script like this

import timeit
from aqua import Reader
reader1 = Reader('IFS-FESOM', 'story-2017-historical', 'lra-r100-monthly')
reader2 = Reader('IFS-FESOM', 'story-2017-historical', 'lra-r100-monthly-zarr')

start_time = timeit.default_timer()
data1 = reader1.retrieve()
end_time = timeit.default_timer()
print(f"Tempo di esecuzione per reader1.retrieve(): {end_time - start_time} secondi")

start_time = timeit.default_timer()
data2 = reader2.retrieve()
end_time = timeit.default_timer()
print(f"Tempo di esecuzione per reader2.retrieve(): {end_time - start_time} secondi")

I get:

Tempo di esecuzione per reader1.retrieve(): 1.5544872460886836 secondi
Tempo di esecuzione per reader2.retrieve(): 11.423419699072838 secondi

However, if I remove the partial json, zarr returns to be massively better

Tempo di esecuzione per reader1.retrieve(): 2.5065057780593634 secondi
Tempo di esecuzione per reader2.retrieve(): 0.0351319620385766 secondi

I see only two possible alternatives: allow the zarr to work only with full json to have the efficient speed up given by zarr, or perhaps find a way to find a better concatenation strategy with xarray through the intake options.

Any thougths/suggestions?
Tagging a few experienced people @koldunovn @lkluft @mnurisso @jhardenberg

lkluft · 2024-04-10T09:36:02Z

Hey Paolo,

here are a couple of thoughts on this ☺️

I think one should be clear about what is happening in this function, which is to create kerchunk references. These are often stored in the form a Zarr store in a reference file system, but this is not Zarr.

At the moment, the JSON files are quite small and it might be fine to not consolidate them. However, to have a future-proof performance, one might consider to consolidate the references. Here is an example of how we do this in gribscan:

from gribscan.gribscan import consolidate_metadata
import json
from os.path import basename


filelist = (
    "/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-full.json",
    "/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-partial.json",
)

for fname in filelist:
    print(basename(fname))
    with open(fname, "r") as fp_in, open(basename(fname), "w") as fp_out:
        refs = json.load(fp_in)["refs"]
        refs[".zmetadata"] = consolidate_metadata(refs)
        json.dump(refs, fp_out)

It looks like there is a spurious time dependence in the plev coordinate. I am not sure how this appeared, but it is not present in the original NetCDF files. This might be the reason why the internal merging of the datasets is much more complicated (i.e. a nested merge instead of a concat) than necessary. So one might investigate where the plev(time, plev) in the JSONs originates.

Please let me know if anything is unclear or if I can be of any help!

Cheers
Lukas

lkluft · 2024-04-10T10:03:53Z

In terms of performance, 2. is likely not so much of an issue because the full JSON needs to be loaded and parsed in both cases. 3., however, likely causes the performance degradation.

In general, I am wondering what the final goal of this functionality is? Do you want to avoid the overhead of open_mfdataset with too many NetCDF files? In this case, I would recommend to aggregate the actual NetCDF files (as you are running the jobs anyhow) instead of virtually concatenating them.

oloapinivad · 2024-04-10T12:24:03Z

Hi @lkluft,

thanks a lot for your reply!

I was probably not very clear above, the idea is to collect the metadata using kerchunk to create json references and speed up the loading of the data. We are not aiming at writing zarr files for now, although this might be an intriguing solution for the future.

I will investigate how to consolidate the metadata, thanks a lot for the example! However - and this replies also to your most important question - the idea is to create json reference to speed up the reading of the netcdf files which are produced every night by the LRA generator runs, so the definition of "consolidated" is not very suited here.

A few ago we tried to follow the idea of creating single files for each variable but it is a bit trickier to implement since there are safety many checks performed on the data (which is processed in monthly chunks by the LRA generator tool), and will likely need a larger rewrite.
Since the amount of netcdf files is getting bigger and bigger and the runs are still ongoing, an automatic procedure to create json references will be a very good solution to simplify the loading of the data - and it is working pretty well if I consider only full 12-month files as shown by the speed test above.

I will dig into the json references to see where the problem associated with the plev comes from, perhaps this is the culprit of the slowdown as you suggest.

oloapinivad · 2024-04-10T16:47:07Z

The slow down was due to plev not being included as an indentical_dims. We will probably need to make this more flexible so that dimensions shared are to be read from the original netcdf. Will investigate more in next days on the best way to contat the two files...

github-actions · 2024-04-11T08:40:26Z

Coverage report: 76%

File	statements	missing	branches	partial	coverage
aqua/__init__.py	8	0	0	0	100%
aqua/accessor.py	28	5	5	0	85%
aqua/cli/diagnostic_config.py	1	0	0	0	100%
aqua/cli/main.py	396	40	156	30	87%
aqua/cli/parser.py	39	0	0	0	100%
aqua/data_models/__init__.py	3	0	0	0	100%
aqua/data_models/cfcoords.py	94	12	34	7	82%
aqua/data_models/cfunits.py	33	0	12	0	100%
aqua/data_models/datamodels.py	2	0	0	0	100%
aqua/exceptions.py	16	8	0	0	50%
aqua/graphics/__init__.py	5	0	0	0	100%
aqua/graphics/hovmoller.py	95	88	38	0	5%
aqua/graphics/multiple_maps.py	104	17	38	14	77%
aqua/graphics/single_map.py	143	26	56	21	74%
aqua/graphics/timeseries.py	118	39	46	16	62%
aqua/gsv/__init__.py	6	2	0	0	67%
aqua/gsv/intake_gsv.py	337	62	128	26	76%
aqua/gsv/timeutil.py	83	34	36	14	53%
aqua/logger.py	74	4	26	5	91%
aqua/lra_generator/__init__.py	2	0	0	0	100%
aqua/lra_generator/lra_generator.py	314	50	118	32	79%
aqua/lra_generator/lra_util.py	48	2	14	2	94%
aqua/reader/__init__.py	4	0	0	0	100%
aqua/reader/catalog.py	78	14	48	9	80%
aqua/reader/fixer.py	463	100	222	34	75%
aqua/reader/reader.py	487	61	271	43	83%
aqua/reader/reader_utils.py	77	29	61	6	60%
aqua/reader/regrid.py	198	19	60	12	87%
aqua/reader/streaming.py	48	3	22	5	89%
aqua/reader/timmean.py	70	13	29	6	77%
aqua/slurm/__init__.py	2	0	0	0	100%
aqua/slurm/slurm.py	105	88	36	0	12%
aqua/util/__init__.py	17	0	0	0	100%
aqua/util/config.py	166	33	79	25	74%
aqua/util/coord.py	37	6	22	4	83%
aqua/util/eccodes.py	70	5	32	3	92%
aqua/util/graphics.py	132	21	28	6	82%
aqua/util/output_saver.py	96	11	30	9	84%
aqua/util/sci_util.py	91	19	52	12	78%
aqua/util/time.py	68	13	18	5	77%
aqua/util/util.py	154	39	55	4	73%
aqua/util/yaml.py	119	20	58	12	80%
aqua/util/zarr.py	28	5	4	1	81%
Total	4459	888	1834	363	76%

Missing Lines

Filename	Missing
__init__.py
accessor.py	31, 35, 51, 55, 59
exceptions.py	8-9, 16-17, 24-25, 32-33
logger.py	23, 96, 103-104
cli/diagnostic_config.py
cli/main.py	75, 82, 126, 212-213, 222-223, 228-229, 238, 244-245, 248-249, 260-261, 266-267, 283, 288, 349, 403-407, 417-419, 443-445, 504-505, 517, 561, 606, 620, 624, 656, 662
cli/parser.py
data_models/__init__.py
data_models/cfcoords.py	48, 50, 53, 55, 73, 79, 191-197
data_models/cfunits.py
data_models/datamodels.py
graphics/__init__.py
graphics/hovmoller.py	68-203
graphics/multiple_maps.py	78, 82, 114-115, 131-133, 150, 161, 182, 208-209, 236-238, 244-245
graphics/single_map.py	94-96, 109-110, 129, 145, 169-170, 174-175, 199, 208-210, 263, 287-289, 293, 297-298, 320, 329-331
graphics/timeseries.py	53, 62, 64-65, 76, 78-79, 82-95, 98-111, 171, 175-176, 183-190, 221, 228-230
gsv/__init__.py	8-9
gsv/intake_gsv.py	20-25, 75, 110, 168, 189-215, 234, 359-361, 400-429, 443-447, 453, 465, 471-473, 514, 567, 574, 580, 593
gsv/timeutil.py	32, 50-59, 84, 87, 90, 93, 105-106, 152, 154, 156, 159, 168, 171, 180-186, 194-195, 221, 230-234
lra_generator/__init__.py
lra_generator/lra_generator.py	96-97, 111, 116, 121, 126, 131, 142, 169, 204, 226-227, 263-265, 313-315, 326, 340-342, 372-379, 429, 467-468, 475, 574, 583-587, 593, 602-603, 621, 637, 641-642, 655-659
lra_generator/lra_util.py	44, 141
reader/__init__.py
reader/catalog.py	99-101, 104-105, 120-121, 127, 137-138, 142-143, 147-148
reader/fixer.py	57-60, 84-86, 98-105, 132, 194-196, 206, 226-240, 298-300, 339-340, 363, 390, 425, 488-493, 503, 523-533, 548-556, 585-587, 613-616, 678, 694, 710, 726, 742-743, 756, 770-781, 811, 813-815, 853-857, 860-863, 872, 909-910, 914-917, 924-929, 968
reader/reader.py	191, 225, 279, 319, 376, 386, 442, 459-460, 492, 517, 569, 600, 602, 609, 622, 627-628, 667, 674, 789-815, 847, 878, 882, 890, 902, 949, 1006, 1010, 1015-1018, 1022, 1057-1062, 1069, 1115-1116, 1123, 1138-1141
reader/reader_utils.py	22-44, 59-66, 103, 119, 125, 166-168
reader/regrid.py	94-95, 161, 170, 172, 219, 224-228, 235, 242, 284, 318-321, 360, 389-390
reader/streaming.py	71, 78, 108
reader/timmean.py	32, 40-41, 61-65, 90, 102, 123-126
slurm/__init__.py
slurm/slurm.py	40-47, 65-73, 87-97, 118-134, 166-197, 240-305, 320-328
util/__init__.py
util/config.py	108, 117, 132, 146, 149, 163-165, 168, 171, 192, 197, 203, 220, 244, 255-267, 278, 286, 290, 306, 309, 332-339
util/coord.py	36, 59, 62, 89-91
util/eccodes.py	128-130, 172-176
util/graphics.py	31, 116, 188-191, 210-212, 216-218, 238, 284-287, 307-310
util/output_saver.py	100-102, 158-160, 208-209, 213, 265-266, 270
util/sci_util.py	65, 76-77, 108, 114-116, 124, 126, 134-136, 159, 162, 164, 177-178, 180-181
util/time.py	42, 65, 67, 69, 131, 174-182
util/util.py	49-52, 105-106, 151-153, 161-162, 208, 212-213, 228, 276-308, 334-337
util/yaml.py	35-41, 74-77, 100, 110-111, 121-122, 145, 147, 244, 247, 259
util/zarr.py	42-45, 50
TOTAL

oloapinivad · 2024-04-11T09:00:23Z

We will probably need to make this more flexible so that dimensions shared are to be read from the original netcdf.

This has been addressed by reading the netcdf files before with xarray. implies slower generation, but should be safer. I am doing some offline test, and I have implemented a test which is already included.

oloapinivad · 2024-04-11T09:44:59Z

I made some tests and I was able to create a zarr reference entry for the NextGems Cycle4 data.

The speed up in retrieving the data is about 50 times, which I can consider satisfactory.

Tempo di esecuzione per reader2.retrieve(): 0.4457592135295272 secondi
Tempo di esecuzione per reader1.retrieve(): 17.28819192480296 secondi

Still the way on which I am passing the xarray args to intake is a bit based on trial and error, and I am not sure I am doing the things in the best way. @lkluft do you have any tips on this? Thanks!

  lra-r100-monthly-zarr:
    driver: zarr
    description: LRA data monthly at r100 on zarr
    args:
      consolidated: false
      combine: nested
      compat: override
      urlpath:
      - reference::/work/bb1153/b382075/aqua/lra/IFS-FESOM/ssp370-ng4/r100/monthly/lra-r100-monthly-full.json
      - reference::/work/bb1153/b382075/aqua/lra/IFS-FESOM/ssp370-ng4/r100/monthly/lra-r100-monthly-partial.json
    metadata:
      source_grid_name: lon-lat

BTW, I suspect the current implementation is not working as it should and might create empty records even if the data is available...

mnurisso · 2024-04-11T09:49:54Z

This is very nice! If you understand better this block:

consolidated: false
combine: nested
compat: override

can you document it somewhere (advanced topic?) so we can use it also in other sources other than the LRA?

oloapinivad · 2024-04-11T09:53:35Z

This is very nice! If you understand better this block:
consolidated: false
combine: nested
compat: override
can you document it somewhere (advanced topic?) so we can use it also in other sources other than the LRA?

I think this is documented here: https://docs.xarray.dev/en/stable/generated/xarray.merge.html. The problem is to set up the right configuration for us, and this might not be trivial.

oloapinivad · 2024-07-19T15:00:25Z

This still not work. Despite my tentatives, there is no way to get a good zarr from a complicate source as the nextgems4 are. The only option I still see is to allow a zarr generation only if complete files are available.

oloapinivad · 2024-07-19T15:25:08Z

Final solution that I investigate is to create a zarr only for complete sources which have a full set of monthly and yearly file. This has to be done by creating an error trap for the case where this is not respected.

lkluft · 2024-07-22T09:36:05Z

This still not work. Despite my tentatives, there is no way to get a good zarr from a complicate source as the nextgems4 are. The only option I still see is to allow a zarr generation only if complete files are available.

Do you refer to the IFS output? AFAIK, all ICON output for Cycle 4 is already in Zarr and, therefore, should be straight-forward to include.

Final solution that I investigate is to create a zarr only for complete sources which have a full set of monthly and yearly file. This has to be done by creating an error trap for the case where this is not respected.

What is the problem with incomplete datasets? I see that it is annoying to create them too often, but technically this should be possible, or?

oloapinivad · 2024-07-23T07:15:07Z

Hi @lkluft, the problem I am having is that these datasets (netcdf-based LRA low res monthly output produced for monitoring purpose) are not "complete", meaning that some years have the full set of variables, some not. The LRA produces monthly files and once a year is complete, it packs them up into yearly file. This is not something zarr likes. Therefore we end up with a weird mixture of monthly and yearly netcdf files, one per variable, with some variable present for all timestep, and some not. Not the way to go for zarr.
To have it working properly we should probably refactor the LRA netcdf generation, and this is beyond our current goal.

I am waiting for #1278 to be merged to adapt some of the new implementation, and then I think that if tests are passed we could merge.

PS
good to hear that ICON nextgems4 are in zarr: @mnurisso and me will come in Wageningen in October, we could think about integrating into the AQUA catalog them too!

lkluft · 2024-07-23T07:21:57Z

[...] these datasets (netcdf-based LRA low res monthly output produced for monitoring purpose) are not "complete", meaning that some years have the full set of variables, some not. [...] Therefore we end up with a weird mixture of monthly and yearly netcdf files, one per variable, with some variable present for all timestep, and some not.

Okay, I see, this is definitely a problem. However, I would see this as a side effect of the rather hand-wavy initial phase of Climate DT model output and hope for more consistent output in the future. I understand that you have to find some workaround to make what is available now accessible, but for future development I would push for more consistent datasets throughout the entire simulated period.

PS good to hear that ICON nextgems4 are in zarr: @mnurisso and me will come in Wageningen in October, we could think about integrating into the AQUA catalog them too!

Nice! We can definitely check how to achieve this ☺️

oloapinivad · 2024-07-24T09:55:20Z

Added a few updates and error traps, hopefully this can be merged.

mnurisso · 2024-07-25T10:45:17Z

I did a bit of linting and as discussed offline I decided to set zarr creation False by default.
The config file has the two field to enable the zarr and the verification False and the CLI has them False as well, in order to enable the tool the config file needs to have the zarr: True and verify_zarr: True, no flag present at the current stage.

If tests are ok for me we can merge.

When successful in my tests there is a factor 3-8 in the performance increase when accessing data and plotting them with the zarr implementation.

oloapinivad · 2024-07-25T13:57:59Z

Thanks lot. I am merging this!

create zarr basic function and lra lister

6b1eae7

oloapinivad added the framework functions common to all diagnostics label Apr 9, 2024

oloapinivad added 7 commits April 9, 2024 18:03

create the zarr entry generator, not working yet

ef8ebc8

working zarr creation

7eb72d4

update of the LRA generator code

1a80178

catalogs

914e0a9

add loggin to zarr

c1809c2

changelog and pyproject

c85b007

improve cases

222bb92

oloapinivad added 2 commits April 10, 2024 18:42

remove circular import and set up plev as shared dim

209dd35

Merge branch 'main' into zarro

0d55ecc

add a test

0b49a29

oloapinivad added the run tests Set this up to let test run label Apr 11, 2024

oloapinivad added 2 commits April 11, 2024 10:25

minimum doc

774a7e9

logging

708f3de

oloapinivad changed the title ~~[WIP] Create Zarr function and LRA feature~~ Create Zarr function and LRA feature Apr 11, 2024

oloapinivad changed the title ~~Create Zarr function and LRA feature~~ Create Zarr function to create json references and LRA feature Apr 11, 2024

ng4 zarr entry

74121d5

oloapinivad added the data source label Apr 11, 2024

adding the feature to lra cli

f616417

Merge remote-tracking branch 'origin/main' into zarro

1cda4d6

mnurisso mentioned this pull request Jun 26, 2024

LRA Nextgems4 extended to Lumi DestinE-Climate-DT/Climate-DT-catalog#23

Merged

2 tasks

oloapinivad added 5 commits July 17, 2024 13:14

Merge remote-tracking branch 'origin/main' into zarro

e6719a6

fix

ce05252

Merge remote-tracking branch 'origin/main' into zarro

07cc6e4

var-based zarr chunking

948e305

new failed test

f7031bd

oloapinivad added 2 commits July 21, 2024 09:00

possible final implementation

331a487

changelog

a8c9615

oloapinivad added 6 commits July 23, 2024 13:35

Merge remote-tracking branch 'origin/main' into zarro

a2eff67

some update

cdad4a0

fix for parallel submission

dfe1144

option for future usage

95010ed

bug in catalog generation

f78bec1

improve error handling

8bbce1a

oloapinivad added the ready to merge This PR is ready to merge in the opinion of the author label Jul 24, 2024

mnurisso added 4 commits July 25, 2024 13:28

linting

e20c5f5

zarr False by default

ca2ed99

Merge remote-tracking branch 'origin' into zarro

4a4e785

zarr option in auto_LRA_overnight config file

3460e9a

oloapinivad merged commit 468c9ef into main Jul 25, 2024
1 check passed

mnurisso deleted the zarro branch July 25, 2024 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Zarr function to create json references and LRA feature #1068

Create Zarr function to create json references and LRA feature #1068

oloapinivad commented Apr 9, 2024 •

edited

Loading

oloapinivad commented Apr 10, 2024 •

edited

Loading

lkluft commented Apr 10, 2024

lkluft commented Apr 10, 2024

oloapinivad commented Apr 10, 2024

oloapinivad commented Apr 10, 2024

github-actions bot commented Apr 11, 2024 •

edited

Loading

oloapinivad commented Apr 11, 2024

oloapinivad commented Apr 11, 2024 •

edited

Loading

mnurisso commented Apr 11, 2024

oloapinivad commented Apr 11, 2024 •

edited

Loading

oloapinivad commented Jul 19, 2024

oloapinivad commented Jul 19, 2024

lkluft commented Jul 22, 2024

oloapinivad commented Jul 23, 2024 •

edited

Loading

lkluft commented Jul 23, 2024

oloapinivad commented Jul 24, 2024

mnurisso commented Jul 25, 2024 •

edited

Loading

oloapinivad commented Jul 25, 2024

Create Zarr function to create json references and LRA feature #1068

Create Zarr function to create json references and LRA feature #1068

Conversation

oloapinivad commented Apr 9, 2024 • edited Loading

PR description:

Issues closed by this pull request:

oloapinivad commented Apr 10, 2024 • edited Loading

lkluft commented Apr 10, 2024

lkluft commented Apr 10, 2024

oloapinivad commented Apr 10, 2024

oloapinivad commented Apr 10, 2024

github-actions bot commented Apr 11, 2024 • edited Loading

Coverage report: 76%

Missing Lines

oloapinivad commented Apr 11, 2024

oloapinivad commented Apr 11, 2024 • edited Loading

mnurisso commented Apr 11, 2024

oloapinivad commented Apr 11, 2024 • edited Loading

oloapinivad commented Jul 19, 2024

oloapinivad commented Jul 19, 2024

lkluft commented Jul 22, 2024

oloapinivad commented Jul 23, 2024 • edited Loading

lkluft commented Jul 23, 2024

oloapinivad commented Jul 24, 2024

mnurisso commented Jul 25, 2024 • edited Loading

oloapinivad commented Jul 25, 2024

oloapinivad commented Apr 9, 2024 •

edited

Loading

oloapinivad commented Apr 10, 2024 •

edited

Loading

github-actions bot commented Apr 11, 2024 •

edited

Loading

oloapinivad commented Apr 11, 2024 •

edited

Loading

oloapinivad commented Apr 11, 2024 •

edited

Loading

oloapinivad commented Jul 23, 2024 •

edited

Loading

mnurisso commented Jul 25, 2024 •

edited

Loading