Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Create Zarr function to create json references and LRA feature #1068

Merged
merged 35 commits into from
Jul 25, 2024
Merged

Conversation

oloapinivad
Copy link
Collaborator

@oloapinivad oloapinivad commented Apr 9, 2024

PR description:

This tries to develop the support for zarr files within AQUA making use of kerchunk to create specific zarr, following what is discussed in #1034. First tests showed that it is quite easy but it needs to create different json files if the data have different chunking. This can be achieved quite easily with two subselection for LRA files.

More will follow soon

Issues closed by this pull request:

Close #1034


  • Tests are included if a new feature is included.
  • Documentation is included if a new feature is included.
  • Docstrings are updated if needed.
  • Changelog is updated.
  • environment.yml and pyproject.toml are updated if needed, together with the lumi installation tool.

@oloapinivad oloapinivad added the framework functions common to all diagnostics label Apr 9, 2024
@oloapinivad
Copy link
Collaborator Author

oloapinivad commented Apr 10, 2024

So the current PR includes a new create_zarr() function that given a netcdf files list returns a json file. The files must have all the same time dimension but the last one due to construction of zarr fsspec/kerchunk#447.

This is not something we can guarantee with LRA, since we have monthly and yearly files by definition when the LRA is created. This is a serious limitation, which I tried to circumvent using two different json creation and combining them afterwards with intake. An example of the resulting lra entry can be seen here below:

  lra-r100-monthly-zarr:
    driver: zarr
    description: LRA data monthly at r100 on zarr
    args:
      consolidated: false
      combine: nested
      urlpath:
      - reference::/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-full.json
      - reference::/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-partial.json
    metadata:
      source_grid_name: lon-lat

However, with such approch, the Zarr is massively SLOWER than NetCDF

If I run a test script like this

import timeit
from aqua import Reader
reader1 = Reader('IFS-FESOM', 'story-2017-historical', 'lra-r100-monthly')
reader2 = Reader('IFS-FESOM', 'story-2017-historical', 'lra-r100-monthly-zarr')

start_time = timeit.default_timer()
data1 = reader1.retrieve()
end_time = timeit.default_timer()
print(f"Tempo di esecuzione per reader1.retrieve(): {end_time - start_time} secondi")

start_time = timeit.default_timer()
data2 = reader2.retrieve()
end_time = timeit.default_timer()
print(f"Tempo di esecuzione per reader2.retrieve(): {end_time - start_time} secondi")

I get:

Tempo di esecuzione per reader1.retrieve(): 1.5544872460886836 secondi
Tempo di esecuzione per reader2.retrieve(): 11.423419699072838 secondi

However, if I remove the partial json, zarr returns to be massively better

Tempo di esecuzione per reader1.retrieve(): 2.5065057780593634 secondi
Tempo di esecuzione per reader2.retrieve(): 0.0351319620385766 secondi

I see only two possible alternatives: allow the zarr to work only with full json to have the efficient speed up given by zarr, or perhaps find a way to find a better concatenation strategy with xarray through the intake options.

Any thougths/suggestions?
Tagging a few experienced people @koldunovn @lkluft @mnurisso @jhardenberg

@lkluft
Copy link
Collaborator

lkluft commented Apr 10, 2024

Hey Paolo,

here are a couple of thoughts on this ☺️

  1. I think one should be clear about what is happening in this function, which is to create kerchunk references. These are often stored in the form a Zarr store in a reference file system, but this is not Zarr.
  2. At the moment, the JSON files are quite small and it might be fine to not consolidate them. However, to have a future-proof performance, one might consider to consolidate the references. Here is an example of how we do this in gribscan:
    from gribscan.gribscan import consolidate_metadata
    import json
    from os.path import basename
    
    
    filelist = (
        "/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-full.json",
        "/home/b/b382076/work/LRA-lumi/IFS-FESOM/story-2017-historical/r100/monthly/lra-r100-monthly-partial.json",
    )
    
    for fname in filelist:
        print(basename(fname))
        with open(fname, "r") as fp_in, open(basename(fname), "w") as fp_out:
            refs = json.load(fp_in)["refs"]
            refs[".zmetadata"] = consolidate_metadata(refs)
            json.dump(refs, fp_out)
  3. It looks like there is a spurious time dependence in the plev coordinate. I am not sure how this appeared, but it is not present in the original NetCDF files. This might be the reason why the internal merging of the datasets is much more complicated (i.e. a nested merge instead of a concat) than necessary. So one might investigate where the plev(time, plev) in the JSONs originates.

Please let me know if anything is unclear or if I can be of any help!

Cheers
Lukas

@lkluft
Copy link
Collaborator

lkluft commented Apr 10, 2024

In terms of performance, 2. is likely not so much of an issue because the full JSON needs to be loaded and parsed in both cases. 3., however, likely causes the performance degradation.

In general, I am wondering what the final goal of this functionality is? Do you want to avoid the overhead of open_mfdataset with too many NetCDF files? In this case, I would recommend to aggregate the actual NetCDF files (as you are running the jobs anyhow) instead of virtually concatenating them.

@oloapinivad
Copy link
Collaborator Author

Hi @lkluft,

thanks a lot for your reply!

I was probably not very clear above, the idea is to collect the metadata using kerchunk to create json references and speed up the loading of the data. We are not aiming at writing zarr files for now, although this might be an intriguing solution for the future.

I will investigate how to consolidate the metadata, thanks a lot for the example! However - and this replies also to your most important question - the idea is to create json reference to speed up the reading of the netcdf files which are produced every night by the LRA generator runs, so the definition of "consolidated" is not very suited here.

A few ago we tried to follow the idea of creating single files for each variable but it is a bit trickier to implement since there are safety many checks performed on the data (which is processed in monthly chunks by the LRA generator tool), and will likely need a larger rewrite.
Since the amount of netcdf files is getting bigger and bigger and the runs are still ongoing, an automatic procedure to create json references will be a very good solution to simplify the loading of the data - and it is working pretty well if I consider only full 12-month files as shown by the speed test above.

I will dig into the json references to see where the problem associated with the plev comes from, perhaps this is the culprit of the slowdown as you suggest.

@oloapinivad
Copy link
Collaborator Author

The slow down was due to plev not being included as an indentical_dims. We will probably need to make this more flexible so that dimensions shared are to be read from the original netcdf. Will investigate more in next days on the best way to contat the two files...

@oloapinivad oloapinivad added the run tests Set this up to let test run label Apr 11, 2024
Copy link

github-actions bot commented Apr 11, 2024

Coverage report: 76%

File statements missing excluded branches partial coverage
aqua/__init__.py 8 0 0 0 0 100%
aqua/accessor.py 28 5 0 5 0 85%
aqua/cli/diagnostic_config.py 1 0 0 0 0 100%
aqua/cli/main.py 396 40 0 156 30 87%
aqua/cli/parser.py 39 0 0 0 0 100%
aqua/data_models/__init__.py 3 0 0 0 0 100%
aqua/data_models/cfcoords.py 94 12 0 34 7 82%
aqua/data_models/cfunits.py 33 0 0 12 0 100%
aqua/data_models/datamodels.py 2 0 0 0 0 100%
aqua/exceptions.py 16 8 0 0 0 50%
aqua/graphics/__init__.py 5 0 0 0 0 100%
aqua/graphics/hovmoller.py 95 88 0 38 0 5%
aqua/graphics/multiple_maps.py 104 17 0 38 14 77%
aqua/graphics/single_map.py 143 26 0 56 21 74%
aqua/graphics/timeseries.py 118 39 0 46 16 62%
aqua/gsv/__init__.py 6 2 0 0 0 67%
aqua/gsv/intake_gsv.py 337 62 0 128 26 76%
aqua/gsv/timeutil.py 83 34 0 36 14 53%
aqua/logger.py 74 4 0 26 5 91%
aqua/lra_generator/__init__.py 2 0 0 0 0 100%
aqua/lra_generator/lra_generator.py 314 50 0 118 32 79%
aqua/lra_generator/lra_util.py 48 2 0 14 2 94%
aqua/reader/__init__.py 4 0 0 0 0 100%
aqua/reader/catalog.py 78 14 0 48 9 80%
aqua/reader/fixer.py 463 100 0 222 34 75%
aqua/reader/reader.py 487 61 0 271 43 83%
aqua/reader/reader_utils.py 77 29 0 61 6 60%
aqua/reader/regrid.py 198 19 0 60 12 87%
aqua/reader/streaming.py 48 3 0 22 5 89%
aqua/reader/timmean.py 70 13 0 29 6 77%
aqua/slurm/__init__.py 2 0 0 0 0 100%
aqua/slurm/slurm.py 105 88 0 36 0 12%
aqua/util/__init__.py 17 0 0 0 0 100%
aqua/util/config.py 166 33 0 79 25 74%
aqua/util/coord.py 37 6 0 22 4 83%
aqua/util/eccodes.py 70 5 0 32 3 92%
aqua/util/graphics.py 132 21 0 28 6 82%
aqua/util/output_saver.py 96 11 0 30 9 84%
aqua/util/sci_util.py 91 19 0 52 12 78%
aqua/util/time.py 68 13 0 18 5 77%
aqua/util/util.py 154 39 0 55 4 73%
aqua/util/yaml.py 119 20 0 58 12 80%
aqua/util/zarr.py 28 5 0 4 1 81%
Total 4459 888 0 1834 363 76%

Missing Lines

Filename Missing
__init__.py
accessor.py 31, 35, 51, 55, 59
exceptions.py 8-9, 16-17, 24-25, 32-33
logger.py 23, 96, 103-104
cli/diagnostic_config.py
cli/main.py 75, 82, 126, 212-213, 222-223, 228-229, 238, 244-245, 248-249, 260-261, 266-267, 283, 288, 349, 403-407, 417-419, 443-445, 504-505, 517, 561, 606, 620, 624, 656, 662
cli/parser.py
data_models/__init__.py
data_models/cfcoords.py 48, 50, 53, 55, 73, 79, 191-197
data_models/cfunits.py
data_models/datamodels.py
graphics/__init__.py
graphics/hovmoller.py 68-203
graphics/multiple_maps.py 78, 82, 114-115, 131-133, 150, 161, 182, 208-209, 236-238, 244-245
graphics/single_map.py 94-96, 109-110, 129, 145, 169-170, 174-175, 199, 208-210, 263, 287-289, 293, 297-298, 320, 329-331
graphics/timeseries.py 53, 62, 64-65, 76, 78-79, 82-95, 98-111, 171, 175-176, 183-190, 221, 228-230
gsv/__init__.py 8-9
gsv/intake_gsv.py 20-25, 75, 110, 168, 189-215, 234, 359-361, 400-429, 443-447, 453, 465, 471-473, 514, 567, 574, 580, 593
gsv/timeutil.py 32, 50-59, 84, 87, 90, 93, 105-106, 152, 154, 156, 159, 168, 171, 180-186, 194-195, 221, 230-234
lra_generator/__init__.py
lra_generator/lra_generator.py 96-97, 111, 116, 121, 126, 131, 142, 169, 204, 226-227, 263-265, 313-315, 326, 340-342, 372-379, 429, 467-468, 475, 574, 583-587, 593, 602-603, 621, 637, 641-642, 655-659
lra_generator/lra_util.py 44, 141
reader/__init__.py
reader/catalog.py 99-101, 104-105, 120-121, 127, 137-138, 142-143, 147-148
reader/fixer.py 57-60, 84-86, 98-105, 132, 194-196, 206, 226-240, 298-300, 339-340, 363, 390, 425, 488-493, 503, 523-533, 548-556, 585-587, 613-616, 678, 694, 710, 726, 742-743, 756, 770-781, 811, 813-815, 853-857, 860-863, 872, 909-910, 914-917, 924-929, 968
reader/reader.py 191, 225, 279, 319, 376, 386, 442, 459-460, 492, 517, 569, 600, 602, 609, 622, 627-628, 667, 674, 789-815, 847, 878, 882, 890, 902, 949, 1006, 1010, 1015-1018, 1022, 1057-1062, 1069, 1115-1116, 1123, 1138-1141
reader/reader_utils.py 22-44, 59-66, 103, 119, 125, 166-168
reader/regrid.py 94-95, 161, 170, 172, 219, 224-228, 235, 242, 284, 318-321, 360, 389-390
reader/streaming.py 71, 78, 108
reader/timmean.py 32, 40-41, 61-65, 90, 102, 123-126
slurm/__init__.py
slurm/slurm.py 40-47, 65-73, 87-97, 118-134, 166-197, 240-305, 320-328
util/__init__.py
util/config.py 108, 117, 132, 146, 149, 163-165, 168, 171, 192, 197, 203, 220, 244, 255-267, 278, 286, 290, 306, 309, 332-339
util/coord.py 36, 59, 62, 89-91
util/eccodes.py 128-130, 172-176
util/graphics.py 31, 116, 188-191, 210-212, 216-218, 238, 284-287, 307-310
util/output_saver.py 100-102, 158-160, 208-209, 213, 265-266, 270
util/sci_util.py 65, 76-77, 108, 114-116, 124, 126, 134-136, 159, 162, 164, 177-178, 180-181
util/time.py 42, 65, 67, 69, 131, 174-182
util/util.py 49-52, 105-106, 151-153, 161-162, 208, 212-213, 228, 276-308, 334-337
util/yaml.py 35-41, 74-77, 100, 110-111, 121-122, 145, 147, 244, 247, 259
util/zarr.py 42-45, 50
TOTAL

@oloapinivad oloapinivad changed the title [WIP] Create Zarr function and LRA feature Create Zarr function and LRA feature Apr 11, 2024
@oloapinivad oloapinivad changed the title Create Zarr function and LRA feature Create Zarr function to create json references and LRA feature Apr 11, 2024
@oloapinivad
Copy link
Collaborator Author

We will probably need to make this more flexible so that dimensions shared are to be read from the original netcdf.

This has been addressed by reading the netcdf files before with xarray. implies slower generation, but should be safer. I am doing some offline test, and I have implemented a test which is already included.

@oloapinivad
Copy link
Collaborator Author

oloapinivad commented Apr 11, 2024

I made some tests and I was able to create a zarr reference entry for the NextGems Cycle4 data.

The speed up in retrieving the data is about 50 times, which I can consider satisfactory.

Tempo di esecuzione per reader2.retrieve(): 0.4457592135295272 secondi
Tempo di esecuzione per reader1.retrieve(): 17.28819192480296 secondi

Still the way on which I am passing the xarray args to intake is a bit based on trial and error, and I am not sure I am doing the things in the best way. @lkluft do you have any tips on this? Thanks!

  lra-r100-monthly-zarr:
    driver: zarr
    description: LRA data monthly at r100 on zarr
    args:
      consolidated: false
      combine: nested
      compat: override
      urlpath:
      - reference::/work/bb1153/b382075/aqua/lra/IFS-FESOM/ssp370-ng4/r100/monthly/lra-r100-monthly-full.json
      - reference::/work/bb1153/b382075/aqua/lra/IFS-FESOM/ssp370-ng4/r100/monthly/lra-r100-monthly-partial.json
    metadata:
      source_grid_name: lon-lat

BTW, I suspect the current implementation is not working as it should and might create empty records even if the data is available...

@mnurisso
Copy link
Collaborator

This is very nice! If you understand better this block:

consolidated: false
combine: nested
compat: override

can you document it somewhere (advanced topic?) so we can use it also in other sources other than the LRA?

@oloapinivad
Copy link
Collaborator Author

oloapinivad commented Apr 11, 2024

This is very nice! If you understand better this block:

consolidated: false
combine: nested
compat: override

can you document it somewhere (advanced topic?) so we can use it also in other sources other than the LRA?

I think this is documented here: https://docs.xarray.dev/en/stable/generated/xarray.merge.html. The problem is to set up the right configuration for us, and this might not be trivial.

@oloapinivad
Copy link
Collaborator Author

This still not work. Despite my tentatives, there is no way to get a good zarr from a complicate source as the nextgems4 are. The only option I still see is to allow a zarr generation only if complete files are available.

@oloapinivad
Copy link
Collaborator Author

Final solution that I investigate is to create a zarr only for complete sources which have a full set of monthly and yearly file. This has to be done by creating an error trap for the case where this is not respected.

@lkluft
Copy link
Collaborator

lkluft commented Jul 22, 2024

This still not work. Despite my tentatives, there is no way to get a good zarr from a complicate source as the nextgems4 are. The only option I still see is to allow a zarr generation only if complete files are available.

Do you refer to the IFS output? AFAIK, all ICON output for Cycle 4 is already in Zarr and, therefore, should be straight-forward to include.

Final solution that I investigate is to create a zarr only for complete sources which have a full set of monthly and yearly file. This has to be done by creating an error trap for the case where this is not respected.

What is the problem with incomplete datasets? I see that it is annoying to create them too often, but technically this should be possible, or?

@oloapinivad
Copy link
Collaborator Author

oloapinivad commented Jul 23, 2024

Hi @lkluft, the problem I am having is that these datasets (netcdf-based LRA low res monthly output produced for monitoring purpose) are not "complete", meaning that some years have the full set of variables, some not. The LRA produces monthly files and once a year is complete, it packs them up into yearly file. This is not something zarr likes. Therefore we end up with a weird mixture of monthly and yearly netcdf files, one per variable, with some variable present for all timestep, and some not. Not the way to go for zarr.
To have it working properly we should probably refactor the LRA netcdf generation, and this is beyond our current goal.

I am waiting for #1278 to be merged to adapt some of the new implementation, and then I think that if tests are passed we could merge.

PS
good to hear that ICON nextgems4 are in zarr: @mnurisso and me will come in Wageningen in October, we could think about integrating into the AQUA catalog them too!

@lkluft
Copy link
Collaborator

lkluft commented Jul 23, 2024

[...] these datasets (netcdf-based LRA low res monthly output produced for monitoring purpose) are not "complete", meaning that some years have the full set of variables, some not. [...] Therefore we end up with a weird mixture of monthly and yearly netcdf files, one per variable, with some variable present for all timestep, and some not.

Okay, I see, this is definitely a problem. However, I would see this as a side effect of the rather hand-wavy initial phase of Climate DT model output and hope for more consistent output in the future. I understand that you have to find some workaround to make what is available now accessible, but for future development I would push for more consistent datasets throughout the entire simulated period.

PS good to hear that ICON nextgems4 are in zarr: @mnurisso and me will come in Wageningen in October, we could think about integrating into the AQUA catalog them too!

Nice! We can definitely check how to achieve this ☺️

@oloapinivad
Copy link
Collaborator Author

Added a few updates and error traps, hopefully this can be merged.

@oloapinivad oloapinivad added the ready to merge This PR is ready to merge in the opinion of the author label Jul 24, 2024
@mnurisso
Copy link
Collaborator

mnurisso commented Jul 25, 2024

I did a bit of linting and as discussed offline I decided to set zarr creation False by default.
The config file has the two field to enable the zarr and the verification False and the CLI has them False as well, in order to enable the tool the config file needs to have the zarr: True and verify_zarr: True, no flag present at the current stage.

If tests are ok for me we can merge.

When successful in my tests there is a factor 3-8 in the performance increase when accessing data and plotting them with the zarr implementation.

@oloapinivad
Copy link
Collaborator Author

Thanks lot. I am merging this!

@oloapinivad oloapinivad merged commit 468c9ef into main Jul 25, 2024
1 check passed
@mnurisso mnurisso deleted the zarro branch July 25, 2024 14:06
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
data source framework functions common to all diagnostics ready to merge This PR is ready to merge in the opinion of the author run tests Set this up to let test run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Function to create zarr from netcdf
3 participants