-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Create Zarr function to create json references and LRA feature #1068
Conversation
So the current PR includes a new This is not something we can guarantee with LRA, since we have monthly and yearly files by definition when the LRA is created. This is a serious limitation, which I tried to circumvent using two different json creation and combining them afterwards with intake. An example of the resulting lra entry can be seen here below:
However, with such approch, the Zarr is massively SLOWER than NetCDF If I run a test script like this
I get:
However, if I remove the partial json, zarr returns to be massively better
I see only two possible alternatives: allow the zarr to work only with full json to have the efficient speed up given by zarr, or perhaps find a way to find a better concatenation strategy with xarray through the intake options. Any thougths/suggestions? |
Hey Paolo, here are a couple of thoughts on this
Please let me know if anything is unclear or if I can be of any help! Cheers |
In terms of performance, 2. is likely not so much of an issue because the full JSON needs to be loaded and parsed in both cases. 3., however, likely causes the performance degradation. In general, I am wondering what the final goal of this functionality is? Do you want to avoid the overhead of |
Hi @lkluft, thanks a lot for your reply! I was probably not very clear above, the idea is to collect the metadata using kerchunk to create json references and speed up the loading of the data. We are not aiming at writing zarr files for now, although this might be an intriguing solution for the future. I will investigate how to consolidate the metadata, thanks a lot for the example! However - and this replies also to your most important question - the idea is to create json reference to speed up the reading of the netcdf files which are produced every night by the LRA generator runs, so the definition of "consolidated" is not very suited here. A few ago we tried to follow the idea of creating single files for each variable but it is a bit trickier to implement since there are safety many checks performed on the data (which is processed in monthly chunks by the LRA generator tool), and will likely need a larger rewrite. I will dig into the json references to see where the problem associated with the |
The slow down was due to |
Coverage report: 76%
Missing Lines
|
This has been addressed by reading the netcdf files before with |
I made some tests and I was able to create a zarr reference entry for the NextGems Cycle4 data. The speed up in retrieving the data is about 50 times, which I can consider satisfactory.
Still the way on which I am passing the xarray args to intake is a bit based on trial and error, and I am not sure I am doing the things in the best way. @lkluft do you have any tips on this? Thanks! lra-r100-monthly-zarr:
driver: zarr
description: LRA data monthly at r100 on zarr
args:
consolidated: false
combine: nested
compat: override
urlpath:
- reference::/work/bb1153/b382075/aqua/lra/IFS-FESOM/ssp370-ng4/r100/monthly/lra-r100-monthly-full.json
- reference::/work/bb1153/b382075/aqua/lra/IFS-FESOM/ssp370-ng4/r100/monthly/lra-r100-monthly-partial.json
metadata:
source_grid_name: lon-lat BTW, I suspect the current implementation is not working as it should and might create empty records even if the data is available... |
This is very nice! If you understand better this block: consolidated: false
combine: nested
compat: override can you document it somewhere (advanced topic?) so we can use it also in other sources other than the LRA? |
I think this is documented here: https://docs.xarray.dev/en/stable/generated/xarray.merge.html. The problem is to set up the right configuration for us, and this might not be trivial. |
This still not work. Despite my tentatives, there is no way to get a good zarr from a complicate source as the nextgems4 are. The only option I still see is to allow a zarr generation only if complete files are available. |
Final solution that I investigate is to create a zarr only for complete sources which have a full set of monthly and yearly file. This has to be done by creating an error trap for the case where this is not respected. |
Do you refer to the IFS output? AFAIK, all ICON output for Cycle 4 is already in Zarr and, therefore, should be straight-forward to include.
What is the problem with incomplete datasets? I see that it is annoying to create them too often, but technically this should be possible, or? |
Hi @lkluft, the problem I am having is that these datasets (netcdf-based LRA low res monthly output produced for monitoring purpose) are not "complete", meaning that some years have the full set of variables, some not. The LRA produces monthly files and once a year is complete, it packs them up into yearly file. This is not something zarr likes. Therefore we end up with a weird mixture of monthly and yearly netcdf files, one per variable, with some variable present for all timestep, and some not. Not the way to go for zarr. I am waiting for #1278 to be merged to adapt some of the new implementation, and then I think that if tests are passed we could merge. PS |
Okay, I see, this is definitely a problem. However, I would see this as a side effect of the rather hand-wavy initial phase of Climate DT model output and hope for more consistent output in the future. I understand that you have to find some workaround to make what is available now accessible, but for future development I would push for more consistent datasets throughout the entire simulated period.
Nice! We can definitely check how to achieve this |
Added a few updates and error traps, hopefully this can be merged. |
I did a bit of linting and as discussed offline I decided to set zarr creation False by default. If tests are ok for me we can merge. When successful in my tests there is a factor 3-8 in the performance increase when accessing data and plotting them with the zarr implementation. |
Thanks lot. I am merging this! |
PR description:
This tries to develop the support for zarr files within AQUA making use of kerchunk to create specific zarr, following what is discussed in #1034. First tests showed that it is quite easy but it needs to create different json files if the data have different chunking. This can be achieved quite easily with two subselection for LRA files.
More will follow soon
Issues closed by this pull request:
Close #1034