Create a MultiZarr json file from netcdf files of unequal time length. #447

oloapinivad · 2024-04-09T12:58:34Z

Hi there,

I am very new to kerchunk but I am trying to create a json file using zarr starting from a series of netcdf files, which might have unequal time (they can be 1 or 12). I do not want to replicate the data with zarr, for both storage and backward compatibility reasons.

Here below a very basic example, I can attach the data but I think it is clear what is done here.

filelist = ['file_201901.nc',  'file_2018.nc']

singles = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in sorted(filelist)]

mzz = MultiZarrToZarr(
    singles,
    concat_dims=["time"],
    identical_dims = ['lat', 'lon']
)

mzz.translate()

This fails with:

ValueError: Found chunk size mismatch:
                        at prefix 2t in iteration 1 (file None)
                        new chunk: [12, 180, 360]
                        chunks so far: [1, 180, 360]

Browsing various issues in the repository (as this #430 (comment)) , it seems that this is due to a discussed limitation of Zarr that does not allow for unequal chunk sizes, which goes beyond kerchunk.

However, I am wondering if there is a way to force the chunk when accessing the data so that for example if I set chunks={"time": 1}, as for example done with xarray, I should be able to still load the data.

Thanks a lot for any hint!

The text was updated successfully, but these errors were encountered:

martindurant · 2024-04-09T13:20:43Z

No, you unfortunately cannot "subchunk" the data that have chunk sizes > 1. The sole exception is completely uncompressed/encoded data, which I assume is not your situation.

Explanation:
suppose you have a chunk of data in your original file of size 2 in time (+ some other dimensions). If we were to try to present this as chunk size 1 to zarr, when accessing time=0, it would need to load the zeroth chunk, decompress it, and slice it. When loading time=1, it would have to load and decompress the very same slice.
This "load-and-slice" logic does not exist, and clearly would be very inefficient. It would further be complicated in the case where chunks cross boundaries (original size 7, desired size 2). So we keep to logical 1-1 mapping of chunks, and remain therefore limited by zarr's model.

oloapinivad · 2024-04-09T13:45:23Z

Ok, thanks a lot, very much clear.

Therefore I will proceed creating two different json files, one for the 12-step chunk and one for 1-step chunk, and then merging afterwards when opening them. in principle this could work!

oloapinivad changed the title ~~How to create json file with c~~ Create a MultiZarr json file from netcdf files of unequal time length. Apr 9, 2024

oloapinivad mentioned this issue Apr 9, 2024

Function to create zarr from netcdf DestinE-Climate-DT/AQUA#1034

Closed

oloapinivad closed this as completed Apr 9, 2024

oloapinivad mentioned this issue Apr 10, 2024

Create Zarr function to create json references and LRA feature DestinE-Climate-DT/AQUA#1068

Merged

5 tasks

observingClouds mentioned this issue May 2, 2024

HGH chunk sizes changes from time=5 to time=1 in 2016 and 2017 ISSI-CONSTRAIN/isccp#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a MultiZarr json file from netcdf files of unequal time length. #447

Create a MultiZarr json file from netcdf files of unequal time length. #447

oloapinivad commented Apr 9, 2024

martindurant commented Apr 9, 2024

oloapinivad commented Apr 9, 2024

Create a MultiZarr json file from netcdf files of unequal time length. #447

Create a MultiZarr json file from netcdf files of unequal time length. #447

Comments

oloapinivad commented Apr 9, 2024

martindurant commented Apr 9, 2024

oloapinivad commented Apr 9, 2024