Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

In Memory netcdf subdatasets do not persist order when buffer is closed #1388

Open
lagamura opened this issue Nov 17, 2024 · 8 comments
Open

Comments

@lagamura
Copy link

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue
    netcdf4 1.7.1 nompi_py311hae66bec_102 conda-forge

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)
    OS: Almalinux-9.3, python: 3.11

  • a description of the issue with the steps needed to reproduce it:
    When writing subdatasets to a netcdf in-memory, the subdatasets change index order when the buffer is written as a netcdf file at the end. Following a minimal example:

import numpy as np
from netCDF4 import Dataset
from osgeo import gdal

list_of_subds = ["first_subdataset", "c_subdataset", "b_subdataset"]

ds = Dataset(
    "dump_ds.nc", mode="w", memory=1028, format="NETCDF4"
) 

ds.createDimension("lon", 100)
ds.createDimension("lat", 100)
ds.createDimension("time", None)

for subds in list_of_subds:

    data = ds.createVariable(
        subds,
        "f8",
        ("time", "lat", "lon"),
        zlib=True,
        fill_value=-1,
    )
    data[0, :, :] = np.arange(100)

print(ds)
nc_buf = ds.close()
with open("dump_ds.nc", "wb") as f:
    f.write(nc_buf)

print(gdal.Info("dump_ds.nc"))

In print(ds) we still have ordered subdatasets:

root group (NETCDF4 data model, file format HDF5):
dimensions(sizes): lon(100), lat(100), time(1)
variables(dimensions): float64 first_subdataset(time, lat, lon), float64 c_subdataset(time, lat, lon), float64 b_subdataset(time, lat, lon)
groups:

printing gdal.Info after dumping the nc file:

Subdatasets:
SUBDATASET_1_NAME=NETCDF:"dump_ds.nc":b_subdataset
SUBDATASET_1_DESC=[1x100x100] b_subdataset (64-bit floating-point)
SUBDATASET_2_NAME=NETCDF:"dump_ds.nc":c_subdataset
SUBDATASET_2_DESC=[1x100x100] c_subdataset (64-bit floating-point)
SUBDATASET_3_NAME=NETCDF:"dump_ds.nc":first_subdataset
SUBDATASET_3_DESC=[1x100x100] first_subdataset (64-bit floating-point)

@jswhit
Copy link
Collaborator

jswhit commented Nov 17, 2024

I don't know how gdal chooses how to order to variables - maybe alphabetical? Don't believe this is a bug in netcdf4-python.

@jswhit
Copy link
Collaborator

jswhit commented Nov 17, 2024

Looks like the order of the variables does change when the memory buffer is written out and re-read (ncdump shows the same thing as gdal). I don't know if the order should be preserved - perhaps @DennisHeimbigner would know.

@lagamura
Copy link
Author

Looks like the order of the variables does change when the memory buffer is written out and re-read (ncdump shows the same thing as gdal). I don't know if the order should be preserved - perhaps @DennisHeimbigner would know.

Thanks for the quick look,
In my opinion the order should be preserved, for consistency, as it happens if you use typically Dataset class to store a netcdf in the disk. Currently, when in-memory is used, the subdatasets will be written alphabetically ordered as you pointed out.

@jswhit
Copy link
Collaborator

jswhit commented Nov 18, 2024

just curious - why does the order matter for your use case?

@lagamura
Copy link
Author

To be in compliance with previous version of the product we are working on. A more specific usage would be if someone opens two netcdfs of the same product and try to compare the subdataset by indices.

@jswhit
Copy link
Collaborator

jswhit commented Nov 18, 2024

netcdf-c keeps track of creation order, and preserves that order when a dataset is written to disk. Since you are bypassing the c library when writing the memory buffer to disk directly, my guess is that the logic that preserves creation order is also bypassed. Unfortunately, I don't see any way to tell the C library to write the memory buffer to disk preserving the creation order.

@lagamura
Copy link
Author

Just to clarify the use-case, we want to use in-memory feature combining with writing the IO.buffer result directly to S3.
It is possible by using the netcdf driver with gdal and write directly to s3 storage, so I will make a minimal example and check the sub-datasets order.

@lagamura
Copy link
Author

Apparently, it seems netcdf gdal driver does not support writing a file directly on s3 (/vsis3).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants