-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
netcdf-c creates fixed length unicode string attributes in netcdf4 files #298
Comments
Thanks Jeff; I saw the reference on the python issue list but hadn't had a chance to chime in there yet. We're a bit resource constrained at the moment but I'll take a look at this as soon as I can! |
The basic problem is that no utf-8 string can be guaranteed to be fixed length. |
This same issue with string encoding also affects arrays. Using the Python API, I wrote a netCDF4 file with a NC_CHAR array, a NC_STRING array, a NC_CHAR attribute and a NC_STRING attribute. I've attached the file in strings.nc.zip, and saved the output of h5dump to this gist. All of these strings are written with The netCDF4 format specification doesn't say anything about character sets, but per the documentation on H5Tset_cset, the character set of an HDF5 array can be set to either ASCII or UTF-8. My preferred resolution would be to switch arrays/attributes written with NC_CHAR to use ASCII instead of UTF-8, because I believe this has better compatibility with tools for working with HDF5. For example, h5py doesn't even support reading fixed size UTF-8. In any case, it would be nice to clarify this in the spec, for both NC_CHAR and NC_STRING. |
I did this deliberately. As noted, any character typed variable |
An additional note. In looking at the spec, NC_CHAR in one place is defined to be |
Thanks for clarifying @DennisHeimbigner, that makes sense. My concern about putting fixed size characters in UTF-8 is that this makes reading them a bit trickier for other HDF5 libraries that can't necessarily assume that all fixed size UTF-8 is actually all one byte characters in the way that netCDF4 uses it. Instead, the "right thing to do" becomes to decode the strings as UTF-8. If NumPy had a native fixed-size UTF-8 dtype like HDF5, this would be fine. We'd just coerce the dtype back to an array of characters and move on. But unfortunately, NumPy doesn't support this, which means that the "right thing to do" for a generic HDF5 tool in Python (namely, h5py) is to convert an array of fixed-size UTF-8 into a NumPy "object" array of native Python strings. This is doable, but it entails a lot more work than simply copying or memory mapping the underlying array. In some sense, storing strings as So as long as we're abusing the character set anyways, ASCII might be a better choice, because then it is definitely still memory mappable into arrays of fixed sized strings. About fixed-length ASCII strings, h5py writes:
I verified this with the following script (Python 2). The extra characters round trip in h5py, but netCDF4-python currently does not read them properly (it decodes to unicode):
I realize that this probably seems like a lot of work to handle Python's quirks, but how tricky unicode can be my guess is that Python/NumPy is not the only system that would have these sort of issues. I would guess that a lot of HDF5 interfaces don't even bother with UTF-8 at all. |
My Googling suggests that HDF5 does not actually distinguish H5T_NATIVE_CHAR from H5T_STD_I8LE in files (one byte integers), e.g., see |
Then perhaps using H5T_NATIVE_CHAR is the right solution. On 8/8/2016 11:04 AM, Stephan Hoyer wrote:
|
We can certainly coerce H5T_STD_I8LE arrays to NumPy character arrays (netCDF3 style), but then we would be stuck without any way to tell apart NC_BYTE from NC_CHAR without adding an additional metadata field. My tests show NC_BYTE already ends up as H5T_STD_I8LE in netCDF4 files. Side note: I think I found a typo in the netCDF4 format spec docs. It says:
I think NC_UBYTE is actually mapped to H5T_NATIVE_UCHAR. |
Ok, then the test that needs checking is:
|
netcdf-c reports that attributes written as fixed width unicode by HDF5, are of type NC_CHAR, but clearly they're saved as UTF-8 in the HDF5 file.
See Unidata/netcdf4-python#575 for more details.
The text was updated successfully, but these errors were encountered: