netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

jswhit · 2016-08-04T13:08:02Z

netcdf-c reports that attributes written as fixed width unicode by HDF5, are of type NC_CHAR, but clearly they're saved as UTF-8 in the HDF5 file.

See Unidata/netcdf4-python#575 for more details.

WardF · 2016-08-04T16:12:50Z

Thanks Jeff; I saw the reference on the python issue list but hadn't had a chance to chime in there yet. We're a bit resource constrained at the moment but I'll take a look at this as soon as I can!

DennisHeimbigner · 2016-08-04T17:02:06Z

The basic problem is that no utf-8 string can be guaranteed to be fixed length.
Hence storing characters into a fixed length string will only work if the characters
are all restricted to the 7-bit ascii subset of utf-8. So to say that HDF-5 is saving them
as utf-8 is technically incorrect; that is not possible.

shoyer · 2016-08-07T07:59:48Z

This same issue with string encoding also affects arrays.

Using the Python API, I wrote a netCDF4 file with a NC_CHAR array, a NC_STRING array, a NC_CHAR attribute and a NC_STRING attribute. I've attached the file in strings.nc.zip, and saved the output of h5dump to this gist.

All of these strings are written with CSET H5T_CSET_UTF8. As expected, the array and attribute written with NC_CHAR are fixed size, but those written with NC_STRING have STRSIZE H5T_VARIABLE.

The netCDF4 format specification doesn't say anything about character sets, but per the documentation on H5Tset_cset, the character set of an HDF5 array can be set to either ASCII or UTF-8.

My preferred resolution would be to switch arrays/attributes written with NC_CHAR to use ASCII instead of UTF-8, because I believe this has better compatibility with tools for working with HDF5. For example, h5py doesn't even support reading fixed size UTF-8.

In any case, it would be nice to clarify this in the spec, for both NC_CHAR and NC_STRING.

DennisHeimbigner · 2016-08-08T03:44:23Z

I did this deliberately. As noted, any character typed variable
cannot actually store UTF-8 characters, only the ascii subset would be legal.
However, It is desirable for historical reasons to allow use of the full 8-bits of a character
rather than restricting to 7-bit ascii. Ideally, HDF5 would support, say, ISO-8859-1
so that we can use the full 8 bits. Since that is not possible, I chose to use UTF-8
as the char set. The experiment that needs to carried out is to change ot ascii
and try to write, say, ISO-8859-1
characters into an NC_CHAR attribute and see if the full 8bits are preserved.

DennisHeimbigner · 2016-08-08T16:27:49Z

An additional note. In looking at the spec, NC_CHAR in one place is defined to be
and 8-bit unsigned value. But elsewhere, it is also defined as ASCII.

shoyer · 2016-08-08T16:54:04Z

Thanks for clarifying @DennisHeimbigner, that makes sense.

My concern about putting fixed size characters in UTF-8 is that this makes reading them a bit trickier for other HDF5 libraries that can't necessarily assume that all fixed size UTF-8 is actually all one byte characters in the way that netCDF4 uses it. Instead, the "right thing to do" becomes to decode the strings as UTF-8.

If NumPy had a native fixed-size UTF-8 dtype like HDF5, this would be fine. We'd just coerce the dtype back to an array of characters and move on. But unfortunately, NumPy doesn't support this, which means that the "right thing to do" for a generic HDF5 tool in Python (namely, h5py) is to convert an array of fixed-size UTF-8 into a NumPy "object" array of native Python strings. This is doable, but it entails a lot more work than simply copying or memory mapping the underlying array.

In some sense, storing strings as H5T_NATIVE_CHAR could be cleaner if you want to use all 8-bits, but then of course it's not marked as a string any more, so many tools will probably read it as one-byte integers instead.

So as long as we're abusing the character set anyways, ASCII might be a better choice, because then it is definitely still memory mappable into arrays of fixed sized strings.

About fixed-length ASCII strings, h5py writes:

Technically, these strings are supposed to store only ASCII-encoded text, although in practice anything you can store in NumPy will round-trip. But for compatibility with other progams using HDF5 (IDL, MATLAB, etc.), you should use ASCII only.

I verified this with the following script (Python 2). The extra characters round trip in h5py, but netCDF4-python currently does not read them properly (it decodes to unicode):

In [1]: import h5py

In [2]: import numpy as np

In [3]: f = h5py.File('test-characters.h5', 'w')

In [4]: all_chars = ''.join(chr(n) for n in range(256))

In [5]: all_chars
Out[5]: '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

In [6]: f.attrs['all_chars'] = np.string_(all_chars)

In [7]: f.attrs['all_chars'] == all_chars
Out[7]: True

In [8]: f.close()

In [9]: ! h5dump test-characters.h5
HDF5 "test-characters.h5" {
GROUP "/" {
   ATTRIBUTE "all_chars" {
      DATATYPE  H5T_STRING {
         STRSIZE 256;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "\000\001\002\003\004\005\006\007
           \013
           \016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177\37777777600\37777777601\37777777602\37777777603\37777777604\37777777605\37777777606\37777777607\37777777610\37777777611\37777777612\37777777613\37777777614\37777777615\37777777616\37777777617\37777777620\37777777621\37777777622\37777777623\37777777624\37777777625\37777777626\37777777627\37777777630\37777777631\37777777632\37777777633\37777777634\37777777635\37777777636\37777777637\37777777640\37777777641\37777777642\37777777643\37777777644\37777777645\37777777646\37777777647\37777777650\37777777651\37777777652\37777777653\37777777654\37777777655\37777777656\37777777657\37777777660\37777777661\37777777662\37777777663\37777777664\37777777665\37777777666\37777777667\37777777670\37777777671\37777777672\37777777673\37777777674\37777777675\37777777676\37777777677\37777777700\37777777701\37777777702\37777777703\37777777704\37777777705\37777777706\37777777707\37777777710\37777777711\37777777712\37777777713\37777777714\37777777715\37777777716\37777777717\37777777720\37777777721\37777777722\37777777723\37777777724\37777777725\37777777726\37777777727\37777777730\37777777731\37777777732\37777777733\37777777734\37777777735\37777777736\37777777737\37777777740\37777777741\37777777742\37777777743\37777777744\37777777745\37777777746\37777777747\37777777750\37777777751\37777777752\37777777753\37777777754\37777777755\37777777756\37777777757\37777777760\37777777761\37777777762\37777777763\37777777764\37777777765\37777777766\37777777767\37777777770\37777777771\37777777772\37777777773\37777777774\37777777775\37777777776\37777777777"
      }
   }
}
}

In [10]: import netCDF4

In [11]: ds = netCDF4.Dataset('test-characters.h5')

In [12]: ds.all_chars
Out[12]: u'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'

In [13]: ds.close()

I realize that this probably seems like a lot of work to handle Python's quirks, but how tricky unicode can be my guess is that Python/NumPy is not the only system that would have these sort of issues. I would guess that a lot of HDF5 interfaces don't even bother with UTF-8 at all.

shoyer · 2016-08-08T17:04:05Z

My Googling suggests that HDF5 does not actually distinguish H5T_NATIVE_CHAR from H5T_STD_I8LE in files (one byte integers), e.g., see
areaDetector/ADCore#176 and http://hdfeos.org/workshops/ws10/presentations/day1/H5Datatypes.doc

DennisHeimbigner · 2016-08-08T17:06:40Z

Then perhaps using H5T_NATIVE_CHAR is the right solution.

On 8/8/2016 11:04 AM, Stephan Hoyer wrote:

My Googling suggests that HDF5 does not actually distinguish
H5T_NATIVE_CHAR from H5T_STD_I8LE in files (one byte integers), e.g., see
areaDetector/ADCore#176
areaDetector/ADCore#176 and
http://hdfeos.org/workshops/ws10/presentations/day1/H5Datatypes.doc

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#298 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P2z3OqldVMtiXIvcT57JigyiM3TbIks5qd2GFgaJpZM4JcrT7.

shoyer · 2016-08-08T17:18:33Z

We can certainly coerce H5T_STD_I8LE arrays to NumPy character arrays (netCDF3 style), but then we would be stuck without any way to tell apart NC_BYTE from NC_CHAR without adding an additional metadata field. My tests show NC_BYTE already ends up as H5T_STD_I8LE in netCDF4 files.

Side note: I think I found a typo in the netCDF4 format spec docs. It says:

NC_BYTE = H5T_NATIVE_SCHAR
NC_UBYTE = H5T_NATIVE_SCHAR

I think NC_UBYTE is actually mapped to H5T_NATIVE_UCHAR.

DennisHeimbigner · 2016-08-08T18:19:24Z

Ok, then the test that needs checking is:

make type H5T_CSET_ASCII
store non-ascii characters (i.e. above 127)
read back and see if hdf5 masked off the top bit.

WardF added type/bug resolution/patch welcome area/netcdf4 reason/resource contstrained status/under review labels Aug 4, 2016

WardF added this to the future milestone Aug 4, 2016

WardF self-assigned this Aug 4, 2016

shoyer mentioned this issue Aug 5, 2016

Lazy evaluation and no segmentation fault when reading str arrays h5netcdf/h5netcdf#17

Merged

shoyer mentioned this issue Aug 7, 2016

OSError: Unable to read attribute (No appropriate function for conversion path) h5netcdf/h5netcdf#16

Closed

shoyer mentioned this issue Sep 2, 2016

Switch NC_CHAR on netCDF4 to use ASCII #316

Merged

marqh mentioned this issue Sep 7, 2016

make CDL tests independent of attribute string typing SciTools/iris#2133

Closed

WardF closed this as completed in #316 Mar 21, 2017

shoyer mentioned this issue Apr 9, 2017

Ensure scalar ASCII decodes to strings on Python 3 h5py/h5py#871

Closed

Dave-Allured mentioned this issue Jul 26, 2018

Add support for attributes of type string cf-convention/cf-conventions#141

Closed

jwpeterson mentioned this issue Feb 14, 2019

Upgrade bundled netCDF to version 4.6.2 libMesh/libmesh#2035

Merged

shoyer mentioned this issue Sep 14, 2019

OSError: Unable to read attribute (No appropriate function for conversion path) h5py/h5py#719

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

jswhit commented Aug 4, 2016

WardF commented Aug 4, 2016

DennisHeimbigner commented Aug 4, 2016

shoyer commented Aug 7, 2016

DennisHeimbigner commented Aug 8, 2016

DennisHeimbigner commented Aug 8, 2016

shoyer commented Aug 8, 2016 •

edited

Loading

shoyer commented Aug 8, 2016

DennisHeimbigner commented Aug 8, 2016

shoyer commented Aug 8, 2016

DennisHeimbigner commented Aug 8, 2016

netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

Comments

jswhit commented Aug 4, 2016

WardF commented Aug 4, 2016

DennisHeimbigner commented Aug 4, 2016

shoyer commented Aug 7, 2016

DennisHeimbigner commented Aug 8, 2016

DennisHeimbigner commented Aug 8, 2016

shoyer commented Aug 8, 2016 • edited Loading

shoyer commented Aug 8, 2016

DennisHeimbigner commented Aug 8, 2016

shoyer commented Aug 8, 2016

DennisHeimbigner commented Aug 8, 2016

shoyer commented Aug 8, 2016 •

edited

Loading