Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

Closed
jswhit opened this issue Aug 4, 2016 · 10 comments · Fixed by #316
Closed

netcdf-c creates fixed length unicode string attributes in netcdf4 files #298

jswhit opened this issue Aug 4, 2016 · 10 comments · Fixed by #316

Comments

@jswhit
Copy link

jswhit commented Aug 4, 2016

netcdf-c reports that attributes written as fixed width unicode by HDF5, are of type NC_CHAR, but clearly they're saved as UTF-8 in the HDF5 file.

See Unidata/netcdf4-python#575 for more details.

@WardF
Copy link
Member

WardF commented Aug 4, 2016

Thanks Jeff; I saw the reference on the python issue list but hadn't had a chance to chime in there yet. We're a bit resource constrained at the moment but I'll take a look at this as soon as I can!

@DennisHeimbigner
Copy link
Collaborator

The basic problem is that no utf-8 string can be guaranteed to be fixed length.
Hence storing characters into a fixed length string will only work if the characters
are all restricted to the 7-bit ascii subset of utf-8. So to say that HDF-5 is saving them
as utf-8 is technically incorrect; that is not possible.

@shoyer
Copy link
Contributor

shoyer commented Aug 7, 2016

This same issue with string encoding also affects arrays.

Using the Python API, I wrote a netCDF4 file with a NC_CHAR array, a NC_STRING array, a NC_CHAR attribute and a NC_STRING attribute. I've attached the file in strings.nc.zip, and saved the output of h5dump to this gist.

All of these strings are written with CSET H5T_CSET_UTF8. As expected, the array and attribute written with NC_CHAR are fixed size, but those written with NC_STRING have STRSIZE H5T_VARIABLE.

The netCDF4 format specification doesn't say anything about character sets, but per the documentation on H5Tset_cset, the character set of an HDF5 array can be set to either ASCII or UTF-8.

My preferred resolution would be to switch arrays/attributes written with NC_CHAR to use ASCII instead of UTF-8, because I believe this has better compatibility with tools for working with HDF5. For example, h5py doesn't even support reading fixed size UTF-8.

In any case, it would be nice to clarify this in the spec, for both NC_CHAR and NC_STRING.

@DennisHeimbigner
Copy link
Collaborator

I did this deliberately. As noted, any character typed variable
cannot actually store UTF-8 characters, only the ascii subset would be legal.
However, It is desirable for historical reasons to allow use of the full 8-bits of a character
rather than restricting to 7-bit ascii. Ideally, HDF5 would support, say, ISO-8859-1
so that we can use the full 8 bits. Since that is not possible, I chose to use UTF-8
as the char set. The experiment that needs to carried out is to change ot ascii
and try to write, say, ISO-8859-1
characters into an NC_CHAR attribute and see if the full 8bits are preserved.

@DennisHeimbigner
Copy link
Collaborator

An additional note. In looking at the spec, NC_CHAR in one place is defined to be
and 8-bit unsigned value. But elsewhere, it is also defined as ASCII.

@shoyer
Copy link
Contributor

shoyer commented Aug 8, 2016

Thanks for clarifying @DennisHeimbigner, that makes sense.

My concern about putting fixed size characters in UTF-8 is that this makes reading them a bit trickier for other HDF5 libraries that can't necessarily assume that all fixed size UTF-8 is actually all one byte characters in the way that netCDF4 uses it. Instead, the "right thing to do" becomes to decode the strings as UTF-8.

If NumPy had a native fixed-size UTF-8 dtype like HDF5, this would be fine. We'd just coerce the dtype back to an array of characters and move on. But unfortunately, NumPy doesn't support this, which means that the "right thing to do" for a generic HDF5 tool in Python (namely, h5py) is to convert an array of fixed-size UTF-8 into a NumPy "object" array of native Python strings. This is doable, but it entails a lot more work than simply copying or memory mapping the underlying array.

In some sense, storing strings as H5T_NATIVE_CHAR could be cleaner if you want to use all 8-bits, but then of course it's not marked as a string any more, so many tools will probably read it as one-byte integers instead.

So as long as we're abusing the character set anyways, ASCII might be a better choice, because then it is definitely still memory mappable into arrays of fixed sized strings.

About fixed-length ASCII strings, h5py writes:

Technically, these strings are supposed to store only ASCII-encoded text, although in practice anything you can store in NumPy will round-trip. But for compatibility with other progams using HDF5 (IDL, MATLAB, etc.), you should use ASCII only.

I verified this with the following script (Python 2). The extra characters round trip in h5py, but netCDF4-python currently does not read them properly (it decodes to unicode):

In [1]: import h5py

In [2]: import numpy as np

In [3]: f = h5py.File('test-characters.h5', 'w')

In [4]: all_chars = ''.join(chr(n) for n in range(256))

In [5]: all_chars
Out[5]: '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

In [6]: f.attrs['all_chars'] = np.string_(all_chars)

In [7]: f.attrs['all_chars'] == all_chars
Out[7]: True

In [8]: f.close()

In [9]: ! h5dump test-characters.h5
HDF5 "test-characters.h5" {
GROUP "/" {
   ATTRIBUTE "all_chars" {
      DATATYPE  H5T_STRING {
         STRSIZE 256;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "\000\001\002\003\004\005\006\007
           \013
           \016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177\37777777600\37777777601\37777777602\37777777603\37777777604\37777777605\37777777606\37777777607\37777777610\37777777611\37777777612\37777777613\37777777614\37777777615\37777777616\37777777617\37777777620\37777777621\37777777622\37777777623\37777777624\37777777625\37777777626\37777777627\37777777630\37777777631\37777777632\37777777633\37777777634\37777777635\37777777636\37777777637\37777777640\37777777641\37777777642\37777777643\37777777644\37777777645\37777777646\37777777647\37777777650\37777777651\37777777652\37777777653\37777777654\37777777655\37777777656\37777777657\37777777660\37777777661\37777777662\37777777663\37777777664\37777777665\37777777666\37777777667\37777777670\37777777671\37777777672\37777777673\37777777674\37777777675\37777777676\37777777677\37777777700\37777777701\37777777702\37777777703\37777777704\37777777705\37777777706\37777777707\37777777710\37777777711\37777777712\37777777713\37777777714\37777777715\37777777716\37777777717\37777777720\37777777721\37777777722\37777777723\37777777724\37777777725\37777777726\37777777727\37777777730\37777777731\37777777732\37777777733\37777777734\37777777735\37777777736\37777777737\37777777740\37777777741\37777777742\37777777743\37777777744\37777777745\37777777746\37777777747\37777777750\37777777751\37777777752\37777777753\37777777754\37777777755\37777777756\37777777757\37777777760\37777777761\37777777762\37777777763\37777777764\37777777765\37777777766\37777777767\37777777770\37777777771\37777777772\37777777773\37777777774\37777777775\37777777776\37777777777"
      }
   }
}
}

In [10]: import netCDF4

In [11]: ds = netCDF4.Dataset('test-characters.h5')

In [12]: ds.all_chars
Out[12]: u'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'

In [13]: ds.close()

I realize that this probably seems like a lot of work to handle Python's quirks, but how tricky unicode can be my guess is that Python/NumPy is not the only system that would have these sort of issues. I would guess that a lot of HDF5 interfaces don't even bother with UTF-8 at all.

@shoyer
Copy link
Contributor

shoyer commented Aug 8, 2016

My Googling suggests that HDF5 does not actually distinguish H5T_NATIVE_CHAR from H5T_STD_I8LE in files (one byte integers), e.g., see
areaDetector/ADCore#176 and http://hdfeos.org/workshops/ws10/presentations/day1/H5Datatypes.doc

@DennisHeimbigner
Copy link
Collaborator

Then perhaps using H5T_NATIVE_CHAR is the right solution.

On 8/8/2016 11:04 AM, Stephan Hoyer wrote:

My Googling suggests that HDF5 does not actually distinguish
H5T_NATIVE_CHAR from H5T_STD_I8LE in files (one byte integers), e.g., see
areaDetector/ADCore#176
areaDetector/ADCore#176 and
http://hdfeos.org/workshops/ws10/presentations/day1/H5Datatypes.doc


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#298 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P2z3OqldVMtiXIvcT57JigyiM3TbIks5qd2GFgaJpZM4JcrT7.

@shoyer
Copy link
Contributor

shoyer commented Aug 8, 2016

We can certainly coerce H5T_STD_I8LE arrays to NumPy character arrays (netCDF3 style), but then we would be stuck without any way to tell apart NC_BYTE from NC_CHAR without adding an additional metadata field. My tests show NC_BYTE already ends up as H5T_STD_I8LE in netCDF4 files.

Side note: I think I found a typo in the netCDF4 format spec docs. It says:

NC_BYTE = H5T_NATIVE_SCHAR
NC_UBYTE = H5T_NATIVE_SCHAR

I think NC_UBYTE is actually mapped to H5T_NATIVE_UCHAR.

@DennisHeimbigner
Copy link
Collaborator

Ok, then the test that needs checking is:

  1. make type H5T_CSET_ASCII
  2. store non-ascii characters (i.e. above 127)
  3. read back and see if hdf5 masked off the top bit.

# for free to join this conversation on GitHub. Already have an account? # to comment