Dataset class should support encoding parameter to override global attribute #654

thehesiod · 2017-04-30T05:57:34Z

In fact the whole idea of having a global encoding property is a bad idea because then you can't support multiple Datasets with different encodings. The real bug should be to deprecate netCDF4.encoding

as an example, the MADIS meso files are encoded in what appears to be cp1252.

The text was updated successfully, but these errors were encountered:

thehesiod · 2017-04-30T05:57:55Z

if someone from the team can give feedback on how they'd like this achieved I can work on this.

jswhit · 2017-04-30T13:29:20Z

I suggest adding a set_encoding method to override the global value. There are already too many kwargs in Dataset.__init__. The new method should have a kwarg to optionally override the global value of unicode_error also.

thehesiod · 2017-04-30T16:34:04Z

not possible because we need encoding during init, for example when it calls _get_dims. Will looking into unicode_error

thehesiod · 2017-04-30T17:25:14Z

ok added support for encoding_errors looking into why unittests are failing

shoyer · 2017-05-01T04:26:58Z

Does netCDF really support arbitrary encodings for strings but not have any way of indicating them in the data model? That seems like a disaster waiting to happen...

thehesiod · 2017-05-01T04:35:06Z

from what I understand yes! I ran into this with MADIS mesonet dataset that had the "Annœullin" string in it (has character \x9c in it), From what I saw there was nothing specifying the encoding :( After some grepping around it seemed like this was most likely a CP1252 format from someone generating the files on a windows box

Some info: http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Strings-and-Variables-of-type-char

basically netCDF 3.x was not designed with a true "string" type, only added in netCDF 4.x. For being a "self-describing" format this was a huge oversight.

jswhit · 2017-05-02T17:17:10Z

Even with the netcdf-4 NC_STRING type there is no concept of an encoding in the data model. It's just stored as a string of bytes in the file, and the client has to know how to encode it into a string. I can't find anything in the CF metadata standard that relates to string encoding, so there's no standard way for the client to figure out how to encode it. The python interface always returns a string, and the encoding is currently defined by a global module variable. I gather the purpose of pull request #655 is to at least allow the user to change the encoding on a per-Dataset basis, so multiple Datasets can be accessed at once with different encodings specified.

When it comes to names of variables, dimensions, attributes, groups, and types netcdf-c always uses UTF-8 encoding.

jswhit · 2017-05-02T17:23:58Z

Just noticed this at http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html

Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations

and here http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_utilities_guide.html

The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive.

which suggests that for NC_STRING variables (and attributes?) we should look for an attribute _Encoding.

thehesiod · 2017-05-02T17:36:56Z

hmm, still need to support files which don't specify the _Encoding. But sounds like enforcing _Encoding needs to be added to netcdf4-python? In my particular case there was no _Encoding attribute.

jswhit · 2017-05-02T18:12:19Z

I doubt that it is really used much. Perhaps we should check for it though. @WardF or @DennisHeimbigner - if you get a chance, could you read through this thread and comment?

thehesiod · 2017-05-02T18:36:40Z

btw another thing is my PR applies the encoding to all places the default_encoding was used before, based on what you found it sounds like some things like attribute names should always be UTF8 with some restrictions (which I'm guessing aren't checked for in the cython code).

DennisHeimbigner · 2017-05-02T19:14:43Z

You are correct: all netcdf names are assumed to be utf8, except that the character '/' is always
disallowed. When names occur inside, say, a cdl file, then certain other characters must be
back-slash escaped.
The default encoding for strings (as opposed to characters) is utf8, and ncdump, for example,
will assume that in the absence of any _Encoding attribute. The _Encoding attribute is currently ignored
in the netcdf-c code (reminder to self: add issue about at least recognizing it).
Character data is the real problem. Technically, is defaults also to utf8, which means the ascii
subset of utf8. In practice, characters can have any 8-bit bit pattern.

shoyer · 2017-05-02T19:23:54Z

For character data, ASCII with errors='surrogateescape' can be a good option for decoding into Python strings, when there is some possibility that somebody has stuffed arbitrary bytes in there. At least then, you can safely encode back into bytes and decode with the right encoding.

DennisHeimbigner · 2017-05-02T19:25:29Z

One more point. Character typed and String typed attributes must always be UTF8
because there is no way to specify _Encoding for an attribute.

ethanrd · 2017-05-02T19:26:23Z

The _Encoding attribute was under discussion recently on the CF mailing list. @rsignell-usgs and I were just this morning discussing this topic and decided to create an issue on the Unidata/netcdf-c repo to update the NUG wording around the _Encoding attribute.

thehesiod · 2017-05-02T23:35:20Z

hah, as a side-note, I just found that some MADIS mesonet files are NOT in cp1252 as they fail to decode with that encoding, so it seems the files are a mixture of encodings without specifying what the encodings are :(

update: even worse, has garbage data as it doesn't seem to be in any reasonable encoding...another idea then is adding encoding validation when setting string data.

jswhit · 2017-05-03T13:51:03Z

So to summarize...

we should look for an _Encoding attribute for NC_STRING variable data, and use it for encoding, otherwise either use a default value or the value specified by a newset_encoding Dataset method.
for encoding character data into python strings we should use ASCII with errors=surrogateescape (i.e. in the chartostring utility function). In stringtochar we should decode using ascii.
For names of variables, dimensions, attributes and groups, always use UTF-8.
For attributes (either string or character) Dennis suggests we should always use UTF-8.

Does this sound reasonable?

For (1) do we really need a set_encoding method, or should we just rely on an _Encoding attribute and use UTF-8 if it's not there?

For (4), could we look for a Dataset or Variable _Encoding attributes and assume it applies to NC_STRING and NC_CHAR attributes?

thehesiod · 2017-05-03T19:11:08Z

Feedback from parts affecting me

prefer having an encoding __init__ parameter as the current code relies on it during initialization also it makes more sense, I don't think you want to be able to change encoding modes of the Dataset after it's been opened, I think logically it should be fixed once it's been opened.
not sure what this means, I know of CDF "classic" files that aren't utf-8 and doing this will make consumers of these files more difficult (will have to parse each string 2x)...instead of simply passing an encoding parameter, or perhaps could be named fallback_encoding (if not specified).
3 + 4. After doing this I think it will greatly simplify where encoding is used.

jswhit · 2017-05-19T02:37:53Z

this issue is address by pull request #665, which adds detection of the _Encoding attribute and the addition of the kwarg encoding to chartostring.

jswhit · 2017-05-19T15:52:12Z

pull request #665 merged, closing for now.

thehesiod changed the title ~~Dataset class should support encoding parameter to override global param~~ Dataset class should support encoding parameter to override global attribute Apr 30, 2017

thehesiod mentioned this issue Apr 30, 2017

Add per Dataset encoding support #655

Closed

rsignell-usgs mentioned this issue May 2, 2017

Conventions for string and character array encoding Unidata/netcdf-c#402

Open

jswhit closed this as completed May 19, 2017

jswhit mentioned this issue Mar 29, 2022

add "_Encoding" attribute to time_iso history file variable NOAA-EMC/fv3atm#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset class should support encoding parameter to override global attribute #654

Dataset class should support encoding parameter to override global attribute #654

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented Apr 30, 2017

jswhit commented Apr 30, 2017 •

edited

Loading

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented Apr 30, 2017

shoyer commented May 1, 2017

thehesiod commented May 1, 2017 •

edited

Loading

jswhit commented May 2, 2017

jswhit commented May 2, 2017 •

edited

Loading

thehesiod commented May 2, 2017 •

edited

Loading

jswhit commented May 2, 2017

thehesiod commented May 2, 2017

DennisHeimbigner commented May 2, 2017

shoyer commented May 2, 2017

DennisHeimbigner commented May 2, 2017

ethanrd commented May 2, 2017

thehesiod commented May 2, 2017 •

edited

Loading

jswhit commented May 3, 2017 •

edited

Loading

thehesiod commented May 3, 2017 •

edited

Loading

jswhit commented May 19, 2017

jswhit commented May 19, 2017

Dataset class should support encoding parameter to override global attribute #654

Dataset class should support encoding parameter to override global attribute #654

Comments

thehesiod commented Apr 30, 2017 • edited Loading

thehesiod commented Apr 30, 2017

jswhit commented Apr 30, 2017 • edited Loading

thehesiod commented Apr 30, 2017 • edited Loading

thehesiod commented Apr 30, 2017

shoyer commented May 1, 2017

thehesiod commented May 1, 2017 • edited Loading

jswhit commented May 2, 2017

jswhit commented May 2, 2017 • edited Loading

thehesiod commented May 2, 2017 • edited Loading

jswhit commented May 2, 2017

thehesiod commented May 2, 2017

DennisHeimbigner commented May 2, 2017

shoyer commented May 2, 2017

DennisHeimbigner commented May 2, 2017

ethanrd commented May 2, 2017

thehesiod commented May 2, 2017 • edited Loading

jswhit commented May 3, 2017 • edited Loading

thehesiod commented May 3, 2017 • edited Loading

jswhit commented May 19, 2017

jswhit commented May 19, 2017

thehesiod commented Apr 30, 2017 •

edited

Loading

jswhit commented Apr 30, 2017 •

edited

Loading

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented May 1, 2017 •

edited

Loading

jswhit commented May 2, 2017 •

edited

Loading

thehesiod commented May 2, 2017 •

edited

Loading

thehesiod commented May 2, 2017 •

edited

Loading

jswhit commented May 3, 2017 •

edited

Loading

thehesiod commented May 3, 2017 •

edited

Loading