ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

cwardgar · 2017-03-26T07:28:55Z

Background

In NetCDF, the NC_CHAR type is 8 bits.
In NetCDF-Java, we read data from NC_CHAR variables into the ArrayChar class.
The backing storage of ArrayChar is a java array of type char[].
However, the char type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16 code unit.

So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?

Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the chars are simply cast to bytes. The cast discards the upper 8 bits and keeps the lower 8 bits.

That means that if the Java char is in the range 0000-00FF, no loss of data occurs and the UTF-16 code unit is effectively converted to a ISO-8859-1 character. This works because ISO-8859-1 was incorporated as the first 256 code points of Unicode.

Problem

However, if the Java char is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g. ?) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c, NC_CHAR on NetCDF-4 is now interpreted as ASCII.

Proposed solution

A better approach is to change the backing storage of ArrayChar from char[] to byte[]. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.

With this change, we should mostly just treat ArrayChar as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. in ncdump – we should look for the special variable attribute _Encoding (see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.

The text was updated successfully, but these errors were encountered:

lesserwhirls · 2017-03-30T17:02:20Z

Things to do:

notify users about the issue and that we are thinking about it
see how many users are using ISO-8859
take this opportunity to get the C and Java libraries lined up in how they treat the issue

Task for THREDDS v6.

cwardgar added Area: CDM Area: NetCDF-4 Area: NetCDF-Java Type: Bug labels Mar 26, 2017

lesserwhirls added Type: Enhancement and removed Type: Bug labels Mar 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

cwardgar commented Mar 26, 2017

lesserwhirls commented Mar 30, 2017

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

Comments

cwardgar commented Mar 26, 2017

Background

Problem

Proposed solution

lesserwhirls commented Mar 30, 2017