Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

Open
cwardgar opened this issue Mar 26, 2017 · 1 comment
Open

Comments

@cwardgar
Copy link
Contributor

Background

  1. In NetCDF, the NC_CHAR type is 8 bits.
  2. In NetCDF-Java, we read data from NC_CHAR variables into the ArrayChar class.
  3. The backing storage of ArrayChar is a java array of type char[].
  4. However, the char type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16 code unit.

So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?

Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the chars are simply cast to bytes. The cast discards the upper 8 bits and keeps the lower 8 bits.

That means that if the Java char is in the range 0000-00FF, no loss of data occurs and the UTF-16 code unit is effectively converted to a ISO-8859-1 character. This works because ISO-8859-1 was incorporated as the first 256 code points of Unicode.

Problem

However, if the Java char is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g. ?) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c, NC_CHAR on NetCDF-4 is now interpreted as ASCII.

Proposed solution

A better approach is to change the backing storage of ArrayChar from char[] to byte[]. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.

With this change, we should mostly just treat ArrayChar as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. in ncdump – we should look for the special variable attribute _Encoding (see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.

@lesserwhirls
Copy link
Collaborator

Things to do:

  • notify users about the issue and that we are thinking about it
  • see how many users are using ISO-8859
  • take this opportunity to get the C and Java libraries lined up in how they treat the issue

Task for THREDDS v6.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

2 participants