You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.
In NetCDF-Java, we read data from NC_CHAR variables into the ArrayChar class.
The backing storage of ArrayChar is a java array of type char[].
However, the char type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16code unit.
So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?
Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the chars are simply cast to bytes. The cast discards the upper 8 bits and keeps the lower 8 bits.
However, if the Java char is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g. ?) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c, NC_CHAR on NetCDF-4 is now interpreted as ASCII.
Proposed solution
A better approach is to change the backing storage of ArrayChar from char[] to byte[]. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.
With this change, we should mostly just treat ArrayChar as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. in ncdump – we should look for the special variable attribute _Encoding (see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.
The text was updated successfully, but these errors were encountered:
Background
NC_CHAR
type is 8 bits.NC_CHAR
variables into theArrayChar
class.ArrayChar
is a java array of typechar[]
.char
type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16 code unit.So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?
Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the
char
s are simply cast tobyte
s. The cast discards the upper 8 bits and keeps the lower 8 bits.That means that if the Java
char
is in the range0000-00FF
, no loss of data occurs and the UTF-16 code unit is effectively converted to a ISO-8859-1 character. This works because ISO-8859-1 was incorporated as the first 256 code points of Unicode.Problem
However, if the Java
char
is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g.?
) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c,NC_CHAR
on NetCDF-4 is now interpreted as ASCII.Proposed solution
A better approach is to change the backing storage of
ArrayChar
fromchar[]
tobyte[]
. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.With this change, we should mostly just treat
ArrayChar
as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. inncdump
– we should look for the special variable attribute_Encoding
(see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.The text was updated successfully, but these errors were encountered: