Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add per Dataset encoding support #655

Closed
wants to merge 14 commits into from

Conversation

thehesiod
Copy link
Contributor

@thehesiod thehesiod commented Apr 30, 2017

@thehesiod
Copy link
Contributor Author

thehesiod commented Apr 30, 2017

btw to make it more backwards compatible we instead change the default encoding to None, with a dynamic fallback in the encoding.__get__ lookup to default_encoding instead of fixing it during __init__. thoughts?

@thehesiod
Copy link
Contributor Author

thehesiod commented May 1, 2017

that recursion bug was a doozy, put a warning for future devs

@thehesiod
Copy link
Contributor Author

cool, looks like this PR is ready for review!

@jswhit
Copy link
Collaborator

jswhit commented May 9, 2017

I'm waiting on this to see how the discussion on netcdf-c plays out (Unidata/netcdf-c#402)

@jswhit
Copy link
Collaborator

jswhit commented May 15, 2017

What if different NC_STRING variables within the same dataset contain data with different encodings? I suppose this could be handled with the proposed _Encoding attribute.

The whole thing is a mess and any solution I can think of (including this one) seems fragile and kludgy. At least we have a solution now that works as long as you know the encoding and you only are reading data from one Dataset at a time.

@thehesiod
Copy link
Contributor Author

thehesiod commented May 15, 2017

ya, we need per variable encoding fallbacks. Let me know how you'd like to handle that. Some ideas:

  • user specified callback that takes one parameter (variable name) and returns encoding to use in case _Encoding not specified and it's a char type.
  • a Dataset.get_variable method which you can use to specify and encoding.

I can then code something up for you to look at. Right now I'm using a custom branch as I need to be able to specify the encoding. As I open multiple datasets in parallel.

@jswhit
Copy link
Collaborator

jswhit commented May 16, 2017

Is your use case (for specifying an encoding) mainly for attributes, or variable data?

@thehesiod
Copy link
Contributor Author

variable data, based on the conversations in the netcdf-c thread it sounds like attribute should all be forced to utf-8 (which I can change in this PR too).

@jswhit
Copy link
Collaborator

jswhit commented May 16, 2017

The reason I ask is if you are mainly concerned about attributes, we could add a kwarg encoding to getncattr. You would then have to use getncattr to retrieve string attributes, but you could specify the encoding on a per attribute basis (no need for a Dataset encoding parameter).

@jswhit
Copy link
Collaborator

jswhit commented May 16, 2017

Could you post one of the MADIS files that you are dealing with?

@thehesiod
Copy link
Contributor Author

thehesiod commented May 16, 2017

https://madis-data.cprk.ncep.noaa.gov/madisPublic1/data/archive/2017/02/01/LDAD/mesonet/netCDF/20170201_0000.gz

found it for variables['stationName'][77]

[b'F' b'1' b'L' b'X' b'J' b'-' b'1' b'3' b' ' b'A' b'n' b'n' b'\x9c' b'u'
 b'l' b'l' b'i' b'n' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' '
 b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b'F' b'R' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'']

which does not decode to UTF-8.

presumably: 'F1LXJ-13 Annœullin FR ' in CP1252 encoding

@jswhit
Copy link
Collaborator

jswhit commented May 16, 2017

thanks @thehesiod. I've created an alternate 'solution' in the 'encoding' branch. Instead of setting the encoding as a Dataset init parameter, I look for an _Encoding attribute for character and vlen string variables. For string variables, if _Encoding is not set, 'utf-8' is used. For character arrays, if _Encoding is set, then a numpy array of fixed length strings is returned by automatically calling chartostring (the rightmost dimension of the variable is assumed to the the length of the strings). If _Encoding is not set, you get the previous behavior (an array of single characters is returned). So, in your case you would have to add _Encoding="cp1252" as a variable attribute, either using NCO or by opening the file in append mode and adding the attribute before you read the data.

For attributes, I added a 'encoding' kwarg to getncattr.

I know adding a new attribute to all the files is probably not a good solution for you, but I'm still a bit confused about what the problem is you are trying to solve. With the current master, you can read the variable stationName and convert it to an array of strings using chartostring - but it will use the global module variable 'default_encoding'. Wouldn't simply adding a kwarg 'encoding' to chartostring solve your problem? (this is also done in the 'encoding' branch).

To be specific, here's what I'm suggesting (using the encoding branch):

from netCDF4 import Dataset, chartostring
nc = Dataset('20170201_0000')
chararr = nc['stationName'][:]
strarr = chartostring(chararr,encoding='cp1252')
print strarr[77]
nc.close()

or alternately

from netCDF4 import Dataset
nc = Dataset('20170201_0000','a')
nc['stationName']._Encoding = 'cp1252'
strarr = nc['stationName'][:]
print strarr[77]
nc.close()

@thehesiod
Copy link
Contributor Author

thehesiod commented May 16, 2017

ya thinking about it more doesn't make sense to need an encoding init param for char variable data, not sure why I thought I needed this. Closing this PR

@thehesiod thehesiod closed this May 16, 2017
@thehesiod thehesiod deleted the encoding branch June 14, 2017 22:48
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset class should support encoding parameter to override global attribute
2 participants