Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update languagecodes (or better drop language validation?) #90

Closed
dpprdan opened this issue Jul 31, 2019 · 2 comments
Closed

Update languagecodes (or better drop language validation?) #90

dpprdan opened this issue Jul 31, 2019 · 2 comments

Comments

@dpprdan
Copy link
Member

dpprdan commented Jul 31, 2019

I looked into how we can support language=native. What looked like an easy fix, sent me down the rabbit hole of whether and how we should validate language codes.

At the moment we are rather strict w.r.t. the languagecodes allowed, i.e. we validate languagecodes against the list provided by data("languagecodes") supplied with the package. (Contrary to what is stated in our documenatation, these are not ISO-639-2 as but ISO 639-1 Alpha-2 codes, I think.)

However, there are more language codes used by Nominatim and hence OpenCage. (Apparently OpenCage does not change the language codes, does it @freyfogle?) Here are some examples from Nominatim with names in the respective languages. OpenCage returns these as well(e.g. https://api.opencagedata.com/geocode/v1/json?q=Oder&language=hsb returns river: Wódra), but with opencage it is not possible to send such queries at the moment.

So we would need a new list against which we can validate the queries?! What would that be?

According to the opencage documentation, the API supports IETF format language codes. More formally, this is the BCP47 specification.

Unfortunately, I have not been able to find a list that contains all the language codes mentioned above.

These first two are not the IETF language codes, but ISO 639 [2|3] lists (on which the BCP47 builds), so it is not surprising that they are not complete.

  • The ISO 639.2 list of the Library of Congress does not contain any of the extensions (e.g. only zh for Chinese) and languages like szl (Schlesian). Also misses codes like de-CH or pt-BR.
  • The ISOcodes::ISO_639_2 or ISOcodes::ISO_639_3 lists also do not contain the extensions.

The next two are supposed to be (based on) the IETF codes (I think), but still do not cover all codes used by Nominatim/OpenCage apparently:

Re the IANA list: There are some libraries in other languages to work with the IANA list: https://github.com/Alhadis/Record-Jar (JavaScript), https://github.com/mattcg/language-tags (JavaScript), https://github.com/OnroerendErfgoed/language-tags (Python), https://github.com/r12a/app-subtags (PHP/JavaScript). There is also the IANA list in JSON format: https://github.com/mattcg/language-subtag-registry. I guess one could build on that, but I don't want to get into that. For R I have only found this question on SO and there is a IETF language parser (without validation) in {NLP}.

It also seems that language tags are not validated by OSM. This means for example that the same language can have different tags. E.g. Taiwan has two als tags on Nominatim: "Republik China uf Taiwan (alt_name:als)" and "Nationalchina (old_name:als)". These correspond to the alemannic names, see https://als.wikipedia.org/wiki/Republik_China. The IETF/ISO 639 language tag for alemannic is not "als" but "gsw", however ("als" is the IETF subtag for Tosk Albanian, see also https://meta.wikimedia.org/wiki/Special_language_codes#Subdomains_that_do_not_conform_to_a_valid_ISO_639_language_code). Other Nominatim records use both the als and gsw tags, e.g. for "Züri", and there are probably others as well that only use the gsw, I suppose. Similarly bat-smg is used for Samogitian, whereas the IETF Tag is supposed to be sgs. Finally, zh_pinyin is yet another incorrect IETF language code. That should be zh-Latn-pinyin or bo-Latn-pinyin.

Long story short: Given that none of the lists is complete and that there are non-IETF language codes used in Nominatim and therefore OpenCage anyway, I am reluctant to validate the language parameter at all. We could explain in the vignette, how one could find out the correct IETF code (e.g. via https://r12a.github.io/app-subtags/) and whether these are actually used within OSM/Nominatim/OpenCage (via Nominatim).

@freyfogle
Copy link
Contributor

agree. do not waste any time/effort on validating language codes. I know of no library in any language that does this. We (OpenCage) do not do this. if you send a bad language code no results will be found and it will then just default to English.

dpprdan added a commit that referenced this issue Oct 22, 2019
* remove language parameter validation, enable language = "native", closes #90

* enable NAs for language et al

* gotta keep that coverage up: test opencage_key

* test oc_config for real
@dpprdan
Copy link
Member Author

dpprdan commented Oct 24, 2019

closed with #92

@dpprdan dpprdan closed this as completed Oct 24, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants