Update languagecodes (or better drop language validation?) #90

dpprdan · 2019-07-31T22:31:38Z

I looked into how we can support language=native. What looked like an easy fix, sent me down the rabbit hole of whether and how we should validate language codes.

At the moment we are rather strict w.r.t. the languagecodes allowed, i.e. we validate languagecodes against the list provided by data("languagecodes") supplied with the package. (Contrary to what is stated in our documenatation, these are not ISO-639-2 as but ISO 639-1 Alpha-2 codes, I think.)

However, there are more language codes used by Nominatim and hence OpenCage. (Apparently OpenCage does not change the language codes, does it @freyfogle?) Here are some examples from Nominatim with names in the respective languages. OpenCage returns these as well(e.g. https://api.opencagedata.com/geocode/v1/json?q=Oder&language=hsb returns river: Wódra), but with opencage it is not possible to send such queries at the moment.

China: Кіта́й (name:be-tarask), China (name:tw), 中国 (name:zh-Hans), 中國 (name:zh-Hant), Zhōngguó (name:zh_pinyin (Note the underscore instead of a hyphen here)), Kėnėjės Liaudėis Respoblėka (name:bat-smg, which is Samogitian, ISO 639-3: sgs)) https://nominatim.openstreetmap.org/details.php?place_id=198177448 (For language:be, OpenCage returns the "short_name" КНР (short_name:be), not the "name" Кітай (name:be), nor the "official_name" Кітайская Народная Рэспубліка (official_name:be). The same for bat-smg).
Guadeloupe: Gwadloup (name:gcf) https://nominatim.openstreetmap.org/details.php?place_id=198069217
Oder: Uodra (name:szl), Wódra (name:hsb), https://nominatim.openstreetmap.org/details.php?place_id=198218400

So we would need a new list against which we can validate the queries?! What would that be?

According to the opencage documentation, the API supports IETF format language codes. More formally, this is the BCP47 specification.

Unfortunately, I have not been able to find a list that contains all the language codes mentioned above.

These first two are not the IETF language codes, but ISO 639 [2|3] lists (on which the BCP47 builds), so it is not surprising that they are not complete.

The ISO 639.2 list of the Library of Congress does not contain any of the extensions (e.g. only zh for Chinese) and languages like szl (Schlesian). Also misses codes like de-CH or pt-BR.
The ISOcodes::ISO_639_2 or ISOcodes::ISO_639_3 lists also do not contain the extensions.

The next two are supposed to be (based on) the IETF codes (I think), but still do not cover all codes used by Nominatim/OpenCage apparently:

The Unicode Common Locale Data Repository. Language list via https://datahub.io/core/language-codes. Pretty close, but does not contain e.g. zh-tw (IETF language tag for Taiwanese or could that also be zh-hant?) or zh_pinyin (https://en.wikipedia.org/wiki/Pinyin).
The IANA language subtag registry, does not contain e.g. bat-smg or zh_pinyin. It is also in record-jar format, which is hard to parse. It also contains all subtags individually (e.g. language de and region CH), which would have to be matched. It is not obvious to me, how to judge which tags go together and which don't.

Re the IANA list: There are some libraries in other languages to work with the IANA list: https://github.com/Alhadis/Record-Jar (JavaScript), https://github.com/mattcg/language-tags (JavaScript), https://github.com/OnroerendErfgoed/language-tags (Python), https://github.com/r12a/app-subtags (PHP/JavaScript). There is also the IANA list in JSON format: https://github.com/mattcg/language-subtag-registry. I guess one could build on that, but I don't want to get into that. For R I have only found this question on SO and there is a IETF language parser (without validation) in {NLP}.

It also seems that language tags are not validated by OSM. This means for example that the same language can have different tags. E.g. Taiwan has two als tags on Nominatim: "Republik China uf Taiwan (alt_name:als)" and "Nationalchina (old_name:als)". These correspond to the alemannic names, see https://als.wikipedia.org/wiki/Republik_China. The IETF/ISO 639 language tag for alemannic is not "als" but "gsw", however ("als" is the IETF subtag for Tosk Albanian, see also https://meta.wikimedia.org/wiki/Special_language_codes#Subdomains_that_do_not_conform_to_a_valid_ISO_639_language_code). Other Nominatim records use both the als and gsw tags, e.g. for "Züri", and there are probably others as well that only use the gsw, I suppose. Similarly bat-smg is used for Samogitian, whereas the IETF Tag is supposed to be sgs. Finally, zh_pinyin is yet another incorrect IETF language code. That should be zh-Latn-pinyin or bo-Latn-pinyin.

Long story short: Given that none of the lists is complete and that there are non-IETF language codes used in Nominatim and therefore OpenCage anyway, I am reluctant to validate the language parameter at all. We could explain in the vignette, how one could find out the correct IETF code (e.g. via https://r12a.github.io/app-subtags/) and whether these are actually used within OSM/Nominatim/OpenCage (via Nominatim).

The text was updated successfully, but these errors were encountered:

freyfogle · 2019-07-31T22:39:21Z

agree. do not waste any time/effort on validating language codes. I know of no library in any language that does this. We (OpenCage) do not do this. if you send a bad language code no results will be found and it will then just default to English.

* remove language parameter validation, enable language = "native", closes #90 * enable NAs for language et al * gotta keep that coverage up: test opencage_key * test oc_config for real

dpprdan · 2019-10-24T16:44:34Z

closed with #92

dpprdan mentioned this issue Aug 29, 2019

Drop language validation, enable language = "native" #92

Merged

dpprdan closed this as completed Oct 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update languagecodes (or better drop language validation?) #90

Update languagecodes (or better drop language validation?) #90

dpprdan commented Jul 31, 2019

freyfogle commented Jul 31, 2019

dpprdan commented Oct 24, 2019

Update languagecodes (or better drop language validation?) #90

Update languagecodes (or better drop language validation?) #90

Comments

dpprdan commented Jul 31, 2019

freyfogle commented Jul 31, 2019

dpprdan commented Oct 24, 2019