You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I looked into how we can support language=native. What looked like an easy fix, sent me down the rabbit hole of whether and how we should validate language codes.
At the moment we are rather strict w.r.t. the languagecodes allowed, i.e. we validate languagecodes against the list provided by data("languagecodes") supplied with the package. (Contrary to what is stated in our documenatation, these are not ISO-639-2 as but ISO 639-1 Alpha-2 codes, I think.)
However, there are more language codes used by Nominatim and hence OpenCage. (Apparently OpenCage does not change the language codes, does it @freyfogle?) Here are some examples from Nominatim with names in the respective languages. OpenCage returns these as well(e.g. https://api.opencagedata.com/geocode/v1/json?q=Oder&language=hsb returns river: Wódra), but with opencage it is not possible to send such queries at the moment.
China: Кіта́й (name:be-tarask), China (name:tw), 中国 (name:zh-Hans), 中國 (name:zh-Hant), Zhōngguó (name:zh_pinyin (Note the underscore instead of a hyphen here)), Kėnėjės Liaudėis Respoblėka (name:bat-smg, which is Samogitian, ISO 639-3: sgs)) https://nominatim.openstreetmap.org/details.php?place_id=198177448 (For language:be, OpenCage returns the "short_name" КНР (short_name:be), not the "name" Кітай (name:be), nor the "official_name" Кітайская Народная Рэспубліка (official_name:be). The same for bat-smg).
Unfortunately, I have not been able to find a list that contains all the language codes mentioned above.
These first two are not the IETF language codes, but ISO 639 [2|3] lists (on which the BCP47 builds), so it is not surprising that they are not complete.
The ISO 639.2 list of the Library of Congress does not contain any of the extensions (e.g. only zh for Chinese) and languages like szl (Schlesian). Also misses codes like de-CH or pt-BR.
The ISOcodes::ISO_639_2 or ISOcodes::ISO_639_3 lists also do not contain the extensions.
The next two are supposed to be (based on) the IETF codes (I think), but still do not cover all codes used by Nominatim/OpenCage apparently:
The IANA language subtag registry, does not contain e.g. bat-smg or zh_pinyin. It is also in record-jar format, which is hard to parse. It also contains all subtags individually (e.g. language de and region CH), which would have to be matched. It is not obvious to me, how to judge which tags go together and which don't.
It also seems that language tags are not validated by OSM. This means for example that the same language can have different tags. E.g. Taiwan has two als tags on Nominatim: "Republik China uf Taiwan (alt_name:als)" and "Nationalchina (old_name:als)". These correspond to the alemannic names, see https://als.wikipedia.org/wiki/Republik_China. The IETF/ISO 639 language tag for alemannic is not "als" but "gsw", however ("als" is the IETF subtag for Tosk Albanian, see also https://meta.wikimedia.org/wiki/Special_language_codes#Subdomains_that_do_not_conform_to_a_valid_ISO_639_language_code). Other Nominatim records use both the als and gsw tags, e.g. for "Züri", and there are probably others as well that only use the gsw, I suppose. Similarly bat-smg is used for Samogitian, whereas the IETF Tag is supposed to be sgs. Finally, zh_pinyin is yet another incorrect IETF language code. That should be zh-Latn-pinyin or bo-Latn-pinyin.
Long story short: Given that none of the lists is complete and that there are non-IETF language codes used in Nominatim and therefore OpenCage anyway, I am reluctant to validate the language parameter at all. We could explain in the vignette, how one could find out the correct IETF code (e.g. via https://r12a.github.io/app-subtags/) and whether these are actually used within OSM/Nominatim/OpenCage (via Nominatim).
The text was updated successfully, but these errors were encountered:
agree. do not waste any time/effort on validating language codes. I know of no library in any language that does this. We (OpenCage) do not do this. if you send a bad language code no results will be found and it will then just default to English.
* remove language parameter validation, enable language = "native", closes#90
* enable NAs for language et al
* gotta keep that coverage up: test opencage_key
* test oc_config for real
I looked into how we can support
language=native
. What looked like an easy fix, sent me down the rabbit hole of whether and how we should validate language codes.At the moment we are rather strict w.r.t. the languagecodes allowed, i.e. we validate languagecodes against the list provided by
data("languagecodes")
supplied with the package. (Contrary to what is stated in our documenatation, these are not ISO-639-2 as but ISO 639-1 Alpha-2 codes, I think.)However, there are more language codes used by Nominatim and hence OpenCage. (Apparently OpenCage does not change the language codes, does it @freyfogle?) Here are some examples from Nominatim with names in the respective languages. OpenCage returns these as well(e.g. https://api.opencagedata.com/geocode/v1/json?q=Oder&language=hsb returns river: Wódra), but with opencage it is not possible to send such queries at the moment.
So we would need a new list against which we can validate the queries?! What would that be?
According to the opencage documentation, the API supports IETF format language codes. More formally, this is the BCP47 specification.
Unfortunately, I have not been able to find a list that contains all the language codes mentioned above.
These first two are not the IETF language codes, but ISO 639 [2|3] lists (on which the BCP47 builds), so it is not surprising that they are not complete.
zh
for Chinese) and languages likeszl
(Schlesian). Also misses codes likede-CH
orpt-BR
.ISOcodes::ISO_639_2
orISOcodes::ISO_639_3
lists also do not contain the extensions.The next two are supposed to be (based on) the IETF codes (I think), but still do not cover all codes used by Nominatim/OpenCage apparently:
zh-tw
(IETF language tag for Taiwanese or could that also bezh-hant
?) orzh_pinyin
(https://en.wikipedia.org/wiki/Pinyin).bat-smg
orzh_pinyin
. It is also in record-jar format, which is hard to parse. It also contains all subtags individually (e.g. languagede
and regionCH
), which would have to be matched. It is not obvious to me, how to judge which tags go together and which don't.It also seems that language tags are not validated by OSM. This means for example that the same language can have different tags. E.g. Taiwan has two
als
tags on Nominatim: "Republik China uf Taiwan (alt_name:als)" and "Nationalchina (old_name:als)". These correspond to the alemannic names, see https://als.wikipedia.org/wiki/Republik_China. The IETF/ISO 639 language tag for alemannic is not "als" but "gsw", however ("als" is the IETF subtag for Tosk Albanian, see also https://meta.wikimedia.org/wiki/Special_language_codes#Subdomains_that_do_not_conform_to_a_valid_ISO_639_language_code). Other Nominatim records use both theals
andgsw
tags, e.g. for "Züri", and there are probably others as well that only use thegsw
, I suppose. Similarlybat-smg
is used for Samogitian, whereas the IETF Tag is supposed to besgs
. Finally,zh_pinyin
is yet another incorrect IETF language code. That should bezh-Latn-pinyin
orbo-Latn-pinyin
.Long story short: Given that none of the lists is complete and that there are non-IETF language codes used in Nominatim and therefore OpenCage anyway, I am reluctant to validate the language parameter at all. We could explain in the vignette, how one could find out the correct IETF code (e.g. via https://r12a.github.io/app-subtags/) and whether these are actually used within OSM/Nominatim/OpenCage (via Nominatim).
The text was updated successfully, but these errors were encountered: