Transliteration not proper for few characters in Tamil #11

vrindaprabhu · 2016-10-06T12:20:16Z

Please find the below code for transliterating from Tamil to English.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT :  ஒனறுமட்டுமல்லாது

The text was updated successfully, but these errors were encountered:

anoopkunchukuttan · 2016-10-06T12:24:27Z

Thanks for pointing out. The extended ITRANS standard we defined does not probably have a mapping for this character. I will check this over the weekend.

arcturusannamalai · 2016-10-22T01:00:04Z

I wonder how this transliteration compares to open-tamil package.

Anoop would you be publishing this package on python pkg repository? Where are your unittests for this project, I can't seem to find it.

vrindaprabhu · 2016-10-28T14:55:54Z

The open-tamil package too has some problems handling the unicodes. You will have to explicitly type out in Tamil to get the best results.Discrepancy I faced is like so -

unicode("தொ","utf-8")
#OUTPUT : u'\u0ba4\u0bc6\u0bbe'

tamil_letter = utf8.get_letters("தொ")
utf_tamil = ''.join(tamil_letter).decode("utf-8")
#OUTPUT : u'\u0ba4\u0bca'

I have used open-tamil package.In both scenarios source of the letters were different i.e. different texts.

arcturusannamalai · 2016-10-29T03:06:20Z

@vrindaprabhu - please create a suitable issue and we can address it.
Also http://libindic.org/ has interesting code bits.

arcturusannamalai · 2016-10-30T00:51:56Z

@vrindaprabhu - I checked on Python3 and Open-Tamil version 0.51, I'm not seeing this issue you report. get_letters() returns just 1 letter as element of list.

vrindaprabhu · 2016-11-02T10:36:00Z

Strange. Probably like I mentioned it depends on how "தொ" is written. Even I did not face the issue all the time but only with few particular sentences in the corpus.

arcturusannamalai · 2016-11-03T05:45:47Z

@vrindaprabhu - there are unicode normalization issues and these are fixed in version 0.65.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transliteration not proper for few characters in Tamil #11

Transliteration not proper for few characters in Tamil #11

vrindaprabhu commented Oct 6, 2016 •

edited

Loading

anoopkunchukuttan commented Oct 6, 2016

arcturusannamalai commented Oct 22, 2016

vrindaprabhu commented Oct 28, 2016 •

edited

Loading

arcturusannamalai commented Oct 29, 2016

arcturusannamalai commented Oct 30, 2016

vrindaprabhu commented Nov 2, 2016

arcturusannamalai commented Nov 3, 2016

Transliteration not proper for few characters in Tamil #11

Transliteration not proper for few characters in Tamil #11

Comments

vrindaprabhu commented Oct 6, 2016 • edited Loading

anoopkunchukuttan commented Oct 6, 2016

arcturusannamalai commented Oct 22, 2016

vrindaprabhu commented Oct 28, 2016 • edited Loading

arcturusannamalai commented Oct 29, 2016

arcturusannamalai commented Oct 30, 2016

vrindaprabhu commented Nov 2, 2016

arcturusannamalai commented Nov 3, 2016

vrindaprabhu commented Oct 6, 2016 •

edited

Loading

vrindaprabhu commented Oct 28, 2016 •

edited

Loading