Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Transliteration not proper for few characters in Tamil #11

Open
vrindaprabhu opened this issue Oct 6, 2016 · 7 comments
Open

Transliteration not proper for few characters in Tamil #11

vrindaprabhu opened this issue Oct 6, 2016 · 7 comments

Comments

@vrindaprabhu
Copy link

vrindaprabhu commented Oct 6, 2016

Please find the below code for transliterating from Tamil to English.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT :  ஒனறுமட்டுமல்லாது
@anoopkunchukuttan
Copy link
Owner

Thanks for pointing out. The extended ITRANS standard we defined does not probably have a mapping for this character. I will check this over the weekend.

@arcturusannamalai
Copy link

I wonder how this transliteration compares to open-tamil package.

Anoop would you be publishing this package on python pkg repository? Where are your unittests for this project, I can't seem to find it.

@vrindaprabhu
Copy link
Author

vrindaprabhu commented Oct 28, 2016

The open-tamil package too has some problems handling the unicodes. You will have to explicitly type out in Tamil to get the best results.Discrepancy I faced is like so -

unicode("தொ","utf-8")
#OUTPUT : u'\u0ba4\u0bc6\u0bbe'

tamil_letter = utf8.get_letters("தொ")
utf_tamil = ''.join(tamil_letter).decode("utf-8")
#OUTPUT : u'\u0ba4\u0bca'

I have used open-tamil package.In both scenarios source of the letters were different i.e. different texts.

@arcturusannamalai
Copy link

@vrindaprabhu - please create a suitable issue and we can address it.
Also http://libindic.org/ has interesting code bits.

@arcturusannamalai
Copy link

@vrindaprabhu - I checked on Python3 and Open-Tamil version 0.51, I'm not seeing this issue you report. get_letters() returns just 1 letter as element of list.

@vrindaprabhu
Copy link
Author

Strange. Probably like I mentioned it depends on how "தொ" is written. Even I did not face the issue all the time but only with few particular sentences in the corpus.

@arcturusannamalai
Copy link

@vrindaprabhu - there are unicode normalization issues and these are fixed in version 0.65.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants