Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Cyrillic "й" is usually better represented with "y" and "ё" with "yo" #2

Open
rudyryk opened this issue Dec 5, 2016 · 6 comments

Comments

@rudyryk
Copy link

rudyryk commented Dec 5, 2016

Now:

>>> print(unidecode(u"Ёж Василий"))
Iozh Vasilii

Better: Yozh Vasiliy

BTW, Is it possible to tweak mapping manually?

@kmike
Copy link
Owner

kmike commented Dec 5, 2016

text-unidecode runs a Perl script to get a list of transliteration rules; bug tracker for the Perl library is https://rt.cpan.org/Public/Dist/Display.html?Name=Text-Unidecode.

You can also monkey-patch the rules - check the module source code, it is very short. I think you can do something like that:

import text_unidecode
text_unidecode._replaces[ord('ё')-1] = 'yo'
text_unidecode._replaces[ord('Ё')-1] = 'Yo'

If you only work with Russian there are Russian-specific transliteration packages available, e.g. https://github.com/j2a/pytils.

@rudyryk
Copy link
Author

rudyryk commented Dec 6, 2016

Thanks @kmike! What if we go further than Perl's version? I suppose we may patch and regenerate data.bin to fix some stuff for Python's version.

@kmike
Copy link
Owner

kmike commented Dec 6, 2016

@rudyryk could you please explain in more details the approach you're thinking about? Fix Perl library -> get the fix merged upstream -> regenerate data.bin file looks like an ideal case.

@rudyryk
Copy link
Author

rudyryk commented Dec 7, 2016

@kmike I would try to dump the patched _replaces from python to data.bin.

Fixing Perl library is also possible I suppose but I'm not sure where to upload fixes, CPAN version seems to be unsupported for some years.

@rudyryk
Copy link
Author

rudyryk commented Dec 7, 2016

BTW, how did you generate data.bin? :) As I can see from the source code of Perl's Text-Unidecode mappings are defined in *.pm files.

@pombredanne
Copy link

@rudyryk See

# usage: perl _dump.pl > src/text_unidecode/data.bin

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants