Cyrillic "й" is usually better represented with "y" and "ё" with "yo" #2

rudyryk · 2016-12-05T17:22:01Z

Now:

>>> print(unidecode(u"Ёж Василий"))
Iozh Vasilii

Better: Yozh Vasiliy

BTW, Is it possible to tweak mapping manually?

The text was updated successfully, but these errors were encountered:

kmike · 2016-12-05T17:37:55Z

text-unidecode runs a Perl script to get a list of transliteration rules; bug tracker for the Perl library is https://rt.cpan.org/Public/Dist/Display.html?Name=Text-Unidecode.

You can also monkey-patch the rules - check the module source code, it is very short. I think you can do something like that:

import text_unidecode
text_unidecode._replaces[ord('ё')-1] = 'yo'
text_unidecode._replaces[ord('Ё')-1] = 'Yo'

If you only work with Russian there are Russian-specific transliteration packages available, e.g. https://github.com/j2a/pytils.

rudyryk · 2016-12-06T07:19:14Z

Thanks @kmike! What if we go further than Perl's version? I suppose we may patch and regenerate data.bin to fix some stuff for Python's version.

kmike · 2016-12-06T14:01:09Z

@rudyryk could you please explain in more details the approach you're thinking about? Fix Perl library -> get the fix merged upstream -> regenerate data.bin file looks like an ideal case.

rudyryk · 2016-12-07T07:46:01Z

@kmike I would try to dump the patched _replaces from python to data.bin.

Fixing Perl library is also possible I suppose but I'm not sure where to upload fixes, CPAN version seems to be unsupported for some years.

rudyryk · 2016-12-07T08:04:04Z

BTW, how did you generate data.bin? :) As I can see from the source code of Perl's Text-Unidecode mappings are defined in *.pm files.

pombredanne · 2017-01-01T17:04:27Z

@rudyryk See

text-unidecode/_dump.pl

Line 10 in e5655a9

# usage: perl _dump.pl > src/text_unidecode/data.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyrillic "й" is usually better represented with "y" and "ё" with "yo" #2

Cyrillic "й" is usually better represented with "y" and "ё" with "yo" #2

rudyryk commented Dec 5, 2016

kmike commented Dec 5, 2016

rudyryk commented Dec 6, 2016 •

edited

Loading

kmike commented Dec 6, 2016

rudyryk commented Dec 7, 2016

rudyryk commented Dec 7, 2016

pombredanne commented Jan 1, 2017

Cyrillic "й" is usually better represented with "y" and "ё" with "yo" #2

Cyrillic "й" is usually better represented with "y" and "ё" with "yo" #2

Comments

rudyryk commented Dec 5, 2016

kmike commented Dec 5, 2016

rudyryk commented Dec 6, 2016 • edited Loading

kmike commented Dec 6, 2016

rudyryk commented Dec 7, 2016

rudyryk commented Dec 7, 2016

pombredanne commented Jan 1, 2017

rudyryk commented Dec 6, 2016 •

edited

Loading