-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add transliteration option. #219
base: master
Are you sure you want to change the base?
Conversation
…er versions of Node.
…LE flag. Other small code tweaks.
…ble, but the automated testing on travis-ci.com didn't like it.
…er than build a string in a for loop.
By the way, even if you approve of what you see here, please hold off a little while before publishing anything with these changes. I've started working on an improved version of unidecode — the original unidecode hasn't been updated for four years now. What I've got so far runs about twice as fast as the original, and I'm expanding the transliterations to cover a lot of popular emojis not handled by the original. Since it's been so long since unidecode has been updated (making me unsure if anyone is actively maintaining it) I'll publish my own version as a separate npm package rather than wait for a response to merge my changes into the original. I'll then modify the code in this pull request to use my new version of unidecode as an option to using the original. |
…ng and German options. Remove conflicts in unit testing where unidecode and unidecode-plus transliterate differently.
…ther unidecode-plus in available or not.
The work I was doing on an improved version of unidecode is done now, and is incorporated in this PR now. |
First of all, I'm impressed by the quality of the pull requests you produce: improved README, tests, the code itself are all top notch. Great job Kerry! In the past, I struggled with adding transliteration functionality to iconv-lite because I didn't have a good grasp of intended use cases. Could you describe what use cases do you have in mind here, so that I could understand it better? One vague concern I have is that this PR introduces some coupling with the unidecode[-plus] library, e.g. "german" option and smart spacing functionality. If there would be additional improvements to unidecode library interface, they will have to be mirrored in iconv-lite, which could become a pain in case the versions drift. An idea I had initially, with regard to the transliteration interface would be to create a callback like this:
This would trivially allow all three major options of what to do with un-encodable characters: 1) replace them with "?", 2) error out (just throw exception from the callback) 3) transliterate to something smarter. Now there's a question for you - would one unicode point be enough context to do a reasonable transliteration job? Combining characters (e.g. diacritical marks or ZWJ/modified emojis) might be problematic, does unidecode handle them? I think we can adjust the callback signature to provide more context. Wdyt about it? |
var MAX_PENDING = 16384; | ||
|
||
TransliterationWrapper.prototype.write = function(str) { | ||
str = iconv.transliterate(str, this.encoder.encodingName, this.options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I had is - would this handle surrogate chars? I.e. when a high surrogate is at the end of the previous block and a low surrogate is at the beginning of the current one? It's a valid situation when encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's taken care of. If the original unidecode is used, it only outputs ASCII, so there are no surrogates to worry about. If unidecode-plus is used, it processes a codepoint at a time, rather than a character at a time, so it keeps the surrogates together.
It is possible for surrogates to be split by TransliterationWrapper
, in this code which is currently at line 187 of index.js:
// Split the text being written out at a safe place where smart spacing won't
// need to make any changes.
var $ = /^(.*)([^\s\x80\x81][\s\x80\x81].*)$/.exec(str);
...but the logic works in such a way that they'll always get put back together again properly.
That's a good question. Mainly it's because I saw that transliteration was a feature of node-iconv, that you'd explicitly mentioned that iconv-lite didn't handle that, and I thought it would be an interesting challenge to make it possible. (Yes, I need to get a life. I know! 🤓) My own use case (which I'd handled previously with a much simpler transliteration function) has been for a geographic database I use for my astronomy web site, skyviewcafe.com. I transliterate place names name to create plain ASCII search keys. Then it doesn't matter if users type accented characters or not, I can find places by name either way.
I'm not sure how to address that, other than to guess that interface changes probably wouldn't happen too often.
I tried something like that on the way to the current version of my code. It worked, but it was a bit on the slow side calling unidecode one character at a time for transliterations, and it also didn't allow for the smart spacing option. Your suggestion does have the benefit of being very flexible, however.
I didn't change anything about how combining diacriticals are handled from the way the way the original unidecode handles that -- it simply keeps the unaccented characters from a combining pair and turns the diacritical into an empty string, which produces the same end result as converting a single-codepoint accented character into its unaccented equivalent. (Which does remind me, however, that my "german" mode wouldn't currently handle combining umlauts.) So, for that level of functionality, yes, one unicode point is enough. For ZWJ/modified emojis... nope! That's a whole other can of worms that would definitely require more context than one codepoint, and I certainly haven't tried to deal with that issue for transliteration myself. |
…d unit testing for streaming transliteration.
…d unit testing for streaming transliteration.
84ee650
to
9aa082f
Compare
This pull request would add the ability to provide transliteration in iconv-lite. To avoid bloating the code, and adding a dependency that many users wouldn't require, the transliteration capability only becomes available if the user optionally adds the "unidecode" package to their project separately.
The updated README contains the details about how this transliteration option would work. The main improvement over using unidecode directly is that unidecode transliterates everything that isn't ASCII into ASCII, whereas this implementation selectively taps into unidecode only for characters that don't exist in a particular target encoding.