Adding support for languages with discernible delimiters #40

arnavkapoor · 2020-08-26T08:23:33Z

Languages without delimiters - Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher.
Also, there isn't a delimiter as such (upto a certain number) for German and Dutch .

One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).

Concrete example

five thousand nine hundred and thirteen - English (5913) 
fünftausendneunhundertdreizehn - German (5913)

nine hundred and thirteen - English (913)
negenhonderddertien - Dutch (913)

To handle this we first check f , fü, fün and finally hit fünf = 5 and similary get negen = 9 and insert a space and then start again from the next character.

The text was updated successfully, but these errors were encountered:

noviluni · 2020-08-26T09:07:16Z

Just to give another approach for German and Dutch, depending on the number of unique tokens, we could do the inverse process, trying to match the tokens with the number

As an example (I didn't think how to implementate it, it's just an idea):

>>> s = 'fünftausendneunhundertdreizehn'   
>>> s.replace('fünf', '5*').replace('tausend', '1000+').replace('neun', '9*').replace('hundert', '100+').replace('dreizehn', '13')
'5*1000+9*100+13'

This could reduce the complexity for long numbers.

Tejasvinarora0110 · 2020-10-18T15:55:51Z

Why can't we just translate all other languages to English and then just convert them to numbers ?
I guess this would reduce the effort. Translation can be done using Googletrans.

noviluni · 2020-12-28T16:42:44Z

Hi @Tejasvinarora0110, sorry for the late answers.

There are multiple reasons to avoid using Google translator:

This library is aimed to work offline.
We want to keep the dependencies list as little as possible.
Keeping all languages independent from others (like English) would allow developing concrete solutions.
Avoid using external services will allow improving the performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for languages with discernible delimiters #40

Adding support for languages with discernible delimiters #40

arnavkapoor commented Aug 26, 2020

noviluni commented Aug 26, 2020 •

edited

Loading

Tejasvinarora0110 commented Oct 18, 2020

noviluni commented Dec 28, 2020

Adding support for languages with discernible delimiters #40

Adding support for languages with discernible delimiters #40

Comments

arnavkapoor commented Aug 26, 2020

noviluni commented Aug 26, 2020 • edited Loading

Tejasvinarora0110 commented Oct 18, 2020

noviluni commented Dec 28, 2020

noviluni commented Aug 26, 2020 •

edited

Loading