Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Adding support for languages with discernible delimiters #40

Open
arnavkapoor opened this issue Aug 26, 2020 · 3 comments
Open

Adding support for languages with discernible delimiters #40

arnavkapoor opened this issue Aug 26, 2020 · 3 comments

Comments

@arnavkapoor
Copy link
Collaborator

Languages without delimiters - Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher.
Also, there isn't a delimiter as such (upto a certain number) for German and Dutch .

One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).

Concrete example

five thousand nine hundred and thirteen - English (5913) 
fünftausendneunhundertdreizehn - German (5913)

nine hundred and thirteen - English (913)
negenhonderddertien - Dutch (913)

To handle this we first check f , fü, fün and finally hit fünf = 5 and similary get negen = 9 and insert a space and then start again from the next character.

@noviluni
Copy link
Contributor

noviluni commented Aug 26, 2020

Just to give another approach for German and Dutch, depending on the number of unique tokens, we could do the inverse process, trying to match the tokens with the number

As an example (I didn't think how to implementate it, it's just an idea):

>>> s = 'fünftausendneunhundertdreizehn'   
>>> s.replace('fünf', '5*').replace('tausend', '1000+').replace('neun', '9*').replace('hundert', '100+').replace('dreizehn', '13')
'5*1000+9*100+13'

This could reduce the complexity for long numbers.

@Tejasvinarora0110
Copy link

Why can't we just translate all other languages to English and then just convert them to numbers ?
I guess this would reduce the effort. Translation can be done using Googletrans.
3

@noviluni
Copy link
Contributor

Hi @Tejasvinarora0110, sorry for the late answers.

There are multiple reasons to avoid using Google translator:

  1. This library is aimed to work offline.
  2. We want to keep the dependencies list as little as possible.
  3. Keeping all languages independent from others (like English) would allow developing concrete solutions.
  4. Avoid using external services will allow improving the performance

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants