Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

FST lookup should support more scripts #64

Closed
valeriansaliou opened this issue Mar 19, 2019 · 3 comments
Closed

FST lookup should support more scripts #64

valeriansaliou opened this issue Mar 19, 2019 · 3 comments
Assignees
Labels
enhancement Enhancement to an existing feature
Milestone

Comments

@valeriansaliou
Copy link
Owner

valeriansaliou commented Mar 19, 2019

Currently 'lookup_begins()' in the FST manager only implements the Latin unicode range in its Regex, for performance reasons. It has been found at scale that queries take 20% more time if we support a wider alphabet in the regex (not sure why!).

We should map all Unicode ranges per script and use the first letter of the suggest word to build the regex that matches the provided word alphabet.

LOOKUP_REGEX_RANGE_LATIN will have siblings: LOOKUP_REGEX_RANGE_CYRILLIC, etc.

Use the following range database: http://kourge.net/projects/regexp-unicode-block

@valeriansaliou valeriansaliou self-assigned this Mar 19, 2019
@valeriansaliou valeriansaliou added the enhancement Enhancement to an existing feature label Mar 19, 2019
@valeriansaliou
Copy link
Owner Author

See code:

pub fn lookup_begins(&self, word: &str) -> Result<FSTStream<Regex>, ()> {

@valeriansaliou valeriansaliou added this to the v1.1.1 milestone Mar 23, 2019
@valeriansaliou
Copy link
Owner Author

valeriansaliou commented Mar 23, 2019

Here's how I would do it code-wise:

  1. Take input word in; detect the script of the first letter of the word (if it's Latin, Mandarin, etc.; if unknown then fallback on Latin)
  2. Depending on the script (use this Enum: https://docs.rs/whatlang/0.7.0/whatlang/enum.Script.html), route to the proper Regex with proper Unicode range for FST lookup as per http://kourge.net/projects/regexp-unicode-block
  3. Should be it!

@valeriansaliou
Copy link
Owner Author

For starters: the FST is a kind of graph that stores all words contained in the index; and that is handy to suggest / auto-complete an input word, or correct it for typo. In this issue, we're just looking at the incomplete word auto-complete system. Eg. type in "so" and it may suggest "sonic" if the word is in the index. The regex is there to tell the fst crate used which next characters it should expect. Unfortunately an ANY Regex match with the dot Regex notation is slow (I am not the author of the Regex module that fst implements). I found out that implementing Unicode ranges in the Regex generated for the input word alphabet proves to be near zero-cost, so we should do it this way.

Already done for Latin, but we need to support all the world's scripts, so a "Regex router" needs to be added to auto-detect the input word script and use the proper Regex. Using a wide-range Unicode Regex is a no-go, as tested performances are as bad as with the Regex dot notation.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement Enhancement to an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant