FST lookup should support more scripts #64

valeriansaliou · 2019-03-19T15:28:16Z

Currently 'lookup_begins()' in the FST manager only implements the Latin unicode range in its Regex, for performance reasons. It has been found at scale that queries take 20% more time if we support a wider alphabet in the regex (not sure why!).

We should map all Unicode ranges per script and use the first letter of the suggest word to build the regex that matches the provided word alphabet.

LOOKUP_REGEX_RANGE_LATIN will have siblings: LOOKUP_REGEX_RANGE_CYRILLIC, etc.

Use the following range database: http://kourge.net/projects/regexp-unicode-block

The text was updated successfully, but these errors were encountered:

valeriansaliou · 2019-03-19T15:30:31Z

See code:

sonic/src/store/fst.rs

Line 458 in 911c914

pub fn lookup_begins(&self, word: &str) -> Result<FSTStream<Regex>, ()> {

valeriansaliou · 2019-03-23T18:17:02Z

Here's how I would do it code-wise:

Take input word in; detect the script of the first letter of the word (if it's Latin, Mandarin, etc.; if unknown then fallback on Latin)
Depending on the script (use this Enum: https://docs.rs/whatlang/0.7.0/whatlang/enum.Script.html), route to the proper Regex with proper Unicode range for FST lookup as per http://kourge.net/projects/regexp-unicode-block
Should be it!

valeriansaliou · 2019-03-23T18:20:48Z

For starters: the FST is a kind of graph that stores all words contained in the index; and that is handy to suggest / auto-complete an input word, or correct it for typo. In this issue, we're just looking at the incomplete word auto-complete system. Eg. type in "so" and it may suggest "sonic" if the word is in the index. The regex is there to tell the fst crate used which next characters it should expect. Unfortunately an ANY Regex match with the dot Regex notation is slow (I am not the author of the Regex module that fst implements). I found out that implementing Unicode ranges in the Regex generated for the input word alphabet proves to be near zero-cost, so we should do it this way.

Already done for Latin, but we need to support all the world's scripts, so a "Regex router" needs to be added to auto-detect the input word script and use the proper Regex. Using a wide-range Unicode Regex is a no-go, as tested performances are as bad as with the Regex dot notation.

valeriansaliou self-assigned this Mar 19, 2019

valeriansaliou added the enhancement Enhancement to an existing feature label Mar 19, 2019

valeriansaliou added this to the v1.1.1 milestone Mar 23, 2019

valeriansaliou modified the milestones: v1.1.1, v1.1.2 Mar 23, 2019

valeriansaliou closed this as completed in 7a1f184 Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FST lookup should support more scripts #64

FST lookup should support more scripts #64

valeriansaliou commented Mar 19, 2019 •

edited

Loading

valeriansaliou commented Mar 19, 2019

valeriansaliou commented Mar 23, 2019 •

edited

Loading

valeriansaliou commented Mar 23, 2019

FST lookup should support more scripts #64

FST lookup should support more scripts #64

Comments

valeriansaliou commented Mar 19, 2019 • edited Loading

valeriansaliou commented Mar 19, 2019

valeriansaliou commented Mar 23, 2019 • edited Loading

valeriansaliou commented Mar 23, 2019

valeriansaliou commented Mar 19, 2019 •

edited

Loading

valeriansaliou commented Mar 23, 2019 •

edited

Loading