You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently 'lookup_begins()' in the FST manager only implements the Latin unicode range in its Regex, for performance reasons. It has been found at scale that queries take 20% more time if we support a wider alphabet in the regex (not sure why!).
We should map all Unicode ranges per script and use the first letter of the suggest word to build the regex that matches the provided word alphabet.
LOOKUP_REGEX_RANGE_LATIN will have siblings: LOOKUP_REGEX_RANGE_CYRILLIC, etc.
For starters: the FST is a kind of graph that stores all words contained in the index; and that is handy to suggest / auto-complete an input word, or correct it for typo. In this issue, we're just looking at the incomplete word auto-complete system. Eg. type in "so" and it may suggest "sonic" if the word is in the index. The regex is there to tell the fst crate used which next characters it should expect. Unfortunately an ANY Regex match with the dot Regex notation is slow (I am not the author of the Regex module that fst implements). I found out that implementing Unicode ranges in the Regex generated for the input word alphabet proves to be near zero-cost, so we should do it this way.
Already done for Latin, but we need to support all the world's scripts, so a "Regex router" needs to be added to auto-detect the input word script and use the proper Regex. Using a wide-range Unicode Regex is a no-go, as tested performances are as bad as with the Regex dot notation.
Currently 'lookup_begins()' in the FST manager only implements the Latin unicode range in its Regex, for performance reasons. It has been found at scale that queries take 20% more time if we support a wider alphabet in the regex (not sure why!).
We should map all Unicode ranges per script and use the first letter of the suggest word to build the regex that matches the provided word alphabet.
LOOKUP_REGEX_RANGE_LATIN
will have siblings:LOOKUP_REGEX_RANGE_CYRILLIC
, etc.Use the following range database: http://kourge.net/projects/regexp-unicode-block
The text was updated successfully, but these errors were encountered: