Optimizations

Jump to bottom

Jacob Wilkins edited this page Oct 22, 2019 · 1 revision

Ngrams

Front n-grams of tokens are made from 3 to 6 in gram length.

terms = {}
for position, token in enumerate(tokens):
    for window_length in range(min_gram, min(max_gram + 1, len(token) + 1)):
        gram = token[:window_length]
        terms.setdefault(gram, [])
        if not position in terms[gram]:
            terms[gram].append(position)
return terms

Stop words

Queries are filtered using this set of stop words.

stopwords = set([
        'a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by',
        'for', 'if', 'in', 'into', 'is', 'it',
        'no', 'not', 'of', 'on', 'or', 's', 'such',
        't', 'that', 'the', 'their', 'then', 'there', 'these',
        'they', 'this', 'to', 'was', 'will', 'with'
    ])

Punctuation

Queries are filtered using this punctuation marks regular expression.

punctuation = re.compile('[~`!@#$%^&*()+={\[}\]|\\:;"\',<.>/?]')

Stemming

Tokens are stemmed using the nltk PorterStemmer module to improve the quality of the search results.

ps = PorterStemmer()
token = ps.stem(token)