-
Notifications
You must be signed in to change notification settings - Fork 1
Optimizations
Jacob Wilkins edited this page Oct 22, 2019
·
1 revision
Front n-grams of tokens are made from 3 to 6 in gram length.
terms = {}
for position, token in enumerate(tokens):
for window_length in range(min_gram, min(max_gram + 1, len(token) + 1)):
gram = token[:window_length]
terms.setdefault(gram, [])
if not position in terms[gram]:
terms[gram].append(position)
return terms
Queries are filtered using this set of stop words.
stopwords = set([
'a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by',
'for', 'if', 'in', 'into', 'is', 'it',
'no', 'not', 'of', 'on', 'or', 's', 'such',
't', 'that', 'the', 'their', 'then', 'there', 'these',
'they', 'this', 'to', 'was', 'will', 'with'
])
Queries are filtered using this punctuation marks regular expression.
punctuation = re.compile('[~`!@#$%^&*()+={\[}\]|\\:;"\',<.>/?]')
Tokens are stemmed using the nltk PorterStemmer module to improve the quality of the search results.
ps = PorterStemmer()
token = ps.stem(token)