-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add nmslib indexer #2417
Merged
Merged
Add nmslib indexer #2417
Changes from 9 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
fcca8f8
Add nmslib indexer
masa3141 d8f8f85
use knnQueryBatch instead of knnQuery
masa3141 acff166
install nmslib into CI only when python version is over 3.0
masa3141 69a7057
Merge remote-tracking branch 'upstream/develop' into feature/nmslib
mpenkov f8ea652
use smart_open's open
masa3141 42f0192
use pickle.load instead of pickle.loads
masa3141 7e2d07f
improve doc string and add tutorial
masa3141 60b2b92
Tweak docstring in nmslib.py
mpenkov 300710e
remove trailing whitespace
mpenkov 9b4417a
improve docstring and initializer
masa3141 47dd709
clarify comment about implementation detail in nmslib.py
mpenkov ec7df96
fix white space
mpenkov 7b21538
Update install.ps1
mpenkov fbec445
Update install.ps1
mpenkov a386f67
Update install.ps1
mpenkov 85b8d28
Create pip.sh
mpenkov e89b8e2
Update tox.ini
mpenkov 32e6e51
Update pip.sh
mpenkov 2227fcf
Revert appveyor-related commits.
mpenkov c5bc0df
change to use underscores instead of camel case
masa3141 09abb6d
Merge remote-tracking branch 'upstream/develop' into feature/nmslib
mpenkov a7f26a9
clean up tox.ini
mpenkov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2019 Radim Rehurek <me@radimrehurek.com> | ||
# Copyright (C) 2019 Masahiro Kazama <kazama.masa@gmail.com> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
""" | ||
Intro | ||
----- | ||
|
||
This module contains integration Nmslib with :class:`~gensim.models.word2vec.Word2Vec`, | ||
:class:`~gensim.models.doc2vec.Doc2Vec`, :class:`~gensim.models.fasttext.FastText` and | ||
:class:`~gensim.models.keyedvectors.KeyedVectors`. | ||
To use nmslib, instantiate a :class:`~gensim.similarities.nmslib.NmslibIndexer` class | ||
and pass the instance as the indexer parameter to your model's most_similar method | ||
(e.g. :py:func:`~gensim.models.doc2vec.most_similar`). | ||
|
||
Example usage | ||
------------- | ||
|
||
.. sourcecode:: pycon | ||
|
||
>>> from gensim.similarities.nmslib import NmslibIndexer | ||
>>> from gensim.models import Word2Vec | ||
>>> | ||
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']] | ||
>>> model = Word2Vec(sentences, min_count=1, seed=1) | ||
>>> | ||
>>> indexer = NmslibIndexer(model) | ||
>>> model.most_similar("cat", topn=2, indexer=indexer) | ||
[('cat', 1.0), ('meow', 0.5595494508743286)] | ||
|
||
Load and save example | ||
--------------------- | ||
|
||
.. sourcecode:: pycon | ||
|
||
>>> from gensim.similarities.nmslib import NmslibIndexer | ||
>>> from gensim.models import Word2Vec | ||
>>> from tempfile import mkstemp | ||
>>> | ||
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']] | ||
>>> model = Word2Vec(sentences, min_count=1, seed=1, iter=10) | ||
>>> | ||
>>> indexer = NmslibIndexer(model) | ||
>>> _, temp_fn = mkstemp() | ||
>>> indexer.save(temp_fn) | ||
>>> | ||
>>> new_indexer = NmslibIndexer.load(temp_fn) | ||
>>> model.most_similar("cat", topn=2, indexer=new_indexer) | ||
[('cat', 1.0), ('meow', 0.5595494508743286)] | ||
|
||
What is Nmslib | ||
------------- | ||
|
||
Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit | ||
for evaluation of similarity search methods. The core-library does not have any third-party dependencies. | ||
More information about Nmslib: `github repository <https://github.com/nmslib/nmslib>`_. | ||
|
||
Why use Nmslib? | ||
------------- | ||
|
||
The current implementation for finding k nearest neighbors in a vector space in gensim has linear complexity | ||
via brute force in the number of indexed documents, although with extremely low constant factors. | ||
The retrieved results are exact, which is an overkill in many applications: | ||
approximate results retrieved in sub-linear time may be enough. | ||
Nmslib can find approximate nearest neighbors much faster. | ||
Compared to annoy, nmslib has more parameters to control the build and query time and accuracy. | ||
Nmslib can achieve faster and more accurate nearest neighbors search than annoy. | ||
""" | ||
|
||
from smart_open import open | ||
try: | ||
import cPickle as _pickle | ||
except ImportError: | ||
import pickle as _pickle | ||
|
||
from gensim.models.doc2vec import Doc2Vec | ||
from gensim.models.word2vec import Word2Vec | ||
from gensim.models.fasttext import FastText | ||
from gensim.models import KeyedVectors | ||
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors | ||
try: | ||
import nmslib | ||
except ImportError: | ||
raise ImportError( | ||
"Nmslib has not been installed, if you wish to use the nmslib indexer, please run `pip install nmslib`" | ||
) | ||
|
||
|
||
class NmslibIndexer(object): | ||
"""This class allows to use `Nmslib <https://github.com/nmslib/nmslib>`_ as indexer for `most_similar` method | ||
from :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.doc2vec.Doc2Vec`, | ||
:class:`~gensim.models.fasttext.FastText` and :class:`~gensim.models.keyedvectors.Word2VecKeyedVectors` classes. | ||
|
||
""" | ||
|
||
def __init__(self, model=None, index_params=None, query_time_params=None): | ||
""" | ||
Parameters | ||
---------- | ||
model : :class:`~gensim.models.base_any2vec.BaseWordEmbeddingsModel`, optional | ||
Model, that will be used as source for index. | ||
If the model is None, index and labels are not initialized. | ||
In that case please load or init the index and labels by yourself. | ||
mpenkov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
index_params : dict, optional | ||
index_params for Nmslib indexer. | ||
query_time_params : dict, optional | ||
query_time_params for Nmslib indexer. | ||
|
||
""" | ||
if index_params is None: | ||
index_params = {'M': 100, 'indexThreadQty': 1, 'efConstruction': 100, 'post': 0} | ||
if query_time_params is None: | ||
query_time_params = {'efSearch': 100} | ||
|
||
self.index = None | ||
self.labels = None | ||
self.model = model | ||
self.index_params = index_params | ||
self.query_time_params = query_time_params | ||
|
||
if model: | ||
if isinstance(self.model, Doc2Vec): | ||
self._build_from_doc2vec() | ||
elif isinstance(self.model, (Word2Vec, FastText)): | ||
self._build_from_word2vec() | ||
elif isinstance(self.model, (WordEmbeddingsKeyedVectors, KeyedVectors)): | ||
self._build_from_keyedvectors() | ||
else: | ||
raise ValueError("model must be a Word2Vec, Doc2Vec, FastText or KeyedVectors instance") | ||
|
||
def save(self, fname, protocol=2): | ||
"""Save this NmslibIndexer instance to a file. | ||
|
||
Parameters | ||
---------- | ||
fname : str | ||
Path to the output file, | ||
will produce 2 files: `fname` - parameters and `fname`.d - :class:`~nmslib.NmslibIndex`. | ||
protocol : int, optional | ||
Protocol for pickle. | ||
|
||
Notes | ||
----- | ||
This method saves **only** the index (**the model isn't preserved**). | ||
|
||
""" | ||
fname_dict = fname + '.d' | ||
self.index.saveIndex(fname) | ||
d = {'index_params': self.index_params, 'query_time_params': self.query_time_params, 'labels': self.labels} | ||
with open(fname_dict, 'wb') as fout: | ||
_pickle.dump(d, fout, protocol=protocol) | ||
|
||
@classmethod | ||
def load(cls, fname): | ||
"""Load a NmslibIndexer instance from a file | ||
|
||
Parameters | ||
---------- | ||
fname : str | ||
Path to dump with NmslibIndexer. | ||
|
||
""" | ||
fname_dict = fname + '.d' | ||
with open(fname_dict, 'rb') as f: | ||
d = _pickle.load(f) | ||
index_params = d['index_params'] | ||
query_time_params = d['query_time_params'] | ||
nmslib_instance = cls(index_params=index_params, query_time_params=query_time_params) | ||
index = nmslib.init() | ||
index.loadIndex(fname) | ||
nmslib_instance.index = index | ||
nmslib_instance.labels = d['labels'] | ||
return nmslib_instance | ||
|
||
def _build_from_word2vec(self): | ||
"""Build an Nmslib index using word vectors from a Word2Vec model.""" | ||
|
||
self.model.init_sims() | ||
return self._build_from_model(self.model.wv.vectors_norm, self.model.wv.index2word) | ||
mpenkov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def _build_from_doc2vec(self): | ||
"""Build an Nmslib index using document vectors from a Doc2Vec model.""" | ||
|
||
docvecs = self.model.docvecs | ||
docvecs.init_sims() | ||
labels = [docvecs.index_to_doctag(i) for i in range(0, docvecs.count)] | ||
return self._build_from_model(docvecs.vectors_docs_norm, labels) | ||
|
||
def _build_from_keyedvectors(self): | ||
"""Build an Nmslib index using word vectors from a KeyedVectors model.""" | ||
|
||
self.model.init_sims() | ||
return self._build_from_model(self.model.vectors_norm, self.model.index2word) | ||
|
||
def _build_from_model(self, vectors, labels): | ||
index = nmslib.init() | ||
index.addDataPointBatch(vectors) | ||
|
||
index.createIndex(self.index_params, print_progress=True) | ||
nmslib.setQueryTimeParams(index, self.query_time_params) | ||
|
||
self.index = index | ||
self.labels = labels | ||
|
||
def most_similar(self, vector, num_neighbors): | ||
"""Find the approximate `num_neighbors` most similar items. | ||
|
||
Parameters | ||
---------- | ||
vector : numpy.array | ||
Vector for word/document. | ||
num_neighbors : int | ||
Number of most similar items | ||
|
||
Returns | ||
------- | ||
list of (str, float) | ||
List of most similar items in format [(`item`, `cosine_distance`), ... ] | ||
|
||
""" | ||
ids, distances = self.index.knnQueryBatch(vector.reshape(1, -1), k=num_neighbors)[0] | ||
|
||
return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if model is None? It may be worth including an example showing this use case, if it is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the model is None, index and labels are not initialized. In that case, a user should load or init the index and labels by themselves. I add this information to doc string.
Also this is used by load function with model=None.