-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
fastText fixes in 3.7 break compatibility with old models #2341
Comments
'Full' fastText models (not KeyedVectors objects) trained in older Gensim versions can be loaded and worked with. There is even a warning message in the logs about the hash function being buggy. Unfortunately, this message itself is buggy and fails to show properly:
|
Hello @akutuzov, thanks for the fast report 👍 About "full" model message - fix already here: #2339. |
upd: @akutuzov I reproduced KV problem (no need additional info from you) Reproduce backward compatibility bug
Full trace ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-340e13f11fe0> in <module>()
2
3 m = KeyedVectors.load("ft_kv.model")
----> 4 m.most_similar("human") # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"
/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
541 mean.append(weight * word)
542 else:
--> 543 mean.append(weight * self.word_vec(word, use_norm=True))
544 if word in self.vocab:
545 all_words.add(self.vocab[word].index)
/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
2057
2058 """
-> 2059 hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
2060
2061 if word in self.vocab:
AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash' |
preliminary variant of fix diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..881aaf18 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -1974,6 +1974,14 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
self.num_ngram_vectors = 0
self.compatible_hash = compatible_hash
+ @classmethod
+ def load(cls, fname_or_handle, **kwargs):
+ model = super(WordEmbeddingsKeyedVectors, cls).load(fname_or_handle, **kwargs)
+ if not hasattr(model, 'compatible_hash'):
+ model.compatible_hash = False
+
+ return model
+
@property
@deprecated("Attribute will be removed in 4.0.0, use self.vectors_vocab instead")
def syn0_vocab(self):
@@ -2012,7 +2020,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
if word in self.vocab:
return True
else:
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
char_ngrams = _compute_ngrams(word, self.min_n, self.max_n)
return any(hash_fn(ng) % self.bucket in self.hash2index for ng in char_ngrams)
@@ -2056,7 +2064,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
If word and all ngrams not in vocabulary.
"""
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
if word in self.vocab:
return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
@@ -2237,7 +2245,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
if self.bucket == 0:
return
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
for w, v in self.vocab.items():
word_vec = np.copy(self.vectors_vocab[v.index]) |
Fixed partially (#2341 (comment)) in #2339 |
Recent fixes to Gensim's fastText implementation introduced in #2313 are great. Unfortunately, they also break compatibility with fastText models trained by older Gensim versions - if the models are stored as a KeyedVectors() object. One can load such a model, but as soon as you try to do anything useful (like
most_similar()
, etc), it fails, because thecompatible_hash
attribute is missing.If this attribute is added manually after the loading, everything goes fine.
Steps/Code/Corpus to Reproduce
Expected Results
The
compatible_hash
attribute is automatically assigned the False value on load, and the model works as before.Actual Results
Versions
Linux-4.15.0-43-generic-x86_64-with-LinuxMint-18.3-sylvia
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.7.0
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: