Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fastText fixes in 3.7 break compatibility with old models #2341

Closed
akutuzov opened this issue Jan 19, 2019 · 6 comments · Fixed by #2339 or #2349
Closed

fastText fixes in 3.7 break compatibility with old models #2341

akutuzov opened this issue Jan 19, 2019 · 6 comments · Fixed by #2339 or #2349
Assignees
Labels
bug Issue described a bug fasttext Issues related to the FastText model

Comments

@akutuzov
Copy link
Contributor

akutuzov commented Jan 19, 2019

Recent fixes to Gensim's fastText implementation introduced in #2313 are great. Unfortunately, they also break compatibility with fastText models trained by older Gensim versions - if the models are stored as a KeyedVectors() object. One can load such a model, but as soon as you try to do anything useful (like most_similar(), etc), it fails, because the compatible_hash attribute is missing.
If this attribute is added manually after the loading, everything goes fine.

Steps/Code/Corpus to Reproduce

import gensim

model = gensim.models.KeyedVectors.load(ANY_KEYED_VECTORS_FASTTEXT_MODEL)
model.most_similar(positive=ANY_WORD)

Expected Results

The compatible_hash attribute is automatically assigned the False value on load, and the model works as before.

Actual Results

/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
   2057 
   2058         """
-> 2059         hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
   2060 
   2061         if word in self.vocab:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'

Versions

Linux-4.15.0-43-generic-x86_64-with-LinuxMint-18.3-sylvia
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.7.0
FAST_VERSION 1

@akutuzov
Copy link
Contributor Author

@mpenkov
@menshikh-iv

@akutuzov
Copy link
Contributor Author

akutuzov commented Jan 19, 2019

'Full' fastText models (not KeyedVectors objects) trained in older Gensim versions can be loaded and worked with. There is even a warning message in the logs about the hash function being buggy. Unfortunately, this message itself is buggy and fails to show properly:

2019-01-19 20:36:08,303 : INFO : loaded test_fasttext.model
--- Logging error ---
Traceback (most recent call last):
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 986, in emit
    msg = self.format(record)
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 836, in format
    return fmt.format(record)
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 573, in format
    record.message = record.getMessage()
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "test.py", line 6, in <module>
    model = gensim.models.fasttext.FastText.load('test_fasttext.model')
  File "/projects/ltg/python3/lib/python3.5/site-packages/gensim/models/fasttext.py", line 845, in load
    "The model will continue to work, but consider training it "                                                                                                                             
Message: 'This older model was trained with a buggy hash function.  '                                                                                                                        
Arguments: ('The model will continue to work, but consider training it from scratch.',)     

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 20, 2019

Hello @akutuzov,

thanks for the fast report 👍

About "full" model message - fix already here: #2339.
About KeyedVectors - can you share any ANY_KEYED_VECTORS_FASTTEXT_MODEL reproduced in #2341 (comment) (we'll fix it ASAP in that case, I'm not sure, but 3.7.1 can appears in next 2 weeks)

@menshikh-iv menshikh-iv added bug Issue described a bug fasttext Issues related to the FastText model labels Jan 20, 2019
@menshikh-iv
Copy link
Contributor

upd: @akutuzov I reproduced KV problem (no need additional info from you)

Reproduce backward compatibility bug

  1. Train FT & save KV in gensim==3.6.0

    from gensim.test.utils import common_texts
    from gensim.models import FastText
    
    m = FastText(common_texts, min_count=0)
    m.wv.save("ft_kv.model")

    produced file (gzipped after, for uploading to github): ft_kv.model.gz

  2. Load KV in gensim==3.7.0 and use it

    from gensim.models.keyedvectors import FastTextKeyedVectors, KeyedVectors
    
    m = KeyedVectors.load("ft_kv.model")
    m.most_similar("human")  # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"
    
    m = FastTextKeyedVectors.load("ft_kv.model")
    m.most_similar("human")  # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"

Full trace

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-340e13f11fe0> in <module>()
      2 
      3 m = KeyedVectors.load("ft_kv.model")
----> 4 m.most_similar("human")  # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"

/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
    541                 mean.append(weight * word)
    542             else:
--> 543                 mean.append(weight * self.word_vec(word, use_norm=True))
    544                 if word in self.vocab:
    545                     all_words.add(self.vocab[word].index)

/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
   2057 
   2058         """
-> 2059         hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
   2060 
   2061         if word in self.vocab:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'

@menshikh-iv
Copy link
Contributor

preliminary variant of fix
CC @mpenkov

diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..881aaf18 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -1974,6 +1974,14 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         self.num_ngram_vectors = 0
         self.compatible_hash = compatible_hash
 
+    @classmethod
+    def load(cls, fname_or_handle, **kwargs):
+        model = super(WordEmbeddingsKeyedVectors, cls).load(fname_or_handle, **kwargs)
+        if not hasattr(model, 'compatible_hash'):
+            model.compatible_hash = False
+
+        return model
+
     @property
     @deprecated("Attribute will be removed in 4.0.0, use self.vectors_vocab instead")
     def syn0_vocab(self):
@@ -2012,7 +2020,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         if word in self.vocab:
             return True
         else:
-            hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+            hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
             char_ngrams = _compute_ngrams(word, self.min_n, self.max_n)
             return any(hash_fn(ng) % self.bucket in self.hash2index for ng in char_ngrams)
 
@@ -2056,7 +2064,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
             If word and all ngrams not in vocabulary.
 
         """
-        hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+        hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
 
         if word in self.vocab:
             return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
@@ -2237,7 +2245,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         if self.bucket == 0:
             return
 
-        hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+        hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
 
         for w, v in self.vocab.items():
             word_vec = np.copy(self.vectors_vocab[v.index])

@menshikh-iv
Copy link
Contributor

Fixed partially (#2341 (comment)) in #2339
Waiting #2340 for full fix

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Issue described a bug fasttext Issues related to the FastText model
Projects
None yet
3 participants