-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Fixes issues while loading word2vec
and doc2vec
models saved using old Gensim versions. Fix #2000, #1977
#2012
Conversation
Plan:
We need to cover as much as possible situations because this kind of problems are already starting to bother. |
gensim/test/test_word2vec.py
Outdated
saved_models_dir = datapath('old_w2v_models') | ||
for old_version in old_versions: | ||
model = word2vec.Word2Vec.load(os.path.join(saved_models_dir, 'w2v_{}.mdl'.format(old_version))) | ||
self.assertTrue(len(model.wv.vocab) == 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add most_similar
+ update an model (similar for d2v)
gensim/models/deprecated/doc2vec.py
Outdated
new_model.docvecs.max_rawint = old_model.docvecs.__dict__.get('max_rawint') | ||
new_model.docvecs.offset2doctag = old_model.docvecs.__dict__.get('offset2doctag') | ||
else: | ||
new_model.docvecs.max_rawint = len(old_model.docvecs.index2doctag) if old_model.docvecs.index2doctag else old_model.docvecs.count | ||
new_model.docvecs.max_rawint = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic: definitely deserves a comment.
gensim/test/test_doc2vec.py
Outdated
doc0_inferred = model.infer_vector(list(DocsLeeCorpus())[0].words) | ||
sims_to_infer = model.docvecs.most_similar([doc0_inferred], topn=len(model.docvecs)) | ||
self.assertTrue(sims_to_infer) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here save
+load
+infer_vector
here (to be 100% sure that this persistent correctly)? Make sure that you used /tmp
directory, check gensim.test.utils
, you'll found needed functions (and same for w2v).
Also, please try to update model (as for w2v)
gensim/test/test_doc2vec.py
Outdated
'3.0.0', '3.1.0', '3.2.0', '3.3.0', '3.4.0' | ||
] | ||
|
||
saved_models_dir = datapath('old_d2v_models') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better datapath('old_d2v_models/d2v_{}.mdl')
and format later
word2vec
and doc2vec
models saved using old Gensim versions.word2vec
and doc2vec
models saved using old Gensim versions. Fix #2000, #1977
…g old Gensim versions. Fix piskvorky#2000, piskvorky#1977 (piskvorky#2012) * adds default values for attributes * ignore values for attributes that do not exist * adds unit test * fixes default values for missing attributes for older gensim models * adds unit test cases for loading really old gensim models * adds test cases for loading all old models * adds more tests post loading * handles loading d2v models saved using version <=0.12.2 * fix `max_rawint` value and PEP8 errors * adds saving and loading back tests * adds comments and fixes `max_rawint` * fix PEP8
This PR addresses #2000 and #1977. The issues were caused due to a few missing attributes (like
min_alpha_yet_reached
,running_training_loss
) in really old Gensim versions. I have added tests to loadword2vec
model saved using Gensim version0.12.0
anddoc2vec
model saved using Gensim0.13.0
. The tests also include checking online training and a similarity search post loading these old models.