Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fixes issues while loading word2vec and doc2vec models saved using old Gensim versions. Fix #2000, #1977 #2012

Merged
merged 13 commits into from
Apr 12, 2018

Conversation

manneshiva
Copy link
Contributor

This PR addresses #2000 and #1977. The issues were caused due to a few missing attributes (like min_alpha_yet_reached , running_training_loss) in really old Gensim versions. I have added tests to load word2vec model saved using Gensim version 0.12.0 and doc2vec model saved using Gensim 0.13.0. The tests also include checking online training and a similarity search post loading these old models.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Apr 2, 2018

Plan:

  • Add toy models for each version (before 3.4.0), add tests for it too.
  • Fix additional errors (if happens).

We need to cover as much as possible situations because this kind of problems are already starting to bother.

saved_models_dir = datapath('old_w2v_models')
for old_version in old_versions:
model = word2vec.Word2Vec.load(os.path.join(saved_models_dir, 'w2v_{}.mdl'.format(old_version)))
self.assertTrue(len(model.wv.vocab) == 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add most_similar + update an model (similar for d2v)

new_model.docvecs.max_rawint = old_model.docvecs.__dict__.get('max_rawint')
new_model.docvecs.offset2doctag = old_model.docvecs.__dict__.get('offset2doctag')
else:
new_model.docvecs.max_rawint = len(old_model.docvecs.index2doctag) if old_model.docvecs.index2doctag else old_model.docvecs.count
new_model.docvecs.max_rawint = \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic: definitely deserves a comment.

doc0_inferred = model.infer_vector(list(DocsLeeCorpus())[0].words)
sims_to_infer = model.docvecs.most_similar([doc0_inferred], topn=len(model.docvecs))
self.assertTrue(sims_to_infer)

Copy link
Contributor

@menshikh-iv menshikh-iv Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here save+load+infer_vector here (to be 100% sure that this persistent correctly)? Make sure that you used /tmp directory, check gensim.test.utils, you'll found needed functions (and same for w2v).

Also, please try to update model (as for w2v)

'3.0.0', '3.1.0', '3.2.0', '3.3.0', '3.4.0'
]

saved_models_dir = datapath('old_d2v_models')
Copy link
Contributor

@menshikh-iv menshikh-iv Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better datapath('old_d2v_models/d2v_{}.mdl') and format later

@menshikh-iv menshikh-iv changed the title Fixes issues while loading word2vec and doc2vec models saved using old Gensim versions. Fixes issues while loading word2vec and doc2vec models saved using old Gensim versions. Fix #2000, #1977 Apr 10, 2018
@menshikh-iv menshikh-iv merged commit 2024be9 into piskvorky:develop Apr 12, 2018
darindf pushed a commit to darindf/gensim that referenced this pull request Apr 23, 2018
…g old Gensim versions. Fix piskvorky#2000, piskvorky#1977 (piskvorky#2012)

* adds default values for attributes

* ignore values for attributes that do not exist

* adds unit test

* fixes default values for missing attributes for older gensim models

* adds unit test cases for loading really old gensim models

* adds test cases for loading all old models

* adds more tests post loading

* handles loading d2v models saved using version <=0.12.2

* fix `max_rawint` value and PEP8 errors

* adds saving and loading back tests

* adds comments and fixes `max_rawint`

* fix PEP8
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants