intersect_word2vec_format lock-factor is not triggered when the data is not in the binary format #2918

lukaszbrzozowski · 2020-08-17T15:06:59Z

Problem description

The method intersect_word2vec_format lacks updating the lock-factor (lockf) if the data is not in binary format.

Steps/code/corpus to reproduce

I base my issue on reading the source code of intersect_word2vec_format: https://tedboy.github.io/nlps/_modules/gensim/models/word2vec.html#Word2Vec.intersect_word2vec_format

In the [if binary: ... else: ... ] statement the lock is triggered only in the "if" clause:

if word in self.vocab:
    overlap_count += 1
    self.syn0[self.vocab[word].index] = weights
    self.syn0_lockf[self.vocab[word].index] = lockf  # lock-factor: 0.0 stops further changes

However, when the data is not stored in the binary format, that is when the default binary=False value is passed, the lock is not triggered:

if word in self.vocab:
    overlap_count += 1
    self.syn0[self.vocab[word].index] = weights

Versions

I base my issue solely on the source code provided in the documentation.

The text was updated successfully, but these errors were encountered:

gojomo · 2020-08-17T21:27:45Z

A link to some cut & paste source code elsewhere isn't very relevant to current project source code. Does the problem exist in either current-released, or current develop branch, source code?

lukaszbrzozowski · 2020-08-18T07:12:07Z

That's my bad, the issue has been resolved in the current release.

gojomo · 2020-08-18T17:33:22Z

Glad to hear that. FYI, .intersect_word2vec_format() is still best considered an experimental method, whose usefulness is tentative, and isn't necessarily kept in sync with other changes/refactorings. So there could be things broken/insconsistent about it, and it ideally would be replaced by some other better-thought-out methods of mixing vectors from multiple sources (including during the initialization of a model). For now, though, it's left in place as something that might be helpful for a few adventurous users, or as a model for their own source code.

lukaszbrzozowski · 2020-08-18T17:55:08Z

Thank you kindly for the answer. I must admit I find it quite unusual that changing the weights of the Word2Vec model is rather difficult. While I do not have a particular interest in NLP, the Word2Vec model is also used in representation learning on graphs, such as in DeepWalk, Node2Vec, and HARP methods. The issue arose when I was implementing the HARP method, in which one needs to initialize the model with specific weights. I gladly hope that in the future the process of weight initialization will become more intuitive, if possible.

gojomo · 2020-08-18T20:10:10Z

I'm not familiar with the HARP method, but I would highlight that because you can directly tamper with any part of the model, especially after the build_vocab() step, any sort of custom-initialization/adjustment is possible. (One could also feed some synthesized corpus , different from the training data, to the .build_vocab() step to achieve whatever vocabulary/frequency info is preferred, before training proceeds with a different corpus, as long as it's vocabulary-compatible.)

Shortcuts & helper methods for less-common or research-techniques, different from classic word2vec steps, could certainly be added, but require a knowledgeable champion or contributor to arrive in a clean & maintainable way. (When gensim has collected wishlist requests for certain features, but then had rookie/non-practitioner/temporary contributors implement those features, it's often been a functionality & maintenance disaster.)

lukaszbrzozowski closed this as completed Aug 18, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intersect_word2vec_format lock-factor is not triggered when the data is not in the binary format #2918

intersect_word2vec_format lock-factor is not triggered when the data is not in the binary format #2918

lukaszbrzozowski commented Aug 17, 2020

gojomo commented Aug 17, 2020

lukaszbrzozowski commented Aug 18, 2020

gojomo commented Aug 18, 2020

lukaszbrzozowski commented Aug 18, 2020

gojomo commented Aug 18, 2020 •

edited

Loading

intersect_word2vec_format lock-factor is not triggered when the data is not in the binary format #2918

intersect_word2vec_format lock-factor is not triggered when the data is not in the binary format #2918

Comments

lukaszbrzozowski commented Aug 17, 2020

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Aug 17, 2020

lukaszbrzozowski commented Aug 18, 2020

gojomo commented Aug 18, 2020

lukaszbrzozowski commented Aug 18, 2020

gojomo commented Aug 18, 2020 • edited Loading

gojomo commented Aug 18, 2020 •

edited

Loading