Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

intersect_word2vec_format lock-factor is not triggered when the data is not in the binary format #2918

Closed
lukaszbrzozowski opened this issue Aug 17, 2020 · 5 comments

Comments

@lukaszbrzozowski
Copy link

Problem description

The method intersect_word2vec_format lacks updating the lock-factor (lockf) if the data is not in binary format.

Steps/code/corpus to reproduce

I base my issue on reading the source code of intersect_word2vec_format: https://tedboy.github.io/nlps/_modules/gensim/models/word2vec.html#Word2Vec.intersect_word2vec_format

In the [if binary: ... else: ... ] statement the lock is triggered only in the "if" clause:

if word in self.vocab:
    overlap_count += 1
    self.syn0[self.vocab[word].index] = weights
    self.syn0_lockf[self.vocab[word].index] = lockf  # lock-factor: 0.0 stops further changes

However, when the data is not stored in the binary format, that is when the default binary=False value is passed, the lock is not triggered:

if word in self.vocab:
    overlap_count += 1
    self.syn0[self.vocab[word].index] = weights

Versions

I base my issue solely on the source code provided in the documentation.

@gojomo
Copy link
Collaborator

gojomo commented Aug 17, 2020

A link to some cut & paste source code elsewhere isn't very relevant to current project source code. Does the problem exist in either current-released, or current develop branch, source code?

@lukaszbrzozowski
Copy link
Author

That's my bad, the issue has been resolved in the current release.

@gojomo
Copy link
Collaborator

gojomo commented Aug 18, 2020

Glad to hear that. FYI, .intersect_word2vec_format() is still best considered an experimental method, whose usefulness is tentative, and isn't necessarily kept in sync with other changes/refactorings. So there could be things broken/insconsistent about it, and it ideally would be replaced by some other better-thought-out methods of mixing vectors from multiple sources (including during the initialization of a model). For now, though, it's left in place as something that might be helpful for a few adventurous users, or as a model for their own source code.

@lukaszbrzozowski
Copy link
Author

Thank you kindly for the answer. I must admit I find it quite unusual that changing the weights of the Word2Vec model is rather difficult. While I do not have a particular interest in NLP, the Word2Vec model is also used in representation learning on graphs, such as in DeepWalk, Node2Vec, and HARP methods. The issue arose when I was implementing the HARP method, in which one needs to initialize the model with specific weights. I gladly hope that in the future the process of weight initialization will become more intuitive, if possible.

@gojomo
Copy link
Collaborator

gojomo commented Aug 18, 2020

I'm not familiar with the HARP method, but I would highlight that because you can directly tamper with any part of the model, especially after the build_vocab() step, any sort of custom-initialization/adjustment is possible. (One could also feed some synthesized corpus , different from the training data, to the .build_vocab() step to achieve whatever vocabulary/frequency info is preferred, before training proceeds with a different corpus, as long as it's vocabulary-compatible.)

Shortcuts & helper methods for less-common or research-techniques, different from classic word2vec steps, could certainly be added, but require a knowledgeable champion or contributor to arrive in a clean & maintainable way. (When gensim has collected wishlist requests for certain features, but then had rookie/non-practitioner/temporary contributors implement those features, it's often been a functionality & maintenance disaster.)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants