-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Word2vec update before train error message. Fix #1162 #1205
Conversation
This is a more-helpful error than before. It'd be best to catch the user error even earlier, closer to where it happens: for example, as soon as (Separate but related: the prior existing code seems a bit off in how a pre-vocabulary model has a |
gensim/models/word2vec.py
Outdated
|
||
# Raise an error if an online update is run before initial training on a corpus | ||
if not len(self.wv.syn0): | ||
raise RuntimeError("You can do an online update of vocabulary on a pre-trained model. " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please start the message by stating the cause of the warning: "The model has not yet been trained."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note it isn't necessary any training has happened - just that no initial/prior vocabulary-discovery has happened (and the update=True
case assumes that it has). So perhaps instead: "Cannot update vocabulary of model which has no prior vocabulary."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the error message as,
"You cannot do an online vocabulary-update of a model which has no prior vocabulary. First build the vocabulary of your model with a corpus before doing an online update."
@gojomo, it does make sense to flag the error earlier than waiting till update_weights
is called. I made a few changes on another file locally ( Note that this change is not currently reflected on github. It is local to me.) and tested the following code change.
def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):
"""
Build vocabulary from a sequence of sentences (can be a once-only generator stream).
Each sentence must be a list of unicode strings.
"""
if update:
if not len(self.wv.vocab):
raise RuntimeError("You cannot do an online vocabulary-update of a model which has no prior vocabulary." \
"First build the vocabulary of your model with a corpus " \
"before doing an online update.")
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.finalize_vocab(update=update) # build tables & arrays
It works well with the cases that I have mentioned earlier. I also tested this,
import gensim
from nltk.corpus import brown, movie_reviews, treebank
b_sents = brown.sents()
b = gensim.models.Word2Vec(b_sents)
b.build_vocab(b_sents, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=True)
b.train(b_sents)
Executing the will not display any error message.
Thanks for the improvement! |
Using Python 2.7.12 on MAC OSX VERSION 10.11.6
Used Sublime Text Build 3126 to make changes
Replicated the error by running the following
Error displayed:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Made changes in file
gensim\models\word2vec.py
in the functionupdate_weights
.Added the following code which checks if the model weights have been initialized.
(line 1072 - 1076)
Post-change testing:
I tested using the following code:
ValueError
and