Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Gensim's word2vec has a loss of 0 from epoch 1? #2920

Closed
LusKrew opened this issue Aug 20, 2020 · 1 comment
Closed

Gensim's word2vec has a loss of 0 from epoch 1? #2920

LusKrew opened this issue Aug 20, 2020 · 1 comment

Comments

@LusKrew
Copy link

LusKrew commented Aug 20, 2020

I am using the Word2vec module of Gensim library to train a word embedding, the dataset is 400k sentences with 100k unique words (its not english)

I'm using this code to monitor and calculate the loss :


class MonitorCallback(CallbackAny2Vec):
    def __init__(self, test_words):
        self._test_words = test_words




def on_epoch_end(self, model):
    print("Model loss:", model.get_latest_training_loss())  # print loss
    for word in self._test_words:  # show wv logic changes
        print(model.wv.most_similar(word))



monitor = MonitorCallback(["MyWord"])  # monitor with demo words


w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT  , callbacks=[monitor])

w2v_model.build_vocab(tokenized_corpus)


words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)


print("[*] Training...")

w2v_model.train(tokenized_corpus, total_examples=len(tokenized_corpus), epochs=W2V_EPOCH)



The problem is from epoch 1 the loss is 0 and the vector of the monitored words dont change at all!

[*] Training...
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0

so what is the problem here? is this normal? the tokenized corpus is a list of lists that are something like tokenized_corpus[0] = [ "word1" , "word2" , ...]

I googled and seems like some of the old versions of gensim had problem with calculating loss function, but they are from almost a year ago and it seems like it should be fixed right now?

I tried the code provided in the answer of this question as well but still the loss is 0 :

https://stackoverflow.com/questions/52038651/loss-does-not-decrease-during-training-word2vec-gensim

@gojomo
Copy link
Collaborator

gojomo commented Aug 20, 2020

You haven't used the compute_loss=True argument to the Word2Vec initialization to enable loss-tallying at all, per docs at https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

After you do that, you may encounter other bugs with the current loss-tracking, which you can read about in detail via the open issues: https://github.com/RaRe-Technologies/gensim/issues?q=is%3Aissue+is%3Aopen+loss+in%3Atitle+

Unless/until you're sure your concern is a bug, questions are better handled via Stack Overflow (where I also answered your question) or the project discussion list, to reserve this issue-tracker for bugs & feature requests.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants