Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

Closed
timbicker opened this issue Nov 6, 2018 · 11 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@timbicker
Copy link

timbicker commented Nov 6, 2018

Description

I am training a doc2vec model on a large corpus. I need to observe the model for more detailed statistics for my supervisor/boss.
The problem is similar to the problem below where I just slightly modified the Doc2Vec Tutorial on the Lee Dataset. The model does not improve its recommendations for the most_similar method.

Steps/Code/Corpus to Reproduce

import gensim
import os
import smart_open
import gensim.models.callbacks


# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'


def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])


train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

results_new = {i: None for i, doc in enumerate(train_corpus)}
results_old = results_new.copy()


class TrainProgressEvaluation(gensim.models.callbacks.CallbackAny2Vec):

    def __init__(self, test_set, results_new, results_old):
        self.test_set = test_set
        self.results_new = results_new
        self.results_old = results_old
        self.epoch = 0

    def on_epoch_end(self, model):
        self.epoch += 1
        print(f"epoch {self.epoch} end")

    def on_batch_begin(self, model):
        for num, sample in enumerate(self.test_set):
            recs = model.docvecs.most_similar(num)
            # for the first call results_new[num] is None
            self.results_old[num] = results_new[num] or recs
            self.results_new[num] = recs
            for i in range(len(recs)):
                if not self.results_old[num][i][0] == self.results_new[num][i][0] or not self.results_old[num][i][1] == self.results_new[num][i][1]:
                    print(f"Sample {num} has changed.")
                    print(f"Old tag {self.results_old[num][i][0]}. New tag {self.results_new[num][i][0]}")
                    print(f"Old distance {self.results_old[num][i][1]}. New distance {self.results_new[num][i][1]}")


model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, workers=4)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs,
            callbacks=(TrainProgressEvaluation(train_corpus, results_new, results_old),))

Expected Results

I expect to see many improvements in either recommendation or distance.

Actual Results

Consol Output with four workers:
It surprises me that only the first sample in the training_corpus receives some updates. I don't understand it.

Sample 0 has changed.
/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
Old tag 116. New tag 30
  if np.issubdtype(vec.dtype, np.int):
Old distance 0.5557072162628174. New distance 0.4648822546005249
Sample 0 has changed.
Old tag 42. New tag 224
Old distance 0.48946425318717957. New distance 0.3621359169483185
Sample 0 has changed.
Old tag 51. New tag 96
Old distance 0.4082771837711334. New distance 0.31921446323394775
Sample 0 has changed.
Old tag 90. New tag 77
Old distance 0.3731566369533539. New distance 0.3184990882873535
Sample 0 has changed.
Old tag 128. New tag 45
Old distance 0.34601616859436035. New distance 0.30474674701690674Sample 0 has changed.
Old tag 30. New tag 116
Old distance 0.4648822546005249. New distance 0.5557072162628174
Sample 0 has changed.
Old tag 224. New tag 42
Old distance 0.3621359169483185. New distance 0.48946425318717957
Sample 0 has changed.
Old tag 96. New tag 51
Old distance 0.31921446323394775. New distance 0.4082771837711334
Sample 0 has changed.
Old tag 77. New tag 90
Sample 0 has changed.
Old tag 46. New tag 234
Old distance 0.3005654215812683. New distance 0.3441653251647949
Old distance 0.3184990882873535. New distance 0.3731566369533539
Sample 0 has changed.
Old tag 45. New tag 128
Old distance 0.30474674701690674. New distance 0.34601616859436035
Sample 0 has changed.
Old tag 46. New tag 234
Old distance 0.3005654215812683. New distance 0.3441653251647949
Sample 0 has changed.
Sample 0 has changed.
Old tag 111. New tag 76
Old tag 111. New tag 76
Old distance 0.280322402715683. New distance 0.32334667444229126
Old distance 0.280322402715683. New distance 0.32334667444229126
Sample 0 has changed.
Old tag 221. New tag 49

Old distance 0.2779023051261902. New distance 0.27320006489753723
Sample 0 has changed.
Old tag 52. New tag 4
Old distance 0.27472415566444397. New distance 0.27205419540405273
Sample 0 has changed.
Old tag 221. New tag 49
Old distance 0.2779023051261902. New distance 0.27320006489753723
Sample 0 has changed.
Sample 0 has changed.
Old tag 205. New tag 149
Old distance 0.26930660009384155. New distance 0.2699446976184845
Old tag 52. New tag 4
Old distance 0.27472415566444397. New distance 0.27205419540405273
Sample 0 has changed.
Old tag 205. New tag 149
Old distance 0.26930660009384155. New distance 0.2699446976184845
Sample 0 has changed.
Old tag 116. New tag 30
Old distance 0.5557072162628174. New distance 0.4648822546005249
Sample 0 has changed.
Old tag 42. New tag 224
Old distance 0.48946425318717957. New distance 0.3621359169483185
Sample 0 has changed.
Old tag 51. New tag 96
Old distance 0.4082771837711334. New distance 0.31921446323394775
Sample 0 has changed.
Old tag 90. New tag 77
Old distance 0.3731566369533539. New distance 0.3184990882873535
Sample 0 has changed.
Old tag 128. New tag 45
Old distance 0.34601616859436035. New distance 0.30474674701690674
Sample 0 has changed.
Old tag 234. New tag 46
Old distance 0.3441653251647949. New distance 0.3005654215812683
Sample 0 has changed.
Old tag 76. New tag 111
Old distance 0.32334667444229126. New distance 0.280322402715683
Sample 0 has changed.
Old tag 49. New tag 221
Old distance 0.27320006489753723. New distance 0.2779023051261902
Sample 0 has changed.
Old tag 4. New tag 52
Old distance 0.27205419540405273. New distance 0.27472415566444397
Sample 0 has changed.
Old tag 149. New tag 205
Old distance 0.2699446976184845. New distance 0.26930660009384155
epoch 1 end
epoch 1 end
epoch 2 end
epoch 3 end
epoch 4 end
....

So I debug the model and there are no improvements anymore:

/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
epoch 1 end
epoch 2 end
....

I try it with 1 worker only:

/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
epoch 1 end
epoch 2 end
epoch 3 end
....

What's happening here and how can I see during training how my doc2vec model improves? Because it is also not possible to see the training_error for doc2vec #999.
Further experimenting reveals that docvecs.vectors_docs are of course updated between each call of batch_end. But most_similiar always returns the same suggestion.

Versions

Darwin-17.5.0-x86_64-i386-64bit
Python 3.7.0 (default, Jun 29 2018, 20:13:13)
[Clang 9.1.0 (clang-902.0.39.2)]
NumPy 1.15.0
SciPy 1.1.0
gensim 3.5.0
FAST_VERSION 0

@timbicker
Copy link
Author

It turns out

model.docvecs.vectors_docs_norm = None
model.docvecs.init_sims()

has to be called before each call model.docvecs.most_similar.
Then the program works as expected.

@piskvorky
Copy link
Owner

piskvorky commented Nov 7, 2018

If that's the case, then that's definitely a bug!

Are you saying you have to call init_sims() before each call of most_similar? If that's so, please reopen this ticket.

@timbicker
Copy link
Author

timbicker commented Nov 9, 2018

Well, yes and no.
I looked at it more thoroughly: most_similar() uses vectors_docs_norm that are called by init_sims(). Also, most_similar() does call init_sims(), but vectors_docs_norm are only recalculated if they are None. So in order to use most_similar() on newly trained vectors, one has to manually set vectors_docs_norm to None. So yes to me, this looks like a bug or at least unexpected behavior. I would like to fix it then, if that is fine for you.

@timbicker timbicker reopened this Nov 9, 2018
@gojomo
Copy link
Collaborator

gojomo commented Nov 12, 2018

train() could null the normed vectors, if any, so that they're recalculated to reflect the updated non-normed vectors. (The prior working assumption had been that most_similar() would only be run on a model that had finished trained.)

@timbicker
Copy link
Author

train() could null the normed vectors, if any, so that they're recalculated to reflect the updated non-normed vectors. (The prior working assumption had been that most_similar() would only be run on a model that had finished trained.)

This is an excellent idea imo. I implemented it in this way.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018
@dnabanita7
Copy link

is it closed? I want to get to work on this @menshikh-iv

@menshikh-iv
Copy link
Contributor

@naba7 see status on top

@timbicker
Copy link
Author

timbicker commented Jan 20, 2019

Sorry for my recent absence. I pushed new changes to the branch of the PR, but it is still closed. I hope it is reopened in the next days, so we can finish working on it.
@naba7 feel free to participate in the PR, if there is anything left to do
Thanks for your support.

@menshikh-iv
Copy link
Contributor

@timbicker done, see #2273

@gojomo
Copy link
Collaborator

gojomo commented Nov 8, 2019

Also an issue for FastText: #2260

@gojomo gojomo changed the title doc2vec example model does not improve doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs Nov 8, 2019
@gojomo
Copy link
Collaborator

gojomo commented Sep 17, 2021

I believe this issue is moot given changes that eliminated so much normed-vector caching in Gensim-4.0.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants