-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915
Comments
Check the FAQ #12. |
Hi Radim, Thank you very much for your quick response and for providing gensim! I am aware of FAQ #12. But the difference is too big:
returns: [('doc1',0.8511...]
returns : 0.98 But why is there such a big difference of cosine similarities beween the methods |
FAQ 12 explains that: the training uses different parameters than the inference. |
That your re-inferred vectors are close to each other is good, but if you think the difference with the vector in the model is "too big" maybe there were other problems with the adequacy of the data/training/parameters (about which you provide no details) on the 1st bulk training. Note, especially, that while training's N epochs occur on a model which is, for many of the epochs, less than half trained, all N inference epochs happen on a model that's already fully trained. So something immediately worth trying to reduce the effect could be more initial training epochs. There are other tips in the project discussion mailing list archives – https://groups.google.com/forum/#!forum/gensim – or providing more details about your setup (data size/quality/type, parameters, goals, etc) may generate more tips specific to your situation. |
If you call the most_similar() method with an inferred vector a top-n list of most similar documents including consine similarities is returned.
If the document was also present in the training set the returned most_similar document should be the same document.
Suprisingly the cosine_similarity of the same document is +-0.86 but never 1
Is this a bug? Or is there any explanation for this behaviour?
The text was updated successfully, but these errors were encountered: