-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
similarity_matrix: suspicous column indexing logic when tfidf model supplied #1960
Comments
Thank you for the report. So far, this does not seem to be a bug to me. Several things to note:
Even the documentation specifies that “[…] The rows of the term similarity matrix will be build in an increasing order of importance of terms […]”. The order in which columns are processed is not affected. |
It is entirely possible I am misunderstanding the purpose of the option of providing a tfidf model into this function. Let me offer a test case (this time in context) that highlights what believe its purpose to be.
My understanding is that a document vector, in this case, is a 3-dimensional vector, and the elements in the vector are values that are ordered according to the term order ['ab', 'abc', 'bcd']. Any document vector in the tfidf transformed corpus will be represented by 3-dimensional vector of tfidf coefficients that take their place with respect to the new tfidf weighted term order ['abc', 'bcd', 'ab']. With respect to the default ordering I would expect:
With respect to the tfidf ordering I would expect:
The current function returns the same matrix both times. Won't this cause term mismatches when computing soft cosine similarity with tfidf transformed documents? |
Thank you for your patience. I see now I had a misconception about the gensim tfidf transformation reordering features. I am not sure where that originated, but I am very glad to have it corrected! Just to make sure I understand, the tfidf parameter here is not meant to restructure the output in any way, it is supplied as a means to improve performance? |
That is correct. When the matrix is empty, we can take a row and just fill all columns corresponding to the closest terms in the embedding space. When the matrix is already half-filled, some columns in a row (that do not necessarily correspond to the closest terms) will already be pre-filled (due to the symmetric The role of the |
However, the documentation should be reworded, since the rows are processed in decreasing, not increasing, order of importance. I will make a commit that clarifies the documentation and closes this issue. |
Description
Expected gensim.models.keyedvectors WordEmbeddingsKeyedVectors property similarity_matrix to respect the reordering of terms dictated by a supplied tfidf model. This does not seem to be the case.
Steps/Code/Corpus to Reproduce
This will be an unconventional bug report because I did not find this bug "in the wild". In adapting the code for another purpose, I noticed some logic which might be problematic. I will
Original Context:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L516
A red flag for me was that while we consider the "word_indices" mapping for the row index, we are never considering it for the column index. Even in the case of rows, notice that we are filling "matrix" at the end without regard for the reordering that was supposed to have occurred.
A Correction in Levenshtein Context:
Test Case
Consider the mini corpus:
However, since 'abc' appears in every document, the tfidf coefficient will be zero. The ordering with respect to the tfidf weight will be: 'lab', 'bad', 'abc'. The default Levenshtein similarity scores are pretty symmetric, there are some reorderings for which the matrix will be the same. However, this example is contrived to so that the tfidf reordering will break the symmetry of the original Levenshtein similarity scores:
We see that the revised logic correctly returns the reordered scores. The original logic returned the first matrix when the tfidf model was supplied to the function. A test case for the similarity_matrix function would need to be constructed so that a reordering of terms would lead to a definitive reordering of the relevant scores in the resulting matrix.
Conclusion
While I have not taken the time to prove definitively that there is a bug in the original context, a unit test should be written to cover the case of a supplied tfidf reordering the terms in similarity_matrix. This is a high-risk piece of logic. If there is a bug, and a supplied tfidf isn't being incorporated properly, eventually SoftCosineSimilarity will return nonsense scores when fed documents transformed by the same tfidf model.
Relevant Code
similarity_matrix:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L440
existing unit tests:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L30
See also related: #1961
Versions
Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr 4 2017, 08:46:44) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.14.1')
('SciPy', '1.0.0')
('gensim', '3.4.0')
('FAST_VERSION', 0)
The text was updated successfully, but these errors were encountered: