-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Mutable vector returned by KeyedVectors.word_vector #1651
Comments
What's wrong with providing a mutable vector? It's very rare to even try to mutate the result, so adding the overhead of a copy, on every access, seems a bad tradeoff to me, and returning a mutable view on the actual backing array the right default behavior. |
@gojomo I was approached by a person from one company, they at work caught this bug. As for me, it's very strange if you change a vector - your About performance - yes, it will be slightly slower, but I don't think that it will greatly affect performance, look at my small benchmark. Preparationimport gensim
import numpy as np
from random import sample
model = gensim.models.KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")
rand_words = sample(model.index2word, k=100000) BenchmarkWithout any changes %%timeit
for w in rand_words:
model[w]
With explicit copy %%timeit
for w in rand_words:
np.array(model[w])
If you think that this is so slow, ok, maybe we'll change https://github.com/RaRe-Technologies/gensim/blob/269028975e0db48e37e01edfb54e66018db7b61b/gensim/models/keyedvectors.py#L598 return np.array(self.word_vec(words)) instead word_vec (because two lines after - very similar thing)? |
I still don't think this is a bug; it's reasonable behavior. It doesn't strike me as 'strange' and those familiar with numpy slot-accessing will expect it. If some particular uncommon usage needs a local mutable copy, they should create it themselves, in their own code. Doing the copy by default is slower, uses more memory. It's unneeded by most uses/users. Since it's always worked the other way, the current behavior may be relied upon by older code. At most, the fact it's a reference could be further noted in the doc-comment. |
I'm also -1 on this change. Without a compelling reason, making extra copies would be unexpected, less flexible, and is better left for user-land. In general, I'd prefer the reason/use-case to come directly from end users. It makes verifying the intent or problem context or its solution clearer, in an open discussion. |
Caught this bug/reasonable behavior while normalizing word vectors in sentence on tf_idf. Here is sample code:
I'm new in python (and programming at all) - didn't expect to fail on this. |
@asegrenev Thanks for the concrete example of a place where this caused a problem! My suggestion would be to include a warning about this mutability in the doc-comments for both "Note that this method returns a reference to a row inside the source model – a numpy view – and thus mutating this vector (as if scaling or normalizing the return value in-place) will mutate the original model's row. If you instead need an independent, locally-mutable vector, use Writing that reminded me of yet another possible approach: we could |
@gojomo Interesting idea, I didn't know you could do that. @asegrenev for your particular case, I'd recommend a more "Pythonic" construction, which will also avoid the mutation issue automatically: sent_vec = np.sum(model[word] * tfidf[word] for word in sent.split()) You may have to handle the edge case where there are no words ( |
@piskvorky Thanks for recommends. The case "sent.split() contains nothing" is already handled while prepocessing sentences. PS: Before contacting @menshikh-iv we've already found the problem and that was just something like "bug report" :) |
Got it -- we welcome all discussions about what problems our users are facing. It helps us assess the prevalence of some issue over time, even if no action is immediately taken. @menshikh-iv To me, that |
|
Updated |
Resolved in #1662. |
Description
TODO: change commented example
Steps/Code/Corpus to Reproduce
Method
KeyedVectors.word_vector
returned "mutable" vector if we callmodel['anywords']
(formodel[['anywords']]
works correctly because vstack make a copy.Simple example
Expected Results
assert passed
Actual Results
assert failed
What needs to fix
Add
arr.setflags(write=False)
in word_vecThe text was updated successfully, but these errors were encountered: