Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Mutable vector returned by KeyedVectors.word_vector #1651

Closed
menshikh-iv opened this issue Oct 25, 2017 · 12 comments
Closed

Mutable vector returned by KeyedVectors.word_vector #1651

menshikh-iv opened this issue Oct 25, 2017 · 12 comments
Labels
difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 25, 2017

Description

TODO: change commented example

Steps/Code/Corpus to Reproduce

Method KeyedVectors.word_vector returned "mutable" vector if we call model['anywords'] (for model[['anywords']] works correctly because vstack make a copy.

Simple example

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/model')
vector = model['word_from_model']
assert ~np.allclose(vector, np.zeros(vector.shape))  # check that our vector isn't zero
vector *= 0.

assert ~np.allclose(model['word_from_model'], np.zeros(vector.shape))  # failed, because now model['word_from_model'] is zero vector

Expected Results

assert passed

Actual Results

assert failed

What needs to fix

Add arr.setflags(write=False) in word_vec

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Oct 25, 2017
@gojomo
Copy link
Collaborator

gojomo commented Oct 25, 2017

What's wrong with providing a mutable vector? It's very rare to even try to mutate the result, so adding the overhead of a copy, on every access, seems a bad tradeoff to me, and returning a mutable view on the actual backing array the right default behavior.

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 25, 2017

@gojomo I was approached by a person from one company, they at work caught this bug. As for me, it's very strange if you change a vector - your model changed too.

About performance - yes, it will be slightly slower, but I don't think that it will greatly affect performance, look at my small benchmark.

Preparation

import gensim
import numpy as np
from random import sample

model = gensim.models.KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")
rand_words = sample(model.index2word, k=100000)

Benchmark

Without any changes

%%timeit

for w in rand_words:
    model[w]

14.4 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With explicit copy

%%timeit

for w in rand_words:
    np.array(model[w])

21.9 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If you think that this is so slow, ok, maybe we'll change https://github.com/RaRe-Technologies/gensim/blob/269028975e0db48e37e01edfb54e66018db7b61b/gensim/models/keyedvectors.py#L598
to

return np.array(self.word_vec(words))

instead word_vec (because two lines after - very similar thing)?
https://github.com/RaRe-Technologies/gensim/blob/269028975e0db48e37e01edfb54e66018db7b61b/gensim/models/keyedvectors.py#L600

@gojomo
Copy link
Collaborator

gojomo commented Oct 25, 2017

I still don't think this is a bug; it's reasonable behavior. It doesn't strike me as 'strange' and those familiar with numpy slot-accessing will expect it.

If some particular uncommon usage needs a local mutable copy, they should create it themselves, in their own code.

Doing the copy by default is slower, uses more memory. It's unneeded by most uses/users. Since it's always worked the other way, the current behavior may be relied upon by older code.

At most, the fact it's a reference could be further noted in the doc-comment.

@piskvorky
Copy link
Owner

piskvorky commented Oct 25, 2017

I'm also -1 on this change. Without a compelling reason, making extra copies would be unexpected, less flexible, and is better left for user-land.

In general, I'd prefer the reason/use-case to come directly from end users. It makes verifying the intent or problem context or its solution clearer, in an open discussion.

@asegrenev
Copy link

asegrenev commented Oct 26, 2017

Caught this bug/reasonable behavior while normalizing word vectors in sentence on tf_idf. Here is sample code:

sent_vec = np.zeros(100)
for word in sent.split():
      word_vec = model[word] 
      word_vec *= tfidf[word]    
      sent_vec += word_vec     

I'm new in python (and programming at all) - didn't expect to fail on this.
Thanks :)

@gojomo
Copy link
Collaborator

gojomo commented Oct 26, 2017

@asegrenev Thanks for the concrete example of a place where this caused a problem!

My suggestion would be to include a warning about this mutability in the doc-comments for both __getitem__() and word_vec(), and an example of how to work around. For example:

"Note that this method returns a reference to a row inside the source model – a numpy view – and thus mutating this vector (as if scaling or normalizing the return value in-place) will mutate the original model's row. If you instead need an independent, locally-mutable vector, use np.array() to copy the return value. For example: vec = np.array(model.word_vec('apple')."

Writing that reminded me of yet another possible approach: we could setflags(write=false) on the returned-array, to make sure users intend any mutation. Any users relying on the old behavior would get a loud error after the change rather than a quiet breakage; any users wanting the old behavior could toggle the flag back to true themselves.

@piskvorky
Copy link
Owner

piskvorky commented Oct 27, 2017

@gojomo Interesting idea, I didn't know you could do that.

@asegrenev for your particular case, I'd recommend a more "Pythonic" construction, which will also avoid the mutation issue automatically:

sent_vec = np.sum(model[word] * tfidf[word] for word in sent.split())

You may have to handle the edge case where there are no words (sent.split() contains nothing) separately though, depending on what you do with sent_vec later.

@asegrenev
Copy link

asegrenev commented Oct 27, 2017

@piskvorky Thanks for recommends. The case "sent.split() contains nothing" is already handled while prepocessing sentences.

PS: Before contacting @menshikh-iv we've already found the problem and that was just something like "bug report" :)

@piskvorky
Copy link
Owner

piskvorky commented Oct 27, 2017

Got it -- we welcome all discussions about what problems our users are facing. It helps us assess the prevalence of some issue over time, even if no action is immediately taken.

@menshikh-iv To me, that setflags(write=false) that Gordon suggested sounds like the ideal solution, it combines the best of both worlds (assuming it works and has no negative side effects, like making a copy anyway; I haven't checked).

@menshikh-iv
Copy link
Contributor Author

setflags looks good for this situation (and allows to avoid this problem in future).

@menshikh-iv
Copy link
Contributor Author

Updated What needs to fix section in head message (added info about setflags)

@piskvorky piskvorky removed the bug Issue described a bug label Oct 27, 2017
@menshikh-iv
Copy link
Contributor Author

Resolved in #1662.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

4 participants