Mutable vector returned by KeyedVectors.word_vector #1651

menshikh-iv · 2017-10-25T14:18:55Z

Description

TODO: change commented example

Steps/Code/Corpus to Reproduce

Method KeyedVectors.word_vector returned "mutable" vector if we call model['anywords'] (for model[['anywords']] works correctly because vstack make a copy.

Simple example

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/model')
vector = model['word_from_model']
assert ~np.allclose(vector, np.zeros(vector.shape))  # check that our vector isn't zero
vector *= 0.

assert ~np.allclose(model['word_from_model'], np.zeros(vector.shape))  # failed, because now model['word_from_model'] is zero vector

Expected Results

assert passed

Actual Results

assert failed

What needs to fix

Add arr.setflags(write=False) in word_vec

The text was updated successfully, but these errors were encountered:

gojomo · 2017-10-25T15:50:54Z

What's wrong with providing a mutable vector? It's very rare to even try to mutate the result, so adding the overhead of a copy, on every access, seems a bad tradeoff to me, and returning a mutable view on the actual backing array the right default behavior.

menshikh-iv · 2017-10-25T16:11:52Z

@gojomo I was approached by a person from one company, they at work caught this bug. As for me, it's very strange if you change a vector - your model changed too.

About performance - yes, it will be slightly slower, but I don't think that it will greatly affect performance, look at my small benchmark.

Preparation

import gensim
import numpy as np
from random import sample

model = gensim.models.KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")
rand_words = sample(model.index2word, k=100000)

Benchmark

Without any changes

%%timeit

for w in rand_words:
    model[w]

14.4 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With explicit copy

%%timeit

for w in rand_words:
    np.array(model[w])

21.9 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If you think that this is so slow, ok, maybe we'll change https://github.com/RaRe-Technologies/gensim/blob/269028975e0db48e37e01edfb54e66018db7b61b/gensim/models/keyedvectors.py#L598
to

return np.array(self.word_vec(words))

instead word_vec (because two lines after - very similar thing)?
https://github.com/RaRe-Technologies/gensim/blob/269028975e0db48e37e01edfb54e66018db7b61b/gensim/models/keyedvectors.py#L600

gojomo · 2017-10-25T19:54:22Z

I still don't think this is a bug; it's reasonable behavior. It doesn't strike me as 'strange' and those familiar with numpy slot-accessing will expect it.

If some particular uncommon usage needs a local mutable copy, they should create it themselves, in their own code.

Doing the copy by default is slower, uses more memory. It's unneeded by most uses/users. Since it's always worked the other way, the current behavior may be relied upon by older code.

At most, the fact it's a reference could be further noted in the doc-comment.

piskvorky · 2017-10-25T20:05:09Z

I'm also -1 on this change. Without a compelling reason, making extra copies would be unexpected, less flexible, and is better left for user-land.

In general, I'd prefer the reason/use-case to come directly from end users. It makes verifying the intent or problem context or its solution clearer, in an open discussion.

asegrenev · 2017-10-26T15:42:57Z

Caught this bug/reasonable behavior while normalizing word vectors in sentence on tf_idf. Here is sample code:

sent_vec = np.zeros(100)
for word in sent.split():
      word_vec = model[word] 
      word_vec *= tfidf[word]    
      sent_vec += word_vec

I'm new in python (and programming at all) - didn't expect to fail on this.
Thanks :)

gojomo · 2017-10-26T20:01:47Z

@asegrenev Thanks for the concrete example of a place where this caused a problem!

My suggestion would be to include a warning about this mutability in the doc-comments for both __getitem__() and word_vec(), and an example of how to work around. For example:

"Note that this method returns a reference to a row inside the source model – a numpy view – and thus mutating this vector (as if scaling or normalizing the return value in-place) will mutate the original model's row. If you instead need an independent, locally-mutable vector, use np.array() to copy the return value. For example: vec = np.array(model.word_vec('apple')."

Writing that reminded me of yet another possible approach: we could setflags(write=false) on the returned-array, to make sure users intend any mutation. Any users relying on the old behavior would get a loud error after the change rather than a quiet breakage; any users wanting the old behavior could toggle the flag back to true themselves.

piskvorky · 2017-10-27T10:04:19Z

@gojomo Interesting idea, I didn't know you could do that.

@asegrenev for your particular case, I'd recommend a more "Pythonic" construction, which will also avoid the mutation issue automatically:

sent_vec = np.sum(model[word] * tfidf[word] for word in sent.split())

You may have to handle the edge case where there are no words (sent.split() contains nothing) separately though, depending on what you do with sent_vec later.

asegrenev · 2017-10-27T10:25:33Z

@piskvorky Thanks for recommends. The case "sent.split() contains nothing" is already handled while prepocessing sentences.

PS: Before contacting @menshikh-iv we've already found the problem and that was just something like "bug report" :)

piskvorky · 2017-10-27T11:28:44Z

Got it -- we welcome all discussions about what problems our users are facing. It helps us assess the prevalence of some issue over time, even if no action is immediately taken.

@menshikh-iv To me, that setflags(write=false) that Gordon suggested sounds like the ideal solution, it combines the best of both worlds (assuming it works and has no negative side effects, like making a copy anyway; I haven't checked).

menshikh-iv · 2017-10-27T12:04:25Z

setflags looks good for this situation (and allows to avoid this problem in future).

menshikh-iv · 2017-10-27T12:27:14Z

Updated What needs to fix section in head message (added info about setflags)

menshikh-iv · 2017-10-30T04:45:36Z

Resolved in #1662.

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Oct 25, 2017

piskvorky removed the bug Issue described a bug label Oct 27, 2017

menshikh-iv closed this as completed in 772dae8 Oct 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutable vector returned by KeyedVectors.word_vector #1651

Mutable vector returned by KeyedVectors.word_vector #1651

menshikh-iv commented Oct 25, 2017 •

edited

Loading

gojomo commented Oct 25, 2017

menshikh-iv commented Oct 25, 2017 •

edited

Loading

gojomo commented Oct 25, 2017

piskvorky commented Oct 25, 2017 •

edited

Loading

asegrenev commented Oct 26, 2017 •

edited

Loading

gojomo commented Oct 26, 2017 •

edited

Loading

piskvorky commented Oct 27, 2017 •

edited

Loading

asegrenev commented Oct 27, 2017 •

edited

Loading

piskvorky commented Oct 27, 2017 •

edited

Loading

menshikh-iv commented Oct 27, 2017

menshikh-iv commented Oct 27, 2017

menshikh-iv commented Oct 30, 2017

Mutable vector returned by KeyedVectors.word_vector #1651

Mutable vector returned by KeyedVectors.word_vector #1651

Comments

menshikh-iv commented Oct 25, 2017 • edited Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

What needs to fix

gojomo commented Oct 25, 2017

menshikh-iv commented Oct 25, 2017 • edited Loading

Preparation

Benchmark

gojomo commented Oct 25, 2017

piskvorky commented Oct 25, 2017 • edited Loading

asegrenev commented Oct 26, 2017 • edited Loading

gojomo commented Oct 26, 2017 • edited Loading

piskvorky commented Oct 27, 2017 • edited Loading

asegrenev commented Oct 27, 2017 • edited Loading

piskvorky commented Oct 27, 2017 • edited Loading

menshikh-iv commented Oct 27, 2017

menshikh-iv commented Oct 27, 2017

menshikh-iv commented Oct 30, 2017

menshikh-iv commented Oct 25, 2017 •

edited

Loading

menshikh-iv commented Oct 25, 2017 •

edited

Loading

piskvorky commented Oct 25, 2017 •

edited

Loading

asegrenev commented Oct 26, 2017 •

edited

Loading

gojomo commented Oct 26, 2017 •

edited

Loading

piskvorky commented Oct 27, 2017 •

edited

Loading

asegrenev commented Oct 27, 2017 •

edited

Loading

piskvorky commented Oct 27, 2017 •

edited

Loading