WARNING : supplied example count did not equal expected count #801

DavidNemeskey · 2016-07-23T08:13:35Z

I tried to learn a word2vec embedding with gensim:

model = gensim.models.Word2Vec(size=300, window=5, min_count=1, workers=4, iter=10, sg=0)
model.build_vocab(sentences)
model.train(sentences)

With logging switched on, I can see that the training stops after processing 10% of the corpus, and then I get this:

2016-07-23 09:26:25,201 : INFO : collected 197546 word types from a corpus of 686363594 raw words and 35463442 sentences
2016-07-23 09:26:25,697 : INFO : min_count=1 retains 197546 unique words (drops 0)
2016-07-23 09:26:25,697 : INFO : min_count leaves 686363594 word corpus (100% of original 686363594)
2016-07-23 09:26:25,962 : INFO : deleting the raw counts dictionary of 197546 items
2016-07-23 09:26:25,966 : INFO : sample=0.001 downsamples 35 most-common words
2016-07-23 09:26:25,967 : INFO : downsampling leaves estimated 437707717 word corpus (63.8% of prior 686363594)
2016-07-23 09:26:25,967 : INFO : estimated required memory for 197546 words and 300 dimensions: 572883400 bytes
2016-07-23 09:26:26,437 : INFO : resetting layer weights
...
...
...
2016-07-23 09:39:42,895 : INFO : PROGRESS: at 9.99% examples, 868278 words/s, in_qsize 8, out_qsize 0
2016-07-23 09:39:43,578 : INFO : worker thread finished; awaiting finish of 3 more threads
2016-07-23 09:39:43,579 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-07-23 09:39:43,584 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-07-23 09:39:43,589 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-07-23 09:39:43,589 : INFO : training on 686363594 raw words (437701650 effective words) took 504.0s, 868387 effective words/s
2016-07-23 09:39:43,589 : WARNING : supplied example count (35463442) did not equal expected count (354634420)

Why does this happen? I found nothing in the documentation that hints at gensim not processing the whole corpus.

The text was updated successfully, but these errors were encountered:

gojomo · 2016-07-23T17:25:51Z

Your sentences needs to be an iterable object, which ca be iterated over multiple times – not merely an iterator that is exhausted after one pass. For example, the following code should print the same count each time:

print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))

In particular, if you're trying to use a generator, make sure you're passing in the function that returns an iterator, not the single iterator returned from a single call.

More helpful background is in the blog post: http://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

DavidNemeskey · 2016-08-07T08:55:50Z

From the this page I gathered that the corpus is iterated twice -- once for build_vocab() and once for train(). Now I see that "The second+ passes train the neural model." -- so I guess this really must be the problem.

gojomo · 2016-08-07T17:56:41Z

Yes, it's typical to do multiple training passes over a corpus – unless it is already gigantic. The original word2vec.c tool defaults to 5 training passes over a supplied input file; and the current version of gensim Word2Vec also defaults to 5 training passes over the supplied iterable corpus object.

The blog post is still a little unclear, given how atypical a single training pass is in practice. @piskvorky – can the blog post be tightened a bit further? I would suggest (a) changing that 'second+' to something clearer like 'second and subsequent'; (b) deemphasize the iter=1 case, perhaps by putting it in a different-color DIV; (c) include a link to your other "Data Streaming in Python: generators, iterators, iterables" post.

DavidNemeskey · 2016-08-07T18:41:25Z

I reopened the issue because I believe @gojomo's suggestions make sense.

piskvorky · 2016-08-08T18:25:58Z

@gojomo How about now?

gojomo · 2016-08-08T20:05:18Z

Better! But I'd prefer to either eliminate or move-to-bottom the 'advanced users' stuff.

The overwhelmingly-common case seems to be (1) less-advanced-user; with (2) small-dataset; and (3) potential confusion about iterators-vs-iterables. In that case, the important thing to emphasize is that the corpus be multiply-iterable, and any other details around that point are just 'attractive hazards'.

piskvorky · 2016-08-09T08:16:07Z

I don't think they are overwhelmingly common -- just the most vocal, for obvious reasons.

tmylk · 2016-10-05T07:37:09Z

Looking for volunteer to add these improvements to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

PNR-1 · 2016-10-08T14:53:15Z

Hey, i'm looking into the code and I guess the best approach would be to just throw an error whenever sentences is not iterable?

gojomo · 2016-10-08T18:54:21Z

@Doppler010 What specific test would you propose, and could it distinguish between a single-use iterator and something that is re-iterable?

PNR-1 · 2016-10-09T02:33:44Z

@gojomo - I was thinking something along the lines of
'iterator' if obj is iter(obj) else 'iterable'
We can add this try:catch sequence in addition to documentation changes.

gojomo · 2016-10-09T03:09:04Z

@Doppler010 - That's an interesting test! I'd prefer not to create a throwaway iterator just as a test, but perhaps this could be combined with the iteration-start that needs to happen anyway, generating a warning when that's not a different-object (and thus the source is likely not a repeatably-iterable object). We'll still want the warning about mismatched-counts, as well – that will also catch places where the user has varied the corpus since build_vocab(), or otherwise not provided an accurate expected-size (which is necessary for proper alpha deacy scheduling and accurate progress-logging).

PNR-1 · 2016-10-09T06:13:25Z

@gojomo - Can you please point me to the location of the iteration-start in word2vec.py . I'm not able to figure it out.

gojomo · 2016-10-09T06:19:59Z

@Doppler010 word2vec.py sets it up at https://github.com/RaRe-Technologies/gensim/blob/192792688b1e7439cf10076648ff499f557142f9/gensim/models/word2vec.py#L784 though since it uses utils.RepeatCorpusNTimes the warning probably belongs there: https://github.com/RaRe-Technologies/gensim/blob/192792688b1e7439cf10076648ff499f557142f9/gensim/utils.py#L652

pamdla · 2020-07-05T13:21:05Z

did this warning fix yet? and how?

gojomo · 2020-07-06T04:36:15Z

@lampda - this warning typically means you've done something wrong: not supplying the expected number of texts. So any fix would be in your code; do the checks above that your corpus iterable is correct. If you have other questions, the project discussion list is more appropriate: https://groups.google.com/forum/#!forum/gensim

DavidNemeskey closed this as completed Aug 7, 2016

DavidNemeskey reopened this Aug 7, 2016

tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Oct 5, 2016

gojomo closed this as completed Jul 6, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARNING : supplied example count did not equal expected count #801

WARNING : supplied example count did not equal expected count #801

DavidNemeskey commented Jul 23, 2016 •

edited

Loading

gojomo commented Jul 23, 2016

DavidNemeskey commented Aug 7, 2016

gojomo commented Aug 7, 2016

DavidNemeskey commented Aug 7, 2016

piskvorky commented Aug 8, 2016

gojomo commented Aug 8, 2016

piskvorky commented Aug 9, 2016 •

edited

Loading

tmylk commented Oct 5, 2016

PNR-1 commented Oct 8, 2016

gojomo commented Oct 8, 2016

PNR-1 commented Oct 9, 2016

gojomo commented Oct 9, 2016

PNR-1 commented Oct 9, 2016

gojomo commented Oct 9, 2016

pamdla commented Jul 5, 2020

gojomo commented Jul 6, 2020

WARNING : supplied example count did not equal expected count #801

WARNING : supplied example count did not equal expected count #801

Comments

DavidNemeskey commented Jul 23, 2016 • edited Loading

gojomo commented Jul 23, 2016

DavidNemeskey commented Aug 7, 2016

gojomo commented Aug 7, 2016

DavidNemeskey commented Aug 7, 2016

piskvorky commented Aug 8, 2016

gojomo commented Aug 8, 2016

piskvorky commented Aug 9, 2016 • edited Loading

tmylk commented Oct 5, 2016

PNR-1 commented Oct 8, 2016

gojomo commented Oct 8, 2016

PNR-1 commented Oct 9, 2016

gojomo commented Oct 9, 2016

PNR-1 commented Oct 9, 2016

gojomo commented Oct 9, 2016

pamdla commented Jul 5, 2020

gojomo commented Jul 6, 2020

DavidNemeskey commented Jul 23, 2016 •

edited

Loading

piskvorky commented Aug 9, 2016 •

edited

Loading