Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

WARNING : supplied example count did not equal expected count #801

Closed
DavidNemeskey opened this issue Jul 23, 2016 · 16 comments
Closed

WARNING : supplied example count did not equal expected count #801

DavidNemeskey opened this issue Jul 23, 2016 · 16 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@DavidNemeskey
Copy link
Contributor

DavidNemeskey commented Jul 23, 2016

I tried to learn a word2vec embedding with gensim:

model = gensim.models.Word2Vec(size=300, window=5, min_count=1, workers=4, iter=10, sg=0)
model.build_vocab(sentences)
model.train(sentences)

With logging switched on, I can see that the training stops after processing 10% of the corpus, and then I get this:

2016-07-23 09:26:25,201 : INFO : collected 197546 word types from a corpus of 686363594 raw words and 35463442 sentences
2016-07-23 09:26:25,697 : INFO : min_count=1 retains 197546 unique words (drops 0)
2016-07-23 09:26:25,697 : INFO : min_count leaves 686363594 word corpus (100% of original 686363594)
2016-07-23 09:26:25,962 : INFO : deleting the raw counts dictionary of 197546 items
2016-07-23 09:26:25,966 : INFO : sample=0.001 downsamples 35 most-common words
2016-07-23 09:26:25,967 : INFO : downsampling leaves estimated 437707717 word corpus (63.8% of prior 686363594)
2016-07-23 09:26:25,967 : INFO : estimated required memory for 197546 words and 300 dimensions: 572883400 bytes
2016-07-23 09:26:26,437 : INFO : resetting layer weights
...
...
...
2016-07-23 09:39:42,895 : INFO : PROGRESS: at 9.99% examples, 868278 words/s, in_qsize 8, out_qsize 0
2016-07-23 09:39:43,578 : INFO : worker thread finished; awaiting finish of 3 more threads
2016-07-23 09:39:43,579 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-07-23 09:39:43,584 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-07-23 09:39:43,589 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-07-23 09:39:43,589 : INFO : training on 686363594 raw words (437701650 effective words) took 504.0s, 868387 effective words/s
2016-07-23 09:39:43,589 : WARNING : supplied example count (35463442) did not equal expected count (354634420)

Why does this happen? I found nothing in the documentation that hints at gensim not processing the whole corpus.

@gojomo
Copy link
Collaborator

gojomo commented Jul 23, 2016

Your sentences needs to be an iterable object, which ca be iterated over multiple times – not merely an iterator that is exhausted after one pass. For example, the following code should print the same count each time:

print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))

In particular, if you're trying to use a generator, make sure you're passing in the function that returns an iterator, not the single iterator returned from a single call.

More helpful background is in the blog post: http://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

@DavidNemeskey
Copy link
Contributor Author

From the this page I gathered that the corpus is iterated twice -- once for build_vocab() and once for train(). Now I see that "The second+ passes train the neural model." -- so I guess this really must be the problem.

@gojomo
Copy link
Collaborator

gojomo commented Aug 7, 2016

Yes, it's typical to do multiple training passes over a corpus – unless it is already gigantic. The original word2vec.c tool defaults to 5 training passes over a supplied input file; and the current version of gensim Word2Vec also defaults to 5 training passes over the supplied iterable corpus object.

The blog post is still a little unclear, given how atypical a single training pass is in practice. @piskvorky – can the blog post be tightened a bit further? I would suggest (a) changing that 'second+' to something clearer like 'second and subsequent'; (b) deemphasize the iter=1 case, perhaps by putting it in a different-color DIV; (c) include a link to your other "Data Streaming in Python: generators, iterators, iterables" post.

@DavidNemeskey
Copy link
Contributor Author

I reopened the issue because I believe @gojomo's suggestions make sense.

@piskvorky
Copy link
Owner

@gojomo How about now?

@gojomo
Copy link
Collaborator

gojomo commented Aug 8, 2016

Better! But I'd prefer to either eliminate or move-to-bottom the 'advanced users' stuff.

The overwhelmingly-common case seems to be (1) less-advanced-user; with (2) small-dataset; and (3) potential confusion about iterators-vs-iterables. In that case, the important thing to emphasize is that the corpus be multiply-iterable, and any other details around that point are just 'attractive hazards'.

@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2016

I don't think they are overwhelmingly common -- just the most vocal, for obvious reasons.

@tmylk tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Oct 5, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

Looking for volunteer to add these improvements to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

@PNR-1
Copy link

PNR-1 commented Oct 8, 2016

Hey, i'm looking into the code and I guess the best approach would be to just throw an error whenever sentences is not iterable?

@gojomo
Copy link
Collaborator

gojomo commented Oct 8, 2016

@Doppler010 What specific test would you propose, and could it distinguish between a single-use iterator and something that is re-iterable?

@PNR-1
Copy link

PNR-1 commented Oct 9, 2016

@gojomo - I was thinking something along the lines of
'iterator' if obj is iter(obj) else 'iterable'
We can add this try:catch sequence in addition to documentation changes.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

@Doppler010 - That's an interesting test! I'd prefer not to create a throwaway iterator just as a test, but perhaps this could be combined with the iteration-start that needs to happen anyway, generating a warning when that's not a different-object (and thus the source is likely not a repeatably-iterable object). We'll still want the warning about mismatched-counts, as well – that will also catch places where the user has varied the corpus since build_vocab(), or otherwise not provided an accurate expected-size (which is necessary for proper alpha deacy scheduling and accurate progress-logging).

@PNR-1
Copy link

PNR-1 commented Oct 9, 2016

@gojomo - Can you please point me to the location of the iteration-start in word2vec.py . I'm not able to figure it out.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

@pamdla
Copy link

pamdla commented Jul 5, 2020

did this warning fix yet? and how?

@gojomo
Copy link
Collaborator

gojomo commented Jul 6, 2020

@lampda - this warning typically means you've done something wrong: not supplying the expected number of texts. So any fix would be in your code; do the checks above that your corpus iterable is correct. If you have other questions, the project discussion list is more appropriate: https://groups.google.com/forum/#!forum/gensim

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

6 participants