-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
WARNING : supplied example count did not equal expected count #801
Comments
Your
In particular, if you're trying to use a generator, make sure you're passing in the function that returns an iterator, not the single iterator returned from a single call. More helpful background is in the blog post: http://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ |
From the this page I gathered that the corpus is iterated twice -- once for |
Yes, it's typical to do multiple training passes over a corpus – unless it is already gigantic. The original word2vec.c tool defaults to 5 training passes over a supplied input file; and the current version of gensim Word2Vec also defaults to 5 training passes over the supplied iterable corpus object. The blog post is still a little unclear, given how atypical a single training pass is in practice. @piskvorky – can the blog post be tightened a bit further? I would suggest (a) changing that 'second+' to something clearer like 'second and subsequent'; (b) deemphasize the |
I reopened the issue because I believe @gojomo's suggestions make sense. |
@gojomo How about now? |
Better! But I'd prefer to either eliminate or move-to-bottom the 'advanced users' stuff. The overwhelmingly-common case seems to be (1) less-advanced-user; with (2) small-dataset; and (3) potential confusion about iterators-vs-iterables. In that case, the important thing to emphasize is that the corpus be multiply-iterable, and any other details around that point are just 'attractive hazards'. |
I don't think they are overwhelmingly common -- just the most vocal, for obvious reasons. |
Looking for volunteer to add these improvements to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb |
Hey, i'm looking into the code and I guess the best approach would be to just throw an error whenever sentences is not iterable? |
@Doppler010 What specific test would you propose, and could it distinguish between a single-use iterator and something that is re-iterable? |
@gojomo - I was thinking something along the lines of |
@Doppler010 - That's an interesting test! I'd prefer not to create a throwaway iterator just as a test, but perhaps this could be combined with the iteration-start that needs to happen anyway, generating a warning when that's not a different-object (and thus the source is likely not a repeatably-iterable object). We'll still want the warning about mismatched-counts, as well – that will also catch places where the user has varied the corpus since |
@gojomo - Can you please point me to the location of the iteration-start in word2vec.py . I'm not able to figure it out. |
@Doppler010 word2vec.py sets it up at https://github.com/RaRe-Technologies/gensim/blob/192792688b1e7439cf10076648ff499f557142f9/gensim/models/word2vec.py#L784 though since it uses |
did this warning fix yet? and how? |
@lampda - this warning typically means you've done something wrong: not supplying the expected number of texts. So any fix would be in your code; do the checks above that your corpus iterable is correct. If you have other questions, the project discussion list is more appropriate: https://groups.google.com/forum/#!forum/gensim |
I tried to learn a word2vec embedding with gensim:
With logging switched on, I can see that the training stops after processing 10% of the corpus, and then I get this:
Why does this happen? I found nothing in the documentation that hints at gensim not processing the whole corpus.
The text was updated successfully, but these errors were encountered: