-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[MRG] intensify cbow+hs tests; bulk testing method #2930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks. What's the risk of the more focused parameters (smaller window, more epochs) hiding an actual problem?
Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
There are some potentially inherent tradeoffs between the goals: test-is-sensitive, test-is-fast, & test-avoids-false-alarms. And the tradeoffs are hard to quantify. The tiny 300-document These changes should still catch any complete breakdown of the listed mode, but might be less likely to catch a slight degradation. But, the frustrating occasional random failures have been mostly ignored as "that flaky test again", and the highest source of variance has been the thinness of the test-data. So, whether the old more-fragile test was actually providing any net problem-detection value is unclear. Still, if there's any particular false-alarm rate you're willing to tolerate – say, 1 in 10,000? – we could theoretically spend the effort to tune each test to hit close to that under current conditions, on a particular machine... then perhaps notice when that rate shoots up. (But, it's seemed to me that the CI machines sometimes show far more variance in their results than what I could see on otherwise same-OS/same-libraries local tests... so maybe we'd just find ourselves in a deep investigation of configuration-specific-performance mysteries, including idiosyncrasies of the high-tenancy CI SaaS environment.) A mechanism for the tests to consult a history of runtimes & near-miss-tolerances could be interesting, so that our testing would show us if, for example, some algorithm suddenly (or gradually) was running 25% slower, or 25% worse on some substantive evaluation, etc. But, there's many other more basic & potentially higher-priority improvements to gensim testing that could also be considered...
|
That's a great list. This item jumps out:
as both high-value and relatively easy to implement (I think). What dataset candidates are you thinking of? Some version of text8/text9? A subset of Wikipedia? |
If I had a strong idea of something ready-to-drop-in, I'd probably have already tried it! One of the nice things about In some text-indexing experiments a while back I'd stripped Wikipedia articles down to their 'intro' sections (just the paragraph or two before the TOC), and doing that again might be good as a general-use corpus, even if trimmed down to the "top N" (most-visited or most-inlinked) few thousand intros. |
Yes, Wikimedia already publishes a dump of "top N most impactful / popular articles", with N=10000 IIRC. So that might be the best option. I opened a separate ticket so this doesn't get lost: #2932. |
Ran a lot of tests to make our random-influenced tests (mainly
test_cbow_hs
) less unstable: using different word-pair (for all tests), more epochs & larger starting alpha (fortest_cbow_hs
).Should help with the occasional test failures.
Left-in-place for the future a utility method for running other methods many times to check for rare failures.