Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[MRG] intensify cbow+hs tests; bulk testing method #2930

Merged
merged 2 commits into from
Sep 2, 2020

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Sep 2, 2020

Ran a lot of tests to make our random-influenced tests (mainly test_cbow_hs) less unstable: using different word-pair (for all tests), more epochs & larger starting alpha (for test_cbow_hs).

Should help with the occasional test failures.

Left-in-place for the future a utility method for running other methods many times to check for rare failures.

@gojomo gojomo requested review from piskvorky and mpenkov September 2, 2020 03:37
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. What's the risk of the more focused parameters (smaller window, more epochs) hiding an actual problem?

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
@gojomo
Copy link
Collaborator Author

gojomo commented Sep 2, 2020

There are some potentially inherent tradeoffs between the goals: test-is-sensitive, test-is-fast, & test-avoids-false-alarms. And the tradeoffs are hard to quantify.

The tiny 300-document lee_background.cor dataset we're using is not really enough to truly & reliably exercise word2vec--algorithm-like features. And, the existing tests are a mix of, on the one hand, shallow (just testing that modes don't error, or fit an expected data shape, rather than consistent performance to expectations), and on the other hand, highly redundant (many tests of the same superficial operations).

These changes should still catch any complete breakdown of the listed mode, but might be less likely to catch a slight degradation. But, the frustrating occasional random failures have been mostly ignored as "that flaky test again", and the highest source of variance has been the thinness of the test-data. So, whether the old more-fragile test was actually providing any net problem-detection value is unclear.

Still, if there's any particular false-alarm rate you're willing to tolerate – say, 1 in 10,000? – we could theoretically spend the effort to tune each test to hit close to that under current conditions, on a particular machine... then perhaps notice when that rate shoots up. (But, it's seemed to me that the CI machines sometimes show far more variance in their results than what I could see on otherwise same-OS/same-libraries local tests... so maybe we'd just find ourselves in a deep investigation of configuration-specific-performance mysteries, including idiosyncrasies of the high-tenancy CI SaaS environment.)

A mechanism for the tests to consult a history of runtimes & near-miss-tolerances could be interesting, so that our testing would show us if, for example, some algorithm suddenly (or gradually) was running 25% slower, or 25% worse on some substantive evaluation, etc.

But, there's many other more basic & potentially higher-priority improvements to gensim testing that could also be considered...

  • harmonizing test-methods style to snake_case rather than leaked-in-from-JUnit's-origins camelCase
  • eliminating excess warnings/output, unless a test is specifically trying to exercise a warning - the clutter in output makes test results harder to review
  • using better data - doing a default Word2Vec epochs=5 train over a dataset thats 12x as large would be no slower, and much more substantive/valuable, than epochs=60 over a toy dataset
  • inventorying, better-naming/documenting, & committing-code-to-create any static test files that are necessary for other test methods - no more missing-version-numbers, generically-named, unclear-provenance "old_model.bin" or "test_data.txt" files
  • discarding outdated test methods (esp. loading ancient models)
  • removing fully-redundant tests that only do a subset of what other multiple other fuller-cycle tests are doing
  • ensure long-running tests are earning their runtime with substantive checks-of-functionality - no reason to run a 1-minute-long intense-training if the only check at the end is "is the return value the right shape?"
  • check test coverage, to find any major methods/modes not being tested at all & add tests
  • moving to more-than-one suite, so that long/sensitive/subtle-performance-regression tests (that might take hours or even days to run, and be most interesting in lead-up to official releases) can be separate from the immediate/automatic tests (ideally completing in 5-10m) that are run every commit

@gojomo gojomo merged commit 030e650 into piskvorky:develop Sep 2, 2020
@piskvorky
Copy link
Owner

That's a great list. This item jumps out:

  • using better data

as both high-value and relatively easy to implement (I think).

What dataset candidates are you thinking of? Some version of text8/text9? A subset of Wikipedia?

@gojomo
Copy link
Collaborator Author

gojomo commented Sep 2, 2020

If I had a strong idea of something ready-to-drop-in, I'd probably have already tried it! One of the nice things about lee_background is that it works, to an extent, for both word- and doc- vector testing/demos – so that'd be nice to retain. (IMO text8/text9's strip-all-punctuation-and-even-linebreaks prprocessing makes it suboptimal even for word2vec training – except as a demo that even messy text can create useful word-vectors.)

In some text-indexing experiments a while back I'd stripped Wikipedia articles down to their 'intro' sections (just the paragraph or two before the TOC), and doing that again might be good as a general-use corpus, even if trimmed down to the "top N" (most-visited or most-inlinked) few thousand intros.

@piskvorky
Copy link
Owner

piskvorky commented Sep 3, 2020

Yes, Wikimedia already publishes a dump of "top N most impactful / popular articles", with N=10000 IIRC.
And we already have an article section parser (also used in your much beloved gensim-data).

So that might be the best option. I opened a separate ticket so this doesn't get lost: #2932.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants