[MRG] intensify cbow+hs tests; bulk testing method #2930

gojomo · 2020-09-02T03:37:49Z

Ran a lot of tests to make our random-influenced tests (mainly test_cbow_hs) less unstable: using different word-pair (for all tests), more epochs & larger starting alpha (for test_cbow_hs).

Should help with the occasional test failures.

Left-in-place for the future a utility method for running other methods many times to check for rare failures.

piskvorky

Looks good, thanks. What's the risk of the more focused parameters (smaller window, more epochs) hiding an actual problem?

gensim/test/test_word2vec.py

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

gojomo · 2020-09-02T17:38:26Z

There are some potentially inherent tradeoffs between the goals: test-is-sensitive, test-is-fast, & test-avoids-false-alarms. And the tradeoffs are hard to quantify.

The tiny 300-document lee_background.cor dataset we're using is not really enough to truly & reliably exercise word2vec--algorithm-like features. And, the existing tests are a mix of, on the one hand, shallow (just testing that modes don't error, or fit an expected data shape, rather than consistent performance to expectations), and on the other hand, highly redundant (many tests of the same superficial operations).

These changes should still catch any complete breakdown of the listed mode, but might be less likely to catch a slight degradation. But, the frustrating occasional random failures have been mostly ignored as "that flaky test again", and the highest source of variance has been the thinness of the test-data. So, whether the old more-fragile test was actually providing any net problem-detection value is unclear.

Still, if there's any particular false-alarm rate you're willing to tolerate – say, 1 in 10,000? – we could theoretically spend the effort to tune each test to hit close to that under current conditions, on a particular machine... then perhaps notice when that rate shoots up. (But, it's seemed to me that the CI machines sometimes show far more variance in their results than what I could see on otherwise same-OS/same-libraries local tests... so maybe we'd just find ourselves in a deep investigation of configuration-specific-performance mysteries, including idiosyncrasies of the high-tenancy CI SaaS environment.)

A mechanism for the tests to consult a history of runtimes & near-miss-tolerances could be interesting, so that our testing would show us if, for example, some algorithm suddenly (or gradually) was running 25% slower, or 25% worse on some substantive evaluation, etc.

But, there's many other more basic & potentially higher-priority improvements to gensim testing that could also be considered...

harmonizing test-methods style to snake_case rather than leaked-in-from-JUnit's-origins camelCase
eliminating excess warnings/output, unless a test is specifically trying to exercise a warning - the clutter in output makes test results harder to review
using better data - doing a default Word2Vec epochs=5 train over a dataset thats 12x as large would be no slower, and much more substantive/valuable, than epochs=60 over a toy dataset
inventorying, better-naming/documenting, & committing-code-to-create any static test files that are necessary for other test methods - no more missing-version-numbers, generically-named, unclear-provenance "old_model.bin" or "test_data.txt" files
discarding outdated test methods (esp. loading ancient models)
removing fully-redundant tests that only do a subset of what other multiple other fuller-cycle tests are doing
ensure long-running tests are earning their runtime with substantive checks-of-functionality - no reason to run a 1-minute-long intense-training if the only check at the end is "is the return value the right shape?"
check test coverage, to find any major methods/modes not being tested at all & add tests
moving to more-than-one suite, so that long/sensitive/subtle-performance-regression tests (that might take hours or even days to run, and be most interesting in lead-up to official releases) can be separate from the immediate/automatic tests (ideally completing in 5-10m) that are run every commit

piskvorky · 2020-09-02T19:20:01Z

That's a great list. This item jumps out:

using better data

as both high-value and relatively easy to implement (I think).

What dataset candidates are you thinking of? Some version of text8/text9? A subset of Wikipedia?

gojomo · 2020-09-02T23:12:54Z

If I had a strong idea of something ready-to-drop-in, I'd probably have already tried it! One of the nice things about lee_background is that it works, to an extent, for both word- and doc- vector testing/demos – so that'd be nice to retain. (IMO text8/text9's strip-all-punctuation-and-even-linebreaks prprocessing makes it suboptimal even for word2vec training – except as a demo that even messy text can create useful word-vectors.)

In some text-indexing experiments a while back I'd stripped Wikipedia articles down to their 'intro' sections (just the paragraph or two before the TOC), and doing that again might be good as a general-use corpus, even if trimmed down to the "top N" (most-visited or most-inlinked) few thousand intros.

piskvorky · 2020-09-03T07:05:59Z

Yes, Wikimedia already publishes a dump of "top N most impactful / popular articles", with N=10000 IIRC.
And we already have an article section parser (also used in your much beloved gensim-data).

So that might be the best option. I opened a separate ticket so this doesn't get lost: #2932.

intensify cbow+hs tests; bulk testing method

7df38c3

gojomo requested review from piskvorky and mpenkov September 2, 2020 03:37

piskvorky approved these changes Sep 2, 2020

View reviewed changes

gensim/test/test_word2vec.py Outdated Show resolved Hide resolved

use increment operator

2561060

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

gojomo merged commit 030e650 into piskvorky:develop Sep 2, 2020

gojomo mentioned this pull request Sep 3, 2020

Clear up job queue parameters in word2vec #2931

Merged

piskvorky mentioned this pull request Sep 3, 2020

Change text corpus for unit tests #2932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] intensify cbow+hs tests; bulk testing method #2930

[MRG] intensify cbow+hs tests; bulk testing method #2930

gojomo commented Sep 2, 2020

piskvorky left a comment

gojomo commented Sep 2, 2020

piskvorky commented Sep 2, 2020

gojomo commented Sep 2, 2020 •

edited

Loading

piskvorky commented Sep 3, 2020 •

edited

Loading

[MRG] intensify cbow+hs tests; bulk testing method #2930

[MRG] intensify cbow+hs tests; bulk testing method #2930

Conversation

gojomo commented Sep 2, 2020

piskvorky left a comment

Choose a reason for hiding this comment

gojomo commented Sep 2, 2020

piskvorky commented Sep 2, 2020

gojomo commented Sep 2, 2020 • edited Loading

piskvorky commented Sep 3, 2020 • edited Loading

gojomo commented Sep 2, 2020 •

edited

Loading

piskvorky commented Sep 3, 2020 •

edited

Loading