We use two datasets for training an annotation classifier:
- Source dataset - unambiguous annotations from corpora
- Target dataset - ambiguous annotations labeled by Mechanical Turk majority vote with 75% agreement threshold (data)
Note: Mechanical Turk dataset is bigger than the one used in results.org - results not immediately comparable.
This classifier first trains MultinomialNB with source dataset, then uses partial_fit
to train it also on the target dataset with higher sample_weight
, see source and docs. In the experiment for 1000 simulations:
- Split ambiguous annotation dataset into train (2/3) and validation (1/3) sets
- Validate the initial transfer classifier (trained on unambiguous annotations only)
- Target train it on all the train set
- Validate it again
- Compute increase in agreement
For features, FullContextBagOfWordsLeftRightCutoff(9)
vectorizer was used.
Then average in agreement increase is measured.
Source weight/Target weight | Source = EMEA | Source = Medline |
---|---|---|
1/10 | 0% | 1% |
1/100 | 1% | 1% |
1/500 | 5% | 0% |
1/1000 | 5% | 0% |
1/5000 | 3% | 1% |
1/10000 | -1% | -1% |
1/50000 | -3% | -1% |
Results seem to have high variability, so rounded to percents only.
Active learning strategy: pick the instance from a pool which has the least estimated prediction probability (using Naive Bayes).
The classifier was trained on unambiguous annotations first (Medline or EMEA corpora). Then 100 times:
- MTurk annotation set was split into train set and validation set
- Classifier was trained on train set, one example at a time (passively or actively) with a weight 1000
- For every iteration accuracy on validation set was measured
As previous results show, accuracy for Medline is high enough that training on MTurk data only overfits the classifier.
As the result of re-evaluation a new feature set was shown to be optimal for Medline. The active/passive learning graphs for it:
Weight 1000
Weight 100
Weight 10
For the purpose of learning optimal balance between source and target datasets, we train the classifier on fractions of source dataset. This is implemented not by talking first N%
of a dataset, but by skipping datapoints to avoid possible bias problems. Here we use WeightedPartialFitPassiveTransferClassifier2
classifier (features optimal for Medline) trained on Medline.
We measure differences between accuracy before and after training on MTurk data over 100 random train/test splits of MTurk data.
Target Weight | Average gain |
---|---|
10 | -0.005 |
50 | -0.0069 |
100 | -0.01 |
500 | -0.0057 |
1000 | -0.01 |
Target Weight | Average gain |
---|---|
10 | 0.001 |
50 | -0.001 |
100 | 0.0039 |
500 | 0.0094 |
1000 | -0.0008 |
Target Weight | Average gain |
---|---|
10 | 0.006 |
50 | 0.0308 |
100 | 0.034 |
500 | 0.02 |
1000 | 0.018 |
Example learning curve for target Weight 10
Although the quality increases, it does not reach the quality attained by training of full Medline dataset.
Weight 100
The resulting quality is comparable to the one of the full-dataset Medline classifier, given the fact that we only used 1% of dataset for training.
See here.