All the code for reproducing the baseline models is included in ocnli/
. To run
these experiments, we suggest creating a
conda environment,
and setting it up by doing the following:
conda create -n ocnli python=3.6.7
conda activate ocnli
pip install -r requirements.txt
Alternatively Docker can be used (see Dockerfile
to
reproduce our environment; note this uses Python3.5 given its reliance
on tensorflow/tensorflow:1.12.0-gpu-py3
. Most of our experiments
were run using Docker through the Beaker collaboration tool).
We used some of the baselines from the original MNLI paper. A
repurposed version of the original MNLI code
is included in ocnli/mnli_code
.
Below is how to train a model
python -m ocnli.mnli_code.train_snli \
cbow \ ## type of model {cbow,bilstm}
baseline_model \ ## name of model
--keep_rate "0.9" \
--alpha "0.0" \
--emb_train \
--datapath data \ ## location of data
--embed_file_name sgns.merge.char \ ## see link below
--wdir /path/to/output/directory \ ## where to dump results
--train_name ocnli/train.json \
--dev_name ocnli/dev.json \
--test_name ocnli/test.json \
--override
where cbow
can be replaced {bilstm,esim}
to alternate between
different model types.
Chinese word/character embeddings (given above as sgns.merge.char
), are used in place
of the original GloVE embeddings and are available from
here; our exact
embeddings are hosted on google.
The code for Transformer baselines are adapted from the CLUE repository
For example, training a BERT or RoBERTa model can be done in the following way:
python -m ocnli.{bert,roberta_wwm_large_ext}.run_classifier \
--task_name=cmnli \
--do_train=true \
--do_eval=true \
--data_dir=/path/to/do \
--vocab_file=/path/to/model/vocab \
--bert_config_file=/path/to/model/bert_config.json \
--init_checkpoint=/path/to/model/bert_model.ckpt \ ## weights,see below
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/path/to/output/directory \
--keep_checkpoint_max=1 \
--save_checkpoints_steps=2500
Results are reported with the following hyper-parameters: ROBERTa:
lr=2e-5; batch_size=32; # epochs=3.0
, BERT: lr=2e-5, batch_size=32, #epochs=3.0
(we generally found models to be stable
across different settings). Note: Results using this code might vary
slightly from published numbers due to random initialization; see
Results section below.
Additional switches (not in the original CLUE code): --partial_input
(to run the hypothesis-only baselines); --max_input
(for running the learning curve experiments).
Currently used pre-trained weights (see additional information, models and code on the CLUE GitHub.)
model | link |
---|---|
roberta_wwm_large_ext (weights) | original-link |
bert (weights) | original-link |
And evaluation can be done via:
python -m ocnli.{bert,roberta_wwm_large_ext}.run_classifier \
--task_name=cmnli \
--do_arbitrary="yes" \ # switch indicating random file
--data_dir=/path/to/specific/eval/jsonl/file \ # exact file to evaluate
--vocab_file=/path/to/model/vocab \
--bert_config_file=/path/to/model/config \
--init_checkpoint=/path/to/model/ \
--max_seq_length=128 \
--output_dir=/path/to/output/directory \
--model_dir /path/to/specific/checkpoint ## pointer to model+ckpt
To ensure that everything is set up correct, below is an example run on the OCNLI dev set that uses one RoBERTa checkpoint that can be downloaded from here
python -m ocnli.roberta_wwm_large_ext.run_classifier \
--task_name cmnli \
--do_arbitrary yes \
--data_dir path/to/ocnli/dev.json \ ## change here to alternative different different files
--vocab_file /path/to/pretrained/roberta/above/vocab.txt \
--bert_config_file /path/to/pretrained/roberta/above/bert_config.json \
--max_seq_length 128 \
--output_dir _runs/ex_roberta_run \
--model_dir /path/to/checkpoint/above/model.ckpt-4728 \
--eval_batch_size 1
This will generate a file metrics.json
that should look as follows (where evaluation_accuracy
is the resulting score):
{
"evaluation_accuracy": 0.7942373156547546,
"evaluation_loss": 0.7230983376502991
}