This repository contains a high-level implementation of ConCoRD, the system proposed in the EMNLP 2022 paper Enhancing Self-Consistency and Performance of Pretrained Language Models with NLI, as well as the steps to reproduce the results in the paper. See the project website for an overview of what ConCoRD does and how it works.
ConCoRD doesn't perform training or fine-tuning, so it doesn't use any training data. However, it does have hyperparameters, so we provide the small validation sets used for hyperparameter tuning in addition to the datasets used for evaluation in this Google Drive folder.
ConCoRD uses off-the-shelf NLI models to perform relation detection and ultimately enhance model self-consistency & accuracy. We use the following NLI models, all from the wonderful HuggingFace library:
The BeliefBank dataset contains:
calibration_facts.json
: dev set for hyperparameter tuningsilver_facts.json
: test setconstraints_v2.json
: golden constraints for evaluating consistency
QA models evaluated include:
Since ConCoRD does not modify the QA or NLI models, for efficiency we cache the inference results from the models on
BeliefBank data. The following sections walk through our full pipeline for generating results, but we have
also uploaded our cached inference results to the Drive folder if you would like to directly experiment with those instead.
All file paths are given relative to the top-level nli-consistency/
directory.
Preprocess calibration and silver facts by using pre-written templates to create question and answer pairs.
python cbqa/preprocess.py -f data/cbqa/beliefbank-data-sep2021/calibration_facts.json -o {output file path}
Repeat for silver facts.
Cached file paths:
data/cbqa/calibration_facts_preprocessed_by_entity.json
, data/cbqa/silver_facts_preprocessed_by_entity.json
For each of Macaw large and 3B, generate a cache of QA results for each of the calibration and silver facts preprocessed results.
For example, for Macaw large and calibration facts:
python -m cbqa.qa_score_dataset -m allenai/macaw-large -f data/cbqa/calibration_facts_preprocessed_by_entity.json -o {output file path}
Cached QA results are under data/cbqa/qa-cache
For each of the NLI models, run NLI inference between each question-answer pair.
For example, for RoBERTa large ANLI:
python -m cbqa.nli_score_dataset -m ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli -f data/cbqa/qa-cache/macaw-large/calibration-facts-qa-scored.json -o {output file path}
Cached NLI results are under data/cbqa/nli-cache
Use the cached QA and NLI calibration facts results to facilitate tuning hyperparameters for the MaxSAT solver with hyperopt. Each QA-NLI model combination, along with QA-oracle,
is evaluated. Results are stored in files under cbqa/tuned_hparams
, where you can also find our original runs.
For example, to tune Macaw large with RoBERTa large ANLI:
python -m cbqa.main -m hparam -qa allenai/macaw-large --qa_scores_cached_path data/cbqa/qa-cache/macaw-large/calibration-facts-qa-scored.json -nli ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli --nli_scores_cached_path data/cbqa/nli-cache/roberta-large-snli/nli-scored-calibration-facts.csv
Let's put it all together.
Evaluate each QA-NLI model combination using tuned hyperparameters on the silver facts.
For example, to evaluate Macaw large with RoBERTa large ANLI:
python -m cbqa.main -qa allenai/macaw-large --qa_scores_cached_path data/cbqa/qa-cache/macaw-large/silver-facts-qa-scored.json -nli ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli --nli_scores_cached_path data/cbqa/nli-cache/roberta-large-snli/nli-scored-silver-facts.csv -v
The -v
flag enables verbose output that allows you to see exactly which beliefs were flipped or untouched.
The oracle (golden constraints as NLI relations) can be run on Macaw large as follows:
python -m cbqa.main -qa allenai/macaw-large --qa_scores_cached_path data/cbqa/qa-cache/macaw-large/silver-facts-qa-scored.json --oracle -v
By using the cached QA and NLI results we included under data/cbqa
, you can reproduce the numbers we report in Tables 1, 6, and 9 in the paper.
Add either --ablation_keep_relation contradiction
or --ablation_keep_relation entailment
when running python -m cbqa.main
(as shown above) for hyperparameter tuning and inference.
Our results on the best NLI model for CBQA (RoBERTa ANLI) are reported in Table 5.
Pass --disable_ec
when running python -m cbqa.main
(as shown above) for hyperparameter tuning and inference.
Our results on the best NLI model for CBQA (RoBERTa ANLI) are reported in Table 8.
This experiment evaluates ConCoRD on related questions from ConVQA about images from the Visual Genome.
- The questions for hyperoptimization (referred to as 'train') is sampled from the file
L-ConVQA
- The questions for final evaluation (referred to as 'test') is sampled from the file
L-ConVQA Test
QA models evaluated include:
- Learning Cross-Modality Encoder Representations from Transformers (LXMERT)
- Vision-and-Language Transformer (ViLT)
Parameters (e.g., paths to datasets, CPU/GPU, etc.) are set by variables within each notebook. Please make sure that all paths are indicated properly for respective user in sections marked as:
### INSTRUCTION FOR USERS : INDICATE APPROPRIATE PATH
For LXMERT, in lieu of the tokenizer provided by HuggingFace, we use the token-to-text mapping from the LXMERT Github repository.
The QA conversion model that we use has a checkpoint available as a cached model, and the cached data listed throughout this section are available on-line as well.
Use the notebook vg-data-selection.ipynb
to sample images and questions from ConVQA for the 'train' set
QA inference is then performed in the following notebooks:
lxmert-run-train-10000im-3pred-40token-1seed_predictions.ipynb
lxmert-test-3pred-40token-1seed_predictions.ipynb
vilt-run-train-10000im-3pred-40token-1seed_predictions.ipynb
vilt-test-3pred-40token-1seed_predictions.ipynb
Cached data from the data sampling and QA inference available on Google Drive:
vg-data-2022-06-13-16:03:37-n=10000-seed=1.txt
lxmert-run-train-10000im-3pred-40token-1seed_predictions_nli.json
lxmert-test-3pred-40token-1seed_predictions_nli.json
vilt-run-train-10000im-3pred-40token-1seed_predictions_nli.json
vilt-test-3pred-40token-1seed_predictions_nli.json
Evaluate the train/test set with various NLI models
Within the first_run
directory, evaluate using the ANLI model with:
- Two .ipynb notebooks for the train set:
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) lxmert SAVE_NLI.ipynb
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vilt SAVE_NLI.ipynb
- Two .py files for the test set:
Within the second_run
directory, evaluating using the MNLI and XXLARGE models with:
NLI_Save_10000_images_num_answers=2_not_redundant=True_repeated_comparisons=False_vqa_lxmert-models-mnli-xxlarge.py
- See its use in
vqa_lxmert_models_run.sh
- See its use in
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vqa lxmert-test-mnli-EMERGENCY.ipynb
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vqa lxmert-test-xxlarge-EMERGENCY.ipynb
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vqa vilt-test-mnli-EMERGENCY.ipynb
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vqa vilt-test-xxlarge-EMERGENCY.ipynb
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vqa vilt-val-mnli-EMERGENCY.ipynb
20220525 NLI Save (10000 images; num_answers = 2; num_choices = 2; not_redundant = True; repeated_comparisons = False, group_count = num_choices) vqa vilt-val-xxlarge-EMERGENCY.ipynb
Cached data from NLI Inference available on Google Drive:
lxmert-run-train-10000im-3pred-40token-1seed_predictions_nli-xxlarge.json
lxmert-run-train-10000im-3pred-40token-1seed_predictions_nli-mnli.json
lxmert-run-train-10000im-3pred-40token-1seed_predictions_nli.json
lxmert-test-3pred-40token-1seed_predictions_nli-xxlarge.json
lxmert-test-3pred-40token-1seed_predictions_nli-mnli.json
lxmert-test-3pred-40token-1seed_predictions_nli.json
vilt-run-train-10000im-3pred-40token-1seed_predictions_nli-xxlarge.json
vilt-run-train-10000im-3pred-40token-1seed_predictions_nli-mnli.json
vilt-run-train-10000im-3pred-40token-1seed_predictions_nli.json
vilt-test-3pred-40token-1seed_predictions_nli-xxlarge.json
vilt-test-3pred-40token-1seed_predictions_nli-mnli.json
vilt-test-3pred-40token-1seed_predictions_nli.json
Tune the hyperparameters on the train set, searching for the optimal NLI model, use of entailment correction and λ and β values
The main file that optimizes for the hyperparameters: visual_tune_mod.py
- The use of the file and its flags are outlined in
visual_tune_table_6.sh
- -f is for source of answers & nli outputs
- -o is the trial outputs
- -t is the number of trials
- -w indicates use of entailment correction
Here is an example of the use of visual_tune_mode.py
python3 visual_tune_mod.py -f vilt-run-train-10000im-3pred-40token-1seed_predictions_nli-mnli.json -o vilt-table6-mnli-nwe.trials -t 100 > vilt-table6-mnli-nwe.log
Optimal hyperparameters were manually noted and used for the next (final) step
Evaluate on the test set using the hyperparameters determined from step 3
The first main cell in the notebook 20221019 vqa solve only test with opt_with timeout counter_with ablation and perfect consistency.ipynb
contains the function for the final evaluation on the test set based on given hyperparameters.
The subsequent four cells contain outputs for the main results in section 4.3
In the semantic_filtering directory:
mkdir hyperparam_search
mkdir eval_results
export CACHE_DIR=<directory where you want to store cached datasets, this is for huggingface caching>
export STORE_DIR=<your root directory for downloaded data files>/nq/
python3 eval_retrieve.py --mode=base --split={test, val} --model={t5-small, t5-large, t5-3b} --cache_dir=$CACHE_DIR --store_dir=$STORE_DIR
These should give you the baseline results reported in Section 4.3.
Our intermediate data files are stored under the name cache_{val, test}_0.8_4_t5-{small, large, 3b}.jsonl
. 0.8 and 4 correspond to the temperature and the number of responses we asked the QA model to generate, respectively.
To obtain the oracle results (upper bound of our results), run the following:
export CACHE_ID=<path to the intermediate data file of choice>
export RESULT_FILE=<filename for your result file>
python3 eval_answers.py --cache_id=<CACHE_ID> --result_file=<RESULT_FILE>
export CACHE_DIR=<directory where you want to store cached datasets, this is for huggingface caching>
export STORE_DIR=<your root directory for downloaded data files>/nq/
python3 eval_retrieve.py --mode=gold --split={test, val} --model={t5-small, t5-large, t5-3b} --cache_dir=$CACHE_DIR --store_dir=$STORE_DIR
Add the flag entail_only
for entailment-only results, and contradiction_only
for contradiction-only results to the commands above.
export CACHE_DIR=<directory where you want to store cached datasets, this is for huggingface caching>
export STORE_DIR=<your root directory for downloaded data files>/nq/
python3 eval_retrieve.py --model={t5-small, t5-large, t5-3b} --cache_dir=$CACHE_DIR --store_dir=$STORE_DIR
The hyperparameter search might take 3 hours or longer depending on the amount of compute available. The results will be printed, or you can find the results stored in the hyperparam_search
directory.
If ConCoRD is useful for your own research, you can cite our work with the following BibTeX entry:
@inproceedings{mitchell2022enhancing,
title={Enhancing Self-Consistency and Performance of
Pretrained Language Models with NLI},
author={Mitchell, Eric and Noh, Joseph J. and Li, Siyan and
Armstrong, William S. and Agarwal, Ananth and
Liu, Patrick and Finn, Chelsea and Manning, Christopher D.},
booktitle={Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing (EMNLP)},
url={https://ericmitchell.ai/concord.pdf},
year={2022},
publisher={Association for Computational Linguistics}
}