We propose a generative evaluation of the VisDial dataset which computes established NLP metrics (CIDEr, METEOR, FastText-L2, FastText-CS, BERT-L2, BERT-CS) between a generated answer and a set of reference answers. We use a simple Canonical Correlation Analysis (CCA) [Hotelling, 1926; Kettenring, 1971] based approach to construct these answer reference sets at-scale across the whole dataset.
See our paper and our previous analysis (code here) where we highlight the flaws of the existing rank-based evaluation of VisDial by applying CCA between question and answer embeddings and achieving near state-of-the-art results.
All required packages are contained in geneval_visdial.yml
. You can install them in an Anaconda environment as:
conda env create -f geneval_visdial.yml
source activate geneval_visdial
This repository uses PyTorch 1.0.1, Python 3.6 and CUDA 8.0. You can change these specifications in geneval_visdial.yml
.
Your generations will need to be saved as a .json
file in the following format:
[{"image_id": 185565, "round_id": 0, "generations": ["yes", "maybe", "yes, I think so"]},
{"image_id": 185565, "round_id": 1, "generations": ["several", "I can see two", "2, I think", "not sure"]},
{"image_id": 185565, "round_id": 2, "generations": ["yes", "yes, it is"]},
...,
{"image_id": 185565, "round_id": 9, "generations": ["no", "no, can't see"]},
{"image_id": 284024, "round_id": 0, "generations": ["one"]},
... ]
image_id
should correspond exactly to the VisDial v1.0 image IDs. round_id
is 0-indexed and corresponds to the dialogue round (0 to 9). generations
should contain a list of strings with no <START>
or <END>
tokens. Note, the number of generations can vary per image/round, but in our experiments we fix it to 1, 5, 10, or 15 generations per entry. In the case of 1 generation, each entry should still be a list (i.e. "generations" : ["yes"]
).
See gens.json
as a example. These answers have been generated for the full VisDial validation set using CCA-AQ-G (k=1)
- see Table 4 (right) in paper. The code to generate these answers can be found in the CCA-visualdialogue repository.
Download refs_S_full_val.json
and save it in densevisdial
directory. These are the answer reference sets for the entire VisDial validation set, automatically generated using the S
or \Sigma clustering method. This method yields the best overlap with human-annotated reference sets, and we use it for all generative evaluation metrics reported in the paper.
Answer reference sets generated using other clustering methods (S
, M
and G
) and the human-annotated reference sets (H
) can be downloaded for the VisDial train and validation sets) here:
C | Train | Val | Description |
---|---|---|---|
S |
refs_S_full_train.json , refs_S_human_train.json |
refs_S_full_val.json , refs_S_human_val.json |
\Sigma clustering (based on standard deviation of correlations) |
M |
refs_M_full_train.json , refs_M_human_train.json |
refs_M_full_val.json , refs_M_human_val.json |
Meanshift clustering |
G |
refs_G_full_train.json , refs_G_human_train.json |
refs_G_full_val.json , refs_G_human_val.json |
Agglomerative clustering (n=5) |
H |
refs_H_human_train.json |
refs_H_human_val.json |
Human-annotated reference sets (relevance scores > 0) |
See How to generate answer reference sets to generate your own answer reference sets using one of the prescribed methods.
The evaluation script uses the bert-as-a-service
client/server package. Download the pre-trained BERT-Base, Uncased model and save it in <bert_model_dir>
.
Then start the bert-as-a-service
server in a separate shell:
bert-serving-start -model_dir <bert_model_dir>/uncased_L-12_H-768_A-12 -num_worker 2 \
-max_seq_len 16 -pooling_strategy CLS_TOKEN -pooling_layer -1
num_workers
controls the number of GPUs or CPU cores (add -cpu
flag) to use.
The evaluation script uses pre-trained FastText word vectors. Download and unzip the English bin+text
FastText model pre-trained on Wikipedia. Save the wiki.en.bin
file as <fasttext_model_dir>/fasttext.wiki.en.bin
.
The evaluation will compute CIDEr (n-grams 1 to 4), METEOR, BERT-L2 (L2 distance), BERT-CS (cosine similarity), FastText-L2 and FastText-CS between each generation and its corresponding set of reference answers.
python evaluate.py --generations gens.json --references densevisdial/refs_S_val.json> \
--fast_text_model <fasttext_model_dir>/fasttext.wiki.en.bin
You can generate the answer reference sets yourself using clustering methods S
, M
, and G
.
Download and unzip the dialogue .json
files from:
Download the .json
files with human-annotated scores:
visdial_1.0_val_dense_annotations.json
visdial_1.0_train_dense_sample.json
(you will need to rename thisvisdial_1.0_train_dense_annotations.json
)
Save all these .jsons
to <dataset_root>/1.0/
.
Download and unzip the FastText English bin+text
model pre-trained on Wikipedia. Save the wiki.en.bin
file as <fasttext_model_dir>/fasttext.wiki.en.bin
.
The compute_clusters.py
script will automatically load and pre-process the data. This may take 10-15 minutes. If you prefer, you can download the pre-processed features to <dataset_root>/1.0/
directly:
- Train QAs:
train_processed_S16_D10_woTrue_whsTrue.zip
- Validation QAs:
val_processed_S16_D10_woTrue_whsTrue.zip
- Pre-processed vocabulary:
vocab_visdial_v1.0_train.pt
- Pre-processed word vectors:
fasttext.wiki.en.bin_vocab_vecs.pt
You can now generate clusters using:
source activate geneval_visdial
python clusters/compute_clusters.py --dataset_root <dataset_root> \
--fast_text_model <fasttext_model_dir>/fasttext.wiki.en.bin
--gpu 1 \
--cca QA_human_trainval \
--eval_set full \
--cluster_method S
This will compute clusters on the full VisDial dataset (both train and validation sets) using the S
clustering method and save the clusters in ./results
as refs_S_full_train.json
and refs_S_full_val.json
. If you want to compute clusters for only the subset of VisDial with human-annotated reference sets, use --eval_set human
.
The --cca
flag specifies the data to train the CCA model:
QA_human_train
trains on all answers with human-annotated relevance scores > 0 and their corresponding questions in the VisDial train set.QA_human_trainval
trains on all answers with human-annotated relevance scores > 0 and their corresponding questions in the VisDial trainval set.QA_full_train
trains on all ground-truth answers and their corresponding questions in the VisDial train set.QA_full_trainval
trains on all ground-truth answers and their corresponding questions in the VisDial trainval set.
We use these differently depending on the evaluation set.
For --eval_set human
:
- Table 4 (left), Table 6: we use
--QA_human_train --cluster_method H
to compute the human-annotated reference sets. We report overlap and embedding metrics for generated answers and these sets on the validation subset,\mathcal{H}_v
- Table 1, Table 8 (
(A_gt, \tilde{A})
rows): we use--QA_human_train
(CCA-QA*) and--QA_full_train
(CCA-QA) to compute the overlap of the automatic reference sets--cluster_method {M,S,G}
with the human-annotated reference sets (H
).
For --eval_set full
:
- Table 4 (right) and Table 7: we use
--QA_human_trainval --cluster_method S
to compute the automatic reference sets. We report overlap and embedding metrics for generated answers and these sets on the full validation set.
@article{massiceti2020revised,
title={A Revised Generation Evaluation of Visual Dialogue},
author={Massiceti, Daniela and Kulharia, Viveka, and Dokania, Puneet K and Siddharth, N and Torr, Philip HS},
journal={arXiv preprint arXiv:2004.09272},
year={2020}
}