This repository contains the dataset and code for the EMNLP2020 paper HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data, which is the first large-scale multi-hop question answering dataset on heterogeneous data including tabular and textual data. The whole dataset contains over 70K question-answer pairs based on 13,000 tables, each table is in average linked to 44 passages, more details in https://hybridqa.github.io/.
The questions are annotated to require aggregation of information from both the table and its hyperlinked text passages, which poses challenges to existing homongeneous text-based or KB-based models.- huggingface transformer 2.6.0
- pytorch 1.4.0
- tensorboardX
- tqdm
Have fun interacting with the dataset: https://hybridqa.github.io/explore.html
The released data contains the following files:
train/dev/test.json: these files are the original files, all annotated by human.
train/dev.traced.json: these files are generated by trace_answer.py to find the answer span in the given evidences.
First of all, you should download all the tables and passages into your current folder
git clone https://github.com/wenhuchen/WikiTables-WithLinks
Then, you can either preprocess the data on your own,
python preprocessing.py
or use our preprocessed version from Amazon S3
wget https://hybridqa.s3-us-west-2.amazonaws.com/preprocessed_data.zip
unzip preprocessed_data.zip
wget https://hybridqa.s3-us-west-2.amazonaws.com/models.zip
unzip BERT-base-uncased.zip
It will download and generate folder stage1/stage2/stage3/
CUDA_VISIBLE_DEVICES=0 python train_stage12.py --stage1_model stage1/2020_10_03_22_47_34/checkpoint-epoch2 --stage2_model stage2/2020_10_03_22_50_31/checkpoint-epoch2/ --do_lower_case --predict_file preprocessed_data/dev_inputs.json --do_eval --option stage12 --model_name_or_path bert-large-uncased
This command generates a intermediate result file
CUDA_VISIBLE_DEVICES=0 python train_stage3.py --model_name_or_path stage3/2020_10_03_22_51_12/checkpoint-epoch3/ --do_stage3 --do_lower_case --predict_file predictions.intermediate.json --per_gpu_train_batch_size 12 --max_seq_length 384 --doc_stride 128 --threads 8
This command generates the prediction file
python evaluate_script.py predictions.json released_data/dev_reference.json
Running training command for stage1 using BERT-base-uncased as follows:
CUDA_VISIBLE_DEVICES=0 python train_stage12.py --do_lower_case --do_train --train_file preprocessed_data/stage1_training_data.json --learning_rate 2e-6 --option stage1 --num_train_epochs 3.0
Or Running training command for stage1 using BERT-large-uncased as follows:
CUDA_VISIBLE_DEVICES=0 python train_stage12.py --model_name_or_path bert-large-uncased --do_train --train_file preprocessed_data/stage1_training_data.json --learning_rate 2e-6 --option stage1 --num_train_epochs 3.0
Running training command for stage2 as follows:
CUDA_VISIBLE_DEVICES=0 python train_stage12.py --do_lower_case --do_train --train_file preprocessed_data/stage2_training_data.json --learning_rate 5e-6 --option stage2 --num_train_epochs 3.0
Or BERT-base-cased/BERT-large-uncased like above.
Running training command for stage3 as follows:
CUDA_VISIBLE_DEVICES=0 python train_stage3.py --do_train --do_lower_case --train_file preprocessed_data/stage3_training_data.json --per_gpu_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 4.0 --max_seq_length 384 --doc_stride 128 --threads 8
Or BERT-base-cased/BERT-large-uncased like above.
Model Selction command for stage1 and stage2 as follows:
CUDA_VISIBLE_DEVICES=0 python train_stage12.py --do_lower_case --do_eval --option stage1 --output_dir stage1/[OWN_PATH]/ --predict_file preprocessed_data/stage1_dev_data.json
Evaluating command for stage1 and stage2 as follows (replace the stage1_model and stage2_model path with your own):
CUDA_VISIBLE_DEVICES=0 python train_stage12.py --stage1_model stage1/[OWN_PATH] --stage2_model stage2/[OWN_PATH] --do_lower_case --predict_file preprocessed_data/dev_inputs.json --do_eval --option stage12
The output will be saved into predictions.intermediate.json, which contain all the answers for non hyper-linked cells, with the hyperlinked cells, we need the MRC model in stage3 to extract the span.
Evaluating command for stage3 as follows (replace the model_name_or_path with your own):
CUDA_VISIBLE_DEVICES=0 python train_stage3.py --model_name_or_path stage3/[OWN_PATH] --do_stage3 --do_lower_case --predict_file predictions.intermediate.json --per_gpu_train_batch_size 12 --max_seq_length 384 --doc_stride 128 --threads 8
The output is finally saved to predictoins.json, which can be used to calculate F1/EM with reference file.
python evaluate_script.py predictions.json released_data/dev_reference.json
We host CodaLab challenge in HybridQA Competition, you should submit your results to the competition to obtain your testing score. The submitted file should first be named "test_answers.json" and then zipped. The required format of the submission file is described as follows:
[
{
"question_id": xxxxx,
"pred": XXX
},
{
"question_id": xxxxx,
"pred": XXX
}
]
The reported scores are EM and F1.
Model | Organization | Reference | Dev-EM | Dev-F1 | Test-EM | Test-F1 |
---|---|---|---|---|---|---|
S3HQA | CASIA | Lei et al. (2023) | 68.4 | 75.3 | 67.9 | 75.5 |
MAFiD | JBNU & NAVER | Lee et al. (2023) | 66.2 | 74.1 | 65.4 | 73.6 |
TACR | TIT & NTU | Wu et al. (2023) | 64.5 | 71.6 | 66.2 | 70.2 |
UL-20B | Tay et al. (2022) | - | - | 61.0 | - | |
MITQA | IBM & IIT | Kumar et al. (2021) | 65.5 | 72.7 | 64.3 | 71.9 |
RHGN | SEU | Yang et al. (2022) | 62.8 | 70.4 | 60.6 | 68.1 |
POINTR + MATE | Eisenschlos et al. (2021) | 63.3 | 70.8 | 62.7 | 70.0 | |
POINTR + TAPAS | Eisenschlos et al. (2021) | 63.4 | 71.0 | 62.8 | 70.2 | |
MuGER2 | JD AI | Wang et al. (2022) | 57.1 | 67.3 | 56.3 | 66.2 |
DocHopper | CMU | Sun et al. (2021) | 47.7 | 55.0 | 46.3 | 53.3 |
HYBRIDER | UCSB | Chen et al. (2020) | 43.5 | 50.6 | 42.2 | 49.9 |
HYBRIDER-Large | UCSB | Chen et al. (2020) | 44.0 | 50.7 | 43.8 | 50.6 |
Unsupervised-QG | NUS&UCSB | Pan et al. (2020) | 25.7 | 30.5 | - | - |
If you find this project useful, please use the following format to cite the paper:
@article{chen2020hybridqa,
title={HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data},
author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William},
journal={Findings of EMNLP 2020},
year={2020}
}
If you have any question about the dataset and code, feel free to raise a github issue or shoot me an email. Thanks!