This is the official implementation of Dual-decoder transformer network for answer grounding in visual question answering, which introduces an simple framework for answer grounding in visual question answering under the Transformer network.
2022.8.6 We create this project for our paper. Thanks for your attention.
2023.4.16 Our paper: Dual-decoder transformer network for answer grounding in visual question answering has been accepted by Rattern Recognition Letters. The code will be released shortly.
2023.5.21 The code is available.
conda create -n DDTN python=3.7
conda activate DDTN
git clone https://github.com/zlj63501/DDTN.git
pip install -r requirements.txt
cd mmf
pip install --editable .
- Download the VizWiz-VQA-Grounding images from VizWiz.
- Download the annotations and weights from Google Drive.
- Extract the image grid and region features, according to the repository VinVl.
The data structure should look like the following:
| -- DDTN
| -- data
| -- annotations
-- train.npy
-- val.npy
-- test.npy
-- answers.txt
-- vocabulary_100K.txt
| -- weights
| -- resnet_head.pth
| -- features
| -- train
| -- VizWiz_train_00000000.npz
| -- ...
| -- val
| -- VizWiz_val_00000001.npz
| -- ...
| -- test
| -- VizWiz_test_00000002.npz
| -- ...
...
We train DDTN to perform grouning and answering at instance level on a single TiTan X GPU with 12 GB memory. The following script performs the training:
python mmf_cli/run.py config=projects/ddtn/configs/defaults.yaml run_type=train_val dataset=vizwiz model=ddtn
python mmf_cli/run.py config=projects/ddtn/configs/defaults.yaml run_type=val dataset=vizwiz model=ddtn checkpoint.resume_file=save/models/xxx.ckpt
@article{ZHU202353,
title = {Dual-decoder transformer network for answer grounding in visual question answering},
journal = {Pattern Recognition Letters},
volume = {171},
pages = {53-60},
year = {2023},
issn = {0167-8655},
doi = {https://doi.org/10.1016/j.patrec.2023.04.003},
url = {https://www.sciencedirect.com/science/article/pii/S0167865523001046},
author = {Liangjun Zhu and Li Peng and Weinan Zhou and Jielong Yang}
}
Our code is built upon the open-sourced MMF, mmdetection, SeqTR and VinVl libraries.