Skip to content
/ VLTVG Public

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022

Notifications You must be signed in to change notification settings

yangli18/VLTVG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

This is the official implementation of Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Introduction

Our proposed framework for visual grounding. With the features from the two modalities as input, the visual-linguistic verification module and language-guided context encoder establish discriminative features for the referred object. Then, the multi-stage cross-modal decoder iteratively mulls over all the visual and linguistic features to identify and localize the object.

Visualization

For different input images and texts, we visualize the verification scores, the iterative attention maps of the multi-stage decoder, and the final localization results.

Model Zoo

The models are available in Google Drive.

RefCOCO RefCOCO+ RefCOCOg ReferItGame Flickr30k
val testA testB val testA testB val-g val-u test-u test test
R50 84.53 87.69 79.22 73.60 78.37 64.53 72.53 74.90 73.88 71.60 79.18
R101 84.77 87.24 80.49 74.19 78.93 65.17 72.98 76.04 74.18 71.98 79.84

Installation

  1. Clone the repository.

    git clone https://github.com/yangli18/VLTVG
  2. Install PyTorch 1.5+ and torchvision 0.6+.

    conda install -c pytorch pytorch torchvision
  3. Install the other dependencies.

    pip install -r requirements.txt

Preparation

Please refer to get_started.md for the preparation of the datasets and pretrained checkpoints.

Training

The following is an example of model training on the RefCOCOg dataset.

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --config configs/VLTVG_R50_gref.py

We train the model on 4 GPUs with a total batch size of 64 for 90 epochs. The model and training hyper-parameters are defined in the configuration file VLTVG_R50_gref.py. We prepare the configuration files for different datasets in the configs/ folder.

Evaluation

Run the following script to evaluate the trained model with a single GPU.

python test.py --config configs/VLTVG_R50_gref.py --checkpoint VLTVG_R50_gref.pth --batch_size_test 16 --test_split val

Or evaluate the trained model with 4 GPUs:

python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py --config configs/VLTVG_R50_gref.py --checkpoint VLTVG_R50_gref.pth --batch_size_test 16 --test_split val

Citation

If you find our code useful, please cite our paper.

@inproceedings{yang2022vgvl,
  title={Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning},
  author={Yang, Li and Xu, Yan and Yuan, Chunfeng and Liu, Wei and Li, Bing and Hu, Weiming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Acknowledgement

Part of our code is based on the previous works DETR and ReSC.

About

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published