Skip to content
/ PureT Public

Implementation of 'End-to-End Transformer Based Model for Image Captioning' [AAAI 2022]

Notifications You must be signed in to change notification settings

232525/PureT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PureT

Implementation of End-to-End Transformer Based Model for Image Captioning [PDF/AAAI] [PDF/Arxiv] [AAAI 2022]

中文介绍请参考README_CN.md

architecture

Requirements (Our Main Enviroment)

  • Python 3.7.4
  • PyTorch 1.5.1
  • TorchVision 0.6.0
  • coco-caption
  • numpy
  • tqdm

Preparation

1. coco-caption preparation

Refer coco-caption README.md, you will first need to download the Stanford CoreNLP 3.6.0 code and models for use by SPICE. To do this, run:

cd coco_caption
bash get_stanford_models.sh

2. Data preparation

The necessary files in training and evaluation are saved in mscoco folder, which is organized as follows:

mscoco/
|--feature/
    |--coco2014/
       |--train2014/
       |--val2014/
       |--test2014/
       |--annotations/
|--misc/
|--sent/
|--txt/

where the mscoco/feature/coco2014 folder contains the raw image and annotation files of MSCOCO 2014 dataset. You can download other files from GoogleDrive or 百度网盘(提取码: hryh).

NOTE: You can also extract image features of MSCOCO 2014 using Swin-Transformer or others and save them as ***.npz files into mscoco/feature for training speed up, refer to coco_dataset.py and data_loader.py for how to read and prepare features. In this case, you need to make some modifications to pure_transformer.py (delete the backbone module). For you smart and excellent people, I think it is an easy work.

Training

Note: our repository is mainly based on JDAI-CV/image-captioning, and we directly reused their config.yml files, so there are many useless parameter in our model. (waiting for further sorting

1. Training under XE loss

Download pre-trained Backbone model (Swin-Transformer) from GoogleDrive or 百度网盘(提取码: hryh) and save it in the root directory.

Before training, you may need check and modify the parameters in config.yml and train.sh files. Then run the script:

# for XE training
bash experiments_PureT/PureT_XE/train.sh

2. Training using SCST (self-critical sequence training)

Copy the pre-trained model under XE loss into folder of experiments_PureT/PureT_SCST/snapshot/ and modify config.yml and train.sh files. Then run the script:

# for SCST training
bash experiments_PureT/PureT_SCST/train.sh

Evaluation

You can download the pre-trained model from GoogleDrive or 百度网盘(提取码: hryh).

CUDA_VISIBLE_DEVICES=0 python main_test.py --folder experiments_PureT/PureT_SCST/ --resume 27
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr SPICE
82.1 67.3 52.0 40.9 30.2 60.1 138.2 24.2

Reference

If you find this repo useful, please consider citing (no obligation at all):

@inproceedings{wangyiyu2022PureT,
  author       = {Yiyu Wang and
                  Jungang Xu and
                  Yingfei Sun},
  title        = {End-to-End Transformer Based Model for Image Captioning},
  booktitle    = {Proceedings of the AAAI Conference on Artificial Intelligence},
  pages        = {2585--2594},
  publisher    = {{AAAI} Press},
  year         = {2022},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/20160}, 
  doi          = {10.1609/aaai.v36i3.20160},
}

Acknowledgements

This repository is based on JDAI-CV/image-captioning, ruotianluo/self-critical.pytorch and microsoft/Swin-Transformer.

About

Implementation of 'End-to-End Transformer Based Model for Image Captioning' [AAAI 2022]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published