This repository contains the code (using Tensorflow) and models for this CVPR 2017 paper (image-to-text and text-to-image task):
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim.
"Dual attention networks for multimodal reasoning and matching."
in Proc. CVPR 2017
Thanks to instructions from the author (Hyeonseob Nam), I was able to reproduce the number reported in the paper on Flickr30k:
Image-to-Text | Text-to-Image | |||||||
Method | R@1 | R@5 | R@10 | MR | R@1 | R@5 | R@10 | MR |
DAN Paper | 55.0 | 81.8 | 89.0 | 1 | 39.4 | 69.2 | 79.1 | 2 |
This Implementation | 54.4 | 82.4 | 89.9 | 1.0 | 39.8 | 71.4 | 80.9 | 2 |
- Python 2.7; TensorFlow >= 1.4.0; tqdm and nltk (for preprocessing)
- Flickr30k Images and Text
- Dataset splits from here. This split is the same as m-RNN.
- Pretrained Resnet-152 Model from Tensorpack
- Get Resnet feature
$ python resnet-extractor/extract.py flickr30k_images/ ImageNet-ResNet152.npz resnet-152 --batch_size 20 --resize 448 --depth 152
- Preprocess
$ python prepro_flickr30k.py splits/ results_20130124.token prepro --noword2vec --noimgfeat
- Training
I use a slightly different training schedule. Batch size 256, learning rate 0.1 and 0.5 dropout for the first 60 epochs and 0.8 dropout and learning rate 0.05 for the next epochs. Also I use Adadelta as optimizer. It will take up to 9GB GPU memory and train for about 50 hours with SSDs.
(There are other options (--use_char, --concat, etc.) I haven't tried with hard negative mining yet.)
$ python main.py prepro models dan --no_wordvec --word_emb_size 512 --num_hops 2 --word_count_thres 1 --sent_size_thres 200 --word_size_thres 20 --hidden_size 512 --keep_prob 0.5 --margin 100 --num_epochs 60 --save_period 1000 --batch_size 256 --clip_gradient_norm 0.1 --init_lr 0.1 --wd 0.0005 --featpath resnet-152/ --feat_dim 14,14,2048 --hn_num 32 --is_train
- Testing with the model You can download my model and put it in models/00/dan/best/ to directly run it. Also put shared.p in models/00/dan/
$ python main.py prepro models dan --no_wordvec --word_emb_size 512 --num_hops 2 --word_count_thres 1 --sent_size_thres 200 --word_size_thres 20 --hidden_size 512 --keep_prob 0.5 --margin 100 --num_epochs 60 --save_period 1000 --batch_size 256 --clip_gradient_norm 0.1 --init_lr 0.1 --wd 0.0005 --featpath resnet-152/ --feat_dim 14,14,2048 --hn_num 32 --is_test --load_best