Name: Each feature file is named as [File ID].npy
which corresponds to the file ID in Flickr30K Entities.
Proposal generation: We use Selective Search to generate proposals for each image in Flickr30K Entities. For Referit Game dataset, we use Edge Box to generate proposals for each image. We select top 100
proposals in each image.
Feature extractor: We apply a Faster-RCNN network pre-trained on PASCAL VOC 2012 for Flickr30K Entities and pre-trained on ImageNet for Referit Game. To extract visual features, we fine-tune the two Faster-RCNN networks on each dataset. The visual feature for each image in these two datasets is represented as a 100 x 4096
matrix. Each row corresponds to visual feature (fc7
layer of Faster-RCNN) in each proposal bounding box.
Proposals generated by Selective Search: link (Google drive, zip file of 27MB, 126MB after unzipping). Note: Proposals generated by Selective Search are in the form of [ymin, xmin, ymax, xmax]
.
Fine-tuned visual features: link (Google drive, zip file of 18GB, 98GB after unzipping)
Proposals generated by Edge Box: link (Google drive, zip file of 19MB, 82MB after unzipping).
Fine-tuned visual features: link (Google drive, zip file of 12GB, 62GB after unzipping)