In this repo, we provide code and pretrained models for the paper "A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval" which has been accepted for presentation at the 30th ACM International Conference on Multimedia (ACM MM).
Requirements: python 3, allennlp 2.8.0, h5py 3.6.0, pandas 1.3.5, spacy 2.3.5, torch 1.7.0 (also tested with 1.8), numpy 1.23.1, tensorboard, tqdm
# clone the repository
cd FSMMDA_VideoRetrieval
export PYTHONPATH=$(pwd):${PYTHONPATH}
- Features:
- TBN EPIC-Kitchens-100 features from JPoSE's repo.
- S3D YouCook2 features from the VALUE benchmark.
- Additional:
- pre-extracted annotations for EPIC-Kitchens-100 and YouCook2
- split folders for EPIC-Kitchens-100 and YouCook2
- GloVe checkpoints for EPIC-Kitchens-100 and YouCook2
To launch a training, first select a configuration file (e.g. prepare_mlmatch_configs_EK100_TBN_augmented_VidTxtLate.py
) and execute the following:
python t2vretrieval/driver/configs/prepare_mlmatch_configs_EK100_TBN_augmented_VidTxtLate.py .
This will return a folder name (where config, models, logs, etc will be saved). Let that folder be $resdir
. Then, execute the following to start a training:
python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --is_train --load_video_first --resume_file glove_checkpoint_path
Config files are used to define details of the model and of the paths containing the annotations, features, etc. By running a "prepare_*" script, a folder containing two .json files is created.
- HGR baseline: prepare_mlmatch_configs_EK100_TBN_baseline.py
- Coarse-grained video selection with variable lambda (λ~β(1, 1)): prepare_mlmatch_configs_EK100_TBN_augmented_Vid_coarse
- Fine-grained video selection with fixed lambda (λ=0.5): prepare_mlmatch_configs_EK100_TBN_augmented_fixLambda.py
- Video augmentation by noise addition (Dong et al.): prepare_mlmatch_configs_EK100_TBN_augmented_thrPos_VidNoise.py
- Text augmentation by synonym replacement: prepare_mlmatch_configs_EK100_TBN_augmented_Txt.py
- Video augmentation by the proposed feature-space technique: prepare_mlmatch_configs_EK100_TBN_augmented_Vid.py
- Text augmentation by the proposed feature-space technique: prepare_mlmatch_configs_EK100_TBN_augmented_TxtLate.py
- Augmentation by the proposed feature-space multi-modal technique: prepare_mlmatch_configs_EK100_TBN_augmented_VidTxtLate.py
- Cooperation of the proposed FSMMDA with RAN: prepare_mlmatch_configs_EK100_TBN_augmented_thrPos_VidTxtLate.py
- Cooperation of the proposed FSMMDA with RANP: prepare_mlmatch_configs_EK100_TBN_augmented_thrPos_HP_VidTxtLate.py
- HGR baseline on YouCook2: prepare_mlmatch_configs_YC2-S3D.py
- Augmentation by the proposed feature-space multi-modal technique on YouCook2: prepare_mlmatch_configs_YC2_augVidTxt-S3D.py
To automatically check for the best checkpoint (after a training run):
python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --eval_set tst
To resume one of the checkpoints provided:
python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --eval_set tst --resume_file checkpoint.th
For instance, by unzipping the archive for the augmented HGR on EPIC-Kitchens-100, the following folder is obtained:
results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/
Therefore, the evaluation can be done by running the following:
python t2vretrieval/driver/multilevel_match.py \
results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/model.json \
results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/path.json \
--eval_set tst \
--resume_file results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/model/epoch.42.th
On EPIC-Kitchens-100:
- Baseline model (HGR): (35.9 nDCG, 39.5 mAP)
- Augmented HGR with the proposed FSMMDA: thr=0.15 (59.3 nDCG, 47.1 mAP)
On YouCook2:
- Baseline model: (49.9 nDCG, 44.6 mAP)
- With the proposed FSMMDA: (51.0 nDCG, 44.7 mAP)
We thank the authors of Chen et al. (CVPR, 2020) (github), Wray et al. (ICCV, 2019) (github), Wray et al. (CVPR, 2021) (github), Falcon et al. (ICIAP, 2022) (github) for the release of their codebases.
If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:
@article{falcon2022fsmmda,
title={A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval},
author={Falcon, Alex and Serra, Giuseppe and Lanz, Oswald},
journal={ACM MM},
year={2022}
}
MIT License