MViT V2

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Abstract

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification.

Results and Models

Models with * in Inference results are ported from the repo SlowFast and tested on our data, and models in Training results are trained in MMAction2 on our data.
The values in columns named after reference are copied from paper, and reference* are results using SlowFast repo and trained on our data.
The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.
MaskFeat fine-tuning experiment is based on pretrain model from MMSelfSup, and the corresponding reference result is based on pretrain model from SlowFast.
Due to the different versions of Kinetics-400, our training results are different from paper.
Due to the training efficiency, we currently only provide MViT-small training results, we don't ensure other config files' training accuracy and welcome you to contribute your reproduction results.
We use repeat augment in MViT training configs following SlowFast. Repeat augment takes multiple times of data augment for one video, this way can improve the generalization of the model and relieve the IO stress of loading videos. And please note that the actual batch size is num_repeats times of batch_size in train_dataloader.

Inference results

Kinetics-400

frame sampling strategy	resolution	backbone	pretrain	top1 acc	top5 acc	reference top1 acc	reference top5 acc	testing protocol	FLOPs	params	config	ckpt
16x4x1	224x224	MViTv2-S*	From scratch	81.1	94.7	81.0	94.6	5 clips x 1 crop	64G	34.5M	config	ckpt
32x3x1	224x224	MViTv2-B*	From scratch	82.6	95.8	82.9	95.7	5 clips x 1 crop	225G	51.2M	config	ckpt
40x3x1	312x312	MViTv2-L*	From scratch	85.4	96.2	86.1	97.0	5 clips x 3 crop	2828G	213M	config	ckpt

Something-Something V2

frame sampling strategy	resolution	backbone	pretrain	top1 acc	top5 acc	reference top1 acc	reference top5 acc	testing protocol	FLOPs	params	config	ckpt
uniform 16	224x224	MViTv2-S*	K400	68.1	91.0	68.2	91.4	1 clips x 3 crop	64G	34.4M	config	ckpt
uniform 32	224x224	MViTv2-B*	K400	70.8	92.7	70.5	92.7	1 clips x 3 crop	225G	51.1M	config	ckpt
uniform 40	312x312	MViTv2-L*	IN21K + K400	73.2	94.0	73.3	94.0	1 clips x 3 crop	2828G	213M	config	ckpt

Training results

Kinetics-400

frame sampling strategy	resolution	backbone	pretrain	top1 acc	top5 acc	reference* top1 acc	reference* top5 acc	testing protocol	FLOPs	params	config	ckpt	log
16x4x1	224x224	MViTv2-S	From scratch	80.6	94.7	80.8	94.6	5 clips x 1 crop	64G	34.5M	config	ckpt	log
16x4x1	224x224	MViTv2-S	K400 MaskFeat	81.8	95.2	81.5	94.9	10 clips x 1 crop	71G	36.4M	config	ckpt	log

the corresponding result without repeat augment is as follows:

frame sampling strategy	resolution	backbone	pretrain	top1 acc	top5 acc	reference* top1 acc	reference* top5 acc	testing protocol	FLOPs	params
16x4x1	224x224	MViTv2-S	From scratch	79.4	93.9	80.8	94.6	5 clips x 1 crop	64G	34.5M

Something-Something V2

frame sampling strategy	resolution	backbone	pretrain	top1 acc	top5 acc	reference top1 acc	reference top5 acc	testing protocol	FLOPs	params	config	ckpt	log
uniform 16	224x224	MViTv2-S	K400	68.2	91.3	68.2	91.4	1 clips x 3 crop	64G	34.4M	config	ckpt	log

For more details on data preparation, you can refer to

Kinetics
Something-something V2

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test MViT model on Kinetics-400 dataset and dump the result to a pkl file.

python tools/test.py configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{li2021improved,
  title={MViTv2: Improved multiscale vision transformers for classification and detection},
  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={CVPR},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MViT V2

Abstract

Results and Models

Inference results

Kinetics-400

Something-Something V2

Training results

Kinetics-400

Something-Something V2

Test

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MViT V2

Abstract

Results and Models

Inference results

Kinetics-400

Something-Something V2

Training results

Kinetics-400

Something-Something V2

Test

Citation