Click on "Open in Colab" to open a sample notebook.
Click here to read the problem-statement.
Click here to read the final report.
Click here to read the final presentation.
We present our work done on model extraction over Vision Transformers such as Video-Swin-T and MoViNeT-A2-Base on Video Action-Recognition tasks under Inter-IIT Tech Meet 10.0. We employ various video student models such as r3d, r(2+1)d and c3d. We also test out various classic techniques such as PRADA, MAZE and DFME.
Deep learning models have found their place in various applications in today’s world. Companies monetize these models as a service available to the end-users over the web. In this context, stealing the knowledge stored within this trained model is an attractive proposition for competitors. A ‘clone’ model can be trained with the victim model’s predictions to bring it close to the ‘victim’ model and can be used for monetary gains or to mount further attacks to improve the clone’s performance. Our solution to this challenge of model extraction attacks on video classification models is based on knowledge distillation. The student model learns by minimizing the difference between the teacher’s and its output logits. In model extraction attacks, the student is replaced by the clone model we are trying to train, whereas the teacher is replaced by the victim model queried. However, model extraction attacks cannot be taken as distillation problems directly because (a) we do not have access to the teacher model architecture, due to which backpropagation through it is not possible (b) we only have access to output logits of the victim model.
Generating videos by stacking affine-transformed images
Augmentations on the balanced 5% Kinetics dataset
Using other action datasets and models
HMDB51 | UCF101 |
---|---|
- project_extraction
- black_box
- video_swin_blackbox
- attack_video_swin.py
- Video-Swin-Transformer
- experiment_file_1
- experiment_file_2
- video_swin_blackbox
- grey_box
- video_swin_transformer_dependencies
- ...
- experiment_file_1
- experiment_file_2
- ...
- eval_teacher.py
- dependencies.sh
- eval.only.py
- dataset_folder_1
- dataset_folder_2
- ...
- swin_weights
- extracted_weights_1
- extracted_weights_2
- ...
- black_box
- Clone the repository
git clone https://github.com/dsgiitr/BOSCH-MODEL-EXTRACTION-ATTACK-FOR-VIDEO-CLASSIFICATION project_extraction
- Set the environment
cd project_extraction
bash dependencies.sh
export PYTHONOPTIMIZE='1'
- Download and unzip the datatset in project_extraction
- Download the teacher-weights in project_extraction
!wget https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window877_kinetics400_1k.pth
- (Optional) Download the extracted weights
- To run black_box code
cd black_box
python3 code.py
- To run grey_box code
cd grey_box
python3 code.py
We have extracted 5% of balanced Kinetics-400 and Kinetics-600 training data and have uploaded it here.
Datasets such as kinetics training data (5%), kinetics validation data, ucf101 and hmdb51 can be downloaded from here.
Trained weights can be found here
-
swin_to_r21_kd.py : Final model used for the submission. Along with all the improvements made earlier, in this file, we calculate the loss not only by evaluating our student model's outputs against the outputs of the victim model but also against the true labels for the examples belonging to the relevant kinetics dataset. Along with this, we also increase the weightage of this loss against the true label (called as the "true_loss" of an example) in order to give greater emphasis to it.
- Configuration :
- Teacher : Swin-T
- Student : r(2+1)d
- Loss : KLDiv
- Optim : AdamW with LR: 0.00003 on linear layers and 0.000003 on core layers
- Scheduler : Reduce on plateau
- Dataset : UCF101, HMDB51, kinetics_5percent
- Configuration :
-
movinet_to_r21.py : Final model used for the submission. Along with all the improvements made earlier, in this file, we calculate the loss not only by evaluating our student model's outputs against the outputs of the victim model but also against the true labels for the examples belonging to the relevant kinetics dataset. Along with this, we also increase the weightage of this loss against the true label (called as the "true_loss" of an example) in order to give greater emphasis to it.
- Configuration :
- Teacher : Movinet
- Student : r(2+1)d
- Loss : KLDiv
- Optim : AdamW with LR: 0.00003 on linear layers and 0.000003 on core layers
- Scheduler : Reduce on plateau
- Dataset : UCF101, HMDB51, kinetics_5percent
- Configuration :
-
eval_only.py : Used for validation against the k400/k600 validation dataset
- Configuration
- Dataset :
- k400_validation (contains 10% class normalised kinetics 400 dataset)
- k600_validation (contains 10% class normalised kinetics 600 dataset)
- Dataset :
- Configuration
-
swin_to_c3d.py : Uses c3d, a purely convolution based architechture, as a student with vanilla settings
- Configuration :
- Teacher : Swin-T
- Student : c3d
- Loss : KLDiv
- Optim : AdamW with LR: 0.00003 on linear layers and 0.000003 on core layers
- Scheduler : Reduce on plateau
- FRAMES = 32
- Configuration :
-
swin_to_r21.py : Changed the student to R(2+1)D, a model pretrained on IG65m dataset
- Vanilla code for model extraction.
- Configuration :
- Teacher : Swin-T
- Student : r(2+1)d
- Loss : KLDiv
- Optim : AdamW with LR: 0.00003 on linear layers and 0.000003 on core layers
- Scheduler : Reduce on plateau
-
swin_to_r21_using_cosine.py : Changed the loss to cosine and included other datasets for better training
- Configuration :
- Teacher : Swin-T
- Student : r(2+1)d
- Loss : KLDiv
- Optim : AdamW with LR: 0.00003 on linear layers and 0.000003 on core layers
- Scheduler : Cosine
- Configuration :
-
swin_to_r21_prada.py : Tried the techniques in PRADA paper with vanilla settings
- Configuration :
- Teacher : Swin-T
- Student : r(2+1)d
- Loss : KLDiv
- Optim : AdamW with LR: 0.00003 on linear layers and 0.000003 on core layers
- Scheduler : Reduce on plateau
- Configuration :
-
- Used to validate the teacher. Evaluates teacher on 10% class normalized kinetics400 dataset
- Accpets both swin-T and movinet
-
attack_video_swin.py: Built upon MAZE. Add an extra temporal dimension and uses generator
- Configuration :
- Teacher : Swin-T
- Student : r3d
- Loss : KLDiv
- Optim : Adam with LR: 0.001 of model and 0.00001 of generator
- Scheduler : Cosine Annealing
- Configuration :
-
attack_video_movinet.py: Same as attack_video_swin, replaces teacher as movinet
- Configuration :
- Teacher : Movinet
- Student : r3d
- Loss : KLDiv
- Optim : Adam with LR: 0.001 of model and 0.00001 of generator
- Scheduler : Cosine Annealing
- Configuration :
-
stacked_images_with_swin.py Applied affine transformation on images and stacked then to produce a video
- Configuration :
- Teacher : Swin
- Student : r3d
- Loss : KLDiv
- Optim : Adam with LR: 0.001
- Scheduler : Cosine Annealing
- Configuration :
The following experiments were performed in the Video Swin Transformer. The best results from these experiments were then extended MoViNet as the victim, hence developing a common strategy as asked.
The accuracies are calculated on the Kinetics400/600 validation dataset (true labels).
Technique | Top-5 Accuracy | Top-1 Accuracy |
---|---|---|
Augmented Kinetics with C3D | 27.5 | 8.4 |
Augmented Kinetics with R(2+1)D | 42.5 | 19.1 |
Concatenated dataset with R(2+1)D | 51.8 | 30.6 |
Combining PRADA approach with R(2+1) | 34.2 | 12.67 |
Combining KD techniques | 54.8 | 31.4 |
The final results for Video Swin Transformer victim were obtained using augmentations, dataset concatenation and KD techniques. The final results for MoViNet-A2 Base were obtained using augmentations and dataset concatenation.
Victim | Clone | Top-5 Accuracy | Number of Queries |
---|---|---|---|
Video Swin Transformer | R(2+1)D | 54.8 | ~4L |
MoViNet-A2 Base | R(2+1)D | 50.4 | ~4L |
Technique | Top-5 Accuracy | Top-1 Accuracy |
---|---|---|
Random normal sampling with ResNet3D | 1.26 | 0.27 |
Training generator along with clone with ResNet3D | 2.69 | 0.41 |
Training conditional GAN independently with ResNet3D | 4.85 | 0.84 |
Stacking affine-transformed images with R(2+1)D | 1.22 | 0.30 |
The final experiment for Video Swin Transformer victim was using stacked affine-transformed images with R(2+1)D. But, we obtained results which were against our expectations. Hence we trained the first approach for more time in case of both victims.
Victim | Clone | Top-5 Accuracy | Number of Queries |
---|---|---|---|
Video Swin Transformer | R(2+1)D | 4.85 | ~1M |
MoViNet-A2 Base | R(2+1)D | 4.13 | ~1M |
- Apoorva Verma
- Harsh Kumar
- Himank Sehgal
- Kumar Devesh
- Pranjal Gulati
- Rohan Mallick
- Sahil Goyal
- Sarthak Gupta
@article{carreira2019short,
title={A short note on the kinetics-700 human action dataset},
author={Carreira, Joao and Noland, Eric and Hillier, Chloe and Zisserman, Andrew},
journal={arXiv preprint arXiv:1907.06987},
year={2019}
}
@article{liu2021video,
title={Video Swin Transformer},
author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
journal={arXiv preprint arXiv:2106.13230},
year={2021}
}
@article{kondratyuk2021movinets,
title={MoViNets: Mobile Video Networks for Efficient Video Recognition},
author={Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Matthew Brown, and Boqing Gong},
journal={arXiv preprint arXiv:2103.11511},
year={2021}
}