We have used the following scripts to extract features for the collected data.
These primarily include video recognition models and multimodal understanding models.
Our feature primarily include two categories:
- Segment features
- Frame features
We have used the following models to extract segment features:
- SlowFast
- 3DResNet
- Omnivore
- X3D
- Imagebind
- Omnivore
- Imagebind
- TSM
- Depth - Imagebind
- Audio - Imagebind
- Text - (Video --> Lavila --> Imagebind)
- You can download the data using the downloader script
- Place the data in folders and change the respective paths in the scripts
- You can download all the extracted features used to train all our models from here