This repository is the official implementation of Dispider.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang
Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
CUHK, Shanghai AI Lab
- [2025/1/6] 🔥🔥🔥 We released the paper on arXiv!
- Release Inference Code
- Release Checkpoints
- Release Training Code
- Release Demo Video
Dispider enables real-time interactions with streaming videos, unlike traditional offline video LLMs that process the entire video before responding. It provides continuous, timely feedback in live scenarios.
Dispider separates perception, decision-making, and reaction into asynchronous modules that operate in parallel. This ensures continuous video processing and response generation without blocking, enabling timely interactions.
Dispider outperforms VideoLLM-online on StreamingBench and surpasses offline Video LLMs on benchmarks like EgoSchema, VideoMME, MLVU, and ETBench. It excels in temporal reasoning and handles diverse video lengths effectively.
Follow the steps below to set up the Dispider environment. We recommend using the specified versions of each library to ensure reproduce optimal performance.
First, create a new Conda environment with Python 3.10 and activate it:
conda create -n dispider python=3.10 -y
conda activate dispider
Ensure that pip
is up to date to avoid any installation issues:
pip install --upgrade pip
Ensure that CUDA 11.8 is installed on your system. You can download it from the official NVIDIA website. Follow the installation instructions provided there.
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
pip install flash-attn==2.5.9.post1 transformers==4.41.2 deepspeed==0.9.5 accelerate==0.27.2 pydantic==1.10.13 timm==0.6.13
First download the checkpoints at the folder.
To perform single-turn inference, execute the following script:
python inference.py --model_path YOUR_MODEL_PATH --video_path YOUR_VIDEO_PATH --prompt YOUR_PROMPT
By default, the prompt is inserted at the beginning of the streaming video. The expected response will be generated in a single turn.
Update the video_path
in data/videomme_template.json
and adjust the corresponding argument in videomme.sh
. Then execute the following command, which will utilize 8 GPUs to run the inference in parallel:
bash scripts/eval/videomme.sh
Shuangrui Ding: mark12ding@gmail.com
The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file.
This codebase is built upon LLaVA and leverages several open-source libraries. We extend our gratitude to the contributors and maintainers of these projects.
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝.
@article{qian2025dispider,
title={Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction},
author={Qian, Rui and Ding, Shuangrui and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Lin, Dahua and Wang, Jiaqi},
journal={arXiv preprint arXiv:2501.03218},
year={2025}
}