Dispider

This repository is the official implementation of Dispider.

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang
Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
CUHK, Shanghai AI Lab

📰 News

[2025/1/6] 🔥🔥🔥 We released the paper on arXiv!

🧾 ToDo Lists

Release Inference Code
Release Checkpoints
Release Training Code
Release Demo Video

💡 Highlights

🔥 A New Paradigm for Online Video LLMs with Active Real-Time Interaction

Dispider enables real-time interactions with streaming videos, unlike traditional offline video LLMs that process the entire video before responding. It provides continuous, timely feedback in live scenarios.

⚡️ Disentangled Perception, Decision, and Reaction Modules Operating Asynchronously

Dispider separates perception, decision-making, and reaction into asynchronous modules that operate in parallel. This ensures continuous video processing and response generation without blocking, enabling timely interactions.

🤯 Superior Performance on StreamingBench and Conventional Video Benchmarks

Dispider outperforms VideoLLM-online on StreamingBench and surpasses offline Video LLMs on benchmarks like EgoSchema, VideoMME, MLVU, and ETBench. It excels in temporal reasoning and handles diverse video lengths effectively.

🛠️ Installation

Follow the steps below to set up the Dispider environment. We recommend using the specified versions of each library to ensure reproduce optimal performance.

1. Create and Activate a Conda Environment

First, create a new Conda environment with Python 3.10 and activate it:

conda create -n dispider python=3.10 -y
conda activate dispider

2. Upgrade pip

Ensure that pip is up to date to avoid any installation issues:

pip install --upgrade pip

3. Install Required Libraries

Ensure that CUDA 11.8 is installed on your system. You can download it from the official NVIDIA website. Follow the installation instructions provided there.

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0

pip install flash-attn==2.5.9.post1 transformers==4.41.2 deepspeed==0.9.5 accelerate==0.27.2 pydantic==1.10.13 timm==0.6.13

Quick Start

First download the checkpoints at the folder.

To perform single-turn inference, execute the following script:

python inference.py --model_path YOUR_MODEL_PATH --video_path YOUR_VIDEO_PATH --prompt YOUR_PROMPT

By default, the prompt is inserted at the beginning of the streaming video. The expected response will be generated in a single turn.

Example Evaluation of VideoMME

Update the video_path in data/videomme_template.json and adjust the corresponding argument in videomme.sh. Then execute the following command, which will utilize 8 GPUs to run the inference in parallel:

bash scripts/eval/videomme.sh

☎️ Contact

Shuangrui Ding: mark12ding@gmail.com

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file.

👍 Acknowledgements

This codebase is built upon LLaVA and leverages several open-source libraries. We extend our gratitude to the contributors and maintainers of these projects.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝.

@article{qian2025dispider,
        title={Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction},
        author={Qian, Rui and Ding, Shuangrui and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Lin, Dahua and Wang, Jiaqi},
        journal={arXiv preprint arXiv:2501.03218},
        year={2025}
      }

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
dispider		dispider
img		img
playground/data		playground/data
scripts/eval		scripts/eval
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dispider

📰 News

🧾 ToDo Lists

💡 Highlights

🔥 A New Paradigm for Online Video LLMs with Active Real-Time Interaction

⚡️ Disentangled Perception, Decision, and Reaction Modules Operating Asynchronously

🤯 Superior Performance on StreamingBench and Conventional Video Benchmarks

🛠️ Installation

1. Create and Activate a Conda Environment

2. Upgrade pip

3. Install Required Libraries

Quick Start

Example Evaluation of VideoMME

☎️ Contact

🔒 License

👍 Acknowledgements

✒️ Citation

About

Releases

Packages

Languages

License

Mark12Ding/Dispider

Folders and files

Latest commit

History

Repository files navigation

Dispider

📰 News

🧾 ToDo Lists

💡 Highlights

🔥 A New Paradigm for Online Video LLMs with Active Real-Time Interaction

⚡️ Disentangled Perception, Decision, and Reaction Modules Operating Asynchronously

🤯 Superior Performance on StreamingBench and Conventional Video Benchmarks

🛠️ Installation

1. Create and Activate a Conda Environment

2. Upgrade pip

3. Install Required Libraries

Quick Start

Example Evaluation of VideoMME

☎️ Contact

🔒 License

👍 Acknowledgements

✒️ Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages