Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

🔥 News

2025.02.17 🌟 We support training on Nvidia GPU with DeepSpeed and inference on Nvidia GPU with Transformer.
2025.02.09 🌟 We support training and inference on Nvidia GPU with Megatron.
2025.02.05 🌟 We release training code, training log, deployment code, and model weights, which support Ascend NPU with MindSpeed.
2024.02.05 🌟 We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.

✨ Highlights

Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

📈 Experimental Results

Comparison of image understanding.

Comparison of video understanding.

Effectiveness of Logits-Masked LM Head.

🐍 Models

Model	LLM Size	Training Context	Training Frames	MindSpeed Weights	Megatron Weights	Huggingface Weights
Long-VITA-16K	14B	16,384	64	https://huggingface.co/VITA-MLLM/Long-VITA-16K	https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG	https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF
Long-VITA-128K	14B	131,072	512	https://huggingface.co/VITA-MLLM/Long-VITA-128K	https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG	https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF
Long-VITA-1M	14B	1,048,576	4,096	https://huggingface.co/VITA-MLLM/Long-VITA-1M	https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG	https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF

⭐ Training, Inference and Evaluation

We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
VLMEvalKit		VLMEvalKit
configs		configs
long_vita		long_vita
long_vita_megatron		long_vita_megatron
long_vita_modellink		long_vita_modellink
scripts		scripts
third_party		third_party
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
DATA.md		DATA.md
GPU_DeepSpeed.md		GPU_DeepSpeed.md
GPU_Megatron.md		GPU_Megatron.md
LICENSE		LICENSE
NPU_MindSpeed.md		NPU_MindSpeed.md
README.md		README.md
requirements.txt		requirements.txt
requirements_npu.txt		requirements_npu.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

🔥 News

Contents

✨ Highlights

📈 Experimental Results

🐍 Models

⭐ Training, Inference and Evaluation

About

Releases

Packages

Contributors 2

Languages

License

VITA-MLLM/Long-VITA

Folders and files

Latest commit

History

Repository files navigation

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

🔥 News

Contents

✨ Highlights

📈 Experimental Results

🐍 Models

⭐ Training, Inference and Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages