Skip to content

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

License

Notifications You must be signed in to change notification settings

VITA-MLLM/Long-VITA

Repository files navigation

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

🔥 News

  • 2025.02.17 🌟 We support training on Nvidia GPU with DeepSpeed and inference on Nvidia GPU with Transformer.
  • 2025.02.09 🌟 We support training and inference on Nvidia GPU with Megatron.
  • 2025.02.05 🌟 We release training code, training log, deployment code, and model weights, which support Ascend NPU with MindSpeed.
  • 2024.02.05 🌟 We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.

Contents

✨ Highlights

  • Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
  • Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
  • Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

📈 Experimental Results

  • Comparison of image understanding.

image image

  • Comparison of video understanding.

image

image

  • Effectiveness of Logits-Masked LM Head.

image

🐍 Models

Model LLM Size Training Context Training Frames MindSpeed Weights Megatron Weights Huggingface Weights
Long-VITA-16K 14B 16,384 64 https://huggingface.co/VITA-MLLM/Long-VITA-16K https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF
Long-VITA-128K 14B 131,072 512 https://huggingface.co/VITA-MLLM/Long-VITA-128K https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF
Long-VITA-1M 14B 1,048,576 4,096 https://huggingface.co/VITA-MLLM/Long-VITA-1M https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF

⭐ Training, Inference and Evaluation

We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.

About

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published