2025.02.17
🌟 We support training on Nvidia GPU with DeepSpeed and inference on Nvidia GPU with Transformer.2025.02.09
🌟 We support training and inference on Nvidia GPU with Megatron.2025.02.05
🌟 We release training code, training log, deployment code, and model weights, which support Ascend NPU with MindSpeed.2024.02.05
🌟 We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.
- Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
- Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
- Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.
- Comparison of image understanding.
- Comparison of video understanding.
- Effectiveness of Logits-Masked LM Head.
Model | LLM Size | Training Context | Training Frames | MindSpeed Weights | Megatron Weights | Huggingface Weights |
---|---|---|---|---|---|---|
Long-VITA-16K | 14B | 16,384 | 64 | https://huggingface.co/VITA-MLLM/Long-VITA-16K | https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF |
Long-VITA-128K | 14B | 131,072 | 512 | https://huggingface.co/VITA-MLLM/Long-VITA-128K | https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF |
Long-VITA-1M | 14B | 1,048,576 | 4,096 | https://huggingface.co/VITA-MLLM/Long-VITA-1M | https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG | https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF |
We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.