视觉语言大模型综述
Code: Github
Weight: Huggingface
Demo: Huggingface
LLaVA: Visual Instruction Tuning
- Date: 2023.04
- Core contribution: A simple framework for training VLM (Vision Language Model) has been proposed, achieving notable visual comprehension abilities with less data. Moreover, a batch of high-quality structured fine-tuning VLM data comprising 158K instances has been produced using existing open-source datasets.
LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
- Date: 2023.10
- Core contribution: By increasing the amount of data for pre-training (558K) and fine-tuning (655K), as well as enhancing the resolution of the visual encoder and the size of the LLM, further evidence was provided for the key factors in improving the capabilities of VLMs.
LLaVA 1.6: LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
- Date: 2024.01
- Core contribution: Selected subsets of datasets were curated, increasing the fine-tuning data volume to 760K; image clarity was enhanced by utilizing dynamic image segmentation techniques; furthermore, larger LLM models were employed for further improvements.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
- Date: 2023.11
- Core contribution: The impact of using high-quality image captions during the pre-training phase of VLMs was proposed. Utilizing GPT-4V, 100K high-quality image caption data were generated, and based on this data, the ShareCaptioner model was trained to produce high-quality image captions. Ultimately, the ShareCaptioner model was used to create and contribute a total of 1.2 million data entries.
- Code: Github
- Weight: [ShareGPT4V-7B] [ShareCaptioner]
- Demo: Huggingface