Skip to content

Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model

License

Notifications You must be signed in to change notification settings

FreedomIntelligence/ALLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

⚡ALLaVA is a project that provides a large-scale GPT4V-synthesized dataset for training LVLMs.⚡

Python Version PyTorch Version Transformers Version

📃 Paper • 🌐 Demo

🤗 ALLaVA-4V Dataset

🤗 ALLaVA-Phi3-mini-128k • 🤗 ALLaVA-StableLM2-1_6B • 🤗 ALLaVA-Phi2-2_7B

✨ Updates

  • [06/25/2024]: We release ALLaVA-Phi3-mini-128k, ALLaVA-StableLM2-1_6B, ALLaVA-Phi2-2_7B which all support loading from 🤗 repo.
  • [03/01/2024]: The huggingface repo of ALLaVA-3B-Longer (recommended) and ALLaVA-3B are updated, which now supports the from_pretrained method to load models.
  • [02/29/2024]: The huggingface repo of ALLaVA-4V dataset and download scripts are updated.
  • [02/21/2024]: We are thrilled to release 1) 1.4M data for training LVLMs, 2) two version of our ALLaVA-3B models, 3) inference code and 4) tech report.

📚 ALLaVA-4V Data

Generation Pipeline

pipeline
  • LAION

We leverage the superb GPT-4V to generate captions and complex reasoning QA pairs. Prompt is here.

  • Vison-FLAN

We leverage the superb GPT-4V to generate captions and detailed answer for the original instructions. Prompt is here.

  • Wizard

We regenerate the answer of Wizard_evol_instruct with GPT-4-Turbo.

Dataset Cards

All datasets can be found here. The structure of naming is shown below:

ALLaVA-4V
├── ALLaVA-Caption-4V
│   ├── ALLaVA-Caption-LAION-4V
│   └── ALLaVA-Caption-VFLAN-4V
├── ALLaVA-Instruct-4V
│   ├── ALLaVA-Instruct-LAION-4V
│   └── ALLaVA-Instruct-VFLAN-4V
├── Evol-Instruct-GPT4-Turbo-143K

The folder structure of the huggingface dataset space:

ALLaVA-4V
├── allava_laion/
│   ├── ALLaVA-Caption-LAION-4V.json
│   ├── ALLaVA-Instruct-LAION-4V.json
|   └── images.zip 
├── allava_vflan/
│   ├── ALLaVA-Caption-VFLAN-4V.json
│   └── ALLaVA-Instruct-VFLAN-4V.json
├── allava_text/
│   └── Evol-Instruct-GPT4-Turbo-143K.json

We do NOT own right to any image contained within the "images.zip" file. We collate the images and upload this file in request of the community to facilitate the data preparation process.

Here we provide detailed information of each subset.

Name #Samples Image Source Instruction Source Answer Source
ALLaVA-Caption-LAION-4V* 505,588 LAION (web) Handcrafted GPT-4V
ALLaVA-Caption-VFLAN-4V** 202,552 Vision FLAN Handcrafted GPT-4V
ALLaVA-Instruct-LAION-4V* 505,588 LAION (web) GPT-4V GPT-4V
ALLaVA-Instruct-VFLAN-4V** 203,065 Vision FLAN Vision FLAN *** GPT-4V
Evol-Instruct-GPT4-Turbo-143K 143,000 / Wizard GPT-4-Turbo

*ALLaVA-Caption-LAION-4V and ALLaVA-Instruct-LAION-4V share the same set of images.

**ALLaVA-Caption-VFLAN-4V and ALLaVA-Instruct-VFLAN-4V share the same set of images.

***In ALLaVA-Caption-VFLAN-4V and ALLaVA-Instruct-VFLAN-4V, we additionally include 30K samples which are distilled following the same manner for LAION (i.e. Distilling Captions and Complex Reasoning QA pairs). We mix them up with the other data whose instructions are from the original VFLAN.

Data Preparation

LAION

Specify the laion_root in ./download/download_laion.sh and run

sh ./download/download_laion.sh

VFLAN

Specify the vflan_root in ./download/download_vflan.sh and run

sh ./download/download_vflan.sh

Evol-Instruct-GPT4-Turbo-143K

Specify the text_root in ./download/download_text.sh and run

sh ./download/download_text.sh

Sample Demonstration

ALLaVA-Caption-LAION-4V
{
  "id": "allava_laion_cap_100277305",
  "image": "allava_laion/images/100277305.jpeg",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease depict the image in words."
    },
    {
      "from": "gpt",
      "value": "The image displays a silver ring with a design that resembles a stylized leaf. The leaf design is split into two symmetrical halves, separated by a gap that runs along the ring's band. Each half of the leaf pattern is textured with detailed, vein-like striations that give it a realistic look. The edges of the leaf design are embellished with small, clear gemstones that could possibly be diamonds, adding a sparkle to the ring. The band of the ring is wide, supporting the large leaf motif, and it tapers slightly as it moves away from the leaf design. The silver metal has a high polish, reflecting light brightly and providing a contrast to the textured leaf surface. The craftsmanship suggests a mix of organic inspiration with a touch of elegance due to the gems
tone accents."
    }
  ],
  "original_caption": "190952cz Pandora Tropical Palm Leaf Ring",
  "url": "https://theartofpandora.com/wp-content/uploads/2016/10/190952CZ-1-600x600.jpg"
}
ALLaVA-Caption-VFLAN-4V
{
  "id": "allava_vflan_cap_100000",
  "image": "allava_vflan/images/images_191task_1k/PACS+guitar_image_category_classification_830_n03272010_10795-5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nHow would you describe the image to someone who cannot see it?"
    },
    {
      "from": "gpt",
      "value": "The image shows a simplistic line drawing of an electric guitar. The guitar is oriented diagonally from the top left corner to the bottom right corner of the image. It has a long, narrow neck with tuning pegs at the top, resembling a stylized \"T\" shape. The body of the guitar is irregular, with two cutaways on either side of the neck, giving it an asymmetric appearance. The body is also characterized by sharp angles rather than curves, which is not typical for most guitars. There are four horizontal lines across the body, likely representing strings. The drawing is black and white, with no shading or color, and appears to be hand-drawn with a single continuous line without much detail. The background is plain and white."
    }
  ]
}
ALLaVA-Instruct-LAION-4V
{
  "id": "allava_laion_inst_100277305",
  "image": "allava_laion/images/100277305.jpeg",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nHow does the inclusion of gemstones affect the overall aesthetic and possible value of the ring?"
    },
    {
      "from": "gpt",
      "value": "The inclusion of gemstones along the edges of the leaf design enhances the ring's aesthetic by adding a touch of brilliance and luxury. The small, clear gemstones catch the light, creating a sparkling effect that contrasts with the silver's polished surface and the leaf's textured details. This sparkling accent likely increases the ring's visual appeal, making it stand out more as an accessory. In terms of value, the presence of gemstones, which could be diamonds, would likely raise the ring's monetary worth. Gemstones, especially diamonds, are prized for their beauty and durability, and their addition to jewelry pieces often signifies a higher quality and class of the item, thereby increasing its market value."
    }
  ],
  "original_caption": "190952cz Pandora Tropical Palm Leaf Ring",
  "url": "https://theartofpandora.com/wp-content/uploads/2016/10/190952CZ-1-600x600.jpg"
}
ALLaVA-Instruct-VFLAN-4V
{
  "id": "allava_vflan_inst_9",
  "image": "allava_vflan/images/images_191task_1k/VQA-E+visual_question_answering_574_COCO_train2014_000000119456.jpg",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nYou are provided with an image and a question related to the image. Answer the question based on the information given in the image.\n\nIs bigfoot in the picture?"
    },
    {
      "from": "gpt",
      "value": "The creature in the image is not Bigfoot, also known as Sasquatch. Bigfoot is typically described in folklore and cryptozoology as a large, bipedal humanoid creature with ape-like features, standing upwards of 6 to 9 feet tall. The animal present in the photograph is quadrupedal, possesses a body structure similar to that of a pig, and has features such as long hair and tusks, which are characteristic of a wild boar. Additionally, there is no evidence of bipedal creatures or any human-like figures that would suggest the presence of Bigfoot. The setting and visual clues do not align with the commonly reported sightings of Bigfoot, which often describe it as a solitary figure standing upright among trees. Therefore, based on the visual information available in the image, the conclusion is that Bigfoot is not depicted here."
    }
  ]
}
Evol-Instruct-GPT4-Turbo-143K
{
  "idx": "heR0vZB",
  "conversations": [
    {
      "from": "human",
      "value": "As an online platform teacher named Aimee, you possess impeccable credentials which include a Bachelor of Science degree in Industrial and Labor Relations from Cornell University, expertise in the English language, and intermediate proficiency in both Chinese and Spanish. Additionally, your professional experience as a STEAM teacher at UN Women in Singapore has honed your skills in teaching children from the ages of 6-11 and working with students from all levels of education. Your exceptional teaching abilities in spoken English and pronunciation paired with your personal strengths of being informed, patient, and engaging make you an ideal teacher for students seeking to improve their English language skills. Can you provide a short, concise, and unique English self-introduction in bullet point form that would attract students to enroll in your course?"
    },
    {
      "from": "gpt",
      "value": "Sure, here are some bullet points for your English self-introduction:\n\n- Hi there! My name is Aimee and I'm thrilled to be your online English teacher!\n- With a Bachelor's degree from Cornell University and experience teaching STEAM at UN Women in Singapore, I know how to help students learn effectively.\n- I'm passionate about language and have intermediate proficiency in both Chinese and Spanish, so I understand the challenges of learning a new language.\n- In my classes, I focus on improving spoken English and pronunciation, as well as building vocabulary and grammar skills.\n- But learning English doesn't have to be boring! I'm an engaging and patient teacher who loves to have fun and make learning enjoyable.\n- So, whether you're a beginner or an advanced student, I'm confident that I can help you improve your English skills and achieve your goals. Let's get started!"
    }
  ]
}

Benchmark Result

Our models ALLaVA-Phi3-mini-128k, ALLaVA-StableLM2-1_6B and ALLaVA-Phi2-2_7B achieve competitive results on 17 benchmarks.

Models Vicuna-80 GQA HallusionBench MME-P MMVP TouchStone TextVQA MME-C MathVista MM-Vet MMMU-val SQA (img) LLaVA (In-the-Wild) MLLM-Bench MMB-en MMB-cn SEEDBench (img, v1)
Large VLMs
BLIP-2 - - - - - - - - - 22.4 34.4 - - 3.0* - - 49.7
InstructBLIP - 49.5 - - - - - - - 25.6 - - 58.2 - 44.0 - -
Qwen-VL-Chat - 57.5 - 1487.6 - - 61.5 360.7 - 31.1 - 68.2 - - 60.6 56.7 65.4
LLaVA-1.5-7B 13.8* 62.0 36.6* 1504.4* 24.7* 594.9* 58.2 324.6* 25.0* 31.1 35.1* 66.8 65.4 23.0* 64.3 58.3 66.1
LLaVA-1.5-13B 22.5 63.3 36.5* 1531.3 38.0* 617.7* 61.3 295.4 28.3* 35.4 34.4* 71.6 72.5 - 67.7 63.6 68.2
LVIS-7B - 62.6 - - - - 58.7 - - 31.5 - - 67.0 29.0* 66.2 - -
LVIS-13B - 63.6* - - - - 62.5* - - 37.4* - - 71.3* - 68.0* - -
ShareGPT4V-7B 13.8* 63.3 36.0* 1540.1* 34.0* 637.2* 60.4 346.1* 24.7* 37.6 35.4* 68.4* 72.6 30.2* 68.8 61.0* 69.7
ShareGPT4V-13B 17.5* 64.8 39.0* 1576.1* 35.3* 648.7* 62.2 309.3* 28.8* 43.1 35.6* 70.0* 79.9 35.5* 71.2 61.7* 70.8
4B-scale Lite VLMs
MobileVLM-v2 5.0* 61.1 30.8* 1440.5 18.7* 541.0* 57.5 261.8* 28.3* 26.1* 30.8* 70.0 53.2* 15.7* 63.2 43.2* 64.5*
Mipha-3B 16.2* 63.9 34.3* 1488.9 32.0* 619.0* 56.6 285.0* 27.8* 33.5* 35.8* 70.9 64.7* 23.1* 69.7 42.9* 71.2*
TinyLLaVA 15.6* 62.1 37.2* 1465.5* 33.3* 663.5* 60.3 281.1* 30.3* 37.5 38.4 73.0 70.8* 29.8* 69.7* 42.8* 70.4*
Ours
ALLaVA-Phi2 49.4 48.8 24.8 1316.2 36.0 632.0 49.5 301.8 27.4 32.2 35.3 67.6 69.4 43.6 64.0 40.8 65.2
ALLaVA-StableLM2 38.8 49.8 25.3 1311.7 34.0 655.2 51.7 257.9 27.7 31.7 33.3 64.7 72.0 39.3 64.6 49.8 65.7
ALLaVA-Phi3 56.9 52.2 48.1 1382.3 32.7 667.8 53.0 347.1 32.9 37.8 41.1 64.0 68.5 54.8 68.1 55.3 69.0

* denotes the results of our evaluation. Bold numbers are the best results among all 4B-scale LVLMs.The detailed information of each benchmark is shown in Table 4 of our technical report.

🏭 Inference

All models can be loaded from HuggingFace using .from_pretrained() method. Check out example scripts for sample inputs and outputs.

🏋️‍♂️ Training

Data

training_datasets

ALLaVA uses 1.0M and 1.5M data for PT. and FT., respectively.

Code

The training code is largely based on LLaVA. We wholeheartedly express our gratitude for their invaluable contributions to open-sourcing LVLMs.

Hyperparameters

Global Batch Size ZeRO Stage Optimizer Max LR Min LR Scheduler Weight decay
256 (PT) / 128 (FT) 1 AdamW 2e-5 2e-6 CosineAnnealingWarmRestarts 0

The LM backbone, projector are trainable, while the vision encoder is kept frozen. The trainabilities of each module are the same for both stages.

🙌 Contributors

Project Leader: Guiming Hardy Chen

Data: Shunian Chen, Junying Chen, Xiangbo Wu

Evaluation: Ruifei Zhang

Deployment: Xiangbo Wu, Zhiyi Zhang

Advising: Zhihong Chen, Benyou Wang

Others: Jianquan Li, Xiang Wan

📝 Citation

If you find our data useful, please consider citing our work! We are FreedomIntelligence from Shenzhen Research Institute of Big Data and The Chinese University of Hong Kong, Shenzhen

@misc{chen2024allava,
      title={ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model}, 
      author={Guiming Hardy Chen and Shunian Chen and Ruifei Zhang and Junying Chen and Xiangbo Wu and Zhiyi Zhang and Zhihong Chen and Jianquan Li and Xiang Wan and Benyou Wang},
      year={2024},
      eprint={2402.11684},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Star History

Star History Chart

About

Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •