GitHub - VARGPT-family/VARGPT: VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

Xianwei Zhuang^1* Yuxin Xie^1* Yufan Deng^1*
Liming Liang¹ Jinghan Ru¹ Yuguo Yin¹ Yuexian Zou ¹

¹ Peking University

News

[In progress] We will explore a more powerful model architecture and use more data to support the unified generation and understanding of various modalities including images, videos, and texts in the next version. Stay tuned! 🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥
[2025-01-22] We release the datasets for training VARGPT (7B+2B). 🔥🔥
[2025-01-21] We release the model and inference code of VARGPT (7B+2B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation. 🔥🔥

What is the new about VARGPT?

Below is a comparison among understanding only, generation only, and unified (understanding & generation) models. Image and Text indicate the representations from specific input modalities. VARGPT modeling understanding and generation as two distinct paradigms within a unified model: predicting the next token for visual understanding and predicting the next scale for visual generation

Below is an overview of VARGPT and proposed unified three-stage instruction tuning. VARGPT is capable of handling image captioning, visual question answering, text-to-image generation and mixed modality generation.

The visual generation capabilities of VARGPT are currently constrained by the limitations of its training data, which is derived from the ImageNet dataset (1.28M images). In the forthcoming iteration, we intend to implement substantial enhancements to both the quality and quantity of the training data.

TODO

Release the inference code.
Release the code for evaluation.
Release the model checkpoint.
Release the datasets.
Supporting stronger visual generation capabilities.
Release the training code.

Hugging Face models and annotations

The VARGPT checkpoints can be found on Hugging Face:

VARGPT-family/VARGPT_LLaVA-v1

The instruction for training data can be found on Hugging Face:

VARGPT-family/VARGPT_datasets

Getting Started

First, set up the environment:

pip3 install -r requirements.txt

The description of code structure:

VARGPT_code/
├── inference           # Inference code for understanding and generation.
├── patching_utils      # Patching utils for supporting VARGPT.
├── README.md
├── requirements.txt    # Requirements for inference code.
├── understand_eval     # Evaluation code for understanding.
└── vargpt_llava        # Complete model architecture code.

Multimodal Understanding

Inference demo for Multimodal Understanding. You can execute the following shell:

python3 inference/understanding_vargpt.py

Or executing the following code:

# Or execute the following code
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava 
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching

model_id = "VARGPT-family/VARGPT_LLaVA-v1"
prepare_vargpt_llava(model_id)

model = VARGPTLlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)
patching(model)

tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please explain the meme in detail."},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)

raw_image = Image.open(image_file)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float32)

output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

Multimodal Generation

Inference demo for Text-to-Image Generation. You can execute the following shell:

python3 inference/generation_vargpt.py

Or executing the following code:

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava 
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching
model_id = "VARGPT-family/VARGPT_LLaVA-v1"

prepare_vargpt_llava(model_id)

model = VARGPTLlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float32, 
    low_cpu_mem_usage=True, 
).to(0)

patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)

# some instruction examples:
# Please design a drawing of a butterfly on a flower.
# Please create a painting of a black weasel is standing in the grass.
# Can you generate a rendered photo of a rabbit sitting in the grass.
# I need a designed photo of a lighthouse is seen in the distance.
# Please create a rendered drawing of an old photo of an aircraft carrier in the water.
# Please produce a designed photo of a squirrel is standing in the snow.


conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Please design a drawing of a butterfly on a flower."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)

inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
    **inputs, 
    max_new_tokens=2048, 
    do_sample=False)

print(processor.decode(output[0], skip_special_tokens=True))

Performance Evaluation

Following lmms-eval you can install the package by cloning the repository and running the following command:

cd understand_eval
pip install -e .

To evaluate the performance of the model in multimodel understanding, you can execute the following shell:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    --main_process_port=39535 \
    -m lmms_eval \
    --model vargpt_llava \
    --model_args pretrained="path/to/VARGPT_LLaVA-v1" \
    --tasks mmmu \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava-hf_mmmu \
    --output_path ./logs/

VARGPT Training Data Preparation

Instruction Fine-tuning Dataset Download

The following content provides detailed instructions for preparing the training data for VARGPT. The data preparation process involves downloading and processing various datasets for different stages of training.

Dataset Structure

stage1-pt: Contains 1.28M pre-training instruction fine-tuning dataset for VARGPT.
stage2-sft: Includes datasets for the second stage of VARGPT instruction fine-tuning:
- stage2-sft/llava_v1_5_mix665k: Derived entirely from LLaVA-1.5 training data.
- stage2-sft/llava_onevision_508k: Sampled from the LLaVA-onevision Dataset.
- stage2-sft/ImageNet-Instruct-5k: Sampled from our stage3-sft/ImageNet-Instruct-130k dataset.
stage3-sft: Contains datasets for the third stage of VARGPT instruction fine-tuning:
- stage3-sft/ImageNet-Instruct-130k
- stage3-sft/ImageNet-Instruct-1270k

Image Data Download and Processing

To prepare the image data, download the following datasets:

ImageNet: ImageNet
COCO: train2017
GQA: images
OCR-VQA: images
TextVQA: train_val_images
VisualGenome: part1, part2
LLaVA-Onevision Dataset: LLaVA-Onevision Dataset

Matching Instruction Fine-tuning Data Paths with Image Paths

For datasets ImageNet-Instruct-class, ImageNet-Instruct-130k, and ImageNet-Instruct-1270k, use the following Python script to match image paths with ImageNet image UIDs:

from tqdm import tqdm
import glob
import os
import json
import pandas as pd

def build_image_mapping(parquet_folder_path):
    image_mapping = {}
    label_mapping = {}
    parquet_files = glob.glob(os.path.join(parquet_folder_path, "train-*.parquet"))
    
    for file_idx, parquet_file in enumerate(tqdm(parquet_files)):
        file_path = os.path.join(parquet_folder_path, parquet_file)
        df = pd.read_parquet(file_path)
        
        for idx in range(len(df)):
            img_path = df.iloc[idx]['image']['path']
            image_filename = f"image_{file_idx}_{idx}.jpg"
            prefix = img_path.split('_')[0] + '_' + img_path.split('_')[1]
            image_mapping[prefix] = image_filename
            label_mapping[os.path.join("", image_filename)] = df.iloc[idx]['label']
    return image_mapping, label_mapping

image_mapping, label_mapping = build_image_mapping("ImageNet/data")

This script creates two mappings:

image_mapping: Maps the original image prefixes to new image filenames.
label_mapping: Associates the new image filenames with their corresponding labels.

Additional Notes

Ensure all datasets are downloaded and stored in appropriate directories before running the script.
The script assumes a specific structure for the ImageNet parquet files. Adjust the path and file naming convention if necessary.
After running the script, you can use image_mapping and label_mapping for further data processing or model training tasks.

For any issues or further clarification, please refer to the project documentation or contact the maintainers.

Citation

To cite the paper and model, please use the below:

@misc{zhuang2025vargptunifiedunderstandinggeneration,
      title={VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model}, 
      author={Xianwei Zhuang and Yuxin Xie and Yufan Deng and Liming Liang and Jinghan Ru and Yuguo Yin and Yuexian Zou},
      year={2025},
      eprint={2501.12327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.12327}, 
}

Acknowledgments

This work is heavily based on LLaVA-1.5, VAR, LLaVA-NeXT, lmms-eval, Show-o, LLaMA-Factory, CLIP, transformers-hf. Thanks to all the authors for their great work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

News

What is the new about VARGPT?

TODO

Hugging Face models and annotations

Getting Started

Multimodal Understanding

Multimodal Generation

Performance Evaluation

VARGPT Training Data Preparation

Instruction Fine-tuning Dataset Download

Dataset Structure

Image Data Download and Processing

Matching Instruction Fine-tuning Data Paths with Image Paths

Additional Notes

Citation

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
docs		docs
inference		inference
patching_utils		patching_utils
understand_eval		understand_eval
vargpt_llava		vargpt_llava
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

VARGPT-family/VARGPT

Folders and files

Latest commit

History

Repository files navigation

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

News

What is the new about VARGPT?

TODO

Hugging Face models and annotations

Getting Started

Multimodal Understanding

Multimodal Generation

Performance Evaluation

VARGPT Training Data Preparation

Instruction Fine-tuning Dataset Download

Dataset Structure

Image Data Download and Processing

Matching Instruction Fine-tuning Data Paths with Image Paths

Additional Notes

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages