Skip to content

Latest commit

 

History

History

mini_monkey

Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models


Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai

arXiv Demo Model Weight Model Weight in Wisemodel


Mini-Monkey is a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive cropping strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090.

TODO

  • Open source code, weight, and data
  • Support training using 3090 GPUs (24Gb video memory)
  • Mini-Monkey with different LLMs

Model Zoo

Mini-Monkey was trained using 8 3090 GPUs on a dataset

Model #param MME RWQA AI2D CCB SEED HallB POPE MathVista DocVQA ChartQA InfoVQA$ TextVQA OCRBench
Mini-Gemini 35B 2141.0 - - - - - - 43.3 - - - - -
LLaVA-NeXT 35B 2028.0 - 74.9 49.2 75.9 34.8 89.6 46.5 - - - - -
InternVL 1.2 40B 2175.4 67.5 79.0 59.2 75.6 47.6 88.0 47.7 - - - - -
InternVL 1.5 26B 2187.8 66.0 80.7 69.8 76.0 49.3 88.3 53.5 90.9 83.8 72.5 80.6 724
DeepSeek-VL 1.7B 1531.6 49.7 51.5 37.6 43.7 27.6 85.9 29.4 - - - - -
Mini-Gemini 2.2B 1653.0 - - - - - - 29.4 - - - - -
Bunny-StableLM-2 2B 1602.9 - - - 58.8 - 85.9 - - - - - -
MiniCPM-V-2 2.8B 1808.6 55.8 62.9 48.0 - 36.1 86.3 38.7 71.9 55.6 - 74.1 605
InternVL 2 2B 1876.8 57.3 74.1 74.7 70.9 37.9 85.2 46.3 86.9 76.2 58.9 73.4 784
Mini-Monkey (ours) 2B 1881.9 57.5 74.7 75.5 71.3 38.7 86.7 47.3 87.4 76.5 60.1 75.7 802

Environment

conda create -n minimonkey python=3.10
conda activate minimonkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey/project/mini_monkey
pip install -r requirements.txt

Install flash-attn==2.3.6:

pip install flash-attn==2.3.6 --no-build-isolation

Alternatively you can compile from source:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install

Evaluate

We use VLMEvalKit repositories for model evaluation. Replace the minimonkey.py in VLMEvalKit with this file and use the weight of Mini-Monkey.

Inference

We provide an example of inference code here

Train

Prepare Training Datasets

Inspired by InternVL 1.2, we adopted a LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with InternVL 1.2.

First, download the annotation files and place them in the playground/opensource/ folder.

Second, download all the images we used.

Then, organize the data as follows in playground/data:

playground/
├── opensource
│   ├── ai2d_train_12k.jsonl
│   ├── chartqa_train_18k.jsonl
│   ├── docvqa_train_10k.jsonl
│   ├── dvqa_train_200k.jsonl
│   ├── geoqa+.jsonl
│   ├── llava_instruct_150k_zh.jsonl
│   └── synthdog_en.jsonl
├── data
│   ├── ai2d
│   │   ├── abc_images
│   │   └── images
│   ├── chartqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── coco
│   │   └── train2017
│   ├── docvqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── dvqa
│   │   └── images
│   ├── llava
│   │   └── llava_pretrain
│   │       └── images
│   ├── synthdog-en
│   │   └── images
│   ├── geoqa+
│   │   └── images

Download the pretrained model from InternVL2-2B.

Execute the training code:

sh shell/minimonkey/minimonkey_finetune_full.sh

Citing Mini-Monkey

If you wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{huang2024mini,
  title={Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models},
  author={Huang, Mingxin and Liu, Yuliang and Liang, Dingkang and Jin, Lianwen and Bai, Xiang},
  journal={arXiv preprint arXiv:2408.02034},
  year={2024}
}

Copyright

We welcome suggestions to help us improve the Mini-Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue.