MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

News

(2025.01.03) Add demo.

Introduction

This repo contains the code for our paper.

Abstract: Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment constraint during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively.

Simplified framework for MaskCLIP++

Installation

See installation instructions.

Preparations

Datasets

See Preparing Datasets for MaskCLIP++.

Pretrained CLIP models

The pre-trained CLIP can be downloaded automatically from huggingface.

Mask generators

The mask generators we already support are as follows. If the path column is given, it is required to manually download the model to the corresponding relative path.

name	weights	path
Mask2Former (Swin-T)	url	`output/ckpts/mask2former/coco/pan/maskformer2_swin_tiny_bs16_50ep_final_9fd0ae.pkl`
Mask2Former (Swin-L)	url	`output/ckpts/mask2former/coco/pan/maskformer2_swin_large_IN21k_384_bs16_100ep_final_f07440.pkl`
FC-CLIP (ConvNext-B)	url(*)	`output/ckpts/fcclip/fcclip_coco-pan_clip-convnext-base.pth`
FC-CLIP (ConvNeXt-L)	url	`output/ckpts/fcclip/fcclip_coco-pan_clip-convnext-large.pth`
MAFTP-B	url	`output/ckpts/maftp/maftp_b.pth`
MAFTP-L	url	`output/ckpts/maftp/maftp_l.pth`

Except for the asterisk-marked(*) url, all the other urls are from the original repository

MaskCLIP++ models

(i) Finetuned on COCO-Stuff

Use mask generators from MAFTP. Eval on 5 datasets.

config	ckpt	A-847	PC-459	A-150	PC-59	PAS-20
clip-convnext-base	url	14.5	18.7	35.4	59.1	95.8
eva-clip-vit-l-14-336	url	16.8	23.9	38.2	62.5	96.8

(ii) Finetuned on COCO-Panoptic

Use mask generators from FC-CLIP. Eval on ADE20K.

config	ckpt	mIoU	PQ	AP
clip-rn50x16	url	29.3	21.8	11.1
clip-convnext-base	url	35.1	24.5	13.6
clip-convnext-large	url	35.6	26.5	16.7
clip-convnext-xxlarge	url	36.4	27.1	16.6
eva-clip-vit-b-16	url	33.8	24.4	13.2
eva-clip-vit-l-14-336	url	36.6	27.3	17.0
eva-clip-vit-g-14-plus	url	36.8	27.7	17.1

Usage

Demo

Use the demo of MaskCLIP++.

Evaluation

source eval_all.sh
eval_ade150 $config $ckpt $ngpu $tag
# $ngpu is an integer representing the number of GPUs in use.
# $tag is the name of a run.
# Other options include: eval_ade847, eval_ctx459, eval_ctx59, eval_pc20

Fine-tuning

For base/large sized CLIPs, the fine-tuning requires about 2-4 hours on 2x NVIDIA 24G 3090 GPUs.

python train_maskclippp.py \
    --config-file $config \
    --num-gpus $ngpu \
    --dist-url "auto" \
    --tag $tag \
    WANDB.ENABLED True

Citing MaskCLIP++

@article{zeng2024maskclip++,
  title={MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation},
  author={Zeng, Quan-Sheng and Li, Yunheng and Zhou, Daquan and Li, Guanbin and Hou, Qibin and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2412.11464},
  year={2024}
}

Acknowledgement

Thanks to the following open source code and models:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
configs		configs
datasets		datasets
demo		demo
eva_clip		eva_clip
maskclippp		maskclippp
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
eval_all.sh		eval_all.sh
requirements.txt		requirements.txt
train_maskclippp.py		train_maskclippp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

News

Introduction

Installation

Preparations

Datasets

Pretrained CLIP models

Mask generators

MaskCLIP++ models

(i) Finetuned on COCO-Stuff

(ii) Finetuned on COCO-Panoptic

Usage

Demo

Evaluation

Fine-tuning

Citing MaskCLIP++

Acknowledgement

About

Releases

Packages

Languages

License

HVision-NKU/MaskCLIPpp

Folders and files

Latest commit

History

Repository files navigation

MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

News

Introduction

Installation

Preparations

Datasets

Pretrained CLIP models

Mask generators

MaskCLIP++ models

(i) Finetuned on COCO-Stuff

(ii) Finetuned on COCO-Panoptic

Usage

Demo

Evaluation

Fine-tuning

Citing MaskCLIP++

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages