- (2025.01.03) Add demo.
This repo contains the code for our paper.
Abstract: Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment constraint during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively.
See installation instructions.
See Preparing Datasets for MaskCLIP++.
The pre-trained CLIP can be downloaded automatically from huggingface.
The mask generators we already support are as follows. If the path
column is given, it is required to manually download the model to the corresponding relative path.
name | weights | path |
---|---|---|
Mask2Former (Swin-T) | url | output/ckpts/mask2former/coco/pan/maskformer2_swin_tiny_bs16_50ep_final_9fd0ae.pkl |
Mask2Former (Swin-L) | url | output/ckpts/mask2former/coco/pan/maskformer2_swin_large_IN21k_384_bs16_100ep_final_f07440.pkl |
FC-CLIP (ConvNext-B) | url(*) | output/ckpts/fcclip/fcclip_coco-pan_clip-convnext-base.pth |
FC-CLIP (ConvNeXt-L) | url | output/ckpts/fcclip/fcclip_coco-pan_clip-convnext-large.pth |
MAFTP-B | url | output/ckpts/maftp/maftp_b.pth |
MAFTP-L | url | output/ckpts/maftp/maftp_l.pth |
Except for the asterisk-marked(*) url, all the other urls are from the original repository
Use mask generators from MAFTP. Eval on 5 datasets.
config | ckpt | A-847 | PC-459 | A-150 | PC-59 | PAS-20 |
---|---|---|---|---|---|---|
clip-convnext-base | url | 14.5 | 18.7 | 35.4 | 59.1 | 95.8 |
eva-clip-vit-l-14-336 | url | 16.8 | 23.9 | 38.2 | 62.5 | 96.8 |
Use mask generators from FC-CLIP. Eval on ADE20K.
config | ckpt | mIoU | PQ | AP |
---|---|---|---|---|
clip-rn50x16 | url | 29.3 | 21.8 | 11.1 |
clip-convnext-base | url | 35.1 | 24.5 | 13.6 |
clip-convnext-large | url | 35.6 | 26.5 | 16.7 |
clip-convnext-xxlarge | url | 36.4 | 27.1 | 16.6 |
eva-clip-vit-b-16 | url | 33.8 | 24.4 | 13.2 |
eva-clip-vit-l-14-336 | url | 36.6 | 27.3 | 17.0 |
eva-clip-vit-g-14-plus | url | 36.8 | 27.7 | 17.1 |
source eval_all.sh
eval_ade150 $config $ckpt $ngpu $tag
# $ngpu is an integer representing the number of GPUs in use.
# $tag is the name of a run.
# Other options include: eval_ade847, eval_ctx459, eval_ctx59, eval_pc20
For base/large sized CLIPs, the fine-tuning requires about 2-4 hours on 2x NVIDIA 24G 3090 GPUs.
python train_maskclippp.py \
--config-file $config \
--num-gpus $ngpu \
--dist-url "auto" \
--tag $tag \
WANDB.ENABLED True
@article{zeng2024maskclip++,
title={MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation},
author={Zeng, Quan-Sheng and Li, Yunheng and Zhou, Daquan and Li, Guanbin and Hou, Qibin and Cheng, Ming-Ming},
journal={arXiv preprint arXiv:2412.11464},
year={2024}
}
Thanks to the following open source code and models: