Skip to content

Official repository of the paper "MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation"

License

Notifications You must be signed in to change notification settings

HVision-NKU/MaskCLIPpp

Repository files navigation

MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

News

  • (2025.01.03) Add demo.

Introduction

This repo contains the code for our paper.

Abstract: Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment constraint during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively.

Image
Simplified framework for MaskCLIP++

Installation

See installation instructions.

Preparations

Datasets

See Preparing Datasets for MaskCLIP++.

Pretrained CLIP models

The pre-trained CLIP can be downloaded automatically from huggingface.

Mask generators

The mask generators we already support are as follows. If the path column is given, it is required to manually download the model to the corresponding relative path.

name weights path
Mask2Former (Swin-T) url output/ckpts/mask2former/coco/pan/maskformer2_swin_tiny_bs16_50ep_final_9fd0ae.pkl
Mask2Former (Swin-L) url output/ckpts/mask2former/coco/pan/maskformer2_swin_large_IN21k_384_bs16_100ep_final_f07440.pkl
FC-CLIP (ConvNext-B) url(*) output/ckpts/fcclip/fcclip_coco-pan_clip-convnext-base.pth
FC-CLIP (ConvNeXt-L) url output/ckpts/fcclip/fcclip_coco-pan_clip-convnext-large.pth
MAFTP-B url output/ckpts/maftp/maftp_b.pth
MAFTP-L url output/ckpts/maftp/maftp_l.pth

Except for the asterisk-marked(*) url, all the other urls are from the original repository

MaskCLIP++ models

(i) Finetuned on COCO-Stuff

Use mask generators from MAFTP. Eval on 5 datasets.

config ckpt A-847 PC-459 A-150 PC-59 PAS-20
clip-convnext-base url 14.5 18.7 35.4 59.1 95.8
eva-clip-vit-l-14-336 url 16.8 23.9 38.2 62.5 96.8

(ii) Finetuned on COCO-Panoptic

Use mask generators from FC-CLIP. Eval on ADE20K.

config ckpt mIoU PQ AP
clip-rn50x16 url 29.3 21.8 11.1
clip-convnext-base url 35.1 24.5 13.6
clip-convnext-large url 35.6 26.5 16.7
clip-convnext-xxlarge url 36.4 27.1 16.6
eva-clip-vit-b-16 url 33.8 24.4 13.2
eva-clip-vit-l-14-336 url 36.6 27.3 17.0
eva-clip-vit-g-14-plus url 36.8 27.7 17.1

Usage

Demo

Use the demo of MaskCLIP++.

Evaluation

source eval_all.sh
eval_ade150 $config $ckpt $ngpu $tag
# $ngpu is an integer representing the number of GPUs in use.
# $tag is the name of a run.
# Other options include: eval_ade847, eval_ctx459, eval_ctx59, eval_pc20

Fine-tuning

For base/large sized CLIPs, the fine-tuning requires about 2-4 hours on 2x NVIDIA 24G 3090 GPUs.

python train_maskclippp.py \
    --config-file $config \
    --num-gpus $ngpu \
    --dist-url "auto" \
    --tag $tag \
    WANDB.ENABLED True

Citing MaskCLIP++

@article{zeng2024maskclip++,
  title={MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation},
  author={Zeng, Quan-Sheng and Li, Yunheng and Zhou, Daquan and Li, Guanbin and Hou, Qibin and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2412.11464},
  year={2024}
}

Acknowledgement

Thanks to the following open source code and models:

About

Official repository of the paper "MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages