This repository contains a pytorch implementation for the paper: Weakly Supervised 3D Open-vocabulary Segmentation. Our method can segment 3D scenes using open-vocabulary texts without requiring any segmentation annotations.
Tested on Ubuntu 20.04 + Pytorch 1.12.1
Install environment:
conda create -n 3dovs python=3.9
conda activate 3dovs
pip install torch torchvision
pip install ftfy regex tqdm scikit-image opencv-python configargparse lpips imageio-ffmpeg kornia tensorboard
pip install git+https://github.com/openai/CLIP.git
Please download the datasets from this link and put the datasets in ./data
. You can put the datasets elsewhere if you modify the corresponding paths in the configs. The datasets are organized as
/data
| /scene0
| |--/images
| | |--00.png
| | |--01.png
| | ...
| |--/segmentations
| | |--classes.txt
| | |--/test_view0
| | | |--class0.png
| | | ...
| | |--/test_view1
| | | |--class0.png
| | | ...
| | ...
| |--poses_bounds.npy
| /scene1
| ...
where images
contains the RGB images, segmentations
contains the segmentation annotations for the test views, segmentations/classes.txt
stores the classes' text descriptions, and poses_bounds.npy
contains the camera poses generated by Colmap.
We provide the checkpoints for the scenes in this link. You can then test the segmentation by:
bash scripts/test_segmentation.sh [CKPT_PATH] [CONFIG_FILE] [GPU_ID]
The config files are stored in configs
, each file is named after configs/$scene_name.txt
. The results will be saved in the checkpoint's path. More details can be found in scripts/test_segmentation.sh
.
We need to extract a hierarchy of CLIP features from image patches for training. You can extract the CLIP features by: (Please modify $scene_name to the scene name you want to extract features for)
bash scripts/extract_clip_features.sh data/$scene_name/images clip_features/$scene_name [GPU_ID]
The extracted features will be saved in clip_features/$scene_name
.
This step is for reconstructing the TensoRF for the scenes. Please modify the datadir
and expname
in configs/resonstruction.txt
to specify the dataset path and the experiment name. By default we set datadir
to data/$scene_name
and expname
as $scene_name
. You can then train the original TensoRF by:
bash script/reconstruction.sh [GPU_ID]
The reconstructed TensoRF will be saved in log/$scene_name
.
We provide the training script for our datasets under configs
as $scene_name.txt
. You can train the segmentation by:
bash scripts/segmentation.sh [CONFIG_FILE] [GPU_ID]
The trained model will be saved in log_seg/$scene_name
. The training takes about 1h30min and consumes about 14GB GPU memory.
That is because the CLIP features are very large (has 512 channels) and consume lots of memory. You can load fewer views' CLIP features by setting clip_input
to 0.5 or smaller values in the config file. Normally 0.5 is enough for good performance.
To test if your prompts are good, you can set test_prompt
to a view number in the config file. You will then see the relevancy maps in this view for each class in clip_features/clip_relevancy_maps
. Each relevancy map is named as scale_class.png
. You can then check if the relevancy maps are good for each class. If not, you can modify the prompts in segmentations/classes.txt
and test again. In our experiments, we find that specific descriptions of objects that include the object's texture and color work better.
For custom scenes, you can generate the camera poses using Colmap following the recover camera poses section from this link.
If your custom data does not have annotated segmentation maps, you can set has_segmentation_maps
to 0 in the config file.
The bad segmentation results may be due to poor geometry reconstruction, erroneous camera poses, or inaccurate text prompts. If none of the above are the main reasons, you can try adjusting the dino_neg_weight
in the config file.
Usually, if the segmentation results do not align well with the object boundaries, you can set dino_neg_weight
to a value larger than 0.2, such as 0.22. If the segmentation is making mistakes, you can set dino_neg_weight
to a value smaller than 0.2, such as 0.18. Since dino_neg_weight
encourages the model to assign different labels when the DINO features are distant, the higher it is, the more unstable the model becomes, but it also encourages sharper boundaries.
- Currently we only support faceforwarding scenes, it can be extended to support unbounded 360 scenes using some coordinate transformation.
This repo is heavily based on the TensoRF. Thank them for sharing their amazing work!
@article{liu2023weakly,
title={Weakly Supervised 3D Open-vocabulary Segmentation},
author={Liu, Kunhao and Zhan, Fangneng and Zhang, Jiahui and Xu, Muyu and Yu, Yingchen and Saddik, Abdulmotaleb El and Theobalt, Christian and Xing, Eric and Lu, Shijian},
journal={arXiv preprint arXiv:2305.14093},
year={2023}
}