We advise you first create a virtual environment with:
conda create -n tripletclip python=3.12
conda activate tripletclip
Then, install the required dependencies from the requirements.txt file:
pip install -r requirements.txt
To train the model with the configuration from the paper, you can run the following command:
python src/main.py
--model_name 'ViT-B-32' \
--lr 0.00005 \
--data_dir '/path/to/data/tar/files' \
--epochs 30 \
--train
The results reported are from the models trained on 1M image-text pairs from high quality TripletCLIP data. The results reported are after training for 10 epochs with a batch size of 1024 in bf16 precision.
Model | SugarCrepe | MSCOCO | Flickr30k | Imagenet1k | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Add Att | Add Obj | Replace Att | Replace Obj | Replace Rel | Swap Att | Swap Obj | i2t R@5 | t2i R@5 | i2t R@5 | t2i R@5 | acc5 | |
CLIP | 54.77 | 56.93 | 62.81 | 61.22 | 56.33 | 53.6 | 50.61 | 5.3 | 4.8 | 10.8 | 9.3 | 9.3 |
NegCLIP | 64.88 | 57.17 | 70.55 | 65.92 | 68.136 | 56.756 | 57.142 | 3.5 | 3.7 | 9.3 | 7.2 | 6.52 |
TripletCLIP | 67.196 | 64.646 | 75.127 | 73.184 | 70.55 | 60.06 | 60.408 | 13.04 | 13.23 | 24.09 | 26.7 | 22.82 |