An implementation of SimCLR with DistributedDataParallel (1GPU : 1Process) in pytorch.
This allows scalability to batch size of 4096 (suggested by authors) using 64 gpus, each with batch size of 64 at a resolution of 224x224x3 in FP32 (see below for FP16 support).
NOTE0: this will not produce SOTA results, but is good for debugging. The authors use a batch size of 4096+ for SOTA.
NOTE1: Setup your github ssh tokens; if you get an authentication issue from the git clone this is most likely it.
> git clone --recursive git+ssh://git@github.com/jramapuram/SimCLR.git
# DATADIR is the location of imagenet or anything that works with imagefolder.
> ./docker/run.sh "python main.py --data-dir=$DATADIR \
--batch-size=64 \
--num-replicas=1 \
--epochs=100" 0 # add --debug-step to do a single minibatch
The bash script docker/run.sh
pulls the appropriate docker container.
If you want to setup your own environment use:
environment.yml
(conda) in addition torequirements.txt
(pip)
or just take a look at the Dockerfile in docker/Dockerfile
.
Setup stuff according to the slurm bash script. Then:
> cd slurm && sbatch run.sh
- Start each replica worker pointing to the master using
--distributed-master=
. - Set the total number of replicas appropriately using
--num-replicas=
. - Set each node to have a unique
--distributed-rank=
ranging from[0, num_replicas)
. - Ensure network connectivity between workers. You will get NCCL errors if there are resolution problems here.
- Profit.
For example, with a 2 node setup run the following on the master node:
python main.py \
--epochs=100 \
--data-dir=<YOUR_DATA_DIR> \
--batch-size=128 \ # divides into 64 per node
--convert-to-sync-bn \
--visdom-url=http://MY_VISDOM_URL \ # optional, not providing uses tensorboard
--visdom-port=8097 \ # optional, not providing uses tensorboard
--num-replicas=2 \ # specifies total available nodes, 2 in this example
--distributed-master=127.0.0.1 \
--distributed-port=29301 \
--distributed-rank=0 \ # rank-0 is the master
--uid=simclrv00_0
and the following on the child node:
export MASTER=<IP_ADDR_OF_MASTER_ABOVE>
python main.py \
--epochs=100 \
--data-dir=<YOUR_DATA_DIR> \
--batch-size=128 \ # divides into 64 per node
--convert-to-sync-bn \
--visdom-url=http://MY_VISDOM_URL \ # optional, not providing uses tensorboard
--visdom-port=8097 \ # optional, not providing uses tensorboard
--num-replicas=2 \ # specifies total available nodes, 2 in this example
--distributed-master=$MASTER \
--distributed-port=29301 \
--distributed-rank=1 \ # rank-1 is this child, increment for extra nodes
--uid=simclrv00_0
Grab imagenet, do standard pre-processing and use --data-dir=${DATA_DIR}
. Note: This SimCLR implementation expects two pytorch imagefolder
locations: train
and test
as opposed to val
in the preprocessor above.
If you have GPUs that works well with FP16, you can try the --half
flag.
This will allow faster training with larger batch sizes (~95 with a 12Gb GPU memory).
If training doesn't work well try chaning the AMP optimization level here.
Try increasing --workers-per-replica
for dataloading or placing your dataset on a drive with larger IOPS.
Optionally, you can also try to use the Nvidia DALI image loading backend by specifying --task=dali_multi_augment_image_folder
. However, the latter is missing the grayscale and gaussian blur augmentations, so model performance might be degraded.
This implementation supports tensorboard and visdom.
Omitting the --visdom-url
and --visdom-port
args defaults to tensorboard (which stores in ./runs
).
Cite the original authors on doing some great work:
@article{chen2020simple,
title={A Simple Framework for Contrastive Learning of Visual Representations},
author={Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
journal={arXiv preprint arXiv:2002.05709},
year={2020}
}
Like this replication? Buy me a beer.