Code to reproduce the experiments reported in this paper:
Jianyu Wang, Hao Liang, Gauri Joshi, "Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD," ICASSP 2020. (arXiv)
This repo contains the implementations of the following algorithms:
- Local SGD Stich ICLR 2018, Yu et al. AAAI 2019, Wang and Joshi 2018
- Overlap-Local-SGD (proposed in this paper)
- Elastic Averaging SGD Zhang et al. NeurIPS 2015
- CoCoD-SGD Shen et al. IJCAI 2019
- Blockwise Model-update Filtering (BMUF) Chen and Huo ICASSP 2016, also equivalent to SlowMo-Local SGD.
Please cite this paper if you use this code for your research/projects.
The code runs on Python 3.5 with PyTorch 1.0.0 and torchvision 0.2.1. The non-blocking communication is implemented using Python threading package.
We implement all the above mentioned algorithms as subclasses of torch.optim.optimizer. A typical usage is shown as follows:
import distoptim
# Before training
# define the optimizer
# One can use: 1) LocalSGD (including BMUF); 2) OverlapLocalSGD;
# 3) EASGD; 4) CoCoDSGD
# tau is the number of local updates / communication period
optimizer = distoptim.SELECTED_OPTIMIZER(tau)
...... # define model, criterion, logging, etc..
# Start training
for batch_id, (data, label) in enumerate(data_loader):
# same as serial training
output = model(data) # forward
loss = criterion(output, label)
loss.backward() # backward
optimizer.step() # gradient step
optimizer.zero_grad()
# additional line to average local models at workers
# communication happens after every tau iterations
# optimizer has its own iteration counter inside
optimizer.average()
In addition, one need to initialize the process group as described in this documentation. In our private cluster, each machine has one GPU.
# backend = gloo or nccl
# rank: 0,1,2,3,...
# size: number of workers
# h0 is the host name of worker0, you need to change it
torch.distributed.init_process_group(backend=args.backend,
init_method='tcp://h0:22000',
rank=args.rank,
world_size=args.size)
@article{wang2020overlap,
title={Overlap Local-{SGD}: An Algorithmic Approach to Hide Communication Delays in Distributed {SGD}},
author={Wang, Jianyu and Liang, Hao and Joshi, Gauri},
journal={arXiv preprint arXiv:2002.09539},
year={2020}
}