Skip to content

JYWa/Overlap_Local_SGD

Repository files navigation

Overlap-Local-SGD

Code to reproduce the experiments reported in this paper:

Jianyu Wang, Hao Liang, Gauri Joshi, "Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD," ICASSP 2020. (arXiv)

This repo contains the implementations of the following algorithms:

Please cite this paper if you use this code for your research/projects.

Dependencies and Setup

The code runs on Python 3.5 with PyTorch 1.0.0 and torchvision 0.2.1. The non-blocking communication is implemented using Python threading package.

Training examples

We implement all the above mentioned algorithms as subclasses of torch.optim.optimizer. A typical usage is shown as follows:

import distoptim

# Before training
# define the optimizer
# One can use: 1) LocalSGD (including BMUF); 2) OverlapLocalSGD; 
#              3) EASGD; 4) CoCoDSGD
# tau is the number of local updates / communication period
optimizer = distoptim.SELECTED_OPTIMIZER(tau)
...... # define model, criterion, logging, etc..

# Start training
for batch_id, (data, label) in enumerate(data_loader):
	# same as serial training
	output = model(data) # forward
	loss = criterion(output, label)
	loss.backward() # backward
	optimizer.step() # gradient step
	optimizer.zero_grad()

	# additional line to average local models at workers
	# communication happens after every tau iterations
	# optimizer has its own iteration counter inside
	optimizer.average()

In addition, one need to initialize the process group as described in this documentation. In our private cluster, each machine has one GPU.

# backend = gloo or nccl
# rank: 0,1,2,3,...
# size: number of workers
# h0 is the host name of worker0, you need to change it
torch.distributed.init_process_group(backend=args.backend, 
                                     init_method='tcp://h0:22000', 
                                     rank=args.rank, 
                                     world_size=args.size)

Citation

@article{wang2020overlap,
	title={Overlap Local-{SGD}: An Algorithmic Approach to Hide Communication Delays in Distributed {SGD}},
	author={Wang, Jianyu and Liang, Hao and Joshi, Gauri},
	journal={arXiv preprint arXiv:2002.09539},
	year={2020}
}

About

Implementation of (overlap) local SGD in Pytorch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published