Examples for runai server using distributed training (using HOROVOD and DDP)
- Build the docker image and push to image repository:
bash build_image_pytorch_hvd.sh
- Submit job to runai system:
runai submit-mpi horovod-pytorch-test \
--image warvito/dist-horovod:pytorch \
--always-pull-image \
--processes 2 \
--gpu 1 \
--project wds20
- Build the docker image and push to image repository:
bash build_image_tf_hvd.sh
- Submit job to runai system:
runai submit-mpi horovod-tf-test \
--image warvito/dist-horovod:tf \
--always-pull-image \
--processes 2 \
--gpu 1 \
--project wds20
-
Build the docker image:
-
Push to private repository:
-
Submit job to runai system: