-
Notifications
You must be signed in to change notification settings - Fork 23.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Init connect timeout when use torch.distributed.run #79388
Comments
Hi, @kiukchung do you mind taking a look? I was able to only repo the single-host case but I cannot run it successfully on AWS for the multi-host case. Maybe I am missing something here and I can also learn how to set it up correctly. |
this is most likely due to the internal method torchelastic will call To get around this issue, you can use the fully qualified domain name of node0 as the host part of |
@kiukchung hi~I mistakenly replaced the domain server with the IP1. 1、Firstly, Create worker-0 Service with worker-0 pod backend, and worker-1 pod. I replace the startcmd with
2、 I run this inside worker-0 and worker-1
it will be failed because 3、Execute the ping command inside worker0 and worker1
4、 Replace domain name(worker-0.system.svc.cluster.local) with pod ip(10.140.0.110) to run inside worker-0 and worker-1 5、As it is impossible for me to know the ip of the pod when it is not up, I would like to use Considering our discussion in this #76367
Although the backend used by launch is static not c10d, but I am still a little worried |
@d4l3k could you help take a look at this one, and feel free to re-add oncall: distributed tag if distributed help is needed. |
Yeah, that's exactly the problem -- we don't do any DNS resolution when identifying if the local host matches. We really should fall back to DNS resolution or correctly handle the search paths in /etc/resolve.conf. Sorry you ran into this issue - we should probably fix it in trunk though that wouldn't be released until PyTorch 1.13 I believe Your solution of using localhost on rank0 is exactly what we do in TorchX's Kubernetes scheduler to work around this problem If you haven't already looked at TorchX, it's our recommended solution for running PT on Kubernetes since we do test it works E2E on an actual K8s cluster |
fwiw if you are using Kubernetes I'd also encourage you to use TorchX to launch DDP jobs onto the k8s cluster. See: https://pytorch.org/torchx/latest/schedulers/kubernetes.html TorchX is a PyTorch job launcher that we've worked on to help users launch training jobs onto various different types of schedulers. We've figured out these sort of kinks on the schedulers that we support. |
It would be good to fix this in elastic as well -- could probably make a call to https://docs.python.org/3/library/socket.html#socket.gethostbyaddr and check against the current IP list. PRs welcome :) |
Have you tried passing |
I ran into the same error when running multi-node jobs on a SLURM cluster. Traceback (most recent call last): The same issue appear when I use torchrun: torchrun --nnodes=3 --nproc_per_node=8 --max_restarts=1 --rdzv_id=9876543210 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR} train.py or torchx run: I did not really understand the workaround that was proposed. |
…th the current IP list (pytorch#90221) Summary: Pull Request resolved: pytorch#90221 Pull Request: pytorch#79388 Fix torch.distributed.run init connect timeout by comparing `host` with the current IP list. Test Plan: ``` > buck2 test mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:utils_test -- --exact 'caffe2/test/distributed/elastic/rendezvous:utils_test - test_matches_machine_hostname_returns_true_if_ip_address_match_between_hosts (utils_test.UtilsTest)' Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. 0 builds failed ``` Unit tests Reviewed By: d4l3k Differential Revision: D41373962 fbshipit-source-id: 7f138e2ef74b057f70271d32b605710bc5d287f6
…th the current IP list (#90221) Summary: Pull Request: #79388 Fix torch.distributed.run init connect timeout by comparing `host` with the current IP list. Test Plan: unit tests Differential Revision: D41373962 Pull Request resolved: #90221 Approved by: https://github.com/d4l3k
🐛 Describe the bug
TRAINING_SCRIPT.py
when I run this on both node0 and node1
I get the error from both node0 and node1
but when I change the run
on node0 (use localhost instead of IP1)
on node1
it go well.
the output of node0
the output of node1
another strange thing is that when I use deprecated module
torch.distributed.launch
, it goes well when I runon both node 0 and node1
as mentioned in #76367
Versions
Collecting environment information...
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26
Python version: 3.7.5 (default, Apr 26 2022, 08:54:01) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-514.44.5.10.h193.x86_64-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 450.102.04
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501
The text was updated successfully, but these errors were encountered: