Description
Hi, I encounter NCCL timeout error at the end of each epoch during training.
Here is part of the error message.
Epoch 6: 100%|██████████████████████████████████████████████████████████████████████████| 13988/13988 [2:18:37<00:00, 1.68it/s, loss=4.75, v_num=0Epoch 6, global step 49999: val_ssim_fid100_f1_total_mean reached 0.91746 (best 0.91746), saving model to "/home/jovyan/zsp01/workplace/lama/experiments/root_2024-07-24_00-30-46_train_lama-fourier_/models/epoch=6-step=49999.ckpt" as top 5
Epoch 7: 0%| | 0/13988 [00:00<?, ?it/s, loss=4.75, v_num=0]
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100015, OpType=ALLREDUCE, NumelIn=12673, NumelOut=12673, Timeout(ms)=600000) ran for 600094 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 100015, last enqueued NCCL work: 100022, last completed NCCL work: 100014.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100015, OpType=ALLREDUCE, NumelIn=12673, NumelOut=12673, Timeout(ms)=600000) ran for 600094 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f87cacd6897 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f87cbfb11b2 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f87cbfb5fd0 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f87cbfb731c in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f8817a68bf4 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f88199c4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f881978f133 in /lib/x86_64-linux-gnu/libc.so.6)
scripts_zsp/train_gaoping.sh:行 6: 5139 已放弃 (核心已转储) CUDA_VISIBLE_DEVICES=0,1 python bin/train.py -cn lama-fourier location=gaoping data.batch_size=40 +trainer.kwargs.resume_from_checkpoint=/home/jovyan/zsp01/workplace/lama/experiments/root_2024-07-23_14-44-09_train_lama-fourier_/models/last.ckpt