You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My environment is as follows.
Ubuntu 18.04
python 3.7
cuda 10.2
pytorch=1.7.1 torchvision=0.8.2
2 x 2080 Ti
I tried to be as same as your environment as possible.
The following error occurred while trying to learn the code as it is.
The following error occurred when entering the result of the above code.
Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
I tried to fix the Master Port, but the following error occurred.
Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
I want you to tell me how to modify the code.
The text was updated successfully, but these errors were encountered:
Hi, great work!
There was a problem trying to train the code.
My environment is as follows.
Ubuntu 18.04
python 3.7
cuda 10.2
pytorch=1.7.1 torchvision=0.8.2
2 x 2080 Ti
I tried to be as same as your environment as possible.
The following error occurred while trying to learn the code as it is.
GPUS_PER_NODE=2 ./tools/run_dist_launch.sh 2 ./scripts/run_experiments_coco.sh
The following error occurred when entering the result of the above code.
Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
I tried to fix the Master Port, but the following error occurred.
Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
I want you to tell me how to modify the code.
The text was updated successfully, but these errors were encountered: