Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

There was a problem trying to train the code. #68

Open
hanmoonje opened this issue May 2, 2023 · 0 comments
Open

There was a problem trying to train the code. #68

hanmoonje opened this issue May 2, 2023 · 0 comments

Comments

@hanmoonje
Copy link

Hi, great work!

There was a problem trying to train the code.

My environment is as follows.
Ubuntu 18.04
python 3.7
cuda 10.2
pytorch=1.7.1 torchvision=0.8.2
2 x 2080 Ti
I tried to be as same as your environment as possible.

The following error occurred while trying to learn the code as it is.

GPUS_PER_NODE=2 ./tools/run_dist_launch.sh 2 ./scripts/run_experiments_coco.sh

The following error occurred when entering the result of the above code.

Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

I tried to fix the Master Port, but the following error occurred.

Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

I want you to tell me how to modify the code.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant