Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Can we train or test on single GPU in detection sections? #24

Open
nestor0003 opened this issue Nov 19, 2021 · 1 comment
Open

Can we train or test on single GPU in detection sections? #24

nestor0003 opened this issue Nov 19, 2021 · 1 comment

Comments

@nestor0003
Copy link

nestor0003 commented Nov 19, 2021

If we want to test detection task, or just use the shell code like 'bash dist_test.sh configs/retinanet_alt_gvt_s_fpn_1x_coco_pvt_setting.py checkpoint_file 1 --eval mAP' ?

Or change the lr? and the number of the worker ?
I'm a beginner of the mmdet framework, please help...
this is the error lines:

/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : ./test.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_o5bp99y9/none_u2fqutod
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_o5bp99y9/none_u2fqutod/attempt_0/0/error.json
loading annotations into memory...
Done (t=0.52s)
creating index...
index created!
Traceback (most recent call last):
File "./test.py", line 213, in
main()
File "./test.py", line 166, in main
model = build_detector(cfg.model, train_cfg=None, test_cfg=cfg.test_cfg)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 67, in build_detector
return build(cfg, DETECTORS, dict(train_cfg=train_cfg, test_cfg=test_cfg))
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 32, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmcv/utils/registry.py", line 171, in build_from_cfg
return obj_cls(**args)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/detectors/retinanet.py", line 16, in init
super(RetinaNet, self).init(backbone, neck, bbox_head, train_cfg,
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 25, in init
self.backbone = build_backbone(backbone)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 37, in build_backbone
return build(cfg, BACKBONES)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 32, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmcv/utils/registry.py", line 171, in build_from_cfg
return obj_cls(**args)
File "/home/user/project/Twins/detection/gvt.py", line 482, in init
super(alt_gvt_small, self).init(
File "/home/user/project/Twins/detection/gvt.py", line 419, in init
super(ALTGVT, self).init(img_size, patch_size, in_chans, num_classes, embed_dims, num_heads,
File "/home/user/project/Twins/detection/gvt.py", line 408, in init
super(PCPVT, self).init(img_size, patch_size, in_chans, num_classes, embed_dims, num_heads,
File "/home/user/project/Twins/detection/gvt.py", line 343, in init
super(CPVTV2, self).init(img_size, patch_size, in_chans, num_classes, embed_dims, num_heads, mlp_ratios,
File "/home/user/project/Twins/detection/gvt.py", line 234, in init
_block = nn.ModuleList([block_cls(
File "/home/user/project/Twins/detection/gvt.py", line 234, in
_block = nn.ModuleList([block_cls(
File "/home/user/project/Twins/detection/gvt.py", line 164, in init
super(GroupBlock, self).init(dim, num_heads, mlp_ratio, qkv_bias, qk_scale, drop, attn_drop,
TypeError: init() takes from 3 to 10 positional arguments but 11 were given
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11449) of binary: /home/user/miniconda3/envs/twins/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_o5bp99y9/none_u2fqutod/attempt_1/0/error.json

@nestor0003 nestor0003 changed the title Can we train or test on 1 GPUs in detection sections? Can we train or test on single GPU in detection sections? Nov 19, 2021
@cxxgtxy
Copy link
Collaborator

cxxgtxy commented Nov 19, 2021

We suggest using at least 4 GPUs to train.

We still have some intern offers. If you are interested, please send your CV to chuxiangxiang@meituan.com. Thanks

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants