CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

ShoufaChen · 2021-09-22T08:46:17Z

Hi,

Thanks for your great work.

I met the OOM error at the evaluation stage after first epoch pretraining. The log is

Test: Total time: 0:01:42 (0.2476 s / it)
Averaged stats: loss: 113.5333 (96.1866)  loss_bbox: 0.5625 (0.5173)  loss_bbox_0: 0.6505 (0.5977)  loss_bbox_1: 0.5762 (0.5255)  loss_bbox_2: 0.5703 (0.5273)  loss_bbox_3: 0.5712 (0.5109)  loss_bbox_4: 0.5651 (0.5138)  loss_ce: 11.5826 (9.1250)  loss_ce_0: 11.4480 (9.4460)  loss_ce_1: 11.7980 (9.5058)  loss_ce_2: 11.8104 (9.4749)  loss_ce_3: 11.6550 (9.2512)  loss_ce_4: 11.5774 (9.0949)  loss_contrastive_align: 6.1482 (5.6187)  loss_contrastive_align_0: 6.1950 (5.8909)  loss_contrastive_align_1: 6.1946 (5.7864)  loss_contrastive_align_2: 6.1133 (5.7674)  loss_contrastive_align_3: 6.1261 (5.6713)  loss_contrastive_align_4: 6.0199 (5.5644)  loss_giou: 0.4890 (0.4578)  loss_giou_0: 0.5642 (0.5090)  loss_giou_1: 0.5024 (0.4579)  loss_giou_2: 0.4965 (0.4619)  loss_giou_3: 0.5086 (0.4525)  loss_giou_4: 0.4900 (0.4579)  cardinality_error_unscaled: 8.3906 (4.8554)  cardinality_error_0_unscaled: 6.5000 (4.3573)  cardinality_error_1_unscaled: 9.4062 (5.9682)  cardinality_error_2_unscaled: 10.3125 (6.3725)  cardinality_error_3_unscaled: 9.2969 (5.2416)  cardinality_error_4_unscaled: 8.8281 (5.0047)  loss_bbox_unscaled: 0.1125 (0.1035)  loss_bbox_0_unscaled: 0.1301 (0.1195)  loss_bbox_1_unscaled: 0.1152 (0.1051)  loss_bbox_2_unscaled: 0.1141 (0.1055)  loss_bbox_3_unscaled: 0.1142 (0.1022)  loss_bbox_4_unscaled: 0.1130 (0.1028)  loss_ce_unscaled: 11.5826 (9.1250)  loss_ce_0_unscaled: 11.4480 (9.4460)  loss_ce_1_unscaled: 11.7980 (9.5058)  loss_ce_2_unscaled: 11.8104 (9.4749)  loss_ce_3_unscaled: 11.6550 (9.2512)  loss_ce_4_unscaled: 11.5774 (9.0949)  loss_contrastive_align_unscaled: 6.1482 (5.6187)  loss_contrastive_align_0_unscaled: 6.1950 (5.8909)  loss_contrastive_align_1_unscaled: 6.1946 (5.7864)  loss_contrastive_align_2_unscaled: 6.1133 (5.7674)  loss_contrastive_align_3_unscaled: 6.1261 (5.6713)  loss_contrastive_align_4_unscaled: 6.0199 (5.5644)  loss_giou_unscaled: 0.2445 (0.2289)  loss_giou_0_unscaled: 0.2821 (0.2545)  loss_giou_1_unscaled: 0.2512 (0.2289)  loss_giou_2_unscaled: 0.2483 (0.2309)  loss_giou_3_unscaled: 0.2543 (0.2263)  loss_giou_4_unscaled: 0.2450 (0.2289)
gathering on cpu
gathering on cpu
gathering on cpu
Traceback (most recent call last):
  File \"main.py\", line 655, in <module>
    main(args)
  File \"main.py\", line 598, in main
    curr_test_stats = evaluate(
  File \"/usr/local/lib/python3.8/site-packages/torch/autograd/grad_mode.py\", line 26, in decorate_context
    return func(*args, **kwargs)
  File \"/worksapce/mdetr/trainer/engine.py\", line 230, in evaluate
    evaluator.synchronize_between_processes()
  File \"/worksapce/mdetr/trainer/datasets/refexp.py\", line 38, in synchronize_between_processes
    all_predictions = dist.all_gather(self.predictions)
  File \"/worksapce/mdetr/trainer/util/dist.py\", line 86, in all_gather
    obj = torch.load(buffer)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 853, in _load
    result = unpickler.load()
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 834, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 175, in default_restore_location
    result = fn(storage, location)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 157, in _cuda_deserialize
    return obj.cuda(device)
  File \"/usr/local/lib/python3.8/site-packages/torch/_utils.py\", line 79, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File \"/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py\", line 462, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

I use 32G V100 GPUs, with 2 samples per GPU following default settings.
I also set CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1.

The text was updated successfully, but these errors were encountered:

ShoufaChen · 2021-09-23T05:33:34Z

Kindly ping @alcinos @ashkamath

I found a similar issue #40.

Any help would be much appriciated. Please let me know if you need further information.

alcinos · 2021-09-23T14:08:54Z

Did you try the solution in #40, namely

Try setting the env variable "MDETR_CPU_REDUCE" to "1", this should help with memory during reduce

ShoufaChen · 2021-09-23T14:19:44Z

Hi, @alcinos

Thanks for your reply.

As mentioned above, I have set CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1.

alcinos · 2021-09-23T14:46:28Z

Appologies, I missed that.
I’m a bit puzzled why this is happening for you. Maybe you can try forcing the predictions to cpu at the end of the postprocessors
Specifically, after this assert,

mdetr/models/postprocessors.py

Line 150 in 0b747b9

assert len(scores) == len(labels) == len(boxes)

move scores, labels and boxes to cpu.

Hope this helps

ShoufaChen · 2021-09-23T14:54:19Z

@alcinos Thanks.

I'll try it.

ShoufaChen · 2021-09-23T15:10:21Z

Hi, @alcinos

I found a related bug and pull a request #42.

Please have a check.
Thanks.

linhuixiao · 2023-02-24T06:23:32Z

Suffered the same bug. How was it resolved? Thank you!

ShoufaChen · 2023-02-24T07:46:20Z

Hi, @linhuixiao ,

I solved this problem with this #42

linhuixiao · 2023-02-24T08:34:04Z

@ShoufaChen Thank your very much for your reply!! I just have not got the meaning, in issue #42 , "Use map_location=device solves this issue." mean set 'map_location' in the main function --resume load? or else where? or just as bash environment variable. If in --resume load, how to set when train a model from no checkpoint? the two method I try are all fail when use 6 GPU on RTX3090 (24 GB mem) and batsize set 4.

ShoufaChen mentioned this issue Sep 23, 2021

fix: reduce gpu memory https://github.com/ashkamath/mdetr/issues/41 #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

ShoufaChen commented Sep 22, 2021 •

edited

Loading

ShoufaChen commented Sep 23, 2021

alcinos commented Sep 23, 2021

ShoufaChen commented Sep 23, 2021

alcinos commented Sep 23, 2021

ShoufaChen commented Sep 23, 2021

ShoufaChen commented Sep 23, 2021

linhuixiao commented Feb 24, 2023

ShoufaChen commented Feb 24, 2023

linhuixiao commented Feb 24, 2023

CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

Comments

ShoufaChen commented Sep 22, 2021 • edited Loading

ShoufaChen commented Sep 23, 2021

alcinos commented Sep 23, 2021

ShoufaChen commented Sep 23, 2021

alcinos commented Sep 23, 2021

ShoufaChen commented Sep 23, 2021

ShoufaChen commented Sep 23, 2021

linhuixiao commented Feb 24, 2023

ShoufaChen commented Feb 24, 2023

linhuixiao commented Feb 24, 2023

ShoufaChen commented Sep 22, 2021 •

edited

Loading