Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

Open
ShoufaChen opened this issue Sep 22, 2021 · 9 comments

Comments

@ShoufaChen
Copy link

ShoufaChen commented Sep 22, 2021

Hi,

Thanks for your great work.

I met the OOM error at the evaluation stage after first epoch pretraining. The log is

Test: Total time: 0:01:42 (0.2476 s / it)
Averaged stats: loss: 113.5333 (96.1866)  loss_bbox: 0.5625 (0.5173)  loss_bbox_0: 0.6505 (0.5977)  loss_bbox_1: 0.5762 (0.5255)  loss_bbox_2: 0.5703 (0.5273)  loss_bbox_3: 0.5712 (0.5109)  loss_bbox_4: 0.5651 (0.5138)  loss_ce: 11.5826 (9.1250)  loss_ce_0: 11.4480 (9.4460)  loss_ce_1: 11.7980 (9.5058)  loss_ce_2: 11.8104 (9.4749)  loss_ce_3: 11.6550 (9.2512)  loss_ce_4: 11.5774 (9.0949)  loss_contrastive_align: 6.1482 (5.6187)  loss_contrastive_align_0: 6.1950 (5.8909)  loss_contrastive_align_1: 6.1946 (5.7864)  loss_contrastive_align_2: 6.1133 (5.7674)  loss_contrastive_align_3: 6.1261 (5.6713)  loss_contrastive_align_4: 6.0199 (5.5644)  loss_giou: 0.4890 (0.4578)  loss_giou_0: 0.5642 (0.5090)  loss_giou_1: 0.5024 (0.4579)  loss_giou_2: 0.4965 (0.4619)  loss_giou_3: 0.5086 (0.4525)  loss_giou_4: 0.4900 (0.4579)  cardinality_error_unscaled: 8.3906 (4.8554)  cardinality_error_0_unscaled: 6.5000 (4.3573)  cardinality_error_1_unscaled: 9.4062 (5.9682)  cardinality_error_2_unscaled: 10.3125 (6.3725)  cardinality_error_3_unscaled: 9.2969 (5.2416)  cardinality_error_4_unscaled: 8.8281 (5.0047)  loss_bbox_unscaled: 0.1125 (0.1035)  loss_bbox_0_unscaled: 0.1301 (0.1195)  loss_bbox_1_unscaled: 0.1152 (0.1051)  loss_bbox_2_unscaled: 0.1141 (0.1055)  loss_bbox_3_unscaled: 0.1142 (0.1022)  loss_bbox_4_unscaled: 0.1130 (0.1028)  loss_ce_unscaled: 11.5826 (9.1250)  loss_ce_0_unscaled: 11.4480 (9.4460)  loss_ce_1_unscaled: 11.7980 (9.5058)  loss_ce_2_unscaled: 11.8104 (9.4749)  loss_ce_3_unscaled: 11.6550 (9.2512)  loss_ce_4_unscaled: 11.5774 (9.0949)  loss_contrastive_align_unscaled: 6.1482 (5.6187)  loss_contrastive_align_0_unscaled: 6.1950 (5.8909)  loss_contrastive_align_1_unscaled: 6.1946 (5.7864)  loss_contrastive_align_2_unscaled: 6.1133 (5.7674)  loss_contrastive_align_3_unscaled: 6.1261 (5.6713)  loss_contrastive_align_4_unscaled: 6.0199 (5.5644)  loss_giou_unscaled: 0.2445 (0.2289)  loss_giou_0_unscaled: 0.2821 (0.2545)  loss_giou_1_unscaled: 0.2512 (0.2289)  loss_giou_2_unscaled: 0.2483 (0.2309)  loss_giou_3_unscaled: 0.2543 (0.2263)  loss_giou_4_unscaled: 0.2450 (0.2289)
gathering on cpu
gathering on cpu
gathering on cpu
Traceback (most recent call last):
  File \"main.py\", line 655, in <module>
    main(args)
  File \"main.py\", line 598, in main
    curr_test_stats = evaluate(
  File \"/usr/local/lib/python3.8/site-packages/torch/autograd/grad_mode.py\", line 26, in decorate_context
    return func(*args, **kwargs)
  File \"/worksapce/mdetr/trainer/engine.py\", line 230, in evaluate
    evaluator.synchronize_between_processes()
  File \"/worksapce/mdetr/trainer/datasets/refexp.py\", line 38, in synchronize_between_processes
    all_predictions = dist.all_gather(self.predictions)
  File \"/worksapce/mdetr/trainer/util/dist.py\", line 86, in all_gather
    obj = torch.load(buffer)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 853, in _load
    result = unpickler.load()
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 834, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 175, in default_restore_location
    result = fn(storage, location)
  File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 157, in _cuda_deserialize
    return obj.cuda(device)
  File \"/usr/local/lib/python3.8/site-packages/torch/_utils.py\", line 79, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File \"/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py\", line 462, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

I use 32G V100 GPUs, with 2 samples per GPU following default settings.
I also set CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1.

@ShoufaChen
Copy link
Author

Kindly ping @alcinos @ashkamath

I found a similar issue #40.

Any help would be much appriciated. Please let me know if you need further information.

@alcinos
Copy link
Collaborator

alcinos commented Sep 23, 2021

Did you try the solution in #40, namely

Try setting the env variable "MDETR_CPU_REDUCE" to "1", this should help with memory during reduce

@ShoufaChen
Copy link
Author

Hi, @alcinos

Thanks for your reply.

As mentioned above, I have set CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1.

@alcinos
Copy link
Collaborator

alcinos commented Sep 23, 2021

Appologies, I missed that.
I’m a bit puzzled why this is happening for you. Maybe you can try forcing the predictions to cpu at the end of the postprocessors
Specifically, after this assert,

assert len(scores) == len(labels) == len(boxes)

move scores, labels and boxes to cpu.

Hope this helps

@ShoufaChen
Copy link
Author

@alcinos Thanks.

I'll try it.

@ShoufaChen
Copy link
Author

Hi, @alcinos

I found a related bug and pull a request #42.

Please have a check.
Thanks.

@linhuixiao
Copy link

Suffered the same bug. How was it resolved? Thank you!

@ShoufaChen
Copy link
Author

Hi, @linhuixiao ,

I solved this problem with this #42

@linhuixiao
Copy link

@ShoufaChen Thank your very much for your reply!! I just have not got the meaning, in issue #42 , "Use map_location=device solves this issue." mean set 'map_location' in the main function --resume load? or else where? or just as bash environment variable. If in --resume load, how to set when train a model from no checkpoint? the two method I try are all fail when use 6 GPU on RTX3090 (24 GB mem) and batsize set 4.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants