-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41
Comments
Kindly ping @alcinos @ashkamath I found a similar issue #40. Any help would be much appriciated. Please let me know if you need further information. |
Did you try the solution in #40, namely
|
Hi, @alcinos Thanks for your reply. As mentioned above, I have set |
Appologies, I missed that. mdetr/models/postprocessors.py Line 150 in 0b747b9
move scores, labels and boxes to cpu. Hope this helps |
@alcinos Thanks. I'll try it. |
Suffered the same bug. How was it resolved? Thank you! |
Hi, @linhuixiao , I solved this problem with this #42 |
@ShoufaChen Thank your very much for your reply!! I just have not got the meaning, in issue #42 , "Use map_location=device solves this issue." mean set 'map_location' in the main function --resume load? or else where? or just as bash environment variable. If in --resume load, how to set when train a model from no checkpoint? the two method I try are all fail when use 6 GPU on RTX3090 (24 GB mem) and batsize set 4. |
Hi,
Thanks for your great work.
I met the OOM error at the evaluation stage after first epoch pretraining. The log is
I use 32G V100 GPUs, with 2 samples per GPU following default settings.
I also set
CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1
.The text was updated successfully, but these errors were encountered: