find out nan in tensor #12

PJJie · 2021-01-14T03:34:52Z

When I replace multiple maxtools with softtools, I find out Nan in tensor

alexandrosstergiou · 2021-01-14T08:08:48Z

Returned NaN values are quite common when using CUDA as it is a low-level language and it does not integrate any internal checks for numerical overflows or underflows. PyTorch itself has a range of functions (e.g. torch.nan_to_num()) to deal with such cases. Simply wrapping your output with these functions should alleviate the issue.

I am also planning on including this in the coming repo commits.

Best,
Alex

MaxChanger · 2022-09-07T07:48:47Z

Hi, @alexandrosstergiou, I would like to know if this bug has been fixed or any progress? I'm also using softpool in a project and I don't have this problem, but other people have this problem with my project haomo-ai/MotionSeg3D#6

alexandrosstergiou · 2022-09-07T11:06:26Z

Hi @MaxChanger. Most NaNvalue-problems in fwd/bwd calls have been fixed after torch 1.6 where torch.amp was integrated alongside its decorators for custom functions. After commit f49fd84, I had stable runs on both full and mixed precision settings over different GPUs, environments, and configurations. Since then I have not noticed any NaN values occurring whilst training in other projects.

Perhaps it will be worth suggesting to anyone opening an issue in your project to re-install the latest version of softpool and ensure that they are using torch >= 1.7 (preferably the latest one) to be sure?

MaxChanger · 2022-09-07T11:23:17Z

Hi @alexandrosstergiou. Thank you for your kind reply. I have conducted nearly a hundred experiments on 4~5 different GPU servers, and I have not found this issue (nan) too. Thus, I thought your project was robust enough.
After your confirmation, I am more at ease, and I will also cooperate with other people to confirm this issue.

alexandrosstergiou closed this as completed Jan 14, 2021

alexandrosstergiou added the bug Something isn't working label Jan 14, 2021

MaxChanger mentioned this issue Sep 7, 2022

loss is nan haomo-ai/MotionSeg3D#6

Closed

MaxChanger mentioned this issue Dec 2, 2022

model output is NAN for toy dataset haomo-ai/MotionSeg3D#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find out nan in tensor #12

find out nan in tensor #12

PJJie commented Jan 14, 2021

alexandrosstergiou commented Jan 14, 2021

MaxChanger commented Sep 7, 2022

alexandrosstergiou commented Sep 7, 2022 •

edited

Loading

MaxChanger commented Sep 7, 2022

find out nan in tensor #12

find out nan in tensor #12

Comments

PJJie commented Jan 14, 2021

alexandrosstergiou commented Jan 14, 2021

MaxChanger commented Sep 7, 2022

alexandrosstergiou commented Sep 7, 2022 • edited Loading

MaxChanger commented Sep 7, 2022

alexandrosstergiou commented Sep 7, 2022 •

edited

Loading