Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

find out nan in tensor #12

Closed
PJJie opened this issue Jan 14, 2021 · 4 comments
Closed

find out nan in tensor #12

PJJie opened this issue Jan 14, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@PJJie
Copy link

PJJie commented Jan 14, 2021

When I replace multiple maxtools with softtools, I find out Nan in tensor

@alexandrosstergiou
Copy link
Owner

Returned NaN values are quite common when using CUDA as it is a low-level language and it does not integrate any internal checks for numerical overflows or underflows. PyTorch itself has a range of functions (e.g. torch.nan_to_num()) to deal with such cases. Simply wrapping your output with these functions should alleviate the issue.

I am also planning on including this in the coming repo commits.

Best,
Alex

@alexandrosstergiou alexandrosstergiou added the bug Something isn't working label Jan 14, 2021
@MaxChanger
Copy link

Hi, @alexandrosstergiou, I would like to know if this bug has been fixed or any progress? I'm also using softpool in a project and I don't have this problem, but other people have this problem with my project haomo-ai/MotionSeg3D#6

@alexandrosstergiou
Copy link
Owner

alexandrosstergiou commented Sep 7, 2022

Hi @MaxChanger. Most NaNvalue-problems in fwd/bwd calls have been fixed after torch 1.6 where torch.amp was integrated alongside its decorators for custom functions. After commit f49fd84, I had stable runs on both full and mixed precision settings over different GPUs, environments, and configurations. Since then I have not noticed any NaN values occurring whilst training in other projects.

Perhaps it will be worth suggesting to anyone opening an issue in your project to re-install the latest version of softpool and ensure that they are using torch >= 1.7 (preferably the latest one) to be sure?

@MaxChanger
Copy link

Hi @alexandrosstergiou. Thank you for your kind reply. I have conducted nearly a hundred experiments on 4~5 different GPU servers, and I have not found this issue (nan) too. Thus, I thought your project was robust enough.
After your confirmation, I am more at ease, and I will also cooperate with other people to confirm this issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants