You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
run with one num processes, the code broke down and returns:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
run with more than one processes, the code broke down and returns:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Reproduction
just follow the instructions and it will be reproduced
Logs
No response
System Info
diffusers v0.32
Who can help?
No response
The text was updated successfully, but these errors were encountered:
yesterday just tested on release 0.31.0, single a100 80G, works fine. see #9857
just tested good a few minute ago in 0.32.0dev0, could you provide your python-installations and commad line script for me to further debug @liuyu19970607
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
run with one num processes, the code broke down and returns:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
, and byrun with more than one processes, the code broke down and returns:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Reproduction
just follow the instructions and it will be reproduced
Logs
No response
System Info
diffusers v0.32
Who can help?
No response
The text was updated successfully, but these errors were encountered: