Flux controlnet can't be trained, do this script really work? #9866

liuyu19970607 · 2024-11-05T08:51:57Z

Describe the bug

run with one num processes, the code broke down and returns:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by

run with more than one processes, the code broke down and returns:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

Reproduction

just follow the instructions and it will be reproduced

Logs

No response

System Info

diffusers v0.32

Who can help?

No response

The text was updated successfully, but these errors were encountered:

sayakpaul · 2024-11-05T10:22:15Z

Does it run suitably within a single GPU? Cc: @PromeAIpro

PromeAIpro · 2024-11-06T02:06:02Z

yesterday just tested on release 0.31.0, single a100 80G, works fine. see #9857
just tested good a few minute ago in 0.32.0dev0, could you provide your python-installations and commad line script for me to further debug @liuyu19970607

github-actions · 2024-12-05T15:03:07Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-12-05T15:19:12Z

Closing due to inactivities.

liuyu19970607 added the bug Something isn't working label Nov 5, 2024

github-actions bot added the stale Issues that haven't received updates label Dec 5, 2024

sayakpaul closed this as completed Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux controlnet can't be trained, do this script really work? #9866

Flux controlnet can't be trained, do this script really work? #9866

liuyu19970607 commented Nov 5, 2024

sayakpaul commented Nov 5, 2024

PromeAIpro commented Nov 6, 2024 •

edited

Loading

github-actions bot commented Dec 5, 2024

sayakpaul commented Dec 5, 2024

Flux controlnet can't be trained, do this script really work? #9866

Flux controlnet can't be trained, do this script really work? #9866

Comments

liuyu19970607 commented Nov 5, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

sayakpaul commented Nov 5, 2024

PromeAIpro commented Nov 6, 2024 • edited Loading

github-actions bot commented Dec 5, 2024

sayakpaul commented Dec 5, 2024

PromeAIpro commented Nov 6, 2024 •

edited

Loading