Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Multi-node training jobs for LightningContainer models can get stuck at inference time #493

Closed
ant0nsc opened this issue Jun 17, 2021 · 0 comments · Fixed by #494
Closed

Comments

@ant0nsc
Copy link
Contributor

ant0nsc commented Jun 17, 2021

It appears that using any of the PyTorch Lightning metrics in the test_step can cause multi-node jobs to hang indefinitely. They appear to try to synchronize to the other GPUs, but those are terminated already.

AB#4121

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant