Multi-node training jobs for LightningContainer models can get stuck at inference time #493

ant0nsc · 2021-06-17T11:18:52Z

It appears that using any of the PyTorch Lightning metrics in the test_step can cause multi-node jobs to hang indefinitely. They appear to try to synchronize to the other GPUs, but those are terminated already.

AB#4121

The text was updated successfully, but these errors were encountered:

ant0nsc mentioned this issue Jun 17, 2021

Fix for stuck test set inference for LightningContainer models #494

Merged

6 tasks

ant0nsc closed this as completed in #494 Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node training jobs for LightningContainer models can get stuck at inference time #493

Multi-node training jobs for LightningContainer models can get stuck at inference time #493

ant0nsc commented Jun 17, 2021 •

edited by azure-boards bot

Loading

Multi-node training jobs for LightningContainer models can get stuck at inference time #493

Multi-node training jobs for LightningContainer models can get stuck at inference time #493

Comments

ant0nsc commented Jun 17, 2021 • edited by azure-boards bot Loading

ant0nsc commented Jun 17, 2021 •

edited by azure-boards bot

Loading