Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fix for stuck test set inference for LightningContainer models #494

Merged
merged 5 commits into from
Jun 17, 2021

Conversation

ant0nsc
Copy link
Contributor

@ant0nsc ant0nsc commented Jun 17, 2021

This fixes in issue where in test set inference on multi-GPU jobs with LightningContainer models got stuck, attempting to
communicate with processes that are dead already.

Closes #493

Please follow the guidelines for PRs contained here. Checklist:

  • Ensure that your PR is small, and implements one change.
  • Add unit tests for all functions that you introduced or modified.
  • Run PyCharm's code cleanup tools on your Python files.
  • Link the correct GitHub issue for tracking.
  • Update the Changelog file: Describe your change in terms of
    Added/Changed/Removed/... in the "Upcoming" section.
  • When merging your PR, replace the default merge message with a description of your PR,
    and if needed a motivation why that change was required.

@ant0nsc ant0nsc enabled auto-merge (squash) June 17, 2021 15:02
@ant0nsc ant0nsc merged commit 9749954 into main Jun 17, 2021
@ant0nsc ant0nsc deleted the antonsc/inference_fix branch June 17, 2021 16:34
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-node training jobs for LightningContainer models can get stuck at inference time
3 participants