Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Enabling distributed training for SSL online evaluator #612

Merged
merged 16 commits into from
Dec 10, 2021

Conversation

ant0nsc
Copy link
Contributor

@ant0nsc ant0nsc commented Dec 8, 2021

SSL online evaluator presently is not synchronizing across GPUs. Enable DDP.

Please follow the guidelines for PRs contained here. Checklist:

  • Ensure that your PR is small, and implements one change.
  • Add unit tests for all functions that you introduced or modified.
  • Run PyCharm's code cleanup tools on your Python files.
  • Link the correct GitHub issue for tracking.
  • Update the Changelog file: Describe your change in terms of
    Added/Changed/Removed/... in the "Upcoming" section.
  • When merging your PR, replace the default merge message with a description of your PR,
    and if needed a motivation why that change was required.

@ant0nsc ant0nsc linked an issue Dec 8, 2021 that may be closed by this pull request
ozan-oktay
ozan-oktay previously approved these changes Dec 8, 2021
Copy link
Contributor

@ozan-oktay ozan-oktay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

maxilse
maxilse previously approved these changes Dec 9, 2021
Copy link
Contributor

@maxilse maxilse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

ozan-oktay
ozan-oktay previously approved these changes Dec 9, 2021
@ant0nsc ant0nsc dismissed stale reviews from ozan-oktay and maxilse via d958171 December 9, 2021 11:50
trainer2.fit(model, datamodule=data)
# Read the parameters and check if they are the same as what was stored in the first callback.
parameters2_after_training = list(callback2.evaluator.parameters())
assert torch.allclose(parameters2_after_training[0], parameters1[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also check if the optimizer state is restored correctly?

Copy link
Contributor

@maxilse maxilse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking good

@ant0nsc ant0nsc merged commit 46017f4 into main Dec 10, 2021
@ant0nsc ant0nsc deleted the antonsc/linearhead branch December 10, 2021 09:34
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SSL: Does linear head do distributed training?
4 participants