Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix RNG reload in resume training from epoch checkpoint #17055

Merged
merged 2 commits into from
May 3, 2022

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented May 2, 2022

What does this PR do?

This PR fixes the reproducibility in training when checkpoints are saved every epoch. The main reason it was failing (as pointed out in #17032) is that the RNG states were never reloaded. They need to be reloaded exactly before iterating through the new epoch, as the call to this will change the global PyTorch RNG (even if the dataloader uses its own generator...) The new test added makes sure this reproducibility is fully tested.

While debugging this, two issues occurred, which this PR also fixes.

  1. There are multiple warnings for the computation of flos when the model is not an NLP model. This PR reduces it to one.
  2. The test of this reproducibility is flaky on multiple GPUs because it relies on some randomness inside the model, but the PyTorch RNG will be called in random order between the two "copies" of the model executed by DataParallel (an issue that wouldn't be the case with DistributedDataParallel but we would need to execute the test via a launcher in that case). So in the test, we only do PyTorch randomness on one or zero GPU to fix this flakiness.

Fixes #17032

@sgugger sgugger requested a review from LysandreJik May 2, 2022 20:22
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 2, 2022

The documentation is not available anymore as the PR was closed or merged.

# For more than 1 GPUs, since the randomness is introduced in the model and with DataParallel (which is used
# in this test for more than 2 GPUs), the calls to the torch RNG will happen in a random order (sometimes
# GPU 0 will call first and sometimes GPU 1).
random_torch = torch.cuda.is_available() and torch.cuda.device_count() >= 1
Copy link

@atreyasha atreyasha May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just a question regarding this line. AFAICT random_torch would only be True if at least one GPU is available. But this would mean this test case will not cover torch randomness when using the CPU. The unit test before this commit however did test randomness on the CPU, or at least was able to if no GPU was available. Is this change intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll fix this :-)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sgugger!

@sgugger sgugger merged commit 1c9fcd0 into main May 3, 2022
@sgugger sgugger deleted the randomness_resume_epocj branch May 3, 2022 14:31
stevhliu pushed a commit to stevhliu/transformers that referenced this pull request May 3, 2022
…17055)

* Fix RNG reload in resume training from epoch checkpoint

* Fix test
nandwalritik pushed a commit to nandwalritik/transformers that referenced this pull request May 4, 2022
…17055)

* Fix RNG reload in resume training from epoch checkpoint

* Fix test
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…17055)

* Fix RNG reload in resume training from epoch checkpoint

* Fix test
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Trainer]: Resume training with save_strategy="epoch" does not load RNG state
4 participants