Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[trainer] sharded _load_best_model #17150

Merged
merged 2 commits into from
May 10, 2022
Merged

[trainer] sharded _load_best_model #17150

merged 2 commits into from
May 10, 2022

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented May 10, 2022

Looks like a copy-in-paste issue. This code path is probably untested.

@sgugger

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 10, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. It is currently untested as there is no way to activate checkpoint sharding from the Trainer without training a very large model, which would is unfeasible on any of the CI runners.

@stas00
Copy link
Contributor Author

stas00 commented May 10, 2022

Thank you for explaining why the testing of this path is complicated, Sylvain.

I think I can make it partially tested by using zero3 w/o "stage3_gather_16bit_weights_on_model_save" which would make it fall through and at least test that condition. I will be adding these tests here #17151

@stas00 stas00 merged commit 9aeacfe into main May 10, 2022
@stas00 stas00 deleted the stas00-patch-1 branch May 10, 2022 14:58
Narsil pushed a commit to Narsil/transformers that referenced this pull request May 12, 2022
* [trainer] sharded _load_best_model

probably needs a test?

* undo delete
ArthurZucker pushed a commit to ArthurZucker/transformers that referenced this pull request May 12, 2022
* [trainer] sharded _load_best_model

probably needs a test?

* undo delete
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* [trainer] sharded _load_best_model

probably needs a test?

* undo delete
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants