Skip to content

Improve error warning for dist_cp loading without optimizer state #3752

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Jan 24, 2025

What does this PR do?

Improve error logging for models saved with load_weights_only=True and loaded with load_weights_only=False under the sharded checkpointing code path.

What issue(s) does this change relate to?

https://databricks.atlassian.net/browse/GRT-2801

Tests

Before: 1-node-mpt-13b-monolithic-crusoe-EVyT86 - optimizer key error 🔴

[rank4]: "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/default_planner.py"
[rank4]: , line 354, in create_default_local_load_plan
[rank4]:     raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")
[rank4]: RuntimeError: Missing key in checkpoint state_dict:
[rank4]: state.model.model.transformer.blocks.5.norm_1.weight.

After: 1-node-mpt-13b-monolithic-crusoe-AF3tW4 - proper error warning ✅
then error about optimizer state is thrown again

2025-01-24 01:12:52,470: rank0[462][MainThread]: INFO: composer.utils.checkpoint: Optimizer states are not in the state_dict and won\'t be loaded. 
2025-01-24 01:12:52,470: rank0[462][MainThread]: INFO: Consider setting load_weights_only=True or ensure that the optimizer state is saved in the checkpoint.

@j316chuck j316chuck marked this pull request as draft January 24, 2025 01:13
@j316chuck j316chuck requested a review from dakinggg January 24, 2025 01:16
@j316chuck j316chuck marked this pull request as ready for review January 24, 2025 05:36
@j316chuck j316chuck changed the title Improve error logging for dist_cp loading without optimizer state Improve error warning for dist_cp loading without optimizer state Jan 24, 2025
@j316chuck j316chuck requested a review from a team as a code owner January 24, 2025 21:28
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant