Skip to content

Rank local checkpointing in DCP internal without collectives #989

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

saumishr
Copy link
Contributor

@saumishr saumishr commented Apr 8, 2025

Summary:

Context

DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D72390326

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72390326

saumishr added a commit to saumishr/tnt that referenced this pull request Apr 16, 2025
…#989)

Summary:

### Context
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D72390326
…#989)

Summary:
Pull Request resolved: pytorch#989

### Context
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D72390326
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72390326

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants