Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ZeRO stage 1 refresh #1042

Merged
merged 12 commits into from
May 19, 2021
Merged

ZeRO stage 1 refresh #1042

merged 12 commits into from
May 19, 2021

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented May 5, 2021

This PR changes the default underlying implementation of ZeRO stage 1. We have had trouble keeping up with bug fixes and maintaining feature parity across ZeRO implementations (e.g., cpu offload, overlap, fp32 support). Instead we are deprecating the original stage 1 implementation in favor of a new mode of stage 2 that supports optimizer state partitioning only. A few implications to be aware of when this PR is merged.

ZeRO stage 1 now supports

  • Optimizer state offload to CPU
  • Communication overlap
    • This is done as in stage 2 through the use of backward hooks to reduce gradients when they become available. This can be turned on/off with "overlap_comm": [true|false]
  • Pipeline parallelism + zero stage 1 now supports optimizer state offload to CPU
  • FP32/TF32 support

Warnings

  • ZeRO stage 1 checkpoints created before this PR will not be compatible with the new ZeRO stage 1
    • Until we release v0.4.0 we will support a new zero param JSON of "legacy_stage": true to use the old codebase.

Misc

  • Memory overhead bug fix related to both zero stage 1 and 2. Fixes bug where if a single model parameter was larger than our reduce bucket size we would error our, which required arbitrarily inflating the bucket size (x4.5) in certain cases to accommodate the large param(s).
  • Fixes a perf bug in pipeline parallelism where we were all-reducing all model gradients instead if relying on reduce(scatter) in zero for reduce comm volume.
  • Adds DeepSpeed version info to checkpoints

@jeffra jeffra merged commit cfa63f5 into master May 19, 2021
@mrwyattii mrwyattii deleted the jeffra/z1-refresh-4 branch July 7, 2023 02:40
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants