Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix the MoE-params gradient-scaling (#4957)
This PR fixes a bug that I introduced in a previous [PR](#4695). The MoE-Params' gradients got accidentally double-scaled due to passing `self.ipg_bucket_has_moe_params` to the all_reduce functions. Since, we have already done the scaling the MoE parameters [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1054), we can safely pass `divide=False`. The divide argument may not be needed anymore, however, I just let it be there as I think it may be needed for the sequence-parallelism accuracy stability adjustments. cc: @tjruwase
- Loading branch information