Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix memcpy issue on backward for zero-infinity #6670

Merged
merged 7 commits into from
Oct 31, 2024

Conversation

xylian86
Copy link
Contributor

This PR is similar to PR#5301, that optimizes the D2H time use pinned memory.

Previously, the D2H memcpy will be the bottleneck during the final backward pass of each iteration for ZeRO-Infinity(offload), as shown in Trace-1. The new version can eliminate the bottleneck, as shown in Trace-2.

Trace-1
image

Trace-2
image

cc @tjruwase

@xylian86 xylian86 requested a review from tjruwase as a code owner October 26, 2024 05:58
@loadams loadams requested a review from GuanhuaWang October 28, 2024 16:02
@loadams loadams enabled auto-merge October 30, 2024 23:44
@loadams loadams disabled auto-merge October 31, 2024 17:55
@loadams loadams merged commit ff1c543 into deepspeedai:master Oct 31, 2024
13 of 14 checks passed
@deepcharm
Copy link
Contributor

deepcharm commented Dec 11, 2024

Hi @xylian86, we have encountered significant performance degradation when applying this patch on Gaudi HPU accelerator.
With the patch partition_grads() takes approximately 2x time.

The issues that we see are:

  1. Device synchronize after the 1-st copy
  2. The 2-nd H2H copy takes very long time

Below is the partition_grads profile before the patch:

partition_grads_before

And with the patch:

partition_grads_with _pr6670

The above traces are taken for a small variation of BERT. For larger models, the performance degradation is more severe.

Could you please share some details of performance improvement that you observed? (i.e accelerator/workload/some stats). Thanks

@tjruwase
Copy link
Contributor

@deepcharm, thanks for reporting this degradation. It is strange since this PR provided good speedups on CUDA, so I am guessing this some device-specific effect.

While waiting for @xylian86 to respond, I wonder if you could test the following modification to L1508-1512, that removes non_blocking=True and get_accelerator().synchronize()

                        self.pinned_grad_buffer[:buffer_numel].copy_(grad_buffer.to(dtype=torch.float32))
                        fp32_grad_tensor.copy_(self.pinned_grad_buffer[:buffer_numel])

@xylian86
Copy link
Contributor Author

@deepcharm Thank you for reporting it. The setup from my side: 4 GH200 GPUs, GBS=128, MBS=16, ZeRO 3. Following this pull request, we observed a 4x reduction in backward pass time during the accumulation step, as demonstrated in the figure attached to the PR.

And as Tunji mentioned, it might be some device-specific effect, what is the bandwidth between CPU and GPU in your testing environment?? This PR is particularly effective for newer GPU clusters with high-bandwidth interconnects (PCIe 5.0).

@nelyahu
Copy link
Contributor

nelyahu commented Dec 12, 2024

@xylian86 which model did you use?

@xylian86
Copy link
Contributor Author

@nelyahu LlaMA 7B

@deepcharm
Copy link
Contributor

@xylian86 Thanks for providing the details! Indeed, it seems that it's a device-specific effect.

@tjruwase We've tried making the 1-st copy non-blocking while also removing the device synchronize.
Unfortunately, no change of performance was observed.

Can we add a zero config option (default=True) for this new behavior, so that HPU customers would be
able to disable it to improve the performance?

@deepcharm
Copy link
Contributor

Hi @tjruwase, if you can please approve/comment on adding a config option to enable/disable this feature? Many thanks

@tjruwase
Copy link
Contributor

tjruwase commented Jan 9, 2025

@deepcharm, apologies for the delayed response. Yes, we will add a ds_config to control this feature.

@tjruwase
Copy link
Contributor

tjruwase commented Jan 9, 2025

@deepcharm, do you know if ZeRO-1/2 on HPU experienced similar perf degradation with the earlier PR #5301?

@tjruwase
Copy link
Contributor

@xylian86, did you try pin_memory=True option of optimizer_offload config before creating this PR?

https://github.com/microsoft/DeepSpeed/blob/7f3d669b40f8d29010efd9578d4a2cdd0f16b20e/deepspeed/runtime/zero/stage3.py#L545

@nelyahu
Copy link
Contributor

nelyahu commented Jan 19, 2025

@deepcharm, do you know if ZeRO-1/2 on HPU experienced similar perf degradation with the earlier PR #5301?

@tjruwase we will check zero1/2+Offload-Optimizer. we observed it only on our nightly regression for Zero-Inf and Zero3+Offload-optimizer.

@xylian86
Copy link
Contributor Author

@tjruwase Hi Tunji, I tried both pin_memory=True and pin_memory=False option before creating the PR. And the traces I shared are from the option pin_memory=True.

Here are the benchmark results using one GH200 GPU with batch size=1, ZeRO 3, and a 3B model.

Before this PR:

pin_memory FWD BWD STEP
True 76.96 25603.30 858.70
False 74.46 24053.81 951.50

After this PR:

pin_memory FWD BWD STEP
True 75.09 1174.31 842.62
False 76.88 1287.00 938.85

Legend:

  • FWD: Forward pass time (ms)
  • BWD: Backward pass time (ms)
  • STEP: Optimizer step time (ms)

I agree that we should add one config that ensures compatibility across different hardware.

@tjruwase
Copy link
Contributor

@xylian86, thanks for the details. However, the results are confusing because if pin_memory=True, then self.fp32_partitioned_groups_flat[i].grad should be pinned memory and so using self.pinned_grad_buffer should not have any advantage. However, I suspect that the implicit upcast (bf16 -> fp32) in fp32_grad_tensor.copy_(grad_buffer) could be the cause. Can you please try the following withpin_memory=True before this PR:

  1. Confirm that self.fp32_partitioned_groups_flat[i].grad is pinned
  2. Measure the perf of doing explicit GPU upcast like below
                        fp32_grad_tensor = self.fp32_partitioned_groups_flat[i].grad.narrow(
                            0, dest_offset, grad_buffer.numel())
                        fp32_grad_tensor.copy_(grad_buffer.float())

@xylian86
Copy link
Contributor Author

xylian86 commented Jan 19, 2025

@tjruwase

  1. Yes, I confirm that the fp32 grad tensor is pinned via following check:
(Pdb) pp self offload_optimizer_pin_memory (Pdb) pp self grad_partitions_flat_buffer is_pinned()
  1. After doing explicit GPU upcast
pin_memory FWD BWD STEP
True 82.2 309.2 834.93

From the above result, 1) The current implementation (pin a large buffer and reuse) will cause addtional cost compare to explicit upcasting, which is quite interesting, I need to take some time to investigate it, and please let me if you have any suggestions. 2) There appears to be similiar issue in the Optimizer Step when copying gradient tensors back to the GPU.

@xylian86
Copy link
Contributor Author

As a follow-up to the previous post, the similiar issue in Optimizer Step occurs in the following two lines https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2075-L2076.

            self.fp16_partitioned_groups_flat[sub_group_id].data.copy_(
                self.fp32_partitioned_groups_flat[sub_group_id].data)

When the CPU-GPU bandwidth is high, it would be more efficient to first copy the FP32 tensor to the GPU and then downcast it to FP16.

@tjruwase
Copy link
Contributor

2. After doing explicit GPU upcast

@xylian86, thanks for the update. The impact of explicit upcast on BWD is quite dramatic. Can you please confirm that the following results using pin_memory=True are correct?

Option BWD Speedup
Before PR 25603.30 1X
After PR 1174.31 21.8X
Before PR + GPU upcast 309.2 82.8X

@xylian86
Copy link
Contributor Author

@tjruwase Yes, the results are correct.

@tjruwase
Copy link
Contributor

@xylian86, thanks for confirmation. We need a PR of explicit GPU upcast that undoes this PR urgently given the HPU perf regression. Are you able to help with this?

@xylian86
Copy link
Contributor Author

@tjruwase For sure. I will open a PR now.

@xylian86
Copy link
Contributor Author

@tjruwase I have opened the PR. #6962, please let me if you have further suggestions on it.

@tjruwase
Copy link
Contributor

@xylian86, thanks for the quick action. The PR looks great.

@deepcharm, after more investigation we are replacing this PR with #6962. Can you please check for any perf issues with the new PR? Thanks!

github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2025
Following discussion in
[PR-6670](#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |
tjruwase pushed a commit that referenced this pull request Feb 6, 2025
Following discussion in
[PR-6670](#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
siqi654321 pushed a commit to siqi654321/DeepSpeed that referenced this pull request Feb 7, 2025
Following discussion in
[PR-6670](deepspeedai#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |

Signed-off-by: siqi <siqi@tecorigin.com>
traincheck-team pushed a commit to traincheck-team/DeepSpeed that referenced this pull request Feb 9, 2025
Following discussion in
[PR-6670](deepspeedai#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants