fix memcpy issue on backward for zero-infinity #6670

xylian86 · 2024-10-26T05:58:22Z

This PR is similar to PR#5301, that optimizes the D2H time use pinned memory.

Previously, the D2H memcpy will be the bottleneck during the final backward pass of each iteration for ZeRO-Infinity(offload), as shown in Trace-1. The new version can eliminate the bottleneck, as shown in Trace-2.

Trace-1

Trace-2

cc @tjruwase

deepspeed/runtime/zero/stage3.py

deepcharm · 2024-12-11T17:27:37Z

Hi @xylian86, we have encountered significant performance degradation when applying this patch on Gaudi HPU accelerator.
With the patch partition_grads() takes approximately 2x time.

The issues that we see are:

Device synchronize after the 1-st copy
The 2-nd H2H copy takes very long time

Below is the partition_grads profile before the patch:

And with the patch:

The above traces are taken for a small variation of BERT. For larger models, the performance degradation is more severe.

Could you please share some details of performance improvement that you observed? (i.e accelerator/workload/some stats). Thanks

tjruwase · 2024-12-11T20:56:00Z

@deepcharm, thanks for reporting this degradation. It is strange since this PR provided good speedups on CUDA, so I am guessing this some device-specific effect.

While waiting for @xylian86 to respond, I wonder if you could test the following modification to L1508-1512, that removes non_blocking=True and get_accelerator().synchronize()

                        self.pinned_grad_buffer[:buffer_numel].copy_(grad_buffer.to(dtype=torch.float32))
                        fp32_grad_tensor.copy_(self.pinned_grad_buffer[:buffer_numel])

xylian86 · 2024-12-12T00:21:30Z

@deepcharm Thank you for reporting it. The setup from my side: 4 GH200 GPUs, GBS=128, MBS=16, ZeRO 3. Following this pull request, we observed a 4x reduction in backward pass time during the accumulation step, as demonstrated in the figure attached to the PR.

And as Tunji mentioned, it might be some device-specific effect, what is the bandwidth between CPU and GPU in your testing environment?? This PR is particularly effective for newer GPU clusters with high-bandwidth interconnects (PCIe 5.0).

nelyahu · 2024-12-12T06:14:21Z

@xylian86 which model did you use?

xylian86 · 2024-12-12T06:21:24Z

@nelyahu LlaMA 7B

deepcharm · 2024-12-12T15:07:09Z

@xylian86 Thanks for providing the details! Indeed, it seems that it's a device-specific effect.

@tjruwase We've tried making the 1-st copy non-blocking while also removing the device synchronize.
Unfortunately, no change of performance was observed.

Can we add a zero config option (default=True) for this new behavior, so that HPU customers would be
able to disable it to improve the performance?

deepcharm · 2025-01-09T09:32:12Z

Hi @tjruwase, if you can please approve/comment on adding a config option to enable/disable this feature? Many thanks

tjruwase · 2025-01-09T17:55:52Z

@deepcharm, apologies for the delayed response. Yes, we will add a ds_config to control this feature.

tjruwase · 2025-01-09T17:59:44Z

@deepcharm, do you know if ZeRO-1/2 on HPU experienced similar perf degradation with the earlier PR #5301?

tjruwase · 2025-01-19T20:20:06Z

@xylian86, did you try pin_memory=True option of optimizer_offload config before creating this PR?

https://github.com/microsoft/DeepSpeed/blob/7f3d669b40f8d29010efd9578d4a2cdd0f16b20e/deepspeed/runtime/zero/stage3.py#L545

nelyahu · 2025-01-19T20:22:37Z

@deepcharm, do you know if ZeRO-1/2 on HPU experienced similar perf degradation with the earlier PR #5301?

@tjruwase we will check zero1/2+Offload-Optimizer. we observed it only on our nightly regression for Zero-Inf and Zero3+Offload-optimizer.

xylian86 · 2025-01-19T21:07:46Z

@tjruwase Hi Tunji, I tried both pin_memory=True and pin_memory=False option before creating the PR. And the traces I shared are from the option pin_memory=True.

Here are the benchmark results using one GH200 GPU with batch size=1, ZeRO 3, and a 3B model.

Before this PR:

pin_memory	FWD	BWD	STEP
True	76.96	25603.30	858.70
False	74.46	24053.81	951.50

After this PR:

pin_memory	FWD	BWD	STEP
True	75.09	1174.31	842.62
False	76.88	1287.00	938.85

Legend:

FWD: Forward pass time (ms)
BWD: Backward pass time (ms)
STEP: Optimizer step time (ms)

I agree that we should add one config that ensures compatibility across different hardware.

tjruwase · 2025-01-19T21:32:07Z

@xylian86, thanks for the details. However, the results are confusing because if pin_memory=True, then self.fp32_partitioned_groups_flat[i].grad should be pinned memory and so using self.pinned_grad_buffer should not have any advantage. However, I suspect that the implicit upcast (bf16 -> fp32) in fp32_grad_tensor.copy_(grad_buffer) could be the cause. Can you please try the following withpin_memory=True before this PR:

Confirm that self.fp32_partitioned_groups_flat[i].grad is pinned
Measure the perf of doing explicit GPU upcast like below

                        fp32_grad_tensor = self.fp32_partitioned_groups_flat[i].grad.narrow(
                            0, dest_offset, grad_buffer.numel())
                        fp32_grad_tensor.copy_(grad_buffer.float())

xylian86 · 2025-01-19T23:34:54Z

@tjruwase

Yes, I confirm that the fp32 grad tensor is pinned via following check:

(Pdb) pp self offload_optimizer_pin_memory

(Pdb) pp self grad_partitions_flat_buffer is_pinned()

After doing explicit GPU upcast

pin_memory	FWD	BWD	STEP
True	82.2	309.2	834.93

From the above result, 1) The current implementation (pin a large buffer and reuse) will cause addtional cost compare to explicit upcasting, which is quite interesting, I need to take some time to investigate it, and please let me if you have any suggestions. 2) There appears to be similiar issue in the Optimizer Step when copying gradient tensors back to the GPU.

xylian86 · 2025-01-20T00:39:02Z

As a follow-up to the previous post, the similiar issue in Optimizer Step occurs in the following two lines https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2075-L2076.

            self.fp16_partitioned_groups_flat[sub_group_id].data.copy_(
                self.fp32_partitioned_groups_flat[sub_group_id].data)

When the CPU-GPU bandwidth is high, it would be more efficient to first copy the FP32 tensor to the GPU and then downcast it to FP16.

tjruwase · 2025-01-20T02:03:19Z

2. After doing explicit GPU upcast

@xylian86, thanks for the update. The impact of explicit upcast on BWD is quite dramatic. Can you please confirm that the following results using pin_memory=True are correct?

Option	BWD	Speedup
Before PR	25603.30	1X
After PR	1174.31	21.8X
Before PR + GPU upcast	309.2	82.8X

xylian86 · 2025-01-20T04:11:37Z

@tjruwase Yes, the results are correct.

tjruwase · 2025-01-20T12:47:03Z

@xylian86, thanks for confirmation. We need a PR of explicit GPU upcast that undoes this PR urgently given the HPU perf regression. Are you able to help with this?

xylian86 · 2025-01-20T12:51:38Z

@tjruwase For sure. I will open a PR now.

xylian86 · 2025-01-20T13:27:23Z

@tjruwase I have opened the PR. #6962, please let me if you have further suggestions on it.

tjruwase · 2025-01-20T15:52:55Z

@xylian86, thanks for the quick action. The PR looks great.

@deepcharm, after more investigation we are replacing this PR with #6962. Can you please check for any perf issues with the new PR? Thanks!

Following discussion in [PR-6670](#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X |

Following discussion in [PR-6670](#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: siqi <siqi@tecorigin.com>

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X |

fix memcpy issue on backward for zero-infinity

45f03f6

xylian86 requested a review from tjruwase as a code owner October 26, 2024 05:58

loadams requested a review from GuanhuaWang October 28, 2024 16:02

Merge branch 'master' into fix_bwd_memcpy_4_infinity

926451b

tjruwase reviewed Oct 28, 2024

View reviewed changes

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved

xylian86 and others added 4 commits October 29, 2024 03:25

fix: use a pre-pinned buffer for grad D2H copy

109bb6d

Merge branch 'master' into fix_bwd_memcpy_4_infinity

1fd255e

Merge branch 'master' into fix_bwd_memcpy_4_infinity

7795161

Merge branch 'master' into fix_bwd_memcpy_4_infinity

ee3df2c

tjruwase approved these changes Oct 30, 2024

View reviewed changes

loadams enabled auto-merge October 30, 2024 23:44

Merge branch 'master' into fix_bwd_memcpy_4_infinity

02522db

loadams disabled auto-merge October 31, 2024 17:55

loadams merged commit ff1c543 into deepspeedai:master Oct 31, 2024
13 of 14 checks passed

xylian86 mentioned this pull request Jan 20, 2025

Using explicit GPU upcast for ZeRO-Offload #6962

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix memcpy issue on backward for zero-infinity #6670

fix memcpy issue on backward for zero-infinity #6670

xylian86 commented Oct 26, 2024

deepcharm commented Dec 11, 2024 •

edited

Loading

tjruwase commented Dec 11, 2024

xylian86 commented Dec 12, 2024

nelyahu commented Dec 12, 2024

xylian86 commented Dec 12, 2024

deepcharm commented Dec 12, 2024

deepcharm commented Jan 9, 2025

tjruwase commented Jan 9, 2025

tjruwase commented Jan 9, 2025

tjruwase commented Jan 19, 2025

nelyahu commented Jan 19, 2025

xylian86 commented Jan 19, 2025

tjruwase commented Jan 19, 2025

xylian86 commented Jan 19, 2025 •

edited

Loading

xylian86 commented Jan 20, 2025

tjruwase commented Jan 20, 2025

xylian86 commented Jan 20, 2025

tjruwase commented Jan 20, 2025

xylian86 commented Jan 20, 2025

xylian86 commented Jan 20, 2025

tjruwase commented Jan 20, 2025

fix memcpy issue on backward for zero-infinity #6670

fix memcpy issue on backward for zero-infinity #6670

Conversation

xylian86 commented Oct 26, 2024

deepcharm commented Dec 11, 2024 • edited Loading

tjruwase commented Dec 11, 2024

xylian86 commented Dec 12, 2024

nelyahu commented Dec 12, 2024

xylian86 commented Dec 12, 2024

deepcharm commented Dec 12, 2024

deepcharm commented Jan 9, 2025

tjruwase commented Jan 9, 2025

tjruwase commented Jan 9, 2025

tjruwase commented Jan 19, 2025

nelyahu commented Jan 19, 2025

xylian86 commented Jan 19, 2025

tjruwase commented Jan 19, 2025

xylian86 commented Jan 19, 2025 • edited Loading

xylian86 commented Jan 20, 2025

tjruwase commented Jan 20, 2025

xylian86 commented Jan 20, 2025

tjruwase commented Jan 20, 2025

xylian86 commented Jan 20, 2025

xylian86 commented Jan 20, 2025

tjruwase commented Jan 20, 2025

deepcharm commented Dec 11, 2024 •

edited

Loading

xylian86 commented Jan 19, 2025 •

edited

Loading