-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
support split qkv linear and sp overlap comm #415
base: main
Are you sure you want to change the base?
Conversation
SP is a fantastic piece of work, it is very elegant and concise, at the current stage, a transformer layer's forward and backward passes involve 8 all-to-all operations, with 5 opportunities for overlapping communication: Forward pass: The QKV matrix operations can be pipelined alongside some of the all-to-all communications. Backward pass: DQ, DK, DV all-to-all communications can be pipelined alongside matrix operations. Backward pass: DO_w can be parallel with DO_input, involving matrix operations and all-to-all communications. Similar overlap-comm strategies are used in Megatron for TP/TP-sp parallelism. I tested under conditions of 1N8C zero1, disabled activation checkpointing, ds-sp=8, and gbs=16: 1B 64K 7B 16K They showed over 10% improvement (where I found that for mega-ds, using split QKV itself can also enhance performance due to reducing slice + cat operations in fwd/bwd), despite some TFLOPs already performing at a relatively good level. co-work with deepspeedai/Megatron-DeepSpeed#415 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
deepspeedai/DeepSpeed#5691 is merged. @inkcherry do you still need this PR be reviewed? Can you resolve conflict on this branch? |
@tohtana , @loadams notice deepspeedai/DeepSpeed#5691 is merged, could you merge this one ? thanks! |
Hello,When I run the pretrain_gpt.py,I met the following bugs, |
@yingtongxiong If using this branch, |
Hi @inkcherry - could you take a look at resolving the merge conflicts on this? |
Hi, @loadams , Currently master mds + master ds (197~200 steps):
this branch + ds fix patch + enable overlap(197~200 steps):
|
hello, and now I met this problem, the run python file is the pretrain_gpt.py |
I can run this shell (where I enable flash-v2 and disable activation-checkpoint) if I don't enable two overlap options. |
@yingtongxiong |
work with deepspeedai/DeepSpeed#5691
when use ds_sequence_parallel, open the following 2 flags to enable overlap comm.
--split-qkv-linear
--ds-sequence-parallel-overlap-comm