-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551
Conversation
Hi @Yejing-Lai , do you want to provide some comments on this PR for Qwen2-MoE AutoTP support? |
Could you try to modify this line if it can meet your needs? https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/auto_tp.py#L336 |
Yes. It can provide the same function and result if probably coded. |
Thank you for your comments. |
…() for uniform code management. Both have the same function and the same result.
Hi @gyou2021 , can you also add |
Added. Thank you for your comment. |
Hi @tjruwase This PR adds AutoTP support for Qwen2-MoE. @Yejing-Lai and me had reviewed this change. Thanks! |
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance.