-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
reduce cpu host overhead when using moe #5578
reduce cpu host overhead when using moe #5578
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranzhejiang Thank you for your contribution! I have a few questions about your changes. Can you clarify them?
e9e32f4
to
d860d2c
Compare
Hi, @tohtana I have clarified the modifications you mentioned and retest this PR with Megatron-Deepspeed on GPU platform(8xA800). It runs well and loss remains consistent with the original method, Could you please help review it again? Thanks! |
686f511
to
23ec4a1
Compare
23ec4a1
to
1cb0efd
Compare
#5881 also adopts this plan to reduce cpu time |
The operation
.to('cpu')
is not necessary for exp_counts, and it will cause device to host synchronization which damage performance.