-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[torch.compile] support moe models #9632
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
("nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Dyn-Per-Token-2048-Samples", | ||
["--quantization", "compressed-tensors" | ||
], 1, 1, "FLASH_ATTN", "generate", True), | ||
("google/gemma-2-2b-it", [], 1, 2, "FLASHINFER", "generate", True), | ||
("ibm/PowerMoE-3b", [], 1, 2, "FLASH_ATTN", "generate", True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not work with FLASHINFER
due to some head size config.
gating_output=router_logits, | ||
renormalize=renormalize) | ||
|
||
forward_native = forward_cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we use inductor, we will use forward_native
. this function needs to be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also write a pytorch native implementation, but I doubt if inductor can optimize it. that's why I use forward_cuda
as forward_native
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically, forward_native
is the function we compile when we use inductor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I was just surprised because I didn't know that a class method can be defined this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's essentially the same as
def forward_native(self, *args, **kwargs):
return self.forward_cuda(*args, **kwargs)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: qishuai <ferdinandzhong@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Linkun Chen <github+anyscale@lkchen.net>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
moe models will read config file to determine the triton config to run.
reading files during forward is a disaster for
torch.compile
.this pr wraps the config reading part inside a custom op, so that it can pass
torch.compile
(althoughtorch.compile
will not be able to optimize it).