Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Have any plans to optimize the prefill kernel for the Hopper architecture? #521

Closed
alexngng opened this issue Oct 10, 2024 · 5 comments
Closed

Comments

@alexngng
Copy link

I notice that the Flashinfer prefill kernel is much slower than FA3 and TRT-LLM FMHA on SM90.
Do you have any plans to use some SM90 features for optimization?

Here is some data I tested on an SM90. Single H20 GPU, Llama2 7B.

Tokens Number TRT-LLM FMHA FA3 Flashinfer
512 x 1 37638.6 39,334.6 74966.6
512 x 2 54729.9 61,680.4 114800.0
512 x 4 103388.8 113,056.2 190688.4
@yzh119
Copy link
Collaborator

yzh119 commented Oct 10, 2024

Hi @alexngng , yes for sure. I still have some slight bug to fix and it's coming soon :)

@jason-huang03
Copy link

Really looking forward to it!

@taegeonum
Copy link

@yzh119 Hello, any update?

@yzh119
Copy link
Collaborator

yzh119 commented Dec 16, 2024

@alexngng @taegeonum @jason-huang03
Done in #667 .

@zhyncs zhyncs closed this as completed Dec 16, 2024
@taegeonum
Copy link

taegeonum commented Dec 26, 2024

@yzh119 Do you have a plan for supporting FP8 Q,K,V?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants