-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Load all MoE experts during warmup #11571
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
…f nodes during warmup
A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available. |
I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?) |
I'll consider adding proper support for this in #11213. |
@ggerganov if you are going to work on warmup then take a look at this: #11733 TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems. |
@fairydreaming Any chance you can resolve the conflicts for this PR? I was just about to do the final tests on the MLA PR but need this and #11397 to do it! :) |
@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code. I guess I will close it for now, as it's no longer a valid solution. |
@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮 I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down. |
I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:
|
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
This PR adds new API call that allows to enable and disable model warmup mode:
LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);
This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.I couldn't find a better way to do it, let me know if one exists.
If the warmup mode is enabled then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during
llama_decode()
call.Fixes #11163