Skip to content

Load all MoE experts during warmup #11571

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
Mar 14, 2025

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Feb 1, 2025

This PR adds new API call that allows to enable and disable model warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.
I couldn't find a better way to do it, let me know if one exists.

If the warmup mode is enabled then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during llama_decode() call.

Fixes #11163

@cpumaxx
Copy link
Contributor

cpumaxx commented Feb 3, 2025

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available.
I will try a test on a non-MoE large model as well to make sure there are no regressions in that case.
Thanks for this fix!

@jukofyork
Copy link
Collaborator

I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?)

@ggerganov
Copy link
Member

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@fairydreaming
Copy link
Collaborator Author

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@ggerganov if you are going to work on warmup then take a look at this: #11733

TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems.

@jukofyork
Copy link
Collaborator

@fairydreaming Any chance you can resolve the conflicts for this PR?

I was just about to do the final tests on the MLA PR but need this and #11397 to do it! :)

@fairydreaming
Copy link
Collaborator Author

@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code.

I guess I will close it for now, as it's no longer a valid solution.

@jukofyork
Copy link
Collaborator

@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮

I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down.

@fairydreaming fairydreaming reopened this Mar 14, 2025
@fairydreaming
Copy link
Collaborator Author

I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

@fairydreaming fairydreaming merged commit 8fcb563 into ggml-org:master Mar 14, 2025
47 checks passed
jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: model warmup doesn't work correctly for MoE models
5 participants