Load all MoE experts during warmup #11571

fairydreaming · 2025-02-01T09:42:00Z

This PR adds new API call that allows to enable and disable model warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

~~This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.~~

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.
I couldn't find a better way to do it, let me know if one exists.

If the warmup mode is enabled then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during llama_decode() call.

Fixes #11163

…f nodes during warmup

cpumaxx · 2025-02-03T17:05:26Z

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available.
I will try a test on a non-MoE large model as well to make sure there are no regressions in that case.
Thanks for this fix!

jukofyork · 2025-02-06T21:23:55Z

I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?)

ggerganov · 2025-02-07T08:04:03Z

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

fairydreaming · 2025-02-09T11:58:04Z

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@ggerganov if you are going to work on warmup then take a look at this: #11733

TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems.

jukofyork · 2025-03-13T19:40:12Z

@fairydreaming Any chance you can resolve the conflicts for this PR?

I was just about to do the final tests on the MLA PR but need this and #11397 to do it! :)

fairydreaming · 2025-03-13T20:35:21Z

@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code.

I guess I will close it for now, as it's no longer a valid solution.

jukofyork · 2025-03-13T23:25:25Z

@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮

I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down.

fairydreaming · 2025-03-14T09:22:59Z

I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

llama : use all experts during warmup

83a473a

fairydreaming mentioned this pull request Feb 1, 2025

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Closed

llama : increased max_nodes as large MoE models use massive amounts o…

c8bc6e4

…f nodes during warmup

ggerganov mentioned this pull request Feb 7, 2025

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

Closed

21 tasks

saood06 mentioned this pull request Feb 9, 2025

Load all MoE experts during warmup and make warmup 1 token ikawrakow/ik_llama.cpp#198

Merged

fairydreaming closed this Mar 13, 2025

Merge remote-tracking branch 'upstream/master' into experts-warmup

2121335

fairydreaming reopened this Mar 14, 2025

llama : correct llama_set_warmup() description

334515f

ggerganov approved these changes Mar 14, 2025

View reviewed changes

fairydreaming merged commit 8fcb563 into ggml-org:master Mar 14, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load all MoE experts during warmup #11571

Load all MoE experts during warmup #11571

Uh oh!

fairydreaming commented Feb 1, 2025 •

edited

Loading

Uh oh!

cpumaxx commented Feb 3, 2025

Uh oh!

jukofyork commented Feb 6, 2025

Uh oh!

ggerganov commented Feb 7, 2025

Uh oh!

fairydreaming commented Feb 9, 2025

Uh oh!

jukofyork commented Mar 13, 2025

Uh oh!

fairydreaming commented Mar 13, 2025

Uh oh!

jukofyork commented Mar 13, 2025

Uh oh!

fairydreaming commented Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

Load all MoE experts during warmup #11571

Load all MoE experts during warmup #11571

Uh oh!

Conversation

fairydreaming commented Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpumaxx commented Feb 3, 2025

Uh oh!

jukofyork commented Feb 6, 2025

Uh oh!

ggerganov commented Feb 7, 2025

Uh oh!

fairydreaming commented Feb 9, 2025

Uh oh!

jukofyork commented Mar 13, 2025

Uh oh!

fairydreaming commented Mar 13, 2025

Uh oh!

jukofyork commented Mar 13, 2025

Uh oh!

fairydreaming commented Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

fairydreaming commented Feb 1, 2025 •

edited

Loading