Support speculative decoding in `server` example #5877

mscheong01 · 2024-03-05T02:39:35Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

provide speculative decoding through server example.

Motivation

Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.

Possible Implementation

perhaps move speculative sampling implementation to common or sampling?

The text was updated successfully, but these errors were encountered:

vietanh125 · 2024-04-28T07:55:44Z

Any updates for this?

mscheong01 · 2024-04-30T06:07:14Z

@vietanh125 Not yet, but contributions are welcome 😃

ggerganov · 2024-04-30T09:18:57Z

There is ongoing related work in #6828. Though I haven't had time to look in details yet

kerthcet · 2024-09-04T06:19:07Z

Sorry, does that means the server doesn't support speculative decoding? However, I can run it with commands like below in Kubernetes

Just a sample:

spec:
  containers:
  - args:
    - -m
    - /workspace/models/llama-2-7b.Q8_0.gguf
    - -md
    - /workspace/models/llama-2-7b.Q2_K.gguf
    - --port
    - "8080"
    - --host
    - 0.0.0.0
    - -fa
    command:
    - ./llama-server

ggerganov · 2024-09-04T06:25:31Z

Not yet supported

kerthcet · 2024-09-04T07:11:24Z

Ok so the -md doesn't work here 😀

etafund · 2024-09-18T15:45:06Z

Also interested in this PR. Thank you to everyone contributing to a solution here.

theo77186 · 2024-09-24T19:53:58Z

The #6828 PR is a distinct technique that uses a lookup file to speculate tokens instead of using a draft model, there seems to have less speedup than draft-based speculative decoding.

Hoernchen · 2024-09-28T16:24:51Z

Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb

gelim · 2024-10-09T18:56:00Z

Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb

Yeah definitely. With draft: LLama-3.2-3B Q8 and model LLama-3.1-70B-Instruct (Q5_K, to fit on 2 32GB Tesla V100)

we go from 10 t/s to 30 t/s. Very impressive I'd say.

CUDA_VISIBLE_DEVICES=0,1 ./llama-speculative \
-m Meta-Llama-3.1-70B-Instruct-Q5_K_M-00001-of-00002.gguf \
-md Llama-3.2-3B-Instruct-Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage" \
 -t 4  -n 512 -c 8192 -s 8 --top_k 1 \
--draft 16 -ngl 88 -ngld 30 --temp 0

encoded   18 tokens in    0.286 seconds, speed:   62.924 t/s
decoded  514 tokens in   15.774 seconds, speed:   32.586 t/s

n_draft   = 16
n_predict = 514
n_drafted = 688
n_accept  = 470
accept    = 68.314%

draft:

llama_perf_context_print:        load time =    2505.46 ms
llama_perf_context_print: prompt eval time =    9569.42 ms /   103 tokens (   92.91 ms per token,    10.76 tokens per second)
llama_perf_context_print:        eval time =    5622.49 ms /   645 runs   (    8.72 ms per token,   114.72 tokens per second)
llama_perf_context_print:       total time =   16079.15 ms /   748 tokens

target:

llama_perf_sampler_print:    sampling time =     112.92 ms /   514 runs   (    0.22 ms per token,  4551.81 tokens per second)
llama_perf_context_print:        load time =   23527.56 ms
llama_perf_context_print: prompt eval time =    9077.52 ms /   749 tokens (   12.12 ms per token,    82.51 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   18584.66 ms /   750 tokens

EDIT: with LLama-3.2-1B Q8 that can go to 40 t/s

enn-nafnlaus · 2024-10-23T11:59:06Z

Wait, what happened? I used to run llama-server with speculative decoding with -md. I just "upgraded" and -md went away. now there's a separate program called llama-speculative, but doesn't appear to be a server. Sigh :( Guess I have to downgrade and find the version where it went away....

etafund · 2024-10-25T06:35:25Z

@enn-nafnlaus Did you find the version where it went away? Would appreciate any leads.

theo77186 · 2024-10-25T06:59:42Z

The last commit with -md in llama-server was 554c247 but it never worked anyway. The speculative decoding flags were silently discarded and no speculator model was loaded.

sammcj · 2024-11-12T09:46:53Z

Came to ask the same as other folks have stated here - looks like -md is no longer an option for the server. @ggerganov do you have any plans to implement speculative decoding for the server component?

oxfighterjet · 2024-11-12T22:12:26Z

Is anyone working on this issue? Or is this possibly blocked by something?

I am already preparing for this feature to be implemented in Ollama, but depend on this feature being implemented in llama-server here.

I don't mind giving this issue here a shot, it is labeled as good first issue and if that's true would make it suitable for my first commit.

I had a quick look and from what I see there is already an example of implementation in speculative. I assume I can use that as a hint for implementing it at the server level.

Are there any additional pointers or specific considerations for the implementation I should be aware of?

ggerganov · 2024-11-13T11:31:08Z

At the very least the llama-speculative example has to be fixed first (#10176 (comment)) and then demonstrate some meaningful gains from having this feature implemented in the server.

m9e · 2024-11-13T16:25:16Z

FWIW, I went to test this a.m. before I went hunting and stumbled into this thread:

encoded   25 tokens in    1.049 seconds, speed:   23.830 t/s
decoded  922 tokens in   60.516 seconds, speed:   15.236 t/s

n_draft   = 8
n_predict = 922
n_drafted = 1024
n_accept  = 793
accept    = 77.441%

draft:

llama_perf_context_print:        load time =    1968.76 ms
llama_perf_context_print: prompt eval time =   44285.19 ms /   280 tokens (  158.16 ms per token,     6.32 tokens per second)
llama_perf_context_print:        eval time =   15870.80 ms /   896 runs   (   17.71 ms per token,    56.46 tokens per second)
llama_perf_context_print:       total time =   61568.07 ms /  1176 tokens

target:

llama_perf_sampler_print:    sampling time =      48.98 ms /   922 runs   (    0.05 ms per token, 18822.47 tokens per second)
llama_perf_context_print:        load time =    2290.46 ms
llama_perf_context_print: prompt eval time =   40887.53 ms /  1177 tokens (   34.74 ms per token,    28.79 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   63536.87 ms /  1178 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating



real    1m6.106s
user    0m1.702s
sys     0m3.050s
(venv) bash-3.2$

using a q4_k_l Qwen2.5-Coder-7B-Instruct draft with a q4_k_l Qwen2.5-coder-32B-Instruct-GGUF main model (bartowski quants from hf)

llama_perf_sampler_print:    sampling time =      57.81 ms /  1049 runs   (    0.06 ms per token, 18145.02 tokens per second)
llama_perf_context_print:        load time =    1841.75 ms
llama_perf_context_print: prompt eval time =     311.89 ms /    25 tokens (   12.48 ms per token,    80.16 tokens per second)
llama_perf_context_print:        eval time =   99573.78 ms /  1023 runs   (   97.34 ms per token,    10.27 tokens per second)
llama_perf_context_print:       total time =  100001.10 ms /  1048 tokens
ggml_metal_free: deallocating

real    1m41.974s
user    0m1.666s
sys     0m1.412s
(venv) bash-3.2$

was perf without the draft model

m3max mbp 128GB.

~53% performance increase when using the draft model, based on time including the double-warmup for the speculative run.

Went immediately to see if I could add on server since I remembered abetlan merging draft model way back when although that required the python bindings, found this thread.

in case I was doing something errant, my CLI:

./llama-speculative -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -md /var/tmp/models/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -ngld 999 -c 0 -t 4 -n 1024 --draft 8

and

time ./llama-cli -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -c 0 -t 4 -n 1024

sammcj · 2024-11-13T23:00:58Z

~53% performance increase when using the draft model

Glad to hear this, this is pretty similar to ExllamaV2.

The Qwen 2.5 model family is a good example for this as well, you can basically use the small 1.5b or even 0.5b model for the draft with the big 72b model and get an excellent boost.

mscheong01 added the enhancement New feature or request label Mar 5, 2024

mscheong01 added good first issue Good for newcomers server/webui labels Mar 25, 2024

etafund mentioned this issue Sep 18, 2024

Server: enable lookup decoding #6828

Open

kerthcet mentioned this issue Nov 5, 2024

Support speculative decoding with llama.cpp InftyAI/llmaz#197

Open

3 tasks

TheTerrasque mentioned this issue Nov 12, 2024

Enable using llama.cpp's --model-draft <model> feature for speculative decoding ollama/ollama#5800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support speculative decoding in `server` example #5877

Support speculative decoding in `server` example #5877

mscheong01 commented Mar 5, 2024 •

edited

Loading

vietanh125 commented Apr 28, 2024

mscheong01 commented Apr 30, 2024

ggerganov commented Apr 30, 2024

kerthcet commented Sep 4, 2024

ggerganov commented Sep 4, 2024

kerthcet commented Sep 4, 2024

etafund commented Sep 18, 2024

theo77186 commented Sep 24, 2024

Hoernchen commented Sep 28, 2024

gelim commented Oct 9, 2024 •

edited

Loading

enn-nafnlaus commented Oct 23, 2024

etafund commented Oct 25, 2024

theo77186 commented Oct 25, 2024

sammcj commented Nov 12, 2024

oxfighterjet commented Nov 12, 2024

ggerganov commented Nov 13, 2024

m9e commented Nov 13, 2024 •

edited

Loading

sammcj commented Nov 13, 2024

Support speculative decoding in server example #5877

Support speculative decoding in server example #5877

Comments

mscheong01 commented Mar 5, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

vietanh125 commented Apr 28, 2024

mscheong01 commented Apr 30, 2024

ggerganov commented Apr 30, 2024

kerthcet commented Sep 4, 2024

ggerganov commented Sep 4, 2024

kerthcet commented Sep 4, 2024

etafund commented Sep 18, 2024

theo77186 commented Sep 24, 2024

Hoernchen commented Sep 28, 2024

gelim commented Oct 9, 2024 • edited Loading

enn-nafnlaus commented Oct 23, 2024

etafund commented Oct 25, 2024

theo77186 commented Oct 25, 2024

sammcj commented Nov 12, 2024

oxfighterjet commented Nov 12, 2024

ggerganov commented Nov 13, 2024

m9e commented Nov 13, 2024 • edited Loading

sammcj commented Nov 13, 2024

Support speculative decoding in `server` example #5877

Support speculative decoding in `server` example #5877

mscheong01 commented Mar 5, 2024 •

edited

Loading

gelim commented Oct 9, 2024 •

edited

Loading

m9e commented Nov 13, 2024 •

edited

Loading