Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support speculative decoding in server example #5877

Open
4 tasks done
mscheong01 opened this issue Mar 5, 2024 · 18 comments
Open
4 tasks done

Support speculative decoding in server example #5877

mscheong01 opened this issue Mar 5, 2024 · 18 comments
Labels
enhancement New feature or request good first issue Good for newcomers server/webui

Comments

@mscheong01
Copy link
Collaborator

mscheong01 commented Mar 5, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

provide speculative decoding through server example.

Motivation

Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.

Possible Implementation

perhaps move speculative sampling implementation to common or sampling?

@mscheong01 mscheong01 added the enhancement New feature or request label Mar 5, 2024
@vietanh125
Copy link

Any updates for this?

@mscheong01
Copy link
Collaborator Author

@vietanh125 Not yet, but contributions are welcome 😃

@ggerganov
Copy link
Owner

There is ongoing related work in #6828. Though I haven't had time to look in details yet

@kerthcet
Copy link

kerthcet commented Sep 4, 2024

Sorry, does that means the server doesn't support speculative decoding? However, I can run it with commands like below in Kubernetes

Just a sample:

spec:
  containers:
  - args:
    - -m
    - /workspace/models/llama-2-7b.Q8_0.gguf
    - -md
    - /workspace/models/llama-2-7b.Q2_K.gguf
    - --port
    - "8080"
    - --host
    - 0.0.0.0
    - -fa
    command:
    - ./llama-server

@ggerganov
Copy link
Owner

Not yet supported

@kerthcet
Copy link

kerthcet commented Sep 4, 2024

Ok so the -md doesn't work here 😀

@etafund
Copy link

etafund commented Sep 18, 2024

Also interested in this PR. Thank you to everyone contributing to a solution here.

@theo77186
Copy link

The #6828 PR is a distinct technique that uses a lookup file to speculate tokens instead of using a draft model, there seems to have less speedup than draft-based speculative decoding.

@Hoernchen
Copy link

Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb

@gelim
Copy link
Contributor

gelim commented Oct 9, 2024

Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb

Yeah definitely. With draft: LLama-3.2-3B Q8 and model LLama-3.1-70B-Instruct (Q5_K, to fit on 2 32GB Tesla V100)

we go from 10 t/s to 30 t/s. Very impressive I'd say.

CUDA_VISIBLE_DEVICES=0,1 ./llama-speculative \
-m Meta-Llama-3.1-70B-Instruct-Q5_K_M-00001-of-00002.gguf \
-md Llama-3.2-3B-Instruct-Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage" \
 -t 4  -n 512 -c 8192 -s 8 --top_k 1 \
--draft 16 -ngl 88 -ngld 30 --temp 0
encoded   18 tokens in    0.286 seconds, speed:   62.924 t/s
decoded  514 tokens in   15.774 seconds, speed:   32.586 t/s

n_draft   = 16
n_predict = 514
n_drafted = 688
n_accept  = 470
accept    = 68.314%

draft:

llama_perf_context_print:        load time =    2505.46 ms
llama_perf_context_print: prompt eval time =    9569.42 ms /   103 tokens (   92.91 ms per token,    10.76 tokens per second)
llama_perf_context_print:        eval time =    5622.49 ms /   645 runs   (    8.72 ms per token,   114.72 tokens per second)
llama_perf_context_print:       total time =   16079.15 ms /   748 tokens

target:

llama_perf_sampler_print:    sampling time =     112.92 ms /   514 runs   (    0.22 ms per token,  4551.81 tokens per second)
llama_perf_context_print:        load time =   23527.56 ms
llama_perf_context_print: prompt eval time =    9077.52 ms /   749 tokens (   12.12 ms per token,    82.51 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   18584.66 ms /   750 tokens

EDIT: with LLama-3.2-1B Q8 that can go to 40 t/s

@enn-nafnlaus
Copy link

Wait, what happened? I used to run llama-server with speculative decoding with -md. I just "upgraded" and -md went away. now there's a separate program called llama-speculative, but doesn't appear to be a server. Sigh :( Guess I have to downgrade and find the version where it went away....

@etafund
Copy link

etafund commented Oct 25, 2024

@enn-nafnlaus Did you find the version where it went away? Would appreciate any leads.

@theo77186
Copy link

The last commit with -md in llama-server was 554c247 but it never worked anyway. The speculative decoding flags were silently discarded and no speculator model was loaded.

@sammcj
Copy link

sammcj commented Nov 12, 2024

Came to ask the same as other folks have stated here - looks like -md is no longer an option for the server. @ggerganov do you have any plans to implement speculative decoding for the server component?

@oxfighterjet
Copy link

Is anyone working on this issue? Or is this possibly blocked by something?

I am already preparing for this feature to be implemented in Ollama, but depend on this feature being implemented in llama-server here.

I don't mind giving this issue here a shot, it is labeled as good first issue and if that's true would make it suitable for my first commit.

I had a quick look and from what I see there is already an example of implementation in speculative. I assume I can use that as a hint for implementing it at the server level.

Are there any additional pointers or specific considerations for the implementation I should be aware of?

@ggerganov
Copy link
Owner

At the very least the llama-speculative example has to be fixed first (#10176 (comment)) and then demonstrate some meaningful gains from having this feature implemented in the server.

@m9e
Copy link

m9e commented Nov 13, 2024

FWIW, I went to test this a.m. before I went hunting and stumbled into this thread:

encoded   25 tokens in    1.049 seconds, speed:   23.830 t/s
decoded  922 tokens in   60.516 seconds, speed:   15.236 t/s

n_draft   = 8
n_predict = 922
n_drafted = 1024
n_accept  = 793
accept    = 77.441%

draft:

llama_perf_context_print:        load time =    1968.76 ms
llama_perf_context_print: prompt eval time =   44285.19 ms /   280 tokens (  158.16 ms per token,     6.32 tokens per second)
llama_perf_context_print:        eval time =   15870.80 ms /   896 runs   (   17.71 ms per token,    56.46 tokens per second)
llama_perf_context_print:       total time =   61568.07 ms /  1176 tokens

target:

llama_perf_sampler_print:    sampling time =      48.98 ms /   922 runs   (    0.05 ms per token, 18822.47 tokens per second)
llama_perf_context_print:        load time =    2290.46 ms
llama_perf_context_print: prompt eval time =   40887.53 ms /  1177 tokens (   34.74 ms per token,    28.79 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   63536.87 ms /  1178 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating



real    1m6.106s
user    0m1.702s
sys     0m3.050s
(venv) bash-3.2$ 

using a q4_k_l Qwen2.5-Coder-7B-Instruct draft with a q4_k_l Qwen2.5-coder-32B-Instruct-GGUF main model (bartowski quants from hf)

llama_perf_sampler_print:    sampling time =      57.81 ms /  1049 runs   (    0.06 ms per token, 18145.02 tokens per second)
llama_perf_context_print:        load time =    1841.75 ms
llama_perf_context_print: prompt eval time =     311.89 ms /    25 tokens (   12.48 ms per token,    80.16 tokens per second)
llama_perf_context_print:        eval time =   99573.78 ms /  1023 runs   (   97.34 ms per token,    10.27 tokens per second)
llama_perf_context_print:       total time =  100001.10 ms /  1048 tokens
ggml_metal_free: deallocating

real    1m41.974s
user    0m1.666s
sys     0m1.412s
(venv) bash-3.2$ 

was perf without the draft model

m3max mbp 128GB.

~53% performance increase when using the draft model, based on time including the double-warmup for the speculative run.

Went immediately to see if I could add on server since I remembered abetlan merging draft model way back when although that required the python bindings, found this thread.

in case I was doing something errant, my CLI:

./llama-speculative -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -md /var/tmp/models/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -ngld 999 -c 0 -t 4 -n 1024 --draft 8

and

time ./llama-cli -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -c 0 -t 4 -n 1024

@sammcj
Copy link

sammcj commented Nov 13, 2024

~53% performance increase when using the draft model

Glad to hear this, this is pretty similar to ExllamaV2.

The Qwen 2.5 model family is a good example for this as well, you can basically use the small 1.5b or even 0.5b model for the draft with the big 72b model and get an excellent boost.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers server/webui
Projects
None yet
Development

No branches or pull requests