Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744

MagisterLuddite · 2023-04-03T18:19:37Z

Potential upside: Improves tokens per watt by >50% for 7B models (on a Ryzen 7/DDR4 system).

Downside: Likely a breaking change to the C API

On my system, the speed of prompt evaluation scales proportionally to the number of threads (up to physical core count) but inference evaluation does not.

By separating the thread counts, prompt evaluation can remain fast while cutting back on the number of cores pegged at 100% that are contributing nothing to inference evaluation (and even slowing it down).

My suspicion is that when inference evaluation is constrained by memory bandwidth rather than CPU, adding additional cores beyond memory-saturation generates contention and effectively hot-loops those cores.

While I don't have access to an M1/M2 system, I suspect that their high memory bandwidth means they'd see little benefit from this change.

Example timings for 7B/int4 (with 8 physical cores):

8t/8e
- 40 ms per eval token
- 132 ms per inference token (8 cores at 100%)

8t/7e
- 40 ms per eval token
- 128 ms per inference token (7 cores at 100%)

8t/6e
- 40 ms per eval token
- 120 ms per inference token (6 cores at 100%)

8t/5e
- 40 ms per eval token
- 124 ms per inference token (5 cores at 100%)

8t/4e
- 40 ms per eval token
- 125 ms per inference token (4 cores at 100%)

8t/3e
- 40 ms per eval token
- 133 ms per inference token (3 cores at 100%)

8t/2e
- 40 ms per eval token
- 168 ms per inference token (2 cores at 100%)

Notice that 8t/3e generates at the same speed as 8t/8e but uses "5 cores" fewer watts. Prompt evaluation remains at maximum speed.

I also saw efficiency improvements with the 13B/int4 model. 8t/4e uses half the cores as before while doing inference evaluation slightly faster.

…umber of threads for single-token eval than for prompt eval.

sw · 2023-04-03T19:28:10Z

I think it's great that you address power consumption. We have been looking at tokens per second, but tokens per Watt is also important, especially on battery-powered devices.

Though I think it would be less surprising for users if the current -t flag would still apply to the inference tokens. Something like --prompt-threads would be a better addition, in my opinion.

Your code seems to be working fine here on a Core i3 and the eval times change in a reasonable manner as I play with the numbers, but I haven't looked at it very closely.

MagisterLuddite · 2023-04-03T19:44:39Z

Unless there's a bug, the behavior of -t is not changed and -e is optional.

If -e is not set then its value becomes the value of -t (or its default value) which results in the same behavior as -t alone (prompt and inference occur with the same number of threads except in the case of BLAS).

sw · 2023-04-03T20:02:19Z

I may have misunderstood this. I have 4 cores and don't usually give a -t flag, so 4 threads. Here's what I'm seeing with your PR (per token, generously rounded values):

flags	prompt eval time	eval time
(none)	130ms	330ms
`-t 4`	120ms	320ms
`-e 1 -t 4`	140ms	620ms
`-e 2 -t 4`	140ms	370ms
`-e 4 -t 1`	460ms	330ms

Maybe we're talking past each other, but it looks like -t now affects the prompt eval, and -e affects the inference eval.

edit: this may have been what you intended, I'm just saying that essentially keeping -t 4 should not affect inference time, in my opinion.
edit 2: added -e 2 -t 4

MagisterLuddite · 2023-04-03T20:25:32Z

Yeah it's possible we're talking past each other!

Think of -t as doing what it used to do (prompt == eval) except now with -e we can modify the number of threads used for eval. Typically, we'd maximize the number of prompt threads because that's almost always beneficial. Then we'd tweak the number of eval threads to match our situation (model size, # of physical cores, heat production, power consumption, etc).

In your case, (none) is effectively "-e 4 -t 4" and is intended to be equivalent to "-t 4".

For the "-e1 -t4" case, you're specifying 1 eval thread and 4 prompt threads, and seeing slow eval and fast prompt as expected.

Then for the "-e4 -t1" case (which would be uncommon), you're seeing the opposite, which is fast eval and slow prompt as expected.

I think if you tried the "-e2 -t4" case you'd notice similar timings to the (none) case for both prompt and eval but with half the CPU usage. That is, fast eval and fast prompt but twice as efficient.

sw · 2023-04-03T20:41:36Z

In your case, (none) is effectively "-e 4 -t 4" and is intended to be equivalent to "-t 4".

Perfectly fine.

For the "-e1 -t4" case, you're specifying 1 eval thread and 4 prompt threads, and seeing slow eval and fast prompt as expected.

Then for the "-e4 -t1" case (which would be uncommon), you're seeing the opposite, which is fast eval and slow prompt as expected.

I'm saying that for many people, the prompt eval is not that important, and when they specify a -t, that should apply to the inference, not the prompt eval.

edit: let me put it another way.

You essentially went:

"I have 8 cores so I'll set -t 8. But inference with this many threads is power-inefficient, so I'll add a new flag so I can use -e 1 to lower the number of inference threads."

I would rather prefer (and I think this would be more in line with the current behaviour and also what people expect):

"I have 8 cores, but inference with 8 threads is power-inefficient, so I'll use -t 1. But I want prompt eval to use all cores, so I'll add a new flag so I can use --prompt-threads 8."

chadbrewbaker · 2023-04-03T21:24:07Z

Why not go full MPI? Probably easier to parallelize the tensor operations, but I think the tokens can also be done as a parallel prefix https://gist.github.com/chadbrewbaker/ffe95290fc945af63611693688dfe54d

You should see super-linear speedup because of cache locality for the matrix operations. MPI_IO will also be a boon for larger models. On Mac M1 you should be able to hit 7/8 cores, plus the GPUs, plus the matrix unit, and still maybe find a way to abuse the video codec silicon for more horsepower - just on localhost.

chadbrewbaker · 2023-04-03T21:46:20Z

I started an mpi branch just to include mpi.h and get it compiling on supported platforms. Probably gate everything with a USING_MPI ifdef so it stays out of everyone's way.

https://github.com/chadbrewbaker/llama.cpp/tree/mpi

linouxis9 · 2023-04-04T16:36:32Z

Thanks for your PR! I was using BLAS to achieve something similar as what you are doing: tweaking the BLAS environment variable to set the number of eval threads, while using a lower threads counts for inference using the usual -t parameter.

chadbrewbaker · 2023-04-04T18:17:42Z

Thanks for your PR! I was using BLAS to achieve something similar as what you are doing: tweaking the BLAS environment variable to set the number of eval threads, while using a lower threads counts for inference using the usual -t parameter.

I thought about GNU Parallel and sharding the model. I don't think anyone has a clear picture in their mind's eye and it is going to take several benchmarks.

ggerganov · 2023-04-05T14:44:00Z

This is not a bad idea at the moment, but I am hoping we can solve the threading issue with some proper and efficient thread pausing/waking mechanism. In this case, we won't need to change the number of threads at all

Add "-e"/"--eval-threads" command-line parameter to set a different n…

3726470

…umber of threads for single-token eval than for prompt eval.

MagisterLuddite changed the title ~~Add "-e"/"--eval-threads" command-line parameter to set a different n…~~ Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval Apr 3, 2023

MagisterLuddite added 2 commits April 5, 2023 12:44

Merge branch 'master' into eval-thread-count

4778f93

Merge remote-tracking branch 'upstream/master' into eval-thread-count

5541a48

ggerganov added the threading Parallel processing and thread management label Apr 14, 2023

MagisterLuddite added 4 commits April 16, 2023 14:03

Merge remote-tracking branch 'upstream/master' into eval-thread-count

c770e01

Merge remote-tracking branch 'upstream/master' into eval-thread-count

2732a6b

Merge remote-tracking branch 'upstream/master' into eval-thread-count

a333c3a

Merge remote-tracking branch 'upstream/master' into eval-thread-count

bcc7f8b

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744

Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744

MagisterLuddite commented Apr 3, 2023 •

edited

Loading

sw commented Apr 3, 2023

MagisterLuddite commented Apr 3, 2023 •

edited

Loading

sw commented Apr 3, 2023 •

edited

Loading

MagisterLuddite commented Apr 3, 2023

sw commented Apr 3, 2023 •

edited

Loading

chadbrewbaker commented Apr 3, 2023

chadbrewbaker commented Apr 3, 2023

linouxis9 commented Apr 4, 2023

chadbrewbaker commented Apr 4, 2023

ggerganov commented Apr 5, 2023

Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744

Are you sure you want to change the base?

Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744

Conversation

MagisterLuddite commented Apr 3, 2023 • edited Loading

sw commented Apr 3, 2023

MagisterLuddite commented Apr 3, 2023 • edited Loading

sw commented Apr 3, 2023 • edited Loading

MagisterLuddite commented Apr 3, 2023

sw commented Apr 3, 2023 • edited Loading

chadbrewbaker commented Apr 3, 2023

chadbrewbaker commented Apr 3, 2023

linouxis9 commented Apr 4, 2023

chadbrewbaker commented Apr 4, 2023

ggerganov commented Apr 5, 2023

MagisterLuddite commented Apr 3, 2023 •

edited

Loading

MagisterLuddite commented Apr 3, 2023 •

edited

Loading

sw commented Apr 3, 2023 •

edited

Loading

sw commented Apr 3, 2023 •

edited

Loading