-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Add "-e"/"--eval-threads" to distinguish thread counts for single-token eval and prompt eval #744
Conversation
…umber of threads for single-token eval than for prompt eval.
I think it's great that you address power consumption. We have been looking at tokens per second, but tokens per Watt is also important, especially on battery-powered devices. Though I think it would be less surprising for users if the current Your code seems to be working fine here on a Core i3 and the eval times change in a reasonable manner as I play with the numbers, but I haven't looked at it very closely. |
Unless there's a bug, the behavior of -t is not changed and -e is optional. If -e is not set then its value becomes the value of -t (or its default value) which results in the same behavior as -t alone (prompt and inference occur with the same number of threads except in the case of BLAS). |
I may have misunderstood this. I have 4 cores and don't usually give a
Maybe we're talking past each other, but it looks like edit: this may have been what you intended, I'm just saying that essentially keeping |
Yeah it's possible we're talking past each other! Think of -t as doing what it used to do (prompt == eval) except now with -e we can modify the number of threads used for eval. Typically, we'd maximize the number of prompt threads because that's almost always beneficial. Then we'd tweak the number of eval threads to match our situation (model size, # of physical cores, heat production, power consumption, etc). In your case, (none) is effectively "-e 4 -t 4" and is intended to be equivalent to "-t 4". For the "-e1 -t4" case, you're specifying 1 eval thread and 4 prompt threads, and seeing slow eval and fast prompt as expected. Then for the "-e4 -t1" case (which would be uncommon), you're seeing the opposite, which is fast eval and slow prompt as expected. I think if you tried the "-e2 -t4" case you'd notice similar timings to the (none) case for both prompt and eval but with half the CPU usage. That is, fast eval and fast prompt but twice as efficient. |
Perfectly fine.
I'm saying that for many people, the prompt eval is not that important, and when they specify a edit: let me put it another way. You essentially went: "I have 8 cores so I'll set I would rather prefer (and I think this would be more in line with the current behaviour and also what people expect): "I have 8 cores, but inference with 8 threads is power-inefficient, so I'll use |
Why not go full MPI? Probably easier to parallelize the tensor operations, but I think the tokens can also be done as a parallel prefix https://gist.github.com/chadbrewbaker/ffe95290fc945af63611693688dfe54d You should see super-linear speedup because of cache locality for the matrix operations. MPI_IO will also be a boon for larger models. On Mac M1 you should be able to hit 7/8 cores, plus the GPUs, plus the matrix unit, and still maybe find a way to abuse the video codec silicon for more horsepower - just on localhost. |
I started an mpi branch just to include mpi.h and get it compiling on supported platforms. Probably gate everything with a USING_MPI ifdef so it stays out of everyone's way. https://github.com/chadbrewbaker/llama.cpp/tree/mpi |
Thanks for your PR! I was using BLAS to achieve something similar as what you are doing: tweaking the BLAS environment variable to set the number of eval threads, while using a lower threads counts for inference using the usual -t parameter. |
I thought about GNU Parallel and sharding the model. I don't think anyone has a clear picture in their mind's eye and it is going to take several benchmarks. |
This is not a bad idea at the moment, but I am hoping we can solve the threading issue with some proper and efficient thread pausing/waking mechanism. In this case, we won't need to change the number of threads at all |
Potential upside: Improves tokens per watt by >50% for 7B models (on a Ryzen 7/DDR4 system).
Downside: Likely a breaking change to the C API
On my system, the speed of prompt evaluation scales proportionally to the number of threads (up to physical core count) but inference evaluation does not.
By separating the thread counts, prompt evaluation can remain fast while cutting back on the number of cores pegged at 100% that are contributing nothing to inference evaluation (and even slowing it down).
My suspicion is that when inference evaluation is constrained by memory bandwidth rather than CPU, adding additional cores beyond memory-saturation generates contention and effectively hot-loops those cores.
While I don't have access to an M1/M2 system, I suspect that their high memory bandwidth means they'd see little benefit from this change.
Example timings for 7B/int4 (with 8 physical cores):
8t/8e
- 40 ms per eval token
- 132 ms per inference token (8 cores at 100%)
8t/7e
- 40 ms per eval token
- 128 ms per inference token (7 cores at 100%)
8t/6e
- 40 ms per eval token
- 120 ms per inference token (6 cores at 100%)
8t/5e
- 40 ms per eval token
- 124 ms per inference token (5 cores at 100%)
8t/4e
- 40 ms per eval token
- 125 ms per inference token (4 cores at 100%)
8t/3e
- 40 ms per eval token
- 133 ms per inference token (3 cores at 100%)
8t/2e
- 40 ms per eval token
- 168 ms per inference token (2 cores at 100%)
Notice that 8t/3e generates at the same speed as 8t/8e but uses "5 cores" fewer watts. Prompt evaluation remains at maximum speed.
I also saw efficiency improvements with the 13B/int4 model. 8t/4e uses half the cores as before while doing inference evaluation slightly faster.