Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32() #1081
Labels
enhancement
New feature or request
good first issue
Good for newcomers
performance
Speed related topics
This part takes about 10% of the total inference time for 7B and it is currently single-threaded:
https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884
Try to multi-thread this by splitting the work across rows.
Since the
GGML_TASK_INIT
currently runs only 1 thread, either:ggml
to support multi-threadedGGML_TASK_INIT
GGML_TASK_COMPUTE
(might be difficult since no barrier mechanism)The text was updated successfully, but these errors were encountered: