Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32()

This part takes about 10% of the total inference time for 7B and it is currently single-threaded:

https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884

Try to multi-thread this by splitting the work across rows.
Since the `GGML_TASK_INIT` currently runs only 1 thread, either:
- update `ggml` to support multi-threaded `GGML_TASK_INIT`
- move the quantization in `GGML_TASK_COMPUTE` (might be difficult since no barrier mechanism)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32() #1081

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32() #1081

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions