Skip to content

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32() #1081

Closed
@ggerganov

Description

@ggerganov

This part takes about 10% of the total inference time for 7B and it is currently single-threaded:

https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884

Try to multi-thread this by splitting the work across rows.
Since the GGML_TASK_INIT currently runs only 1 thread, either:

  • update ggml to support multi-threaded GGML_TASK_INIT
  • move the quantization in GGML_TASK_COMPUTE (might be difficult since no barrier mechanism)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions