Skip to content

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32() #1081

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
ggerganov opened this issue Apr 20, 2023 · 1 comment
Closed
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers performance Speed related topics

Comments

@ggerganov
Copy link
Member

This part takes about 10% of the total inference time for 7B and it is currently single-threaded:

https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884

Try to multi-thread this by splitting the work across rows.
Since the GGML_TASK_INIT currently runs only 1 thread, either:

  • update ggml to support multi-threaded GGML_TASK_INIT
  • move the quantization in GGML_TASK_COMPUTE (might be difficult since no barrier mechanism)
@ggerganov ggerganov added enhancement New feature or request performance Speed related topics labels Apr 20, 2023
@ggerganov ggerganov added the good first issue Good for newcomers label Apr 20, 2023
@ggerganov ggerganov self-assigned this Apr 23, 2023
@ggerganov
Copy link
Member Author

Doing tests with latest code base, the Q8_0 quantization part is quite negligible - not really sure how I measured 10% back when I created the issue, but now I do 2 separate runs: with and without calling quantize_row_q_dot() and the time per token is pretty much the same.

Also, multi-threading it via the second approach only degrades the performance.
The first approach would need more changes and I don't think it is really worth it.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers performance Speed related topics
Projects
None yet
Development

No branches or pull requests

1 participant