Closed
Description
This part takes about 10% of the total inference time for 7B and it is currently single-threaded:
Try to multi-thread this by splitting the work across rows.
Since the GGML_TASK_INIT
currently runs only 1 thread, either:
- update
ggml
to support multi-threadedGGML_TASK_INIT
- move the quantization in
GGML_TASK_COMPUTE
(might be difficult since no barrier mechanism)