-
Notifications
You must be signed in to change notification settings - Fork 11.4k
move BLAS to a separate backend #6210
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Will just need to adapt |
This comment was marked as off-topic.
This comment was marked as off-topic.
@mofosyne I appreciate that you are trying to help, but please don't do that on my PRs. I very often have not pushed local changes and I prefer to deal with the merge conflicts myself. |
2b5c73d
to
ca91205
Compare
@ggerganov I am thinking about how accelerate should interact with the BLAS backend. I think this would make sense:
Conversely:
Currently |
Yes, that makes sense. With only |
On M2 Ultra there is a similar effect with the LLAMA_NO_LLAMAFILE=1 LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master sl/blas-backend -m models/tinyllama-1b/ggml-model-q4_0.gguf -m models/tinyllama-1b/ggml-model-q8_0.gguf -m models/tinyllama-1b/ggml-model-f16.gguf -m models/tinyllama-1b/ggml-model-f32.gguf -p 32,64,128,256,512 -n 0 -t 4,8,16
|
I realized that there is an issue that causes the
|
This should be good now. I have updated the PR description with more details about the changes included here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the BLAS backend should not be used alongside GPU backends, as it will prevent offloading of large batches with partial offloading
On macOS with Metal enabled, when I build with LLAMA_BLAS=OFF
and run with partial offloading (-ngl 28
), the non-offloaded layers are running on the CPU backend:
...
node # 32 ( ADD): l_out-0 ( 8M) [ CPU ]: ffn_out-0 ( 8M) [ CPU ] ffn_inp-0 ( 8M) [ CPU ]
node # 33 ( RMS_NORM): norm-1 ( 8M) [ CPU ]: l_out-0 ( 8M) [ CPU ]
node # 34 ( MUL): attn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [ CPU ] blk.1.attn_norm.weig ( 16K) [ CPU ]
node # 35 ( MUL_MAT): Qcur-1 ( 8M) [ CPU ]: blk.1.attn_q.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
node # 37 ( ROPE): Qcur-1 ( 8M) [ CPU ]: Qcur-1 (reshaped) ( 8M) [ CPU ] inp_pos ( 2K) [ CPU ]
node # 38 ( MUL_MAT): Kcur-1 ( 8M) [ CPU ]: blk.1.attn_k.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
node # 40 ( ROPE): Kcur-1 ( 8M) [ CPU ]: Kcur-1 (reshaped) ( 8M) [ CPU ] inp_pos ( 2K) [ CPU ]
node # 41 ( MUL_MAT): Vcur-1 ( 8M) [ CPU ]: blk.1.attn_v.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
node # 43 ( CPY): k_cache_view-1 (copy ( 4M) [ CPU ]: Kcur-1 ( 8M) [ CPU ] k_cache_view-1 ( 4M) [ CPU ]
node # 46 ( CPY): v_cache_view-1 (copy ( 4M) [ CPU ]: Vcur-1 (transposed) ( 8M) [ CPU ] v_cache_view-1 ( 4M) [ CPU ]
node # 50 ( MUL_MAT): kq-1 ( 32M) [ CPU ]: k-1 ( 4M) [ CPU ] q-1 ( 8M) [ CPU ]
node # 51 ( SOFT_MAX): kq_soft_max_ext-1 ( 32M) [ CPU ]: kq-1 ( 32M) [ CPU ] KQ_mask ( 1M) [ CPU ]
node # 52 ( MUL_MAT): kqv-1 ( 8M) [ CPU ]: v-1 ( 4M) [ CPU ] kq_soft_max_ext-1 ( 32M) [ CPU ]
node # 54 ( CONT): kqv_merged_cont-1 ( 8M) [ CPU ]: kqv_merged-1 ( 8M) [ CPU ]
node # 55 ( MUL_MAT): kqv_out-1 ( 8M) [ CPU ]: blk.1.attn_output.we ( 17M) [ CPU ] kqv_merged_cont-1 ( 8M) [ CPU ]
node # 56 ( ADD): ffn_inp-1 ( 8M) [ CPU ]: kqv_out-1 ( 8M) [ CPU ] l_out-0 ( 8M) [ CPU ]
node # 57 ( RMS_NORM): norm-1 ( 8M) [ CPU ]: ffn_inp-1 ( 8M) [ CPU ]
node # 58 ( MUL): ffn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [ CPU ] blk.1.ffn_norm.weigh ( 16K) [ CPU ]
node # 59 ( MUL_MAT): ffn_gate-1 ( 21M) [ CPU ]: blk.1.ffn_gate.weigh ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
node # 60 ( UNARY): ffn_silu-1 ( 21M) [ CPU ]: ffn_gate-1 ( 21M) [ CPU ]
node # 61 ( MUL_MAT): ffn_up-1 ( 21M) [ CPU ]: blk.1.ffn_up.weight ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
node # 62 ( MUL): ffn_gate_par-1 ( 21M) [ CPU ]: ffn_silu-1 ( 21M) [ CPU ] ffn_up-1 ( 21M) [ CPU ]
node # 63 ( MUL_MAT): ffn_out-1 ( 8M) [ CPU ]: blk.1.ffn_down.weigh ( 45M) [ CPU ] ffn_gate_par-1 ( 21M) [ CPU ]
node # 64 ( ADD): l_out-1 ( 8M) [ CPU ]: ffn_out-1 ( 8M) [ CPU ] ffn_inp-1 ( 8M) [ CPU ]
node # 65 ( RMS_NORM): norm-2 ( 8M) [ CPU ]: l_out-1 ( 8M) [ CPU ]
...
With LLAMA_BLAS=ON
it uses the BLAS backend for the matrix multiplications:
...
## SPLIT #16: Metal # 1 inputs: [ffn_out-0 ( 8M)]
node # 32 ( ADD): l_out-0 ( 8M) [Metal ]: Metal#ffn_out-0#0 ( 8M) [ NULL ] ffn_inp-0 ( 8M) [Metal ]
node # 33 ( RMS_NORM): norm-1 ( 8M) [Metal ]: l_out-0 ( 8M) [Metal ]
## SPLIT #17: CPU # 0 inputs:
node # 34 ( MUL): attn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [Metal ] blk.1.attn_norm.weig ( 16K) [ CPU ]
## SPLIT #18: BLAS # 0 inputs:
node # 35 ( MUL_MAT): Qcur-1 ( 8M) [ BLAS ]: blk.1.attn_q.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
## SPLIT #19: Metal # 1 inputs: [Qcur-1 (reshaped) ( 8M)]
node # 37 ( ROPE): Qcur-1 ( 8M) [Metal ]: Metal#Qcur-1 (reshap ( 8M) [ NULL ] Metal#inp_pos#0 ( 2K) [ NULL ]
## SPLIT #20: BLAS # 0 inputs:
node # 38 ( MUL_MAT): Kcur-1 ( 8M) [ BLAS ]: blk.1.attn_k.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
## SPLIT #21: Metal # 1 inputs: [Kcur-1 (reshaped) ( 8M)]
node # 40 ( ROPE): Kcur-1 ( 8M) [Metal ]: Metal#Kcur-1 (reshap ( 8M) [ NULL ] Metal#inp_pos#0 ( 2K) [ NULL ]
## SPLIT #22: BLAS # 0 inputs:
node # 41 ( MUL_MAT): Vcur-1 ( 8M) [ BLAS ]: blk.1.attn_v.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
## SPLIT #23: CPU # 0 inputs:
node # 43 ( CPY): k_cache_view-1 (copy ( 4M) [ CPU ]: Kcur-1 ( 8M) [Metal ] k_cache_view-1 ( 4M) [ CPU ]
node # 46 ( CPY): v_cache_view-1 (copy ( 4M) [ CPU ]: Vcur-1 (transposed) ( 8M) [ BLAS ] v_cache_view-1 ( 4M) [ CPU ]
## SPLIT #24: Metal # 2 inputs: [k-1 ( 4M)] [v-1 ( 4M)]
node # 50 ( MUL_MAT): kq-1 ( 32M) [Metal ]: Metal#k-1#0 ( 4M) [ NULL ] q-1 ( 8M) [Metal ]
node # 51 ( SOFT_MAX): kq_soft_max_ext-1 ( 32M) [Metal ]: kq-1 ( 32M) [Metal ] Metal#KQ_mask#0 ( 1M) [ NULL ]
node # 52 ( MUL_MAT): kqv-1 ( 8M) [Metal ]: Metal#v-1#0 ( 4M) [ NULL ] kq_soft_max_ext-1 ( 32M) [Metal ]
node # 54 ( CONT): kqv_merged_cont-1 ( 8M) [Metal ]: kqv_merged-1 ( 8M) [Metal ]
## SPLIT #25: BLAS # 0 inputs:
node # 55 ( MUL_MAT): kqv_out-1 ( 8M) [ BLAS ]: blk.1.attn_output.we ( 17M) [ CPU ] kqv_merged_cont-1 ( 8M) [Metal ]
## SPLIT #26: Metal # 1 inputs: [kqv_out-1 ( 8M)]
node # 56 ( ADD): ffn_inp-1 ( 8M) [Metal ]: Metal#kqv_out-1#0 ( 8M) [ NULL ] l_out-0 ( 8M) [Metal ]
node # 57 ( RMS_NORM): norm-1 ( 8M) [Metal ]: ffn_inp-1 ( 8M) [Metal ]
## SPLIT #27: CPU # 0 inputs:
node # 58 ( MUL): ffn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [Metal ] blk.1.ffn_norm.weigh ( 16K) [ CPU ]
## SPLIT #28: BLAS # 0 inputs:
node # 59 ( MUL_MAT): ffn_gate-1 ( 21M) [ BLAS ]: blk.1.ffn_gate.weigh ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
## SPLIT #29: Metal # 1 inputs: [ffn_gate-1 ( 21M)]
node # 60 ( UNARY): ffn_silu-1 ( 21M) [Metal ]: Metal#ffn_gate-1#0 ( 21M) [ NULL ]
## SPLIT #30: BLAS # 0 inputs:
node # 61 ( MUL_MAT): ffn_up-1 ( 21M) [ BLAS ]: blk.1.ffn_up.weight ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
## SPLIT #31: Metal # 1 inputs: [ffn_up-1 ( 21M)]
node # 62 ( MUL): ffn_gate_par-1 ( 21M) [Metal ]: ffn_silu-1 ( 21M) [Metal ] Metal#ffn_up-1#0 ( 21M) [ NULL ]
## SPLIT #32: BLAS # 0 inputs:
node # 63 ( MUL_MAT): ffn_out-1 ( 8M) [ BLAS ]: blk.1.ffn_down.weigh ( 45M) [ CPU ] ffn_gate_par-1 ( 21M) [Metal ]
## SPLIT #33: Metal # 1 inputs: [ffn_out-1 ( 8M)]
node # 64 ( ADD): l_out-1 ( 8M) [Metal ]: Metal#ffn_out-1#0 ( 8M) [ NULL ] ffn_inp-1 ( 8M) [Metal ]
node # 65 ( RMS_NORM): norm-2 ( 8M) [Metal ]: l_out-1 ( 8M) [Metal ]
...
Is this the expectation? It seems like using BLAS together with GPU offloading leads to improvement in this case, or did I misunderstood this comment?
Specifically, this applies to backends that implement the |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Metal should not be used for the operations in between the BLAS backend in not offloaded layers though, I will try to fix that. |
@zhouwg I already considered it and rejected it. Spamming more about it is not going to help your cause. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@zhouwg Please focus on your PR and respect the comments and suggestions that have already been provided. Consider this final warning, before having to block you |
thanks for your reminder. I see. |
In that same example, if I allow the diff --git a/ggml-metal.m b/ggml-metal.m
index 7786acd6..665eae15 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -3178,6 +3178,12 @@ GGML_CALL static bool ggml_backend_metal_supports_buft(ggml_backend_t backend, g
UNUSED(backend);
}
+GGML_CALL static bool ggml_backend_metal_offload_op(ggml_backend_t backend, const struct ggml_tensor * op) {
+ return (op->op == GGML_OP_MUL);
+
+ GGML_UNUSED(backend);
+}
+
static struct ggml_backend_i ggml_backend_metal_i = {
/* .get_name = */ ggml_backend_metal_name,
/* .free = */ ggml_backend_metal_free,
@@ -3193,7 +3199,7 @@ static struct ggml_backend_i ggml_backend_metal_i = {
/* .graph_compute = */ ggml_backend_metal_graph_compute,
/* .supports_op = */ ggml_backend_metal_supports_op,
/* .supports_buft = */ ggml_backend_metal_supports_buft,
- /* .offload_op = */ NULL,
+ /* .offload_op = */ ggml_backend_metal_offload_op,
/* .event_new = */ NULL,
/* .event_free = */ NULL,
/* .event_record = */ NULL, I get the following schedule:
How does the logic decide to also offload nodes |
In the first pass, ops with weights are assigned the backend of the weight. |
This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. ggml-ci
Moves BLAS support from
ggml.c
to a separate backend, and adds the necessary changes to ggml-backend to support backends that only implement matrix multiplication.ggml_backend_sched
supports_op
function of the backendsupports_backend
to backend functionsupports_buft
ggml_backend_buft_is_host
fromsupports_buft
ggml_backend_sched
will avoid copies between backends when the backend supports the buffer typeGGML_SCHED_DEBUG
environment variable can be used to view the graph splits. This is useful to see what operations are being run on each backend-t
or-tb
)LLAMA_BLAS
when using cmake, or when using make,LLAMA_OPENBLAS
,LLAMA_OPENBLAS64
orLLAMA_BLIS
ggml.c
. Applications that want to support BLAS will need to use the BLAS backendggml_backend_sched
alongside the CPU or other backends to provide support for other operations