perf(ggml): tall and skinny GEMM for LoRA: F32 `mul_mat([16 X 5120], [16 X 5120])` takes 120ms - 24x slower than expected #956

jon-chuang · 2023-04-14T04:01:56Z

Context:
LoRA requires computing $W^{'} = W + Δ W$ where $Δ W = B A^{T}$ and $A, B$ are tall and skinny.

See #820 for the use-case.

Problem
The estimated FLOPs of these matmuls are 16 * 5120 * 5120 ~= 0.419 GFLOPs.

My setup can do 360 GFLOPs for f16 mul_mat (open blas). This would imply a computing time of ~1.2ms for float16, so float32 should be in the same ballpark.

The model's mul_mat(5120 X 5120], [5120 X 8]) of FLOP complexity 8 * 5120 * 5120 ~= 0.210 GFLOPs takes 2.647 ms

So at the least, we should be getting ~5ms, not 120ms. So, we are observing a 24x deviation from the expected compute time.

The above observed time is roughly the same with or without OpenBLAS.

EDIT: with the optimizations described below, we are now at about 30-35ms. So we have about 6-7x left to go.

The text was updated successfully, but these errors were encountered:

jon-chuang · 2023-04-14T04:12:23Z

See: https://arxiv.org/pdf/2208.08088.pdf, a specialized GEMM for tall and skinny matrices - results in 4X speedup over BLAS, but it's not even close to 24x we need to explain here.

jon-chuang · 2023-04-14T04:38:18Z

To my understanding, the vectorization of BLAS and GGML happens along the row dimension. Tall and skinny matrices essentially get 0 vectorization.

Correct? @ggerganov

I will try to unroll the loops to get the vectorization.

MillionthOdin16 · 2023-04-14T06:00:56Z

Where are you seeing dimensions like mul_mat([16 X 5120], [16 X 5120])? All my timings are accumulations of a bunch of 1ms mul_mats.

I'm seeing outputs more like this:

 - 1022: [ 62, 1, 40]    DIAG_MASK_INF   (  1) cpu =   0.001 /   0.001 ms, wall =   0.001 /   0.001 ms
 - 1023: [ 62, 1, 40]         SOFT_MAX   (  1) cpu =   0.009 /   0.009 ms, wall =   0.009 /   0.009 ms
 - 1024: [ 128, 1, 40]          MUL_MAT   (  1) cpu =   0.114 /   0.114 ms, wall =   0.115 /   0.115 ms
 - 1025: [ 128, 40, 1]          PERMUTE   (  1) cpu =   0.001 /   0.001 ms, wall =   0.001 /   0.001 ms
 - 1026: [ 5120, 1, 1]              CPY   (  1) cpu =   0.004 /   0.004 ms, wall =   0.004 /   0.004 ms
 - 1027: [ 5120, 1, 1]          MUL_MAT   (  1) cpu =   0.471 /   0.471 ms, wall =   0.471 /   0.471 ms
 - 1028: [ 5120, 1, 1]              ADD   (  1) cpu =   0.006 /   0.006 ms, wall =   0.006 /   0.006 ms
 - 1029: [ 5120, 1, 1]         RMS_NORM   (  1) cpu =   0.009 /   0.009 ms, wall =   0.009 /   0.009 ms
 - 1030: [ 5120, 1, 1]              MUL   (  1) cpu =   0.003 /   0.003 ms, wall =   0.004 /   0.004 ms
 - 1031: [ 13824, 1, 1]          MUL_MAT   (  1) cpu =   1.257 /   1.257 ms, wall =   1.259 /   1.259 ms
 - 1032: [ 13824, 1, 1]             SILU   (  1) cpu =   0.053 /   0.053 ms, wall =   0.053 /   0.053 ms
 - 1033: [ 13824, 1, 1]          MUL_MAT   (  1) cpu =   1.106 /   1.106 ms, wall =   1.109 /   1.109 ms
 - 1034: [ 13824, 1, 1]              MUL   (  1) cpu =   0.008 /   0.008 ms, wall =   0.008 /   0.008 ms
 - 1035: [ 5120, 1, 1]          MUL_MAT   (  1) cpu =   1.112 /   1.112 ms, wall =   1.116 /   1.116 ms

jon-chuang · 2023-04-14T06:04:22Z

The timings you show are from multiplying square-ish matrices with short and wide matrices (vectors).

We are multiplying tall and skinny matrices to obtain the delta update from LoRA adapters (low-rank adapters from fine-tuning).

See: https://arxiv.org/abs/2106.09685

MillionthOdin16 · 2023-04-14T06:19:26Z

LOLOL

MillionthOdin16 · 2023-04-14T06:22:45Z

The timings you show are from multiplying square-ish matrices with short and wide matrices (vectors).

We are multiplying tall and skinny matrices to obtain the delta update from LoRA adapters (low-rank adapters from fine-tuning).

Oh dang, I thought you were talking about general optimizations relating to the general mat_mul / q_f32. Didn't know Lora. I've been looking at it all day haha

jon-chuang · 2023-04-14T07:03:10Z

GPT4 suggests the following AVX 2 code:

Is segfaults. Let me collapse it as it's off-topic.

jon-chuang · 2023-04-14T07:11:16Z

See: https://arxiv.org/pdf/2208.08088.pdf, a specialized GEMM for tall and skinny matrices - results in 4X speedup over BLAS, but it's not even close to 24x we need to explain here.

Actually, we are not even using BLAS, just the ggml impl since it does not fit BLAS criteria. Hopefully it's enough to account.

jon-chuang · 2023-04-14T07:56:00Z

This manages to run about 4X faster.

// A = K X M, B = K X N
void multiply_tall_skinny_matrices(const float * A, const float * BT, float * C, int M, int N, int K, int ir0, int ir1) {
    for (int i = ir0; i < ir1; ++i) {
        for (int j = 0; j < N; j += 8) { // Process 8 elements of C's row at a time - 256 / size_of(float)

            __m256 c_row = _mm256_setzero_ps(); // Initialize the result vector to all zeros

            for (int k = 0; k < K; ++k) {
                __m256 a = _mm256_broadcast_ss(&A[i + k * M]); // Broadcast the k-th element of the i-th row of A
                __m256 b = _mm256_load_ps(&BT[j + k * N]); // Load a segment of the k-th row of B^T (corresponding to the k-th column of B)
                c_row = _mm256_fmadd_ps(a, b, c_row); // FMA: c_row = a * b + c_row
            }

            // Store the result in the corresponding row of C
            _mm256_store_ps(&C[i * N + j ], c_row);
        }
    }
}

GPT4 almost delivered after all. Luckily I'm still smarter than it.

jon-chuang · 2023-04-14T08:10:34Z

AVX is about 3X faster.

void multiply_tall_skinny_matrices_avx(const float * A, const float * BT, float * C, int M, int N, int K, int ir0, int ir1) {
    for (int i = ir0; i < ir1; ++i) {
        for (int j = 0; j < N; j += 4) { // Process 8 elements of C's row at a time - 128 / size_of(float)
            __m128 c_row = _mm_setzero_ps(); // Initialize the result vector to all zeros

            for (int k = 0; k < K; ++k) {
                __m128 a = _mm_broadcast_ss(&A[i + k * M]); // Broadcast the k-th element of the i-th row of A
                __m128 b = _mm_loadu_ps(&BT[j + k * N]); // Load a segment of the k-th row of B^T (corresponding to the k-th column of B)
                c_row = _mm_fmadd_ps(a, b, c_row); // FMA: c_row = a * b + c_row
            }

            // Store the result in the corresponding row of C
            _mm_store_ps(&C[i * N + j], c_row);
        }
    }
}

jon-chuang · 2023-04-14T09:13:07Z

I hope this can be optimized further to close the ~~24X->~~ 6X->1X gap.

github-actions · 2024-04-09T01:10:20Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

* Give the CI builds a recognizable AVX1 name * Chat Adapters

jon-chuang changed the title ~~perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms each~~ perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms each - 24x slower than expected Apr 14, 2023

jon-chuang changed the title ~~perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms each - 24x slower than expected~~ perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected Apr 14, 2023

jon-chuang changed the title ~~perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected~~ perf(ggml): tall and skinny GEMM: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected Apr 14, 2023

jon-chuang mentioned this issue Apr 14, 2023

Add LoRA support #820

Merged

jon-chuang changed the title ~~perf(ggml): tall and skinny GEMM: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected~~ perf(ggml): tall and skinny GEMM for LoRA: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected Apr 14, 2023

This comment was marked as off-topic.

# to view

This was referenced Apr 14, 2023

Tracking: LoRA #964

Closed

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

Closed

ziwang-com mentioned this issue May 20, 2023

跟踪： LoRA ziwang-com/zero-lora#13

Open

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024

Chat Adapters (ggml-org#956)

8421243

* Give the CI builds a recognizable AVX1 name * Chat Adapters

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ggml): tall and skinny GEMM for LoRA: F32 `mul_mat([16 X 5120], [16 X 5120])` takes 120ms - 24x slower than expected #956

perf(ggml): tall and skinny GEMM for LoRA: F32 `mul_mat([16 X 5120], [16 X 5120])` takes 120ms - 24x slower than expected #956

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 •

edited

Loading

MillionthOdin16 commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 •

edited

Loading

This comment was marked as off-topic.

MillionthOdin16 commented Apr 14, 2023

MillionthOdin16 commented Apr 14, 2023

jon-chuang commented Apr 14, 2023

jon-chuang commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading

github-actions bot commented Apr 9, 2024

perf(ggml): tall and skinny GEMM for LoRA: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected #956

perf(ggml): tall and skinny GEMM for LoRA: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected #956

Comments

jon-chuang commented Apr 14, 2023 • edited Loading

jon-chuang commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 • edited Loading

MillionthOdin16 commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 • edited Loading

This comment was marked as off-topic.

MillionthOdin16 commented Apr 14, 2023

MillionthOdin16 commented Apr 14, 2023

jon-chuang commented Apr 14, 2023

jon-chuang commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 • edited Loading

jon-chuang commented Apr 14, 2023 • edited Loading

jon-chuang commented Apr 14, 2023 • edited Loading

github-actions bot commented Apr 9, 2024

perf(ggml): tall and skinny GEMM for LoRA: F32 `mul_mat([16 X 5120], [16 X 5120])` takes 120ms - 24x slower than expected #956

perf(ggml): tall and skinny GEMM for LoRA: F32 `mul_mat([16 X 5120], [16 X 5120])` takes 120ms - 24x slower than expected #956

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023 •

edited

Loading