cuBLAS: refactor and optimize f16 mat mul performance #1259

slaren · 2023-04-30T22:06:50Z

Moves all the cuBLAS specific code from ggml.c to ggml-cuda.cu. This also makes ggml-cuda.h much simpler, since fewer definitions have to exposed now.

Additionally, improves mat mul performance by using multiple stream where possible (when multiplying 3 or 4-dimensional tensors), and by choosing between doing f16 x f32 mat muls either as f16 x f16 or as f32 x f32, depending on what requires less data to be transferred to the GPU.

Overall, improves perplexity times with cuBLAS by ~15%.

`🤖 Generated by Copilot at 4e54943`

Summary

🚀🧹🛠️

This pull request improves the performance, compatibility, and readability of the GGML library and the llama model loader. It refactors the CUDA and BLAS code, simplifies the error checking and memory management, and exposes some useful functions and macros. The main files affected are ggml-cuda.h, ggml.c, ggml.h, llama-util.h, and llama.cpp.

ggml refactored
CUDA and BLAS streamlined
Winter of llama

Walkthrough

Refactored the code for using cuBLAS for matrix multiplication in GGML, by moving the CUDA-related functions and macros to ggml-cuda.h and calling them from ggml.c with conditional compilation (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)
Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)
Improved the code style and quality in ggml.c, by removing unused variables, empty lines, and redundant conditional compilation (link, link, link, link, link, link)

From #1233:

Enhanced the llama_buffer and llama_ctx_buffer structs in llama-util.h, by adding default constructors and disabling copy and move constructors and assignment operators, to prevent memory leaks or errors (link, link, link)
Optimized the initialization of temporary buffers in the llama_model_loader struct in llama.cpp, by using the constructor of the std::vector instead of the resize method (link)

slaren · 2023-04-30T22:34:07Z

Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)

Specifically, this adds vector versions of ggml_fp16_to_fp32 and ggml_fp32_to_fp16. ggml_fp32_to_fp16_row is vectorized with F16C. This was necessary as GGML_FP32_TO_FP16 isn't visible from ggml-cuda.cu, and ggml_fp32_to_fp16 is too slow without inlining.

Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)

GGML_ASSERT is now exposed in ggml.h, I did this to be able to use it from ggml-cuda.cu, but if this is not desirable I can remove it.

…l_mat_f16

ggml.c

ggerganov · 2023-05-01T12:00:49Z

ggml-cuda.cu

-    __half d;               // delta
-    __half m;               // min
+    half d;                 // delta
+    half m;                 // min
    uint32_t qh;            // 5-th bit of quants


At some point, should sync the CUDA block_q5_1 with the CPU one:

https://github.com/ggerganov/llama.cpp/blob/c0335b51f959ddf8c7b58bf497d10e5dc4730267/ggml.c#L736

I am not entirely sure why this isn't the case already, did you have any problems with alignment or anything else?

I updated it in the same way as q5_0 and didn't notice any issues.

For Q5_1 it works both ways.
For Q5_0, the uint32_t way does not work due to alignment issues, so we changed Q5_1 to uint8_t[4] for consistency

slaren marked this pull request as draft April 30, 2023 22:13

slaren added 3 commits May 1, 2023 13:33

cuBLAS: refactor, convert fp16 to fp32 on device

cf93fdc

cuBLAS: use multiple streams, choose smartly between mul_mat_q and mu…

a9ad140

…l_mat_f16

fix build

4cd0a48

slaren force-pushed the cuda-mat-mul branch from 4fd9d4a to 4cd0a48 Compare May 1, 2023 11:35

slaren marked this pull request as ready for review May 1, 2023 11:38

ggerganov reviewed May 1, 2023

View reviewed changes

ggml.c Show resolved Hide resolved

slaren mentioned this pull request May 1, 2023

Generalize quantize_fns for simpler FP16 handling #1237

Merged

ggerganov approved these changes May 1, 2023

View reviewed changes

cuBLAS: update block_q5_1

a79756b

slaren merged commit 58b367c into ggml-org:master May 1, 2023

slaren deleted the cuda-mat-mul branch May 1, 2023 16:11

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS: refactor and optimize f16 mat mul performance #1259

cuBLAS: refactor and optimize f16 mat mul performance #1259

slaren commented Apr 30, 2023 •

edited

Loading

slaren commented Apr 30, 2023 •

edited

Loading

ggerganov May 1, 2023

slaren May 1, 2023

slaren May 1, 2023

ggerganov May 1, 2023

cuBLAS: refactor and optimize f16 mat mul performance #1259

cuBLAS: refactor and optimize f16 mat mul performance #1259

Conversation

slaren commented Apr 30, 2023 • edited Loading

🤖 Generated by Copilot at 4e54943

Summary

Walkthrough

slaren commented Apr 30, 2023 • edited Loading

ggerganov May 1, 2023

Choose a reason for hiding this comment

slaren May 1, 2023

Choose a reason for hiding this comment

slaren May 1, 2023

Choose a reason for hiding this comment

ggerganov May 1, 2023

Choose a reason for hiding this comment

slaren commented Apr 30, 2023 •

edited

Loading

`🤖 Generated by Copilot at 4e54943`

slaren commented Apr 30, 2023 •

edited

Loading