-
Notifications
You must be signed in to change notification settings - Fork 11.4k
cuBLAS: refactor and optimize f16 mat mul performance #1259
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Specifically, this adds vector versions of
|
ggml-cuda.cu
Outdated
__half d; // delta | ||
__half m; // min | ||
half d; // delta | ||
half m; // min | ||
uint32_t qh; // 5-th bit of quants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, should sync the CUDA block_q5_1
with the CPU one:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure why this isn't the case already, did you have any problems with alignment or anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it in the same way as q5_0 and didn't notice any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Q5_1
it works both ways.
For Q5_0
, the uint32_t
way does not work due to alignment issues, so we changed Q5_1
to uint8_t[4]
for consistency
Moves all the cuBLAS specific code from
ggml.c
toggml-cuda.cu
. This also makesggml-cuda.h
much simpler, since fewer definitions have to exposed now.Additionally, improves mat mul performance by using multiple stream where possible (when multiplying 3 or 4-dimensional tensors), and by choosing between doing f16 x f32 mat muls either as f16 x f16 or as f32 x f32, depending on what requires less data to be transferred to the GPU.
Overall, improves perplexity times with cuBLAS by ~15%.
🤖 Generated by Copilot at 4e54943
Summary
🚀🧹🛠️
This pull request improves the performance, compatibility, and readability of the GGML library and the llama model loader. It refactors the CUDA and BLAS code, simplifies the error checking and memory management, and exposes some useful functions and macros. The main files affected are
ggml-cuda.h
,ggml.c
,ggml.h
,llama-util.h
, andllama.cpp
.Walkthrough
ggml-cuda.h
and calling them fromggml.c
with conditional compilation (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)ggml.h
and removing them fromggml.c
(link, link)ggml.c
toggml.h
, to make it available for other source files that use the GGML library (link, link)ggml.c
, by removing unused variables, empty lines, and redundant conditional compilation (link, link, link, link, link, link)From #1233:
llama_buffer
andllama_ctx_buffer
structs inllama-util.h
, by adding default constructors and disabling copy and move constructors and assignment operators, to prevent memory leaks or errors (link, link, link)llama_model_loader
struct inllama.cpp
, by using the constructor of thestd::vector
instead of theresize
method (link)