Skip to content

CUDA: add bf16 and f32 support to cublas_mul_mat_batched #14361

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Jun 24, 2025

Add bf16 and f32 to support batched cuBLAS mul mat. Speed up when we do

--cache_type_v bf16 --cache_type_k bf16 when running llama-bench

Model Test t/s master t/s add_bp16_fp32_to_cublas_batched Speedup
llama 7B Q5_K_M pp512 3681.87 4807.34 1.31
llama 7B Q5_K_M tg128 129.40 129.28 1.00

@am17an am17an marked this pull request as ready for review June 24, 2025 10:07
@am17an am17an requested a review from JohannesGaessler June 24, 2025 10:07
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 24, 2025
@am17an am17an force-pushed the add_bp16_fp32_to_cublas_batched branch from 79ca9fd to 789c697 Compare June 24, 2025 12:37
@am17an am17an force-pushed the add_bp16_fp32_to_cublas_batched branch from fe14807 to b7225ec Compare June 25, 2025 08:00
@am17an am17an requested a review from JohannesGaessler June 25, 2025 08:01
@am17an
Copy link
Collaborator Author

am17an commented Jun 25, 2025

mul_mat_batched with bf16 is failing for ubuntu-22-cmake-vulkan, should I remove the extra tests?

@am17an am17an force-pushed the add_bp16_fp32_to_cublas_batched branch from 2b83788 to 87aeacf Compare June 25, 2025 10:16
@am17an am17an requested a review from JohannesGaessler June 25, 2025 10:45
@am17an
Copy link
Collaborator Author

am17an commented Jun 25, 2025

mul_mat_batched with bf16 is failing for ubuntu-22-cmake-vulkan, should I remove the extra tests?

@JohannesGaessler mul-mat tests in bf16 which fail for vulkan because of an assert /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5134: GGML_ASSERT(ggml_vk_dim01_contiguous(src0) || src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16) failed . Not familiar with the vulkan code, so I'm not sure what to do

@JohannesGaessler
Copy link
Collaborator

Sorry, I didn't see the Vulkan comment. The problem from what I can tell is that the logic in ggml_backend_vk_device_supports_op and the assert are inconsistent. What I assume happened is that one of the 2 was changed without updating the other one, so presumably the problem can be fixed by updating the outdated one (and until now this simply wasn't noticed). @0cc4m @jeffbolznv can either of you weigh in on which version is correct?

@jeffbolznv
Copy link
Collaborator

I think this was supposed to work, but just changing the assert I see the test fail. I'll debug it.

@jeffbolznv
Copy link
Collaborator

#14378 should fix the new tests.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants