CUDA: add bf16 and f32 support to cublas_mul_mat_batched #14361

am17an · 2025-06-24T10:07:33Z

Add bf16 and f32 to support batched cuBLAS mul mat. Speed up when we do

--cache_type_v bf16 --cache_type_k bf16 when running llama-bench

Model	Test	t/s master	t/s add_bp16_fp32_to_cublas_batched	Speedup
llama 7B Q5_K_M	pp512	3681.87	4807.34	1.31
llama 7B Q5_K_M	tg128	129.40	129.28	1.00

ggml/src/ggml-cuda/ggml-cuda.cu

am17an · 2025-06-25T08:59:41Z

mul_mat_batched with bf16 is failing for ubuntu-22-cmake-vulkan, should I remove the extra tests?

ggml/src/ggml-cuda/ggml-cuda.cu

tests/test-backend-ops.cpp

ggml/src/ggml-cuda/ggml-cuda.cu

am17an · 2025-06-25T11:53:15Z

mul_mat_batched with bf16 is failing for ubuntu-22-cmake-vulkan, should I remove the extra tests?

@JohannesGaessler mul-mat tests in bf16 which fail for vulkan because of an assert /home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5134: GGML_ASSERT(ggml_vk_dim01_contiguous(src0) || src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16) failed . Not familiar with the vulkan code, so I'm not sure what to do

…r bools

JohannesGaessler · 2025-06-25T12:42:10Z

Sorry, I didn't see the Vulkan comment. The problem from what I can tell is that the logic in ggml_backend_vk_device_supports_op and the assert are inconsistent. What I assume happened is that one of the 2 was changed without updating the other one, so presumably the problem can be fixed by updating the outdated one (and until now this simply wasn't noticed). @0cc4m @jeffbolznv can either of you weigh in on which version is correct?

jeffbolznv · 2025-06-25T13:50:42Z

I think this was supposed to work, but just changing the assert I see the test fail. I'll debug it.

jeffbolznv · 2025-06-25T16:45:04Z

#14378 should fix the new tests.

am17an marked this pull request as ready for review June 24, 2025 10:07

am17an requested a review from JohannesGaessler June 24, 2025 10:07

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 24, 2025

CUDA: add bf16 and f32 support to cublas_mul_mat_batched

789c697

am17an force-pushed the add_bp16_fp32_to_cublas_batched branch from 79ca9fd to 789c697 Compare June 24, 2025 12:37

JohannesGaessler reviewed Jun 24, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

Review: add type traits and make function more generic

b7225ec

am17an force-pushed the add_bp16_fp32_to_cublas_batched branch from fe14807 to b7225ec Compare June 25, 2025 08:00

am17an requested a review from JohannesGaessler June 25, 2025 08:01

JohannesGaessler reviewed Jun 25, 2025

View reviewed changes

Review: make check more explicit, add back comments, and fix formatting

87aeacf

am17an force-pushed the add_bp16_fp32_to_cublas_batched branch from 2b83788 to 87aeacf Compare June 25, 2025 10:16

am17an requested a review from JohannesGaessler June 25, 2025 10:45

JohannesGaessler approved these changes Jun 25, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

Review: fix formatting, remove useless type conversion, fix naming fo…

0608d77

…r bools

jeffbolznv mentioned this pull request Jun 25, 2025

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline #14378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: add bf16 and f32 support to cublas_mul_mat_batched #14361

CUDA: add bf16 and f32 support to cublas_mul_mat_batched #14361

am17an commented Jun 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

am17an commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Jun 25, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Jun 25, 2025

Uh oh!

jeffbolznv commented Jun 25, 2025

Uh oh!

jeffbolznv commented Jun 25, 2025

Uh oh!

Uh oh!

CUDA: add bf16 and f32 support to cublas_mul_mat_batched #14361

Are you sure you want to change the base?

CUDA: add bf16 and f32 support to cublas_mul_mat_batched #14361

Conversation

am17an commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jun 25, 2025

Uh oh!

jeffbolznv commented Jun 25, 2025

Uh oh!

jeffbolznv commented Jun 25, 2025

Uh oh!

Uh oh!

am17an commented Jun 24, 2025 •

edited

Loading

am17an commented Jun 25, 2025 •

edited

Loading