Adjust mul_mat_f16 work memory #1226

ggerganov · 2023-04-29T09:45:07Z

Haven't tested this yet. The goal is to allocate just the needed amount of memory when not using cuBLAS

slaren · 2023-04-29T13:59:29Z

Looks good, I didn't realize that this could increase the maximum size of the work memory, so I set it to the worst case maximum to make testing easier.
In the future we shouldn't need to use any work memory at all for this with cuBLAS, I have been testing converting between f16 and f32 in the GPU, and it is faster that way.

ggerganov requested review from slaren and 0cc4m April 29, 2023 09:45

ggerganov added 2 commits April 29, 2023 13:53

llama : minor - remove explicity int64_t cast

150e135

ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS

0ffcd89

ggerganov force-pushed the adjust-mul-mat-f16-work-memory branch from 638651a to 0ffcd89 Compare April 29, 2023 10:54

ggml : add asserts to guard for incorrect wsize

658c686

slaren approved these changes Apr 29, 2023

View reviewed changes

ggerganov merged commit 214b6a3 into master Apr 29, 2023

ggerganov deleted the adjust-mul-mat-f16-work-memory branch April 29, 2023 15:43

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust mul_mat_f16 work memory #1226

Adjust mul_mat_f16 work memory #1226

ggerganov commented Apr 29, 2023 •

edited

Loading

slaren commented Apr 29, 2023

Adjust mul_mat_f16 work memory #1226

Adjust mul_mat_f16 work memory #1226

Conversation

ggerganov commented Apr 29, 2023 • edited Loading

slaren commented Apr 29, 2023

ggerganov commented Apr 29, 2023 •

edited

Loading