It appears context memory usage can be trivially halved by using fp16? #146

jarcen · 2023-03-14T23:11:08Z

I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but additional memory usage was not pleasant.

LLAMA 7B quantized to 4 bits reports ggml ctx size = 8113.34 MB

I went to the code and changed data type for memory_k and memory_v from GGML_TYPE_F32 to GGML_TYPE_F16

These are the changed lines:

        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_k
        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_v

And these:

        model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
        model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);

New memory usage is reportedly ggml ctx size = 6065.34 MB and task manager agrees. That's 2GB down.
So far everything is working, no crashes and no degradation in quality. Is there any reason to not do that?

The text was updated successfully, but these errors were encountered:

Add llama2.cpp to notable forks section

ty-everett mentioned this issue Mar 15, 2023

Use F16 for memory_k and memory_v (as suggested in #146) #154

Closed

gjmulder added the enhancement label Mar 15, 2023

Green-Sky mentioned this issue Mar 19, 2023

Command line switch to use F16 for memory_k and memory_v (refactor of #154) #294

Merged

ggerganov closed this as completed in #294 Mar 19, 2023

rooprob pushed a commit to rooprob/llama.cpp that referenced this issue Aug 2, 2023

Merge pull request ggml-org#146 from admu-progvar/master

6b3a689

Add llama2.cpp to notable forks section

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It appears context memory usage can be trivially halved by using fp16? #146

It appears context memory usage can be trivially halved by using fp16? #146

jarcen commented Mar 14, 2023

It appears context memory usage can be trivially halved by using fp16? #146

It appears context memory usage can be trivially halved by using fp16? #146

Comments

jarcen commented Mar 14, 2023