Skip to content

It appears context memory usage can be trivially halved by using fp16? #146

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
jarcen opened this issue Mar 14, 2023 · 0 comments · Fixed by #294
Closed

It appears context memory usage can be trivially halved by using fp16? #146

jarcen opened this issue Mar 14, 2023 · 0 comments · Fixed by #294
Labels
enhancement New feature or request

Comments

@jarcen
Copy link

jarcen commented Mar 14, 2023

I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but additional memory usage was not pleasant.

LLAMA 7B quantized to 4 bits reports ggml ctx size = 8113.34 MB

I went to the code and changed data type for memory_k and memory_v from GGML_TYPE_F32 to GGML_TYPE_F16

These are the changed lines:

        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_k
        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_v

And these:

        model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
        model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);

New memory usage is reportedly ggml ctx size = 6065.34 MB and task manager agrees. That's 2GB down.
So far everything is working, no crashes and no degradation in quality. Is there any reason to not do that?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants