Skip to content

Command line switch to use F16 for memory_k and memory_v (refactor of #154) #294

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 2 commits into from
Mar 19, 2023

Conversation

Green-Sky
Copy link
Collaborator

@Green-Sky Green-Sky commented Mar 19, 2023

made the changes requested by @ggerganov in #154 . fixes #146

With this change, you can half the llama_model_load: memory_size = 512.00 MB -> memory_size = 256.00 MB
(ctx512 7B q4_0)

A non empirical comparison does not seem to degrade the quality of the prediction. but that might not mean anything. (waiting on #270)

@ggerganov ggerganov merged commit 0b366e7 into ggml-org:master Mar 19, 2023
@Green-Sky Green-Sky deleted the f16_memory_cli branch March 22, 2023 12:15
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
….22.0

Bump uvicorn from 0.21.1 to 0.22.0
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

It appears context memory usage can be trivially halved by using fp16?
3 participants