-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Seriously, convert ggml to ggjt v1 #10
Comments
P.S. I got the tokenizer.model from huggingface, convert.py from llamacpp and put them in the parent folder of my alpaca7b ggml model named model.bin and ran this from shell |
Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat. If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple. Also feel free to commit your benchmark .txt file // I'm using the default demo files.
|
This is starLLM automated to ask What is my name? which I ingested into it. # use_mmap=True
llm = LlamaCpp(use_mmap=True, model_path=local_path, callbacks=callbacks, verbose=True)
llama_print_timings: load time = 8441.23 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 8440.31 ms / 6 tokens ( 1406.72 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 8500.26 ms
It sounds like your name is Alex.
> Question:
What is my name?
> Answer:
It sounds like your name is Alex.
> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 47.66585969924927 seconds and # use_mmap=False
llm = LlamaCpp(use_mmap=False, model_path=local_path, callbacks=callbacks, verbose=True)
llama_print_timings: load time = 6395.35 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 6394.58 ms / 6 tokens ( 1065.76 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 6507.05 ms
Your name is Alexandra.
> Question:
What is my name?
> Answer:
Your name is Alexandra.
> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 42.63529133796692 seconds So not sure if mmap does much, not sure why or how langchain integrates that argument yet. |
I'm gonna craft an auto convert if your model shows up as an older one like ggml. I could probably even support .pth and such. People will be thankful, I cant believe the performance difference. Ill also work/look into vicuna if you can test it. Ill try to download the model but my areas internet is slow and not stable. |
Why are your runtimes at 1000ms per token? can you shoot me your hardware specs, please. Also are you using :memory: for testing? Then we'd be able craft a benchmark script. Jap auto-convert seems reasonable. |
No, I havn't messed around with that yet, just using the db from ssd.
IDK if thats what you need? |
I'm getting >60ms per token hits. Running six threads. Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB. @alxspiker did you try f16_ky=True? Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed |
823.11 ms per token |
Your issue changed my life. My terminal session is close to real time. This is incredible. I'm going to upload the converted ggjt-v1 models onto HuggingFace so it's way easier for people to interact with. |
A change in models would be the first; then we should tweak the argument
Originally posted by @su77ungr in #8 (comment)
Okay, not kidding been digging and trying so many things. Been learning a lot about how binary files are handled and loaded into memory. Still working on it but heres another find, I converted my alpaca7b model from ggml to ggjt v1 using the convert.py from the LlamaCpp repo and instead of using mlock everytime, the model is loaded with mmap therefor it seems like now it only loads what it needs and has provided slower results:
I was confused at first because the LlamaCppEmbeddings() doesnt support use_mmap argument but LlamaCpp() does. I haven't messed with LlamaCpp() yet but I changed use_mlock to True in LlamaCppEmbeddings() and got the quick results back.
But then...
I realized that because the model didn't have to completely load into the memory when using a converted model and use_mlock was set to its default
False
, the initial load time seemed instant so I needed to measure the entire script time including the model loading instead of just the ingestion time to get accurate speed results.Results
So for a small ingestion, the converted model doesn't seem to impact performance as widely as I thought and DOES INSANELY REDUCE MEMORY USAGE, I might be able to load way bigger models now (lord have mercy on my ram). But that minor improvement might add up with bigger documents, I just dont have the time to test large files.
The text was updated successfully, but these errors were encountered: