-
Notifications
You must be signed in to change notification settings - Fork 12.8k
server : add VSCode's Github Copilot Chat support #12896
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Sounds like someone just got Edison'd 🤡 |
There's a lot of tools like this, that work, but don't explicitly say llama.cpp, open-webui is another one (ramalama serve is just vanilla llama-server, but we try and make it easier to use, easier to pull accelerator runtimes and models): https://github.com/open-webui/docs/pull/455/files In RamaLama we are going to create a proxy that forks llama-server processes to mimic Ollama to make it even easier to use everyday llama-server. With most tools if you select generic OpenAI endpoint, llama-server works. |
* server : add VSCode's Github Copilot Chat support * cont : update handler name
@ggerganov, it seems, GET /api/tags API is missing. At least, my vscode-insiders with github.copilot version 1.308.1532 (updated |
It's probably some new logic - should be easy to add support. Feel free to open a PR if you are interested. |
This seems to be broken now. When I open the model selection dialog it shows no models with the following error in the logs:
I used the same command mentioned initially: |
Overview
VSCode recently added support to use local models with Github Copilot Chat:
https://code.visualstudio.com/updates/v1_99#_bring-your-own-key-byok-preview
This PR adds compatibility of
llama-server
with this feature.Usage
Start a
llama-server
on port 11434 with an instruct model of your choice. For example, usingQwen 2.5 Coder Instruct 3B
:# downloads ~3GB of data llama-server \ -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF \ --port 11434 -fa -ngl 99 -c 0
In VSCode -> Chat -> Manage models -> select "Ollama" (not sure why it is called like this):
Select the available model from the list and click "OK":
Enjoy local AI assistance using vanilla
llama.cpp
:Advanced context reuse for faster prompt reprocessing can be enabled by adding
--cache-reuse 256
to thellama-server
commandSpeculative decoding is also supported. Simply start the
llama-server
like this for example: