Skip to content

server : add VSCode's Github Copilot Chat support #12896

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 2 commits into from
Apr 11, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 11, 2025

Overview

VSCode recently added support to use local models with Github Copilot Chat:

https://code.visualstudio.com/updates/v1_99#_bring-your-own-key-byok-preview

This PR adds compatibility of llama-server with this feature.

Usage

  • Start a llama-server on port 11434 with an instruct model of your choice. For example, using Qwen 2.5 Coder Instruct 3B:

    # downloads ~3GB of data
    
    llama-server \
        -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF \
        --port 11434 -fa -ngl 99 -c 0
  • In VSCode -> Chat -> Manage models -> select "Ollama" (not sure why it is called like this):

    image

  • Select the available model from the list and click "OK":

    image

  • Enjoy local AI assistance using vanilla llama.cpp:

    image

  • Advanced context reuse for faster prompt reprocessing can be enabled by adding --cache-reuse 256 to the llama-server command

  • Speculative decoding is also supported. Simply start the llama-server like this for example:

    llama-server \
        -m  ./models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
        -md ./models/qwen2.5-1.5b-coder-instruct/ggml-model-q4_0.gguf \
        --port 11434 -fa -ngl 99 -ngld 99 -c 0 --cache-reuse 256

@ggerganov ggerganov merged commit c94085d into master Apr 11, 2025
50 checks passed
@ggerganov ggerganov deleted the gg/vscode-integration branch April 11, 2025 20:37
@ExtReMLapin
Copy link
Contributor

select "Ollama" (not sure why it is called like this):

Sounds like someone just got Edison'd 🤡

@ericcurtin
Copy link
Collaborator

ericcurtin commented Apr 16, 2025

There's a lot of tools like this, that work, but don't explicitly say llama.cpp, open-webui is another one (ramalama serve is just vanilla llama-server, but we try and make it easier to use, easier to pull accelerator runtimes and models):

https://github.com/open-webui/docs/pull/455/files

In RamaLama we are going to create a proxy that forks llama-server processes to mimic Ollama to make it even easier to use everyday llama-server.

With most tools if you select generic OpenAI endpoint, llama-server works.

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
* server : add VSCode's Github Copilot Chat support

* cont : update handler name
@kabakaev
Copy link

@ggerganov, it seems, GET /api/tags API is missing.

At least, my vscode-insiders with github.copilot version 1.308.1532 (updated 2025-04-25, 18:46:22) requests /api/tags and gets HTTP/404 response.

@ggerganov
Copy link
Member Author

It's probably some new logic - should be easy to add support. Feel free to open a PR if you are interested.

@theoparis
Copy link

This seems to be broken now. When I open the model selection dialog it shows no models with the following error in the logs:

srv  log_server_r: request: GET /api/version 127.0.0.1 404

I used the same command mentioned initially: llama-server -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF --port 11434 -fa -ngl 99 -c 0

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants