Skip to content

[FR] Full llama.cpp integration local / remote #44

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
fszontagh opened this issue Mar 20, 2025 · 9 comments
Open

[FR] Full llama.cpp integration local / remote #44

fszontagh opened this issue Mar 20, 2025 · 9 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@fszontagh
Copy link
Owner

Individual TAB on the GUI to implement interactions with language models

  • add chat box with multiple session handling
  • implement model management
@fszontagh fszontagh added enhancement New feature or request good first issue Good for newcomers labels Mar 20, 2025
@fszontagh fszontagh self-assigned this Mar 20, 2025
@fszontagh fszontagh moved this to Planning in Stable Diffusion GUI Mar 20, 2025
@iwr-redmond
Copy link

Nomic's GPT4All desktop application is written in C++ with a QT frontend. It's also MIT-licensed, which means that anything useful for this FR can be easily adopted.

@fszontagh fszontagh moved this from Planning to In progress in Stable Diffusion GUI Mar 22, 2025
@fszontagh
Copy link
Owner Author

Image

@iwr-redmond
Copy link

iwr-redmond commented Mar 29, 2025

Record time! Make sure to provide a facility for setting up chat templates and system prompts.

GPT4all recently migrated to minja from their own simplified template format, which I reckon was easier to understand.

@fszontagh
Copy link
Owner Author

There is template handling in llama.cpp:

const char* tmpl = llama_model_chat_template(model, /* name */ nullptr);

prev_len = llama_chat_apply_template(tmpl, messages.data(), messages.size(), false, nullptr, 0);

This currently can load the template from the model only. I need further investigation.

I need to do a lot of fine tuning in it.

@iwr-redmond
Copy link

iwr-redmond commented Mar 29, 2025

IIRC older GGUF models don't have built in templates. You can confirm this by loading the same file in the current GPT4all release.

EDIT: Compared the default prompt template for Zephyr-7B in GPT4all, with the late 2023 GGUF showing no template and the late-2024 GGUF showing a built-in Jinja2 template.

@fszontagh
Copy link
Owner Author

fszontagh commented Mar 30, 2025

@fszontagh
Copy link
Owner Author

fszontagh commented Apr 1, 2025

A small reminder

Steps for starting a chat session

  • User selects a model from the list, GUI sends a command to the llama's extprocess (if it is ready and available) to load the selected model Image
  • when model is loaded into RAM / VRAM, then llama's extprocess reads the meta data from the model file, filling up some settings (template, max. context size etc.. if they are exists in the model) Image Image
  • when user sends the prompt, then the llama's extprocess loads the context using the editable settings from the UI (batch size, context size, number of threads)
  • the prompt template is only used when prompt is sent to the process. Other settings whitch are related to the context or the model, they can not be modified in an already started chat session. (it can be with reloading the context or the model to apply the new cfg, but not implemented yet)

FMI:

  • prompt template can be changed after when the model sent response. The "history" is always reformatted by the template
  • number of threads is comming from the settings which is already used at stable diffusion

TODO:

  • add more fine tuning settings to the GUI (sampler settings):
    • temp
    • min p
    • top k
    • dist
  • implement kv cache to store already used tokents (save / restore the chat history)
  • use one webview per chat session

@fszontagh
Copy link
Owner Author

Here is an all-in-one template (llama-3.2-3B-Instruct)

{{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
    {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
    {{- "Do not use variables.\n\n" }}
    {%- for t in tools %}
        {{- t | tojson(indent=4) }}
        {{- "\n\n" }}
    {%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}

{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
    {#- Extract the first user message so we can plug it in here #}
    {%- if messages | length != 0 %}
        {%- set first_user_message = messages[0]['content']|trim %}
        {%- set messages = messages[1:] %}
    {%- else %}
        {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
    {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
    {{- "Given the following functions, please respond with a JSON for a function call " }}
    {{- "with its proper arguments that best answers the given prompt.\n\n" }}
    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
    {{- "Do not use variables.\n\n" }}
    {%- for t in tools %}
        {{- t | tojson(indent=4) }}
        {{- "\n\n" }}
    {%- endfor %}
    {{- first_user_message + "<|eot_id|>"}}
{%- endif %}

{%- for message in messages %}
    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
    {%- elif 'tool_calls' in message %}
        {%- if not message.tool_calls|length == 1 %}
            {{- raise_exception("This model only supports single tool-calls at once!") }}
        {%- endif %}
        {%- set tool_call = message.tool_calls[0].function %}
        {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
        {{- '{"name": "' + tool_call.name + '", ' }}
        {{- '"parameters": ' }}
        {{- tool_call.arguments | tojson }}
        {{- "}" }}
        {{- "<|eot_id|>" }}
    {%- elif message.role == "tool" or message.role == "ipython" %}
        {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
        {%- if message.content is mapping or message.content is iterable %}
            {{- message.content | tojson }}
        {%- else %}
            {{- message.content }}
        {%- endif %}
        {{- "<|eot_id|>" }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}

@iwr-redmond
Copy link

You may wish to consider allowing the kv_cache_type to be set. At Q8_0, this can save a lot of VRAM without noticeably reducing quality.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Status: In progress
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants