Add Jinja template support #11016

ochafik · 2024-12-30T03:48:15Z

Subset of #9639 with just the Jinja templating support.

Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.

Copies minja.hpp & chat-template.hpp from google/minja (created for this 😅) at this commit
Adds --jinja flag to llama-server, llama-cli, llama-run
Adds --chat-template-file flag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 )
Loads tokenizer.chat_template (or tokenizer.chat_template.tool_use if defined, only when the request has tools).
Dual testing in test-chat-template.cpp of legacy adhoc templating & jinja route. Wherever the expected outputs diverge, the jinja expectations should be more correct (note that templates are run w/ trim_blocks = true, lstrip_blocks = true)
- Sent Refactor test-chat-template.cpp #11224 separately

Example usage:

# Launch in background
./build/bin/llama-server \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF \
  -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --jinja &

curl http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ipython",
          "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
          "parameters": {
            "type": "object",
            "properties": {
              "code": {
                "type": "string",
                "description": "The code to run in the ipython interpreter."
              }
            },
            "required": ["code"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Print a hello world message with python (using single quotes '"'"' for strings)."
      }
    ]
  }'

show output

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<tool_call>\n{\"name\": \"ipython\", \"arguments\": {\"code\": \"print('Hello world!')\"}}\n</tool_call>",
        "role": "assistant"
      }
    }
  ],
  "created": 1736811609,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4494-a57bb94e",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 205,
    "total_tokens": 230
  },
  "id": "chatcmpl-5YJXFVhvjoMDlLx1asuWNdSO3JVWWsUF",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 155.151,
    "prompt_per_token_ms": 155.151,
    "prompt_per_second": 6.445333900522716,
    "predicted_n": 25,
    "predicted_ms": 419.714,
    "predicted_per_token_ms": 16.78856,
    "predicted_per_second": 59.56437002339688
  }
}

TODO:

Add cross-testing in test-chat-template.cpp (note that minja is tested against a lot of templates in its own repo)
Add some instructions here
Add more server tests to exercise the template overrides.

ericcurtin · 2025-01-13T17:23:33Z

Feel free to add the option to llama-run for basic testing also @ochafik

examples/server/server.cpp

ngxson · 2025-01-21T09:19:36Z

Small thing to note is that some jinja templates are not "linear", meaning each conversation turn is not self-contained, but can modify the content before it.

For example, the new deepseek-r1 distilled has {% set content = content.split('</think>')[-1] %} to remove the thinking process from conversation history. I also once saw a template that adds EOS token after each formatted chat, which also breaks this logic.

The consequence is that it will break common_chat_format_single (used in llama-cli) and apply_chat_template (used by llama-run) since they assume that each new message is self-contained (i.e. is addition, but not modification)

A solution is to also track the cached token at token level (not conversation level), which I introduced here #11203 , @ericcurtin feel free to port this to llama-run if you want. This approach is kinda like server implementation.

ochafik · 2025-01-21T14:05:35Z

Thanks everyone for the insightful reviews! More from #9639 to come soon :-)

fairydreaming · 2025-01-21T18:23:49Z

Not sure if this is a special case or the template is broken, but when I load minimax-text-01 (my work-in-progress) with the following template:

{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{{ '<end_of_sentence>\\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% endif %}

with this PR llama.cpp crashes during model loading:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Expected block keyword at row 1, column 492:
{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\n' + message['content'][0]['text'] + '<end_of_sentence>\n'}}{% elif message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\n' + message['content'][0]['text'] + '<end_of_sentence>\n'}}{% elif message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\n' }}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<beginning_of_sentence>ai name=assistant\n' }}{% endif %}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^

ochafik · 2025-01-21T18:33:38Z

Not sure if this is a special case or the template is broken, but when I load minimax-text-01 (my work-in-progress) with the following template:

{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\\n' + message['content'][0]['text'] + '<end_of_sentence>\\n'}}{% elif message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{{ '<end_of_sentence>\\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<beginning_of_sentence>ai name=assistant\\n' }}{% endif %}

Hey @fairydreaming , thanks for testing & reporting! Your template contain an exotic {% generation %}...{% endgeneration %} syntax that doesn't seem supported by, say, this online jinja parser either.

terminate called after throwing an instance of 'std::runtime_error'
what(): Expected block keyword at row 1, column 492:

I could certainly make the error more informative though, feel free to file something on https://github.com/google/minja to that end (and/or any feature request).

Looking forward to testing your model, good luck with it!

fairydreaming · 2025-01-21T19:06:08Z

@ochafik I did some research and it seems to be a custom keyword introduced in HF transformers: huggingface/transformers#30650

Fortunately among all the models I have currently on disk only MiniMax-Text-01 uses this.

ochafik · 2025-01-22T03:00:09Z

@ochafik I did some research and it seems to be a custom keyword introduced in HF transformers: huggingface/transformers#30650

Fortunately among all the models I have currently on disk only MiniMax-Text-01 uses this.

@fairydreaming thanks for researching that, will track support in google/minja#28

* Copy minja from google/minja@58f0ca6 * Add --jinja and --chat-template-file flags * Add missing <optional> include * Avoid print in get_hf_chat_template.py * No designated initializers yet * Try and work around msvc++ non-macro max resolution quirk * Update test_chat_completion.py * Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template * Refactor test-chat-template * Test templates w/ minja * Fix deprecation * Add --jinja to llama-run * Update common_chat_format_example to use minja template wrapper * Test chat_template in e2e test * Update utils.py * Update test_chat_completion.py * Update run.cpp * Update arg.cpp * Refactor common_chat_* functions to accept minja template + use_jinja option * Attempt to fix linkage of LLAMA_CHATML_TEMPLATE * Revert LLAMA_CHATML_TEMPLATE refactor * Normalize newlines in test-chat-templates for windows tests * Forward decl minja::chat_template to avoid eager json dep * Flush stdout in chat template before potential crash * Fix copy elision warning * Rm unused optional include * Add missing optional include to server.cpp * Disable jinja test that has a cryptic windows failure * minja: fix vigogne (google/minja#22) * Apply suggestions from code review Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Finish suggested renamings * Move chat_templates inside server_context + remove mutex * Update --chat-template-file w/ recent change to --chat-template * Refactor chat template validation * Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr) * Warn against missing eos / bos tokens when jinja template references them * rename: common_chat_template[s] * reinstate assert on chat_templates.template_default * Update minja to google/minja@b8437df * Update minja to google/minja#25 * Update minja from google/minja#27 * rm unused optional header --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov · 2025-02-15T09:26:36Z

@ochafik I think we should take some time to wrap the jinja / json functionality better because I am taking a bit more detailed look now and I am afraid that these large headers are proliferating too much across the examples codebase than they are supposed to.

Here is what I think needs to be changed:

All common_chat_ interfaces from the common/common.h header should be moved to common/chat.h
common/common.cpp should stop including json.hpp and the common_chat_ implementations there should be moved to common/chat.cpp. There is some curl-related functionality in common/common.cpp that still requires json.hpp, so this can stay for a while but ideally it should also stop using json.hpp at some point.
common/chat.h should not include json.hpp. We have to be really careful with this header and not allow it to spread across the source files. The exception is the server example where the json functionality is already at its core and cannot be fixed anymore. But for example, main.cpp should not need to include json.hpp directly. With the change in this PR, it now includes common/chat-template.hpp which brings all the jinja/json stuff. Instead, this should be wrapped and accessed only through common/chat.h.
The minja sources (i.e. common/minja.hpp and common/chat-template.hpp) should only be included in common/chat.cpp and nowhere else. The minja sources could be moved to a separate folder common/minja and it can be included for things like test-chat.cpp, but it should not be included by any of the other sources.

ochafik · 2025-02-15T11:49:01Z

Here is what I think needs to be changed:

All common_chat_ interfaces from the common/common.h header should be moved to common/chat.h

common/common.cpp should stop including json.hpp and the common_chat_ implementations there should be moved to common/chat.cpp. There is some curl-related functionality in common/common.cpp that still requires json.hpp, so this can stay for a while but ideally it should also stop using json.hpp at some point.

common/chat.h should not include json.hpp. We have to be really careful with this header and not allow it to spread across the source files. The exception is the server example where the json functionality is already at its core and cannot be fixed anymore. But for example, main.cpp should not need to include json.hpp directly. With the change in this PR, it now includes common/chat-template.hpp which brings all the jinja/json stuff. Instead, this should be wrapped and accessed only through common/chat.h.

@ggerganov Thanks! I think this works great if we start passing tools & json_schema as JSON strings (slight inefficiency to dump in server then parse again in chat, but hopefully negligible cost - will try to measure it). Preparing a cleanup.

(cc/ @bandoti, heads up re/ #11556: big internal changes / cleanup looming ahead that should make it easier to wire into the cli)

The minja sources (i.e. common/minja.hpp and common/chat-template.hpp) should only be included in common/chat.cpp and nowhere else. The minja sources could be moved to a separate folder common/minja and it can be included for things like test-chat.cpp, but it should not be included by any of the other sources.

👍

* Copy minja from google/minja@58f0ca6 * Add --jinja and --chat-template-file flags * Add missing <optional> include * Avoid print in get_hf_chat_template.py * No designated initializers yet * Try and work around msvc++ non-macro max resolution quirk * Update test_chat_completion.py * Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template * Refactor test-chat-template * Test templates w/ minja * Fix deprecation * Add --jinja to llama-run * Update common_chat_format_example to use minja template wrapper * Test chat_template in e2e test * Update utils.py * Update test_chat_completion.py * Update run.cpp * Update arg.cpp * Refactor common_chat_* functions to accept minja template + use_jinja option * Attempt to fix linkage of LLAMA_CHATML_TEMPLATE * Revert LLAMA_CHATML_TEMPLATE refactor * Normalize newlines in test-chat-templates for windows tests * Forward decl minja::chat_template to avoid eager json dep * Flush stdout in chat template before potential crash * Fix copy elision warning * Rm unused optional include * Add missing optional include to server.cpp * Disable jinja test that has a cryptic windows failure * minja: fix vigogne (google/minja#22) * Apply suggestions from code review Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Finish suggested renamings * Move chat_templates inside server_context + remove mutex * Update --chat-template-file w/ recent change to --chat-template * Refactor chat template validation * Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr) * Warn against missing eos / bos tokens when jinja template references them * rename: common_chat_template[s] * reinstate assert on chat_templates.template_default * Update minja to google/minja@b8437df * Update minja to google/minja#25 * Update minja from google/minja#27 * rm unused optional header --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions bot added script examples python server labels Dec 30, 2024

ochafik added 2 commits December 30, 2024 03:50

Copy minja from google/minja@58f0ca6

abd274a

Add --jinja and --chat-template-file flags

e5113e8

ochafik force-pushed the jinja branch from 4ec6151 to e5113e8 Compare December 30, 2024 03:50

ochafik added 4 commits December 30, 2024 04:10

Add missing <optional> include

80138d9

Avoid print in get_hf_chat_template.py

06b5159

No designated initializers yet

ce48584

Try and work around msvc++ non-macro max resolution quirk

389d79b

ochafik force-pushed the jinja branch from c3b07a8 to 389d79b Compare December 30, 2024 04:50

Update test_chat_completion.py

238b968

ochafik mentioned this pull request Dec 30, 2024

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Merged

41 tasks

slaren mentioned this pull request Dec 31, 2024

llama : add support for Cohere2ForCausalLM #10900

Merged

ngxson mentioned this pull request Jan 13, 2025

Added chat template support to llama-run #11215

Closed

ochafik added 4 commits January 13, 2025 19:56

Merge remote-tracking branch 'origin/master' into jinja

cb72cf1

Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

78861a3

Refactor test-chat-template

1aac99a

Test templates w/ minja

7c84ebc

github-actions bot added the testing label Jan 13, 2025

ochafik added 8 commits January 13, 2025 21:30

Fix deprecation

18f257b

Add --jinja to llama-run

8dd4f33

Merge remote-tracking branch 'origin/master' into jinja

c04c50e

Update common_chat_format_example to use minja template wrapper

a6afb27

Test chat_template in e2e test

b4083e4

Update utils.py

b7e2171

Update test_chat_completion.py

a57bb94

Update run.cpp

4daae0b

ggerganov approved these changes Jan 21, 2025

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

ngxson mentioned this pull request Jan 21, 2025

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

Open

ochafik force-pushed the jinja branch from cec1ad7 to 9d8ebd6 Compare January 21, 2025 12:26

rm unused optional header

cbb9b81

ochafik merged commit 6171c9d into ggml-org:master Jan 21, 2025
47 checks passed

ochafik mentioned this pull request Jan 22, 2025

Support MiniMaxAI/MiniMax-Text-01 ({% generation %} blocks) google/minja#28

Closed

ochafik mentioned this pull request Jan 22, 2025

sync: minja #11352

Merged

getnamo mentioned this pull request Jan 23, 2025

Add Prompt templating structures getnamo/Llama-Unreal#3

Closed

3 tasks

engelmi mentioned this pull request Jan 25, 2025

Added --jinja to llama-run command containers/ramalama#625

Merged

This was referenced Jan 31, 2025

Eval bug: Release b4524 breaks serving of granite-code models #11500

Closed

Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) #11533

Merged

neurer mentioned this pull request Jan 31, 2025

Alpaca uses my CPU instead of my GPU (AMD) Jeffser/Alpaca#139

Closed

LorenDB mentioned this pull request Jan 31, 2025

Support RamaLama as default instead of Ollama open-webui/open-webui#9162

Closed

ochafik mentioned this pull request Jan 31, 2025

Eval bug: c4ai-command-r7b-12-2024 unable to use #11443

Closed

phil-scott-78 mentioned this pull request Jan 31, 2025

[Feature]: DeepSeek-R1-Distill-Qwen or similar distilled DeepSeek gguf support SciSharp/LLamaSharp#1059

Closed

ochafik mentioned this pull request Feb 3, 2025

Integration tests w/ downstream projects google/minja#44

Open

2 tasks

ochafik mentioned this pull request Feb 16, 2025

tool-call: refactor common chat / tool-call api (+ tests / fixes) #11900

Merged

3 tasks

fszontagh mentioned this pull request Apr 1, 2025

[FR] Full llama.cpp integration local / remote fszontagh/sd.cpp.gui.wx#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Jinja template support #11016

Add Jinja template support #11016

ochafik commented Dec 30, 2024 •

edited

Loading

ericcurtin commented Jan 13, 2025

ngxson commented Jan 21, 2025

ochafik commented Jan 21, 2025

fairydreaming commented Jan 21, 2025

ochafik commented Jan 21, 2025

fairydreaming commented Jan 21, 2025

ochafik commented Jan 22, 2025

ggerganov commented Feb 15, 2025

ochafik commented Feb 15, 2025

Add Jinja template support #11016

Add Jinja template support #11016

Conversation

ochafik commented Dec 30, 2024 • edited Loading

ericcurtin commented Jan 13, 2025

ngxson commented Jan 21, 2025

ochafik commented Jan 21, 2025

fairydreaming commented Jan 21, 2025

ochafik commented Jan 21, 2025

fairydreaming commented Jan 21, 2025

ochafik commented Jan 22, 2025

ggerganov commented Feb 15, 2025

ochafik commented Feb 15, 2025

ochafik commented Dec 30, 2024 •

edited

Loading