Best practices for extending ChatWrapper while reusing LlamaChatSession #429

snowyu · 2025-02-21T06:09:25Z

snowyu
Feb 21, 2025

I am integrating node-llama-cpp into the PPE CLI project and facing the following issues:

Core Challenge

How to implement custom message handling while maximizing reuse of existing session management capabilities?

Key technical constraints:

Need to maintain OpenAI-style message format (AIChatMessageParam[])
Require dynamic role handling (custom roles)
Security requirement for content wrapping with \1 (avoid injecting SpecialToken)
Template selection logic based on model metadata

Current Approach

class PPEChatWrapper extends ChatWrapper {
    public readonly wrapperName = "PPEChat";

    constructor(public options: {filename: string; stops: string[], fileInfo: GgufFileInfo}) {
      super();
    }

    async generateContextState({ chatHistory }: { chatHistory: AIChatMessageParam[] }) {
     // Safety handling: wrap content with control characters
     const processedHistory = chatHistory.map(msg => ({
      ...msg,
      content: `\x01${msg.content.replace(/[\x01]/g, '')}\x01`
     }));

     // Use HuggingFace template conversion
     const contextText = await formatPromptToLLamaText(processedHistory, this.options);

     return {
      contextText,
      stopGenerationTriggers: [LlamaText(this.options.stops)]
     };
    }
}

### Challenges Encountered

1. Reading Model Metadata Built-in System Template: (resolved) model.fileInfo.metadata.tokenizer.chat_template
2. **generateContextState**: The current generateContextState signature doesn't support async operations. How are others handling template processing that requires async I/O.
3. **Session State Reuse Pattern**: What's the recommended way to leverage existing LlamaChatSession state management when using custom wrappers? Are there any workaround patterns that have worked for others?
4. **Dynamic Role Handling**: Has anyone implemented a system supporting custom role names (beyond system/user/assistant)? Our template needs to handle conversations like:
   ```yaml
   <|im_start|>system
   This is a conversation between Mike and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|im_end|>
   <|im_start|>Llama
   What can I do for you, sir?<|im_end|>
   <|im_start|>Mike
   Nice to meet you, Llama!<|im_end|>
   <|im_start|>Llama
   Hello! It's nice to meet you too, Mr. Mike. How may I assist you today?
   <|im_start|>Mike
   Why the sky is blue?<|im_end|>
   <|im_start|>Llama

API Redundancy: avoid duplicate implementations for existing library capabilities

Proposed Solutions

✅ Implement a basic foundational class LlamaBaseChatSession similar to LlamaChatSession and LlamaSimpleChat:
- Accept chatHistory as any[] parameter
- No need to support function calling (as it would require introducing specialized roles)
✅ Add ChatSession.evaluate method to directly evaluate the entire ChatHistory
✅ Make ChatWrapper.generateContextState support async operations, or pass context as string directly which can be wrapped like \1.
✅ Implement registration and management support for ChatWrapper. The ChatWrapper ctor should be able to pass in the current model's filename and fileInfo, along with other extensible options.

Answered by giladgd

Feb 22, 2025

Most of your questions would be answered by this documentation: https://node-llama-cpp.withcat.ai/guide/external-chat-state

Need to maintain OpenAI-style message format (AIChatMessageParam[])

You can create an adaptation yourself from ChatHistoryItem[] to the OpenAI format and vice verse.
Note that the ChatHistoryItem type contains more information than the OpenAI format, so doing that will mean you'll miss out on some features, but mostly things that you can only do with node-llama-cpp and not with an OpenAI API, so this may be fine for your use case.
The main features pertain to content segmentation and the stability of the context state to reuse it as much as possible and avoid redun…

View full answer

giladgd · 2025-02-22T02:42:13Z

giladgd
Feb 22, 2025
Maintainer

Most of your questions would be answered by this documentation: https://node-llama-cpp.withcat.ai/guide/external-chat-state

Need to maintain OpenAI-style message format (AIChatMessageParam[])

You can create an adaptation yourself from ChatHistoryItem[] to the OpenAI format and vice verse.
Note that the ChatHistoryItem type contains more information than the OpenAI format, so doing that will mean you'll miss out on some features, but mostly things that you can only do with node-llama-cpp and not with an OpenAI API, so this may be fine for your use case.
The main features pertain to content segmentation and the stability of the context state to reuse it as much as possible and avoid redundant evaluations.

Require dynamic role handling (custom roles)

This is a non-strandard feature, and isn't supported by most model chat templates.
You can implement a custom chat wrapper that takes the first part of a user message and converts it into a role name.
For example, if the user message starts with Llama: , convert it to <|im_start|>Llama\n in the actual "rendered" chat history text.

Security requirement for content wrapping with \1 (avoid injecting SpecialToken)

Any text in a LlamaText not wrapped in SpecialTokensText will never tokenize into special tokens.
See the LlamaText tutorial to learn more about input safety in node-llama-cpp.

Template selection logic based on model metadata

This is already done by node-llama-cpp, no need to implement it yourself.
When using LlamaChatSession or LlamaChat, it automatically uses the most compatible chat wrapper and configures it to work best with the model you use.

generateContextState: The current generateContextState signature doesn't support async operations

.generateContextState() should be deterministic and efficient, as it may be called multiple times for measuring purposes, as well as for efficiently resolving the new context state for performing a context shift.

Session State Reuse Pattern: What's the recommended way to leverage existing LlamaChatSession state management when using custom wrappers?
I'm not entirely sure what you mean, but you can get and set the chat history to manipulate it as you wish.

Implement registration and management support for ChatWrapper. The ChatWrapper ctor should be able to pass in the current model's filename and fileInfo, along with other extensible options

I've made chat wrappers independent on purpose, so you can use them without depending on a model, or even calling getLlama as all.
This is crucial for many reasons, the top being separation of concerns, testability, and making it easier to customize and maintain due to being decoupled from significant depndencies.

0 replies

snowyu · 2025-02-22T06:51:08Z

snowyu
Feb 22, 2025
Author

This is a non-strandard feature, and isn't supported by most model chat templates.

No. This is genernal for the "completion" model. The conversation closer to the system template format, the better the generated quality.
You can try the QWen, gemma etc using its system template format. eg, llama-3.1-8b-instruct:

<|start_header_id|>system<|end_header_id|>

This is a conversation between Mike and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|eot_id|><|start_header_id|>Llama<|end_header_id|>

What can I do for you, sir?<|eot_id|><|start_header_id|>Mike<|end_header_id|>

Nice to meet you, Llama!<|eot_id|><|start_header_id|>Llama

The essence of LLM is just a completion. and "instruct" model is a fine-tune for following the system template.

You can implement a custom chat wrapper that takes the first part of a user message and converts it into a role name.

Need async supports for converting the message object to string.
Passing whole chatHistrorty as one string directly(LlamaText is not necessary) if sync only.

IMO: use C for more speedup, js for more flexibility

This is already done by node-llama-cpp, no need to implement it yourself. When using LlamaChatSession or LlamaChat,

I've already implemented it a long time ago. It's easy to add new model with default parameters supports without coding.

And I've implemented general tool calls and thinking mode(include deep thinking) for any models at a higher level as plugins.

1 reply

giladgd Feb 25, 2025
Maintainer

No. This is genernal for the "completion" model

Whether the syntax used for a chat template technically allows for inputting a "role name" doesn't mean that the model will follow it, as models are usually trained on datasets that follow assistant/user roles only, and won't output any other roles even if you ask for it.
The chat template having the word system in <|start_header_id|>system<|end_header_id|> doesn't mean the model will output anything else instead of system even if instructed to, since it was trained to always output <|start_header_id|>system<|end_header_id|> as a whole.
Many models don't even have this presumed flexibility in their syntax at all (like Mistral).

You can always make your own rules and fine-tune a model to follow them, but the popular models people use the most only use system and user roles.
Also, the provided templates by almost all the models I've seen so far don't include any flexibility for additional roles.
If there are popular models that support more than just these two roles that I'm unaware of, please let me know.

I've tested the system prompt you provided with Llama 3.1 8B Instruct and can confirm that the model doesn't follow it.
If you forcibly provide the full example of using other roles then it somewhat follows it, but it's not a reliable way to support such a feature, especially in a generic manner and across different models.

Even if I were to need such a feature, the syntax of the chat template itself is not something I would have felt good with changing without any fine tuning before using it in production.
The model's behavior would be too unpredictable in such usages and is more likely to hallucinate, due to lacking the proper training on what data to regard to as instructions.
I would have preferred either:

Using a model that officially supports such a feature on the chat template level
Telling the model what roles there are and having it play along inside the existing user/assistant roles

Need async supports for converting the message object to string

The conversion must be sync and not be performance-intensive, as it may be called many times during an interaction for various purposes, like context shifting, function calling and many other technical reasons that account for higher quality responses.
It's should be regarded to like a React-render - it should be sync because the framework has its own logic around it, and does much more that meets the eye in the relatively simple-looking implementation of a chat wrapper.

You shouldn't need to do async work inside of a chat wrapper's generateContextState method.
If you feel you need that, then you should probably look at other places to do async work, like a custom chat context shift strategy, or in an init method you manually call after creating an instance of the chat wrapper.

snowyu · 2025-02-25T00:32:21Z

snowyu
Feb 25, 2025
Author

@giladgd Now I have done my basic research on prompt, thought clearly about the hierarchy with llama.cpp, and am ready to integrate it. Your node-llama-cpp project is very well organized at the low level, especially the separation of sampler and predictor from CPP to js, which makes it more flexible to use in js. But the lack of layering at a higher level makes it simply unable to adapt to a wider range of needs. Although I am willing to join in the work, it seems that you have your own set of thinking patterns, and I have to fork and rewrite to make it simpler now, which is the last thing I want to do because it wastes everyone's time.

Anyway, thank you very much for your hard work.

1 reply

giladgd Feb 25, 2025
Maintainer

node-llama-cpp aims to provide high-level APIs for common use cases, and enough flexibility to build anything else from scratch with the low-level APIs.
You're always welcome to open PRs to add new features. If you want to add a big feature or a breaking change, please discuss it first so we can align on the implementation and avoid duplicate work.

From this thread, it appears that you have a niche use case that may not fit well into node-llama-cpp (custom roles in a chat template, correct me if I'm wrong).
If that's the case, you can build your own solution based on the low-level API provided by node-llama-cpp, which is the exact same API used for the implementations of all other high-level APIs.
You can implement your own CustomChatSession class on top of it, the same way that LlamaChat is built on top of it.

snowyu · 2025-02-25T07:11:56Z

snowyu
Feb 25, 2025
Author

I've tested the system prompt you provided with Llama 3.1 8B Instruct and can confirm that the model doesn't follow it.
If you forcibly provide the full example of using other roles then it somewhat follows it, but it's not a reliable way to support such a feature, especially in a generic manner and across different models.

Yes, but the premise is without any guidance, eg <|start_header_id|>xxx<|end_header_id|>I am XXX.... You should guide it first:

<|start_header_id|>system<|end_header_id|>

This is a conversation between Mike and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|eot_id|><|start_header_id|>Llama<|end_header_id|>

{{greeting}}<|eot_id|><|start_header_id|>Mike<|end_header_id|>

Nice to meet you, Llama!<|eot_id|><|start_header_id|>Llama

the Llama 3.1 8B Instruct result(click to expand).

<|start_header_id|>system<|end_header_id|>

This is a conversation between Mike and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|eot_id|><|start_header_id|>Llama<|end_header_id|>

What can I do for you, sir?<|eot_id|><|start_header_id|>Mike<|end_header_id|>

Nice to meet you, Llama!<|eot_id|><|start_header_id|>Llama
Likewise, Mike. It's nice to finally talk to you. How are you doing today? Would you like to discuss something in particular or just see where the conversation takes us?<|eot_id|><|start_header_id|>Mike<|end_header_id|>

Why the sky is blue?<|eot_id|><|start_header_id|>Llama
A classic question!

The color of the sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh.

When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2). These molecules scatter the light in all directions, but they tend to scatter shorter wavelengths more than longer wavelengths.

Blue light has a much shorter wavelength than red light, so it gets scattered in all directions by the molecules in the air. This is why we see the sky appear blue during the daytime – our eyes are seeing the scattered blue light from the sun.

However, when the sun sets or rises, the light has to travel through more of the atmosphere to reach our eyes, which scatters away some of the blue light, making the sky appear more orange-red.

Would you like me to explain anything else about this topic, Mike?

The only role that can be fixed is the system role, but gemma2 does not have it.
pls Tell me, what is your prompt? The Qwen is very well to follow the system template.

I only have two days left to integrate it, so I have to use some ugly tricks that you will not be interested in.

I just have a little idea:

Abstract LlamaCompletion from infill.
Abstract LlamaChat inherits from LlamaCompletion
Drop LlmaText, keep the string type and use more simple way to wrap normal content which no special tokens in lower level api.
- wrap with a special control char\0x01.
refactor ChatWrapper, add register/unregister/resolve static methods
Define Role Mapper for assistant and user.
Extract func tool feature from classes to an independent plugin. the tool function needs to agree on the role names.

1 reply

giladgd Feb 25, 2025
Maintainer

Abstract LlamaCompletion from infill.

Abstract LlamaChat inherits from LlamaCompletion

What issues do you have with LlamaCompletion or LlamaChat?
What's the reason to make any changes to their implementations?
I don't see how it contributes in any way to solving your use case.

Drop LlmaText, keep the string type and use more simple way to wrap normal content which no special tokens in lower level api.

LlamaText exists to ensure inputs are safe, and easily differentiate between special tokens and regular text.
Using a simple string type will never be as safe and is more prone to special token injection attacks.

Adding proper support for roles in node-llama-cpp requires thinking about the most convenient format to store this information, stream it as it's being generated, and ensuring the APIs stay simple for the common uses cases that don't use roles at all.

refactor ChatWrapper, add register/unregister/resolve static methods

I'm not sure what's the reason for the "register/unregister" functions, but there's already a resolve function.

Define Role Mapper for assistant and user

Since it seems that you aren't interested in most features offered by LlamaChat (like input safety, grammar, function calling, etc.)
I think that you may find more success using the raw API directly to feed the model whatever you want and parse its output.

Extract func tool feature from classes to an independent plugin. the tool function needs to agree on the role names.

That was my approach initially, but the implementation is much more advanced and uses many tricks that cannot easily and intuitively be done through a plugin. It has to be deeply integrated to work as well as it does now.
I may introduce a plugin system in the future that allows for more advanced integrations into a LlamaChatSession, but it'll take time to get it done right.
Exposing an extendable interface for plugins to use while still supporting streaming and all the other features, while having the API be simple and intuitive is not easy to do.

snowyu · 2025-03-01T10:02:30Z

snowyu
Mar 1, 2025
Author

@giladgd

What issues do you have with LlamaCompletion or LlamaChat?
What's the reason to make any changes to their implementations?

IMO, the purpose of OO is to maximize the reuse of ode and data. For the large language model (LLM), the relationship between classes should be:

graph TD
    LLMChat --> LLMInstructCompletion
    LLMInstructCompletion --> LLMCompletion
    LLMInfillCompletion --> LLMCompletion

LlamaText exists to ensure inputs are safe, and easily differentiate between special tokens and regular text.
Using a simple string type will never be as safe and is more prone to special token injection attacks.

Why do you think that only by introducing the LlamaText type can the input safety be ensured? For example, if the upper layer is always called in the way of LlamaText([SpecialTag(content)]), how is it different from using strings directly?

Safety ultimately depends on the implementation of the upper layer. Using a simple string type to wrap the safe content with control characters, the only additional operation required by the upper layer is to filter the control character for the safe content. But because no new types are introduced, the entire API will be cleaner and clearer.

// upper level processing using string is easy:
for (const msg of messages) {
  // keep the msg.content safe.
  msg.content = CTRL_CHAR + trimControlChar(msg.content) + CTRL_CHAR
  // keep the dynamic role safe.
  msg.role = CTRL_CHAR + trimControlChar(msg.role) + CTRL_CHAR
  // msg.content = LLamaText(msg.content)
}
const data = await getTemplateData()
// howto do with LlamaText here?
// the upper level should make sure system_template is safe too.
const text_content = await formatMessagesWithTemplate(system_template, {...data, messages})
// it could be introduced to llamaModel.tokenize when addSpecial == null
const tokens = processTextWithSafeWrapper(text_content, (text, addSpecial) => llamaModel.tokenize(text, addSpecial, trimLeadingSpace))

Only when there are multiple different token types and they need to be handled differently, you might have to introduce new type. For the low-level API, it is sufficient to ensure that tokenize and generate are correct.

Adding proper support for roles in node-llama-cpp requires thinking about the most convenient format to store this information, stream it as it's being generated, and ensuring the APIs stay simple for the common uses cases that don't use roles at all.

Isn’t it safe and simple to use strings using control characters?

I'm not sure what's the reason for the static "register/unregister" functions, but there's already a resolve function.

Putting them in the same ChatWrapper class makes them more cohesive.
The existing resolve function you mentioned has been frozen with Object.freeze() and cannot be cleaned or registered. The bottom layer should be empty and no ChatWrapper class should be registered.

Since it seems that you aren't interested in most features offered by LlamaChat (like input safety, grammar, function calling, etc.)
I think that you may find more success using the raw API directly to feed the model whatever you want and parse its output.

Yes, but I had to rewrite a lot of code with similar functions, such as the LlamaSampler that is not public, and the result returned by completionWithMeta lacks the used seed, temperature, etc.

IMO, it is more appropriate to separate the lowest layer into an independent npm package.

Exposing an extendable interface for plugins to use while still supporting streaming and all the other features, while having the API be simple and intuitive is not easy to do.

Yes, It is not easy, but your current function tool implementation has embedded a lot of code in the bottom layer, and this code does not need to exist if the function is not used. These function tool codes should be extracted, rather than trick-embedded into the underlying layer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for extending ChatWrapper while reusing LlamaChatSession #429

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Best practices for extending ChatWrapper while reusing LlamaChatSession #429

snowyu Feb 21, 2025

Core Challenge

Current Approach

Proposed Solutions

Replies: 5 comments · 3 replies

giladgd Feb 22, 2025 Maintainer

snowyu Feb 22, 2025 Author

giladgd Feb 25, 2025 Maintainer

snowyu Feb 25, 2025 Author

giladgd Feb 25, 2025 Maintainer

snowyu Feb 25, 2025 Author

giladgd Feb 25, 2025 Maintainer

snowyu Mar 1, 2025 Author

snowyu
Feb 21, 2025

Replies: 5 comments 3 replies

giladgd
Feb 22, 2025
Maintainer

snowyu
Feb 22, 2025
Author

giladgd Feb 25, 2025
Maintainer

snowyu
Feb 25, 2025
Author

giladgd Feb 25, 2025
Maintainer

snowyu
Feb 25, 2025
Author

giladgd Feb 25, 2025
Maintainer

snowyu
Mar 1, 2025
Author