feat: add to tokenizer chat configuration. #76

b4rtaz · 2024-05-30T15:08:19Z

This PR extends the tokenizer file format. Now it's possible to add to the tokenzier file the chat configuration.

...
 seqLen: 8192
💡 nSlices: 1
💡 ropeTheta: 500000.0
📄 chatTemplate[0]: 
📄 chatTemplate[1]: <|start_header_id|>
📄 chatTemplate[2]: <|end_header_id|>


📄 chatTemplate[3]: <|eot_id|>
📄 chatTemplate[4]: <|start_header_id|>assistant<|end_header_id|>


📄 bosId: 128000
📄 eosId: 128001
📄 chatEosId: 128009
🕒 ropeCache: 131072 kB
⏩ Loaded 6175568 kB

DifferentialityDevelopment · 2024-05-30T15:38:58Z

Do you maybe know how I'd do the tokenizer conversion for models that don't have a tokenizer.model file?

b4rtaz · 2024-05-30T21:56:34Z

@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.

I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json and tokenizer.json files.

How to convert the tokenizer:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
✅ Created dllama_tokenizer_hermes.t

So far I have:

b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
ĠD1Ġ+ĠD2ĠisĠtheĠsumĠofĠtwoĠdistances,ĠD1ĠandĠD2.ĠItĠisĠaĠconceptĠusedĠinĠgeometryĠandĠtrigonometryĠtoĠrelateĠtheĠlengthsĠofĠtwoĠsidesĠofĠaĠtriangle.ĠTheĠformulaĠforĠD1Ġ+ĠD2Ġis:ĠD1Ġ+ĠD2Ġ=Ġsqrt((x2Ġ-Ġx1)^2Ġ+Ġ(y2Ġ-Ġy1)^2),ĠwhereĠ(x1,Ġy1)ĠandĠ(x2,Ġy2)ĠareĠtheĠcoordinatesĠofĠtheĠtwoĠpoints.ĠThisĠformulaĠisĠusedĠtoĠfindĠtheĠdistanceĠbetweenĠtwoĠpointsĠinĠaĠtwo-dimensionalĠspace.ĠDoĠyouĠhaveĠanyĠspecificĠquestionsĠaboutĠthisĠconcept?Ġ<|im_end

If I replace manualy Ġ => .

 D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end

The tokenizer is not easy part here. :)

DifferentialityDevelopment · 2024-05-31T06:33:56Z

@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.

I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json and tokenizer.json files.

How to convert the tokenizer:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
✅ Created dllama_tokenizer_hermes.t

So far I have:

b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
ĠD1Ġ+ĠD2ĠisĠtheĠsumĠofĠtwoĠdistances,ĠD1ĠandĠD2.ĠItĠisĠaĠconceptĠusedĠinĠgeometryĠandĠtrigonometryĠtoĠrelateĠtheĠlengthsĠofĠtwoĠsidesĠofĠaĠtriangle.ĠTheĠformulaĠforĠD1Ġ+ĠD2Ġis:ĠD1Ġ+ĠD2Ġ=Ġsqrt((x2Ġ-Ġx1)^2Ġ+Ġ(y2Ġ-Ġy1)^2),ĠwhereĠ(x1,Ġy1)ĠandĠ(x2,Ġy2)ĠareĠtheĠcoordinatesĠofĠtheĠtwoĠpoints.ĠThisĠformulaĠisĠusedĠtoĠfindĠtheĠdistanceĠbetweenĠtwoĠpointsĠinĠaĠtwo-dimensionalĠspace.ĠDoĠyouĠhaveĠanyĠspecificĠquestionsĠaboutĠthisĠconcept?Ġ<|im_end

If I replace manualy Ġ => .

 D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end

The tokenizer is not easy part here. :)

Your definitely closer than I got, mine flat out crashed when trying to use the converted tokenizer.

I'll see what I can do to help.

b4rtaz · 2024-05-31T14:59:07Z

Ok, now after I replaced manually all Ġ => in tokenizer.config and executed the converter:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
⏩ Enter value for chat template key "chat_extra_stop":
<|im_start|>
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 192, 'chat_template': 6}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n', 'chat_extra_stop': '<|im_start|>'}
✅ Created dllama_tokenizer_hermes.t

It seems the Hermes 2 works quite good.

DifferentialityDevelopment · 2024-05-31T15:53:47Z

Awesome stuff @b4rtaz!

feat: add to tokenizer chat configuration.

ec99eba

b4rtaz mentioned this pull request May 30, 2024

Add additional chat templates to dllama-api #73

Closed

version assert.

0df0361

convert-tokenizer-hf.py.

5646e14

b4rtaz added 3 commits May 31, 2024 16:47

eos detector.

72f8c3b

missing includes.

12230c2

cstdlib.

63c67b2

b4rtaz mentioned this pull request May 31, 2024

master and worker started but with problems #80

Closed

b4rtaz added 3 commits May 31, 2024 21:59

chat cli.

eeac6f5

polishing.

867e06a

fix: test.

4dc36d5

b4rtaz merged commit 6eccd30 into main May 31, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add to tokenizer chat configuration. #76

feat: add to tokenizer chat configuration. #76

b4rtaz commented May 30, 2024 •

edited

Loading

DifferentialityDevelopment commented May 30, 2024

b4rtaz commented May 30, 2024

DifferentialityDevelopment commented May 31, 2024

b4rtaz commented May 31, 2024

DifferentialityDevelopment commented May 31, 2024

feat: add to tokenizer chat configuration. #76

feat: add to tokenizer chat configuration. #76

Conversation

b4rtaz commented May 30, 2024 • edited Loading

DifferentialityDevelopment commented May 30, 2024

b4rtaz commented May 30, 2024

DifferentialityDevelopment commented May 31, 2024

b4rtaz commented May 31, 2024

DifferentialityDevelopment commented May 31, 2024

b4rtaz commented May 30, 2024 •

edited

Loading