Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: add to tokenizer chat configuration. #76

Merged
merged 9 commits into from
May 31, 2024
Merged

Conversation

b4rtaz
Copy link
Owner

@b4rtaz b4rtaz commented May 30, 2024

This PR extends the tokenizer file format. Now it's possible to add to the tokenzier file the chat configuration.

...
 seqLen: 8192
💡 nSlices: 1
💡 ropeTheta: 500000.0
📄 chatTemplate[0]: 
📄 chatTemplate[1]: <|start_header_id|>
📄 chatTemplate[2]: <|end_header_id|>


📄 chatTemplate[3]: <|eot_id|>
📄 chatTemplate[4]: <|start_header_id|>assistant<|end_header_id|>


📄 bosId: 128000
📄 eosId: 128001
📄 chatEosId: 128009
🕒 ropeCache: 131072 kB
⏩ Loaded 6175568 kB

@DifferentialityDevelopment
Copy link
Contributor

Do you maybe know how I'd do the tokenizer conversion for models that don't have a tokenizer.model file?

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 30, 2024

@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.

I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json and tokenizer.json files.

How to convert the tokenizer:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
✅ Created dllama_tokenizer_hermes.t

So far I have:

b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
ĠD1Ġ+ĠD2ĠisĠtheĠsumĠofĠtwoĠdistances,ĠD1ĠandĠD2.ĠItĠisĠaĠconceptĠusedĠinĠgeometryĠandĠtrigonometryĠtoĠrelateĠtheĠlengthsĠofĠtwoĠsidesĠofĠaĠtriangle.ĠTheĠformulaĠforĠD1Ġ+ĠD2Ġis:ĠD1Ġ+ĠD2Ġ=Ġsqrt((x2Ġ-Ġx1)^2Ġ+Ġ(y2Ġ-Ġy1)^2),ĠwhereĠ(x1,Ġy1)ĠandĠ(x2,Ġy2)ĠareĠtheĠcoordinatesĠofĠtheĠtwoĠpoints.ĠThisĠformulaĠisĠusedĠtoĠfindĠtheĠdistanceĠbetweenĠtwoĠpointsĠinĠaĠtwo-dimensionalĠspace.ĠDoĠyouĠhaveĠanyĠspecificĠquestionsĠaboutĠthisĠconcept?Ġ<|im_end

If I replace manualy Ġ => .

 D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end

The tokenizer is not easy part here. :)

@DifferentialityDevelopment
Copy link
Contributor

@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.

I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json and tokenizer.json files.

How to convert the tokenizer:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
✅ Created dllama_tokenizer_hermes.t

So far I have:

b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
ĠD1Ġ+ĠD2ĠisĠtheĠsumĠofĠtwoĠdistances,ĠD1ĠandĠD2.ĠItĠisĠaĠconceptĠusedĠinĠgeometryĠandĠtrigonometryĠtoĠrelateĠtheĠlengthsĠofĠtwoĠsidesĠofĠaĠtriangle.ĠTheĠformulaĠforĠD1Ġ+ĠD2Ġis:ĠD1Ġ+ĠD2Ġ=Ġsqrt((x2Ġ-Ġx1)^2Ġ+Ġ(y2Ġ-Ġy1)^2),ĠwhereĠ(x1,Ġy1)ĠandĠ(x2,Ġy2)ĠareĠtheĠcoordinatesĠofĠtheĠtwoĠpoints.ĠThisĠformulaĠisĠusedĠtoĠfindĠtheĠdistanceĠbetweenĠtwoĠpointsĠinĠaĠtwo-dimensionalĠspace.ĠDoĠyouĠhaveĠanyĠspecificĠquestionsĠaboutĠthisĠconcept?Ġ<|im_end

If I replace manualy Ġ => .

 D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end

The tokenizer is not easy part here. :)

Your definitely closer than I got, mine flat out crashed when trying to use the converted tokenizer.

I'll see what I can do to help.

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 31, 2024

Ok, now after I replaced manually all Ġ => in tokenizer.config and executed the converter:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
⏩ Enter value for chat template key "chat_extra_stop":
<|im_start|>
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 192, 'chat_template': 6}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n', 'chat_extra_stop': '<|im_start|>'}
✅ Created dllama_tokenizer_hermes.t

It seems the Hermes 2 works quite good.

image

@DifferentialityDevelopment
Copy link
Contributor

Awesome stuff @b4rtaz!

@b4rtaz b4rtaz merged commit 6eccd30 into main May 31, 2024
3 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants