Skip to content

Commit

Permalink
Merge pull request #824 from OptimalScale/yizhenjia-template-update
Browse files Browse the repository at this point in the history
Support yi and yi1.5 template
  • Loading branch information
research4pan authored May 13, 2024
2 parents abaec4b + 3dd7ab2 commit d5125fb
Show file tree
Hide file tree
Showing 5 changed files with 94 additions and 11 deletions.
4 changes: 3 additions & 1 deletion docs/source/examples/DATASETS.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,10 @@ Conversations should be formatted before feeding into the model. As of now, we'v
| `internlm2` | `<s><\|im_start\|>system`<br>`You are a chatbot developed by LMFlow team.<\|im_end\|>`<br>`<\|im_start\|>user`<br>`Who are you?<\|im_end\|>`<br>`<\|im_start\|>assistant`<br>`I am a chatbot developed by LMFlow team.<\|im_end\|>`<br>`<\|im_start\|>user`<br>`How old are you?<\|im_end\|>`<br>`<\|im_start\|>assistant`<br>`I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<\|im_end\|>`<br> | [Link](./supported_conversation_template.md#internlm2) |
| `llama3` | `<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>`<br><br>`You are a chatbot developed by LMFlow team.<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>`<br><br>`Who are you?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>`<br><br>`I am a chatbot developed by LMFlow team.<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>`<br><br>`How old are you?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>`<br><br>`I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<\|eot_id\|>` | [Link](./supported_conversation_template.md#llama-3) |
| `llama2` | `<s>[INST] <<SYS>>`<br>`You are a chatbot developed by LMFlow team.`<br>`<</SYS>>`<br><br>`Who are you? [/INST] I am a chatbot developed by LMFlow team.</s><s>[INST] How old are you? [/INST] I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.</s>` | [Link](./supported_conversation_template.md#llama-2) |
| `phi3` | `<s><\|system\|>`<br>`You are a chatbot developed by LMFlow team.<\|end\|>`<br>`<\|user\|>\nWho are you?<\|end\|>`<br>`<\|assistant\|>`<br>`I am a chatbot developed by LMFlow team.<\|end\|>`<br>`<\|user\|>`<br>`How old are you?<\|end\|>`<br>`<\|assistant\|>`<br>`I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<\|end\|>`<br>`<\|endoftext\|>` | [Link](./supported_conversation_template.md#phi-3) |
| `phi3` | `<s><\|system\|>`<br>`You are a chatbot developed by LMFlow team.<\|end\|>`<br>`<\|user\|>`<br>`Who are you?<\|end\|>`<br>`<\|assistant\|>`<br>`I am a chatbot developed by LMFlow team.<\|end\|>`<br>`<\|user\|>`<br>`How old are you?<\|end\|>`<br>`<\|assistant\|>`<br>`I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<\|end\|>`<br>`<\|endoftext\|>` | [Link](./supported_conversation_template.md#phi-3) |
| `qwen2` | `<\|im_start\|>system`<br>`You are a chatbot developed by LMFlow team.<\|im_end\|>`<br>`<\|im_start\|>user`<br>`Who are you?<\|im_end\|>`<br>`<\|im_start\|>assistant`<br>`I am a chatbot developed by LMFlow team.<\|im_end\|>`<br>`<\|im_start\|>user`<br>`How old are you?<\|im_end\|>`<br>`<\|im_start\|>assistant`<br>`I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<\|im_end\|>`<br> | [Link](./supported_conversation_template.md#qwen-2) |
| `yi` | Same as `chatml` | [Link](./supported_conversation_template.md#yi) |
| `yi1_5`| `You are a chatbot developed by LMFlow team.<\|im_start\|>user`<br>`Who are you?<\|im_end\|>`<br>`<\|im_start\|>assistant`<br>`I am a chatbot developed by LMFlow team.<\|im_end\|>`<br>`<\|im_start\|>user`<br>`How old are you?<\|im_end\|>`<br>`<\|im_start\|>assistant`<br>`I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<\|im_end\|>`<br> | [Link](./supported_conversation_template.md#yi-15) |

Passing the template name to the `--conversation_template` argument to apply the corresponding conversation template:
```sh
Expand Down
42 changes: 32 additions & 10 deletions docs/source/examples/supported_conversation_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,10 @@
- [Phi-3](#phi-3)
- [Qwen-2](#qwen-2)
- [Yi](#yi)
- [Yi-1.5](#yi-15)


## ChatGLM-3
```{admonition} **Work in Progress**
:class: info
This template is not preseted in LMFlow currently. We are working on it and will update it soon.
```
**With a system message**
```
[gMASK]sop<|system|>\n {{system_message}}<|user|>\n {{user_message_0}}
Expand Down Expand Up @@ -322,11 +318,6 @@ The conversation template for Mixtral 8x7B is slightly different from the templa


## Yi
```{admonition} **Work in Progress**
:class: info
This template is not preseted in LMFlow currently. We are working on it and will update it soon.
```
**With a system message**
```
<|im_start|>system\n{{system_message}}<|im_end|>\n<|im_start|>user\n{{user_message_0}}<|im_end|>\n
Expand Down Expand Up @@ -359,3 +350,34 @@ This template is not preseted in LMFlow currently. We are working on it and will
```


## Yi-1.5
**With a system message**
```
{{system_message}}<|im_start|>user\n{{user_message_0}}<|im_end|>\n
```

**Without a system message**
```
<|im_start|>user\n{{user_message_0}}<|im_end|>\n
```

**A complete conversation**
```
{{system_message}}<|im_start|>user\n{{user_message_0}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_0}}<|im_end|>\n
```

**Multiple rounds**
```
{{system_message}}<|im_start|>user\n{{user_message_0}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_0}}<|im_end|>\n<|im_start|>user\n{{user_message_1}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_1}}<|im_end|>\n
```

**jinja template**
[[Reference](https://huggingface.co/01-ai/Yi-1.5-6B-Chat/blob/d68dab90947a3c869e28c9cb2806996af99a6080/tokenizer_config.json#L40)]
```
{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\\n' + content + '<|im_end|>\\n<|im_start|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|im_end|>' + '\\n' }}{% endif %}{% endfor %}
```

**Filled Example**
```
You are a chatbot developed by LMFlow team.<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a chatbot developed by LMFlow team.<|im_end|>\n<|im_start|>user\nHow old are you?<|im_end|>\n<|im_start|>assistant\nI don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<|im_end|>\n
```
3 changes: 3 additions & 0 deletions src/lmflow/utils/conversation_template/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from .llama import LLAMA2_TEMPLATE, LLAMA3_TEMPLATE
from .phi import PHI3_TEMPLATE
from .qwen import QWEN2_TEMPLATE
from .yi import YI1_5_TEMPLATE


PRESET_TEMPLATES = {
Expand All @@ -22,4 +23,6 @@
'llama3': LLAMA3_TEMPLATE,
'phi3': PHI3_TEMPLATE,
'qwen2': QWEN2_TEMPLATE,
'yi': CHATML_TEMPLATE,
'yi1_5': YI1_5_TEMPLATE
}
24 changes: 24 additions & 0 deletions src/lmflow/utils/conversation_template/yi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 Statistics and Machine Learning Research Group. All rights reserved.
from .base import StringFormatter, TemplateComponent, ConversationTemplate


YI1_5_TEMPLATE = ConversationTemplate(
template_name='yi1_5',
user_formatter=StringFormatter(
template=[
TemplateComponent(type='string', content='<|im_start|>user\n{{content}}<|im_end|>\n')
]
),
assistant_formatter=StringFormatter(
template=[
TemplateComponent(type='string', content='<|im_start|>assistant\n{{content}}<|im_end|>\n')
]
),
system_formatter=StringFormatter(
template=[
TemplateComponent(type='string', content='{{content}}')
]
)
)
32 changes: 32 additions & 0 deletions tests/models/test_hf_decoder_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,13 @@
)
]

CONVERSATION_SINGLETURN_YI1_5_IDS = [
(
[15692, 5885, 59666, 59705, 622, 59593, 5858, 46826, 3903, 144, 25102, 59666, 59705, 622, 59593, 701, 46826, 144 ],
[59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144, 25070, 99, 59666, 59705, 622, 59593, 701, 46826, 144, ]
)
]

CONVERSATION_MULTITURN = {
"system": "sysinfo",
"messages": [
Expand Down Expand Up @@ -222,6 +229,17 @@
)
]

CONVERSATION_MULTITURN_YI1_5_IDS = [
(
[15692, 5885, 59666, 59705, 622, 59593, 5858, 46826, 3903, 144, 25102, 59666, 59705, 622, 59593, 701, 46826, 144 ],
[59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144, 25070, 99, 59666, 59705, 622, 59593, 701, 46826, 144, ]
),
(
[59666, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 100, 59666, 59705, 622, 59593, 701, 46826, 144,],
[59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144, 59597, 59610, 59583, 1226, 97, 5867, 99, 59666, 59705, 622, 59593, 701, 46826, 144]
)
]

test_encode_input = "Question: Which of the following is not true for myelinated nerve fibers: (A) Impulse through myelinated fibers is slower than non-myelinated fibers (B) Membrane currents are generated at nodes of Ranvier (C) Saltatory conduction of impulses is seen (D) Local anesthesia is effective only when the nerve is not covered by myelin sheath."
test_encode_output = [24361, 25, 9022, 286, 262, 1708, 318, 407, 2081, 329, 616, 417, 3898, 16384, 26742, 25, 357, 32, 8, 9855, 9615, 832, 616, 417, 3898, 26742, 318, 13611, 621, 1729, 12, 1820, 417, 3898, 26742, 357, 33, 8, 4942, 1671, 1531, 28629, 389, 7560, 379, 13760, 286, 23075, 49663, 357, 34, 8, 13754, 2870, 369, 11124, 286, 37505, 318, 1775, 357, 35, 8, 10714, 49592, 318, 4050, 691, 618, 262, 16384, 318, 407, 5017, 416, 616, 27176, 673, 776, 13]
test_decode_input = [24361, 25, 9022, 286, 262, 1708, 318, 407, 2081, 329, 616, 417, 3898, 16384, 26742, 25, 357, 32, 8, 9855, 9615, 832, 616, 417, 3898, 26742, 318, 13611, 621, 1729, 12, 1820, 417, 3898, 26742, 357, 33, 8, 4942, 1671, 1531, 28629, 389, 7560, 379, 13760, 286, 23075, 49663, 357, 34, 8, 13754, 2870, 369, 11124, 286, 37505, 318, 1775, 357, 35, 8, 10714, 49592, 318, 4050, 691, 618, 262, 16384, 318, 407, 5017, 416, 616, 27176, 673, 776, 13]
Expand Down Expand Up @@ -430,6 +448,13 @@ def test_tokenize_conversation(self):
trust_remote_code=True
)

self._test_tokenize(
model_name='01-ai/Yi-1.5-6B',
groundtruth_dataset={"type": "conversation", "instances": [CONVERSATION_SINGLETURN]},
groundtruth_tokenized_dataset=make_gt_from_conversation_ids_batch([CONVERSATION_SINGLETURN_YI1_5_IDS]),
conversation_template=PRESET_TEMPLATES['yi1_5'],
)


def test_tokenize_conversation_multiple(self):
conversation_dataset = {
Expand Down Expand Up @@ -525,6 +550,13 @@ def test_tokenize_conversation_multiple(self):
conversation_template=PRESET_TEMPLATES['chatglm3'],
trust_remote_code=True
)

self._test_tokenize(
model_name='01-ai/Yi-1.5-6B',
groundtruth_dataset={"type": "conversation", "instances": [CONVERSATION_MULTITURN]},
groundtruth_tokenized_dataset=make_gt_from_conversation_ids_batch([CONVERSATION_MULTITURN_YI1_5_IDS]),
conversation_template=PRESET_TEMPLATES['yi1_5'],
)


def test_encode(self):
Expand Down

0 comments on commit d5125fb

Please # to comment.