Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Custom conversation template improvement and document update #783

Merged
merged 1 commit into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 63 additions & 51 deletions docs/source/examples/DATASETS.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,55 +62,6 @@ supported types are listed as follows.

## Supported Dataset and Detailed Formats

### TextOnly

This is the most common dataset type, which only contains raw texts in each
sample. This type of dataset can be used as the training set for text decoder
models, or the input of decoder models / encoder-decoder models. Its format is
as follows (three instances for example),

```json
{
"type": "text_only",
"instances": [
{ "text": "SAMPLE_TEXT_1" },
{ "text": "SAMPLE_TEXT_2" },
{ "text": "SAMPLE_TEXT_3" },
]
}
```

For example, `data/example_dataset/train/train_50.json` has the aboved format.

### Text2Text

This is the dataset type mostly used for inferencing, which contains a pair of
texts in each sample. This type of dataset can be used as the training set for
text encoder-decoder models, or question-answer pair for evaluating model
inferences. Its format is as follows (three instances for example),

```json
{
"type": "text2text",
"instances": [
{
"input": "SAMPLE_INPUT_1",
"output": "SAMPLE_OUTPUT_1",
},
{
"input": "SAMPLE_INPUT_2",
"output": "SAMPLE_OUTPUT_2",
},
{
"input": "SAMPLE_INPUT_3",
"output": "SAMPLE_OUTPUT_3",
},
]
}
```

For example, `data/example_dataset/test/test_13.json` has the aboved format.

### Conversation

```{admonition} **Work in Progress**
Expand Down Expand Up @@ -168,12 +119,22 @@ Conversational data are commonly used in sft process. We currently support conve
]
}
```
Data types:
- `conversation_id`: `Optional[Any]`. An identifier for the conversation. `conversation_id` is only for convience of tracking the conversation and will not be used in the pipeline.
- `system`: `Optional[string]`. A system prompt that is used to start the conversation.
- `tools`: `Optional[List[string]]`. A list of tools that are used in the conversation.
- `messages`: `List[Dict]`. A list of messages in the conversation. Each message contains the following fields:
- `role`: `string`. The role of the message. It can be either `user` or `assistant`.
- `content`: `string`. The content of the message.

> We are working on supporting customized message keys and role names. Please stay tuned.

Tips:
- `system`, `tools`, and `conversation_id` are OPTIONAL. `conversation_id` is only for convience of tracking the conversation and will not be used in the pipeline.
- Please make sure the messages are:
1. Start with an user message.
2. In the correct order. The pipeline will not check the order of the messages.
3. In pairs of user and assistant (i.e., the length of the messages should be even). If the conversation ends with the user, the pipeline will trim the last user message.
4. Make sure the `content`s are not empty. If the `content` is empty, the pipeline will add a space to it.

#### Conversation Template

Expand Down Expand Up @@ -244,4 +205,55 @@ For dataset that system prompts, tool prompts and templates are already applied

#### Customize Conversation Template

Please refer to the [Customize Conversation Template](./customize_conversation_template.md) for more details.
Please refer to the [Customize Conversation Template](./customize_conversation_template.md) for more details.


### TextOnly

This is the most common dataset type, which only contains raw texts in each
sample. This type of dataset can be used as the training set for text decoder
models, or the input of decoder models / encoder-decoder models. Its format is
as follows (three instances for example),

```json
{
"type": "text_only",
"instances": [
{ "text": "SAMPLE_TEXT_1" },
{ "text": "SAMPLE_TEXT_2" },
{ "text": "SAMPLE_TEXT_3" },
]
}
```

For example, `data/example_dataset/train/train_50.json` has the aboved format.


### Text2Text

This is the dataset type mostly used for inferencing, which contains a pair of
texts in each sample. This type of dataset can be used as the training set for
text encoder-decoder models, or question-answer pair for evaluating model
inferences. Its format is as follows (three instances for example),

```json
{
"type": "text2text",
"instances": [
{
"input": "SAMPLE_INPUT_1",
"output": "SAMPLE_OUTPUT_1",
},
{
"input": "SAMPLE_INPUT_2",
"output": "SAMPLE_OUTPUT_2",
},
{
"input": "SAMPLE_INPUT_3",
"output": "SAMPLE_OUTPUT_3",
},
]
}
```

For example, `data/example_dataset/test/test_13.json` has the aboved format.
31 changes: 29 additions & 2 deletions docs/source/examples/customize_conversation_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ We are rapidly working on this page.

We provide the flexibility to customize the conversation template. You can customize your own conversation template by following the steps below:

## Step 1: Knowing the conversation template of your model
## Knowing the conversation template of your model

The conversation template varies according to the model you are using. For example:

Expand All @@ -23,4 +23,31 @@ The template for Llama-2 looks like:
Find more templates [here](./supported_conversation_template.md).


## Step 2: XXX
## Make your own template

`TemplateComponent`s to a conversation template is just like bricks to a LEGO house. You can build your own template by combining different components.

The following provides an example of building a conversation template for the ChatML format:

1. Decompose the official template
The official template looks like:
```
<|im_start|>system\n{{system_message}}<|im_end|>\n<|im_start|>user\n{{user_message_0}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_0}}<|im_end|>\n<|im_start|>user\n{{user_message_1}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_1}}<|im_end|>\n
```
It is easy to recognize the format for each message:
- System message: `<|im_start|>system\n{{system_message}}<|im_end|>\n`
- User message: `<|im_start|>user\n{{user_message}}<|im_end|>\n`
- Assistant message: `<|im_start|>assistant\n{{assistant_reply}}<|im_end|>\n`

2. Choose proper `Formatter`
Recall the requirements for a conversation dataset:
> - `system`: `Optional[string]`.
> - `tools`: `Optional[List[string]]`.
> - `messages`: `List[Dict]`.
> - `role`: `string`.
> - `content`: `string`.
System message, user message, and assistant message are strings thus we can use `StringFormatter` for them.

3. Build the template
```python
```
4 changes: 4 additions & 0 deletions src/lmflow/utils/conversation_formatter.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,10 @@ def format(self, **kwargs) -> list:
if component.type == 'string':
for key, value in kwargs.items():
templated = component.content.replace("{{" + key + "}}", value)
if len(templated) == 0:
logger.warning("Found empty string after formatting, adding a space instead. "
"If this is not intended, please check the dataset.")
templated = " "
formatted_template.append(TemplateComponent(type='string', content=templated))
else:
formatted_template.append(component)
Expand Down
Loading