OptimalScale · research4pan · Apr 22, 2024 · Apr 22, 2024
diff --git a/docs/source/examples/DATASETS.md b/docs/source/examples/DATASETS.md
@@ -62,55 +62,6 @@ supported types are listed as follows.
 
 ## Supported Dataset and Detailed Formats
 
-### TextOnly
-
-This is the most common dataset type, which only contains raw texts in each
-sample. This type of dataset can be used as the training set for text decoder
-models, or the input of decoder models / encoder-decoder models. Its format is
-as follows (three instances for example),
-
-```json
-{
-  "type": "text_only",
-  "instances": [
-    {  "text": "SAMPLE_TEXT_1" },
-    {  "text": "SAMPLE_TEXT_2" },
-    {  "text": "SAMPLE_TEXT_3" },
-  ]
-}
-```
-
-For example, `data/example_dataset/train/train_50.json` has the aboved format.
-
-### Text2Text
-
-This is the dataset type mostly used for inferencing, which contains a pair of
-texts in each sample. This type of dataset can be used as the training set for
-text encoder-decoder models, or question-answer pair for evaluating model
-inferences. Its format is as follows (three instances for example),
-
-```json
-{
-  "type": "text2text",
-  "instances": [
-    {
-        "input": "SAMPLE_INPUT_1",
-        "output": "SAMPLE_OUTPUT_1",
-    },
-    {
-        "input": "SAMPLE_INPUT_2",
-        "output": "SAMPLE_OUTPUT_2",
-    },
-    {
-        "input": "SAMPLE_INPUT_3",
-        "output": "SAMPLE_OUTPUT_3",
-    },
-  ]
-}
-```
-
-For example, `data/example_dataset/test/test_13.json` has the aboved format.
-
 ### Conversation
 
 ```{admonition} **Work in Progress**
@@ -168,12 +119,22 @@ Conversational data are commonly used in sft process. We currently support conve
   ]
 }
 ```
+Data types:
+- `conversation_id`: `Optional[Any]`. An identifier for the conversation. `conversation_id` is only for convience of tracking the conversation and will not be used in the pipeline.
+- `system`: `Optional[string]`. A system prompt that is used to start the conversation.
+- `tools`: `Optional[List[string]]`. A list of tools that are used in the conversation.
+- `messages`: `List[Dict]`. A list of messages in the conversation. Each message contains the following fields:
+  - `role`: `string`. The role of the message. It can be either `user` or `assistant`.
+  - `content`: `string`. The content of the message.
+
+> We are working on supporting customized message keys and role names. Please stay tuned.
+
 Tips:
-- `system`, `tools`, and `conversation_id` are OPTIONAL. `conversation_id` is only for convience of tracking the conversation and will not be used in the pipeline.
 - Please make sure the messages are:
   1. Start with an user message.
   2. In the correct order. The pipeline will not check the order of the messages.
   3. In pairs of user and assistant (i.e., the length of the messages should be even). If the conversation ends with the user, the pipeline will trim the last user message.
+  4. Make sure the `content`s are not empty. If the `content` is empty, the pipeline will add a space to it.
 
 #### Conversation Template
 
@@ -244,4 +205,55 @@ For dataset that system prompts, tool prompts and templates are already applied
 
 #### Customize Conversation Template
 
-Please refer to the [Customize Conversation Template](./customize_conversation_template.md) for more details.
+Please refer to the [Customize Conversation Template](./customize_conversation_template.md) for more details.
+
+
+### TextOnly
+
+This is the most common dataset type, which only contains raw texts in each
+sample. This type of dataset can be used as the training set for text decoder
+models, or the input of decoder models / encoder-decoder models. Its format is
+as follows (three instances for example),
+
+```json
+{
+  "type": "text_only",
+  "instances": [
+    {  "text": "SAMPLE_TEXT_1" },
+    {  "text": "SAMPLE_TEXT_2" },
+    {  "text": "SAMPLE_TEXT_3" },
+  ]
+}
+```
+
+For example, `data/example_dataset/train/train_50.json` has the aboved format.
+
+
+### Text2Text
+
+This is the dataset type mostly used for inferencing, which contains a pair of
+texts in each sample. This type of dataset can be used as the training set for
+text encoder-decoder models, or question-answer pair for evaluating model
+inferences. Its format is as follows (three instances for example),
+
+```json
+{
+  "type": "text2text",
+  "instances": [
+    {
+        "input": "SAMPLE_INPUT_1",
+        "output": "SAMPLE_OUTPUT_1",
+    },
+    {
+        "input": "SAMPLE_INPUT_2",
+        "output": "SAMPLE_OUTPUT_2",
+    },
+    {
+        "input": "SAMPLE_INPUT_3",
+        "output": "SAMPLE_OUTPUT_3",
+    },
+  ]
+}
+```
+
+For example, `data/example_dataset/test/test_13.json` has the aboved format.
diff --git a/docs/source/examples/customize_conversation_template.md b/docs/source/examples/customize_conversation_template.md
@@ -11,7 +11,7 @@ We are rapidly working on this page.
 
 We provide the flexibility to customize the conversation template. You can customize your own conversation template by following the steps below:
 
-## Step 1: Knowing the conversation template of your model
+## Knowing the conversation template of your model
 
 The conversation template varies according to the model you are using. For example:  
 
@@ -23,4 +23,31 @@ The template for Llama-2 looks like:
 Find more templates [here](./supported_conversation_template.md).
 
 
-## Step 2: XXX
+## Make your own template
+
+`TemplateComponent`s to a conversation template is just like bricks to a LEGO house. You can build your own template by combining different components.
+
+The following provides an example of building a conversation template for the ChatML format:
+
+1. Decompose the official template
+The official template looks like:
+```
+<|im_start|>system\n{{system_message}}<|im_end|>\n<|im_start|>user\n{{user_message_0}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_0}}<|im_end|>\n<|im_start|>user\n{{user_message_1}}<|im_end|>\n<|im_start|>assistant\n{{assistant_reply_1}}<|im_end|>\n
+```
+It is easy to recognize the format for each message:
+- System message: `<|im_start|>system\n{{system_message}}<|im_end|>\n`
+- User message: `<|im_start|>user\n{{user_message}}<|im_end|>\n`
+- Assistant message: `<|im_start|>assistant\n{{assistant_reply}}<|im_end|>\n`
+
+2. Choose proper `Formatter`  
+Recall the requirements for a conversation dataset:  
+> - `system`: `Optional[string]`. 
+> - `tools`: `Optional[List[string]]`.  
+> - `messages`: `List[Dict]`.  
+>    - `role`: `string`.  
+>    - `content`: `string`.  
+System message, user message, and assistant message are strings thus we can use `StringFormatter` for them.
+
+3. Build the template
+```python
+```
diff --git a/src/lmflow/utils/conversation_formatter.py b/src/lmflow/utils/conversation_formatter.py
@@ -97,6 +97,10 @@ def format(self, **kwargs) -> list:
             if component.type == 'string':
                 for key, value in kwargs.items():
                     templated = component.content.replace("{{" + key + "}}", value)
+                    if len(templated) == 0:
+                        logger.warning("Found empty string after formatting, adding a space instead. "
+                                       "If this is not intended, please check the dataset.")
+                        templated = " "
                     formatted_template.append(TemplateComponent(type='string', content=templated))
             else:
                 formatted_template.append(component)