Adding data formatting examples to docs

Maxusmusti · Maxusmusti · commit 26116b99b886 · 2025-08-15T16:15:39.000-04:00
Signed-off-by: Mustafa Eyceoz &lt;meyceoz@redhat.com&gt;
diff --git a/examples/docs/sft_usage.md b/examples/docs/sft_usage.md
@@ -2,6 +2,42 @@
 
 This document shows how to use the SFT (Supervised Fine-Tuning) algorithm in training_hub.
 
+## Data Format Requirements
+
+Training Hub supports **messages format** data via the instructlab-training backend. Your training data must be a **JSON Lines (.jsonl)** file containing messages data.
+
+### Required Format: JSONL with Messages
+
+Each line in your JSONL file should contain a conversation sample:
+
+```json
+{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there! How can I help you?"}]}
+{"messages": [{"role": "user", "content": "What is SFT?"}, {"role": "assistant", "content": "SFT stands for Supervised Fine-Tuning..."}]}
+```
+
+### Message Structure
+
+- **`role`**: One of `"system"`, `"user"`, `"assistant"`, or `"pretraining"`
+- **`content`**: The text content of the message
+- **`reasoning_content`** (optional): Additional reasoning traces
+
+### Masking Control with `unmask` Field
+
+Control training behavior with the optional `unmask` metadata field:
+
+**Standard instruction tuning (default):**
+```json
+{"messages": [...]}  // Only assistant responses used for loss
+{"messages": [...], "unmask": false}  // Same as above
+```
+
+**Pretraining mode:**
+```json
+{"messages": [...], "unmask": true}  // All content except system messages used for loss
+```
+
+When `unmask=true`, the model learns from both user and assistant messages (pretraining-style). When `unmask=false` or absent, only assistant messages are used for training loss (classic instruction-tuning).
+
 ## Simple Usage with Convenience Function
 
 The easiest way to run SFT training is using the convenience function:
diff --git a/examples/notebooks/sft_comprehensive_tutorial.ipynb b/examples/notebooks/sft_comprehensive_tutorial.ipynb
@@ -43,6 +43,11 @@
     "from pathlib import Path"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "source": "## Data Format Requirements\n\nBefore configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific **messages format**.\n\n### Required Format: JSONL with Messages\n\nYour training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n```\n\n### Message Structure\n\nEach conversation contains a `messages` array with message objects having:\n- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n- **`content`**: The text content of the message\n- **`reasoning_content`** (optional): Additional reasoning traces\n\n### Masking Behavior with `unmask` Field\n\nYou can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n\n#### Standard Instruction Tuning (default)\n```json\n{\"messages\": [...]}\n```\nor\n```json\n{\"messages\": [...], \"unmask\": false}\n```\n- **Trains only on assistant responses** (standard instruction-following)\n- System messages are always masked (ignored for loss)\n- User messages are masked\n- Assistant messages are unmasked (used for loss calculation)\n\n#### Pretraining Mode\n```json\n{\"messages\": [...], \"unmask\": true}\n```\n- **Trains on all content except system messages**\n- System messages are always masked\n- User and assistant messages are both unmasked\n- Useful for pretraining-style data where the model should learn from all text\n\n### Example Data Formats\n\n**Standard SFT (instruction-following):**\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n    if n == 0 or n == 1:\\n        return 1\\n    return n * factorial(n - 1)\\n```\"}]}\n```\n\n**Pretraining-style (learn from all content):**\n```json\n{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n```\n\n### Data Path Configuration\n\nWhen configuring your training, point to your JSONL file:\n\n```python\ndata_path = \"/path/to/your/training_data.jsonl\"  # Your messages-format JSONL file\n```\n\nThe training pipeline will automatically:\n1. Load and validate your JSONL data\n2. Apply chat templates based on your model\n3. Handle masking according to the `unmask` setting\n4. Process the data for efficient training",
+   "metadata": {}
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

Original file line number	Diff line number	Diff line change
`@@ -43,6 +43,11 @@`
`43`	`43`	`"from pathlib import Path"`
`44`	`44`	`]`
`45`	`45`	`},`
	`46`	`+ {`
	`47`	`+ "cell_type": "markdown",`
	`48`	+ "source": "## Data Format Requirements\n\nBefore configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific messages format.\n\n### Required Format: JSONL with Messages\n\nYour training data must be a JSON Lines (.jsonl) file where each line contains a conversation sample:\n\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n```\n\n### Message Structure\n\nEach conversation contains a `messages` array with message objects having:\n- `role`: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n- `content`: The text content of the message\n- `reasoning_content` (optional): Additional reasoning traces\n\n### Masking Behavior with `unmask` Field\n\nYou can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n\n#### Standard Instruction Tuning (default)\n```json\n{\"messages\": [...]}\n```\nor\n```json\n{\"messages\": [...], \"unmask\": false}\n```\n- Trains only on assistant responses (standard instruction-following)\n- System messages are always masked (ignored for loss)\n- User messages are masked\n- Assistant messages are unmasked (used for loss calculation)\n\n#### Pretraining Mode\n```json\n{\"messages\": [...], \"unmask\": true}\n```\n- Trains on all content except system messages\n- System messages are always masked\n- User and assistant messages are both unmasked\n- Useful for pretraining-style data where the model should learn from all text\n\n### Example Data Formats\n\nStandard SFT (instruction-following):\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n if n == 0 or n == 1:\\n return 1\\n return n * factorial(n - 1)\\n```\"}]}\n```\n\nPretraining-style (learn from all content):\n```json\n{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n```\n\n### Data Path Configuration\n\nWhen configuring your training, point to your JSONL file:\n\n```python\ndata_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n```\n\nThe training pipeline will automatically:\n1. Load and validate your JSONL data\n2. Apply chat templates based on your model\n3. Handle masking according to the `unmask` setting\n4. Process the data for efficient training",
	`49`	`+ "metadata": {}`
	`50`	`+ },`
`46`	`51`	`{`
`47`	`52`	`"cell_type": "markdown",`
`48`	`53`	`"metadata": {},`