Skip to content

Commit 26116b9

Browse files
committed
Adding data formatting examples to docs
Signed-off-by: Mustafa Eyceoz <[email protected]>
1 parent 098dd79 commit 26116b9

File tree

2 files changed

+41
-0
lines changed

2 files changed

+41
-0
lines changed

examples/docs/sft_usage.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,42 @@
22

33
This document shows how to use the SFT (Supervised Fine-Tuning) algorithm in training_hub.
44

5+
## Data Format Requirements
6+
7+
Training Hub supports **messages format** data via the instructlab-training backend. Your training data must be a **JSON Lines (.jsonl)** file containing messages data.
8+
9+
### Required Format: JSONL with Messages
10+
11+
Each line in your JSONL file should contain a conversation sample:
12+
13+
```json
14+
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there! How can I help you?"}]}
15+
{"messages": [{"role": "user", "content": "What is SFT?"}, {"role": "assistant", "content": "SFT stands for Supervised Fine-Tuning..."}]}
16+
```
17+
18+
### Message Structure
19+
20+
- **`role`**: One of `"system"`, `"user"`, `"assistant"`, or `"pretraining"`
21+
- **`content`**: The text content of the message
22+
- **`reasoning_content`** (optional): Additional reasoning traces
23+
24+
### Masking Control with `unmask` Field
25+
26+
Control training behavior with the optional `unmask` metadata field:
27+
28+
**Standard instruction tuning (default):**
29+
```json
30+
{"messages": [...]} // Only assistant responses used for loss
31+
{"messages": [...], "unmask": false} // Same as above
32+
```
33+
34+
**Pretraining mode:**
35+
```json
36+
{"messages": [...], "unmask": true} // All content except system messages used for loss
37+
```
38+
39+
When `unmask=true`, the model learns from both user and assistant messages (pretraining-style). When `unmask=false` or absent, only assistant messages are used for training loss (classic instruction-tuning).
40+
541
## Simple Usage with Convenience Function
642

743
The easiest way to run SFT training is using the convenience function:

examples/notebooks/sft_comprehensive_tutorial.ipynb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,11 @@
4343
"from pathlib import Path"
4444
]
4545
},
46+
{
47+
"cell_type": "markdown",
48+
"source": "## Data Format Requirements\n\nBefore configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific **messages format**.\n\n### Required Format: JSONL with Messages\n\nYour training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n```\n\n### Message Structure\n\nEach conversation contains a `messages` array with message objects having:\n- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n- **`content`**: The text content of the message\n- **`reasoning_content`** (optional): Additional reasoning traces\n\n### Masking Behavior with `unmask` Field\n\nYou can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n\n#### Standard Instruction Tuning (default)\n```json\n{\"messages\": [...]}\n```\nor\n```json\n{\"messages\": [...], \"unmask\": false}\n```\n- **Trains only on assistant responses** (standard instruction-following)\n- System messages are always masked (ignored for loss)\n- User messages are masked\n- Assistant messages are unmasked (used for loss calculation)\n\n#### Pretraining Mode\n```json\n{\"messages\": [...], \"unmask\": true}\n```\n- **Trains on all content except system messages**\n- System messages are always masked\n- User and assistant messages are both unmasked\n- Useful for pretraining-style data where the model should learn from all text\n\n### Example Data Formats\n\n**Standard SFT (instruction-following):**\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n if n == 0 or n == 1:\\n return 1\\n return n * factorial(n - 1)\\n```\"}]}\n```\n\n**Pretraining-style (learn from all content):**\n```json\n{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n```\n\n### Data Path Configuration\n\nWhen configuring your training, point to your JSONL file:\n\n```python\ndata_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n```\n\nThe training pipeline will automatically:\n1. Load and validate your JSONL data\n2. Apply chat templates based on your model\n3. Handle masking according to the `unmask` setting\n4. Process the data for efficient training",
49+
"metadata": {}
50+
},
4651
{
4752
"cell_type": "markdown",
4853
"metadata": {},

0 commit comments

Comments
 (0)