+ "source": "## Data Format Requirements\n\nBefore configuring your training, ensure your data is in the correct format. Training Hub uses the instructlab-training backend, which expects data in a specific **messages format**.\n\n### Required Format: JSONL with Messages\n\nYour training data must be a **JSON Lines (.jsonl)** file where each line contains a conversation sample:\n\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello, how are you?\"}, {\"role\": \"assistant\", \"content\": \"I'm doing well, thank you! How can I help you today?\"}]}\n{\"messages\": [{\"role\": \"user\", \"content\": \"What is machine learning?\"}, {\"role\": \"assistant\", \"content\": \"Machine learning is a subset of artificial intelligence...\"}]}\n```\n\n### Message Structure\n\nEach conversation contains a `messages` array with message objects having:\n- **`role`**: One of `\"system\"`, `\"user\"`, `\"assistant\"`, or `\"pretraining\"`\n- **`content`**: The text content of the message\n- **`reasoning_content`** (optional): Additional reasoning traces\n\n### Masking Behavior with `unmask` Field\n\nYou can control which parts of the conversation are used for training loss by adding an `unmask` metadata field:\n\n#### Standard Instruction Tuning (default)\n```json\n{\"messages\": [...]}\n```\nor\n```json\n{\"messages\": [...], \"unmask\": false}\n```\n- **Trains only on assistant responses** (standard instruction-following)\n- System messages are always masked (ignored for loss)\n- User messages are masked\n- Assistant messages are unmasked (used for loss calculation)\n\n#### Pretraining Mode\n```json\n{\"messages\": [...], \"unmask\": true}\n```\n- **Trains on all content except system messages**\n- System messages are always masked\n- User and assistant messages are both unmasked\n- Useful for pretraining-style data where the model should learn from all text\n\n### Example Data Formats\n\n**Standard SFT (instruction-following):**\n```json\n{\"messages\": [{\"role\": \"system\", \"content\": \"You are a coding assistant.\"}, {\"role\": \"user\", \"content\": \"Write a Python function to calculate factorial\"}, {\"role\": \"assistant\", \"content\": \"Here's a Python function to calculate factorial:\\n\\n```python\\ndef factorial(n):\\n if n == 0 or n == 1:\\n return 1\\n return n * factorial(n - 1)\\n```\"}]}\n```\n\n**Pretraining-style (learn from all content):**\n```json\n{\"messages\": [{\"role\": \"user\", \"content\": \"The capital of France is\"}, {\"role\": \"assistant\", \"content\": \"Paris.\"}], \"unmask\": true}\n```\n\n### Data Path Configuration\n\nWhen configuring your training, point to your JSONL file:\n\n```python\ndata_path = \"/path/to/your/training_data.jsonl\" # Your messages-format JSONL file\n```\n\nThe training pipeline will automatically:\n1. Load and validate your JSONL data\n2. Apply chat templates based on your model\n3. Handle masking according to the `unmask` setting\n4. Process the data for efficient training",
0 commit comments