Skip to content

Commit 965c54f

Browse files
authored
Merge pull request #3 from RobotSail/add-notebooks
Adds notebooks for OSFT
2 parents 28e52df + 30af193 commit 965c54f

12 files changed

+3738
-11
lines changed

examples/notebooks/lab_multiphase_training_tutorial.ipynb

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,22 @@
4545
{
4646
"cell_type": "markdown",
4747
"metadata": {},
48-
"source": "## Logging Configuration\n\nSet up logging to prevent notebook crashes from excessive output while still showing essential progress and error information.\n\n**Note:** While this notebook will walk you through a breakdown of all the steps and contains the end-to-end pipeline, we also provide an example script for any significantly long-running jobs for reproducibility, flexibility, and logging consistency in case of notebook disconnects. You can find the script at `scripts/lab_multiphase_training.py`.\n\n**Quick script usage:**\n```bash\npython scripts/lab_multiphase_training.py \\\n --base-model-path /path/to/model \\\n --phase07-data-path /path/to/knowledge.jsonl \\\n --phase10-data-path /path/to/skills_replay.jsonl \\\n --ckpt-output-base-dir /path/to/checkpoints\n```"
48+
"source": [
49+
"## Logging Configuration\n",
50+
"\n",
51+
"Set up logging to prevent notebook crashes from excessive output while still showing essential progress and error information.\n",
52+
"\n",
53+
"**Note:** While this notebook will walk you through a breakdown of all the steps and contains the end-to-end pipeline, we also provide an example script for any significantly long-running jobs for reproducibility, flexibility, and logging consistency in case of notebook disconnects. You can find the script at `scripts/lab_multiphase_training.py`.\n",
54+
"\n",
55+
"**Quick script usage:**\n",
56+
"```bash\n",
57+
"python scripts/lab_multiphase_training.py \\\n",
58+
" --base-model-path /path/to/model \\\n",
59+
" --phase07-data-path /path/to/knowledge.jsonl \\\n",
60+
" --phase10-data-path /path/to/skills_replay.jsonl \\\n",
61+
" --ckpt-output-base-dir /path/to/checkpoints\n",
62+
"```"
63+
]
4964
},
5065
{
5166
"cell_type": "code",
@@ -147,7 +162,40 @@
147162
"execution_count": null,
148163
"metadata": {},
149164
"outputs": [],
150-
"source": "# LAB Multi-Phase Training Configuration\nexperiment_prefix = \"lab_multiphase_training_demo\"\nckpt_output_base_dir = \"/path/to/your/checkpoints\" # Update this path\n\n# Model and data paths - Update these to your actual paths\nbase_model_path = \"/path/to/your/base/model\" # e.g., granite-3.1-8b-starter-v2.1\nphase07_data_path = \"/path/to/knowledge_data.jsonl\" # Knowledge/facts data for Phase07\nphase10_data_path = \"/path/to/skills_plus_replay_data.jsonl\" # Skills + replay data for Phase10\n# Note: Phase10 data should include:\n# - New skills/task data\n# - Replay of Phase07 knowledge data \n# - Replay of base model's original instruction tuning data\n\n# Training hyperparameters\nmax_tokens_per_gpu = 25_000 # Memory limit per GPU (reduce if hitting OOM errors)\nmax_seq_len = 20_000 # Maximum sequence length\n\n# Distributed training setup (adjust for your hardware)\nnproc_per_node = 8 # Number of GPUs per node\nnnodes = 1 # Number of nodes\nnode_rank = 0 # This node's rank\nrdzv_id = 420 # Rendezvous ID\nrdzv_endpoint = \"0.0.0.0:12345\" # Master endpoint\n\nprint(f\"LAB Multi-Phase Experiment: {experiment_prefix}\")\nprint(f\"Output directory: {ckpt_output_base_dir}\")\nprint(f\"GPUs per node: {nproc_per_node}\")\nprint(f\"Max tokens per GPU: {max_tokens_per_gpu:,}\")\nprint(f\"\\nData composition:\")\nprint(f\" Phase07: Knowledge data only\")\nprint(f\" Phase10: Skills + Phase07 replay + Base model instruction replay\")\nprint(f\"\\n💡 Note: If you encounter OOM (Out of Memory) errors, reduce max_tokens_per_gpu\")"
165+
"source": [
166+
"# LAB Multi-Phase Training Configuration\n",
167+
"experiment_prefix = \"lab_multiphase_training_demo\"\n",
168+
"ckpt_output_base_dir = \"/path/to/your/checkpoints\" # Update this path\n",
169+
"\n",
170+
"# Model and data paths - Update these to your actual paths\n",
171+
"base_model_path = \"/path/to/your/base/model\" # e.g., granite-3.1-8b-starter-v2.1\n",
172+
"phase07_data_path = \"/path/to/knowledge_data.jsonl\" # Knowledge/facts data for Phase07\n",
173+
"phase10_data_path = \"/path/to/skills_plus_replay_data.jsonl\" # Skills + replay data for Phase10\n",
174+
"# Note: Phase10 data should include:\n",
175+
"# - New skills/task data\n",
176+
"# - Replay of Phase07 knowledge data \n",
177+
"# - Replay of base model's original instruction tuning data\n",
178+
"\n",
179+
"# Training hyperparameters\n",
180+
"max_tokens_per_gpu = 25_000 # Memory limit per GPU (reduce if hitting OOM errors)\n",
181+
"max_seq_len = 20_000 # Maximum sequence length\n",
182+
"\n",
183+
"# Distributed training setup (adjust for your hardware)\n",
184+
"nproc_per_node = 8 # Number of GPUs per node\n",
185+
"nnodes = 1 # Number of nodes\n",
186+
"node_rank = 0 # This node's rank\n",
187+
"rdzv_id = 47 # Rendezvous ID\n",
188+
"rdzv_endpoint = \"0.0.0.0:12345\" # Master endpoint\n",
189+
"\n",
190+
"print(f\"LAB Multi-Phase Experiment: {experiment_prefix}\")\n",
191+
"print(f\"Output directory: {ckpt_output_base_dir}\")\n",
192+
"print(f\"GPUs per node: {nproc_per_node}\")\n",
193+
"print(f\"Max tokens per GPU: {max_tokens_per_gpu:,}\")\n",
194+
"print(f\"\\nData composition:\")\n",
195+
"print(f\" Phase07: Knowledge data only\")\n",
196+
"print(f\" Phase10: Skills + Phase07 replay + Base model instruction replay\")\n",
197+
"print(f\"\\n💡 Note: If you encounter OOM (Out of Memory) errors, reduce max_tokens_per_gpu\")"
198+
]
151199
},
152200
{
153201
"cell_type": "markdown",
@@ -511,4 +559,4 @@
511559
},
512560
"nbformat": 4,
513561
"nbformat_minor": 4
514-
}
562+
}

0 commit comments

Comments
 (0)