You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/notebooks/lab_multiphase_training_tutorial.ipynb
+51-3Lines changed: 51 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,22 @@
45
45
{
46
46
"cell_type": "markdown",
47
47
"metadata": {},
48
-
"source": "## Logging Configuration\n\nSet up logging to prevent notebook crashes from excessive output while still showing essential progress and error information.\n\n**Note:** While this notebook will walk you through a breakdown of all the steps and contains the end-to-end pipeline, we also provide an example script for any significantly long-running jobs for reproducibility, flexibility, and logging consistency in case of notebook disconnects. You can find the script at `scripts/lab_multiphase_training.py`.\n\n**Quick script usage:**\n```bash\npython scripts/lab_multiphase_training.py \\\n --base-model-path /path/to/model \\\n --phase07-data-path /path/to/knowledge.jsonl \\\n --phase10-data-path /path/to/skills_replay.jsonl \\\n --ckpt-output-base-dir /path/to/checkpoints\n```"
48
+
"source": [
49
+
"## Logging Configuration\n",
50
+
"\n",
51
+
"Set up logging to prevent notebook crashes from excessive output while still showing essential progress and error information.\n",
52
+
"\n",
53
+
"**Note:** While this notebook will walk you through a breakdown of all the steps and contains the end-to-end pipeline, we also provide an example script for any significantly long-running jobs for reproducibility, flexibility, and logging consistency in case of notebook disconnects. You can find the script at `scripts/lab_multiphase_training.py`.\n",
"source": "# LAB Multi-Phase Training Configuration\nexperiment_prefix = \"lab_multiphase_training_demo\"\nckpt_output_base_dir = \"/path/to/your/checkpoints\" # Update this path\n\n# Model and data paths - Update these to your actual paths\nbase_model_path = \"/path/to/your/base/model\" # e.g., granite-3.1-8b-starter-v2.1\nphase07_data_path = \"/path/to/knowledge_data.jsonl\" # Knowledge/facts data for Phase07\nphase10_data_path = \"/path/to/skills_plus_replay_data.jsonl\" # Skills + replay data for Phase10\n# Note: Phase10 data should include:\n# - New skills/task data\n# - Replay of Phase07 knowledge data \n# - Replay of base model's original instruction tuning data\n\n# Training hyperparameters\nmax_tokens_per_gpu = 25_000 # Memory limit per GPU (reduce if hitting OOM errors)\nmax_seq_len = 20_000 # Maximum sequence length\n\n# Distributed training setup (adjust for your hardware)\nnproc_per_node = 8 # Number of GPUs per node\nnnodes = 1 # Number of nodes\nnode_rank = 0 # This node's rank\nrdzv_id = 420 # Rendezvous ID\nrdzv_endpoint = \"0.0.0.0:12345\" # Master endpoint\n\nprint(f\"LAB Multi-Phase Experiment: {experiment_prefix}\")\nprint(f\"Output directory: {ckpt_output_base_dir}\")\nprint(f\"GPUs per node: {nproc_per_node}\")\nprint(f\"Max tokens per GPU: {max_tokens_per_gpu:,}\")\nprint(f\"\\nData composition:\")\nprint(f\" Phase07: Knowledge data only\")\nprint(f\" Phase10: Skills + Phase07 replay + Base model instruction replay\")\nprint(f\"\\n💡 Note: If you encounter OOM (Out of Memory) errors, reduce max_tokens_per_gpu\")"
0 commit comments