-
Notifications
You must be signed in to change notification settings - Fork 18
add OSFT notebook for different batch sizes #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new Jupyter notebook example that demonstrates scaling OSFT training hyperparameters with dataset size (Small/Medium/Large), includes common configuration, per-dataset example configs with computed steps, training invocation skeletons, strategy notes, and a summary table. No code or API changes elsewhere. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🔇 Additional comments (2)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
examples/notebooks/osft_dataset_scaling_guide.ipynb (3)
233-246: Remove extraneous f-strings (no placeholders).These trigger Ruff F541 and add noise. Keep f-strings only where {} interpolation occurs.
-print(f"result = osft(") +print("result = osft(") -print(f" # Model and data") +print(" # Model and data") -print(f" ckpt_output_dir='/path/to/checkpoints/osft_1k_dataset',") +print(" ckpt_output_dir='/path/to/checkpoints/osft_1k_dataset',") -print(f" ") +print(" ") -print(f" # OSFT parameters") +print(" # OSFT parameters") -print(f" ") +print(" ") -print(f" # Batch size scaled for small dataset") +print(" # Batch size scaled for small dataset") -print(f" ") +print(" ") -print(f" # Other training parameters") +print(" # Other training parameters") -print(f" ") +print(" ") -print(f" # Distributed training") +print(" # Distributed training") -print(f")") +print(")") @@ -print(f"result = osft(") +print("result = osft(") -print(f" # Model and data") +print(" # Model and data") -print(f" ckpt_output_dir='/path/to/checkpoints/osft_10k_dataset',") +print(" ckpt_output_dir='/path/to/checkpoints/osft_10k_dataset',") -print(f" ") +print(" ") -print(f" # OSFT parameters") +print(" # OSFT parameters") -print(f" ") +print(" ") -print(f" # Batch size scaled for medium dataset") +print(" # Batch size scaled for medium dataset") -print(f" ") +print(" ") -print(f" # Other training parameters") +print(" # Other training parameters") -print(f" ") +print(" ") -print(f" # Distributed training") +print(" # Distributed training") -print(f")") +print(")") @@ -print(f"result = osft(") +print("result = osft(") -print(f" # Model and data") +print(" # Model and data") -print(f" ckpt_output_dir='/path/to/checkpoints/osft_100k_dataset',") +print(" ckpt_output_dir='/path/to/checkpoints/osft_100k_dataset',") -print(f" ") +print(" ") -print(f" # OSFT parameters") +print(" # OSFT parameters") -print(f" ") +print(" ") -print(f" # Batch size scaled for large dataset") +print(" # Batch size scaled for large dataset") -print(f" ") +print(" ") -print(f" # Other training parameters") +print(" # Other training parameters") -print(f" ") +print(" ") -print(f" # Distributed training") +print(" # Distributed training") -print(f")") +print(")")Also applies to: 252-259, 361-374, 380-389, 491-504, 510-517
739-742: Use a portable kernelspec display_name.“.venv” is machine-specific and breaks opening the notebook elsewhere. Prefer “Python 3” or similar.
109-116: Clarify effective_batch_size definition (global vs per-device).Readers may confuse per-GPU micro-batch with global “effective_batch_size”. Add a short note and the formula: effective = per_device_batch_size × grad_accum × num_gpus. Optionally include helper fields for per_device and grad_accum.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/notebooks/osft_dataset_scaling_guide.ipynb(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
examples/notebooks/osft_dataset_scaling_guide.ipynb
68-68: f-string without any placeholders
Remove extraneous f prefix
(F541)
69-69: f-string without any placeholders
Remove extraneous f prefix
(F541)
72-72: f-string without any placeholders
Remove extraneous f prefix
(F541)
73-73: f-string without any placeholders
Remove extraneous f prefix
(F541)
74-74: f-string without any placeholders
Remove extraneous f prefix
(F541)
76-76: f-string without any placeholders
Remove extraneous f prefix
(F541)
77-77: f-string without any placeholders
Remove extraneous f prefix
(F541)
79-79: f-string without any placeholders
Remove extraneous f prefix
(F541)
80-80: f-string without any placeholders
Remove extraneous f prefix
(F541)
86-86: f-string without any placeholders
Remove extraneous f prefix
(F541)
87-87: f-string without any placeholders
Remove extraneous f prefix
(F541)
93-93: f-string without any placeholders
Remove extraneous f prefix
(F541)
126-126: f-string without any placeholders
Remove extraneous f prefix
(F541)
127-127: f-string without any placeholders
Remove extraneous f prefix
(F541)
130-130: f-string without any placeholders
Remove extraneous f prefix
(F541)
131-131: f-string without any placeholders
Remove extraneous f prefix
(F541)
132-132: f-string without any placeholders
Remove extraneous f prefix
(F541)
134-134: f-string without any placeholders
Remove extraneous f prefix
(F541)
135-135: f-string without any placeholders
Remove extraneous f prefix
(F541)
137-137: f-string without any placeholders
Remove extraneous f prefix
(F541)
138-138: f-string without any placeholders
Remove extraneous f prefix
(F541)
144-144: f-string without any placeholders
Remove extraneous f prefix
(F541)
145-145: f-string without any placeholders
Remove extraneous f prefix
(F541)
151-151: f-string without any placeholders
Remove extraneous f prefix
(F541)
184-184: f-string without any placeholders
Remove extraneous f prefix
(F541)
185-185: f-string without any placeholders
Remove extraneous f prefix
(F541)
188-188: f-string without any placeholders
Remove extraneous f prefix
(F541)
189-189: f-string without any placeholders
Remove extraneous f prefix
(F541)
190-190: f-string without any placeholders
Remove extraneous f prefix
(F541)
192-192: f-string without any placeholders
Remove extraneous f prefix
(F541)
193-193: f-string without any placeholders
Remove extraneous f prefix
(F541)
195-195: f-string without any placeholders
Remove extraneous f prefix
(F541)
196-196: f-string without any placeholders
Remove extraneous f prefix
(F541)
202-202: f-string without any placeholders
Remove extraneous f prefix
(F541)
203-203: f-string without any placeholders
Remove extraneous f prefix
(F541)
209-209: f-string without any placeholders
Remove extraneous f prefix
(F541)
| "execution_count": 1, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "/mnt/7TB-a/osilkin/non-critical-changes/training_hub/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", | ||
| " from .autonotebook import tqdm as notebook_tqdm\n" | ||
| ] | ||
| } | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Clear notebook outputs and warnings before committing.
Committed outputs include environment-specific warnings (e.g., tqdm/ipywidgets) and make diffs noisy.
Run this read-only check:
To clear locally (not run here): jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace examples/notebooks/osft_dataset_scaling_guide.ipynb
Add a pre-commit hook like nbstripout for future commits.
Also applies to: 147-160, 191-226, 275-289, 321-355, 405-418, 449-483, 533-551, 651-676
🏁 Script executed:
#!/bin/bash
# Show cells with non-empty outputs and non-null execution_count
jq -r '
.cells[]
| {ec: .execution_count, out: (.outputs|length)}
| select(.out>0 or (.ec != null))
' examples/notebooks/osft_dataset_scaling_guide.ipynb | head -n 20Length of output: 189
🏁 Script executed:
#!/bin/bash
set -euo pipefail
nb="examples/notebooks/osft_dataset_scaling_guide.ipynb"
if [ ! -f "$nb" ]; then
echo "MISSING: $nb"
exit 0
fi
python3 - <<'PY'
import json,sys
with open("examples/notebooks/osft_dataset_scaling_guide.ipynb","r",encoding="utf-8") as f:
nb=json.load(f)
cells=nb.get("cells",[])
results=[]
for idx,cell in enumerate(cells):
ec=cell.get("execution_count", None)
outs=cell.get("outputs", [])
if (outs and len(outs)>0) or ec is not None:
sample=None
if outs:
o=outs[0]
if isinstance(o.get("text"), list):
sample="".join(o.get("text"))[:200]
else:
sample=str(o.get("text") or o.get("ename") or "")[:200]
results.append({"cell_index": idx, "execution_count": ec, "outputs": len(outs), "sample_excerpt": sample})
print(json.dumps(results, indent=2))
PYLength of output: 3105
Clear notebook outputs and remove environment-specific warnings before committing.
examples/notebooks/osft_dataset_scaling_guide.ipynb has executed outputs in cells 3, 5, 7, 8, 10, 11, 13, 14, 16, 20 — cell 3 contains a TqdmWarning about ipywidgets. Run locally: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace examples/notebooks/osft_dataset_scaling_guide.ipynb and add nbstripout (or similar) as a pre-commit hook.
🤖 Prompt for AI Agents
In examples/notebooks/osft_dataset_scaling_guide.ipynb around lines 55 to 66 the
notebook contains executed outputs (including a TqdmWarning about missing
ipywidgets) and environment-specific stderr output; clear all cell outputs in
the notebook (e.g., run locally: jupyter nbconvert
--ClearOutputPreprocessor.enabled=True --inplace
examples/notebooks/osft_dataset_scaling_guide.ipynb) and save the cleaned file,
then add/enable a notebook output cleaner such as nbstripout (or a pre-commit
hook that runs nbconvert or nbstripout) to prevent committing execution outputs
and environment-specific warnings in the future.
| "# Import training_hub for OSFT training\n", | ||
| "from training_hub import osft\n", | ||
| "\n", | ||
| "# Standard library imports\n", | ||
| "import os\n", | ||
| "from datetime import datetime\n" | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Warmup steps can exceed total steps; compute dynamically and fix step math (ceil).
- Large dataset example sets warmup_steps=500 while total_steps=291; that’s invalid in most trainers. Small/medium also use unusually high warmups vs your 5–10% guidance.
- Also, steps/epoch currently use floor division and undercount when dataset_size % batch_size != 0.
Apply this diff to:
- add math,
- compute steps via ceil,
- derive warmup as 10% of total (bounded),
- print total GPUs (nodes×gpus),
- drop hardcoded warmups in configs.
@@
-# Import training_hub for OSFT training
-from training_hub import osft
-
-# Standard library imports
-import os
-from datetime import datetime
+# Import training_hub for OSFT training (optional)
+try:
+ from training_hub import osft # noqa: F401
+except Exception:
+ osft = None
+ print("Note: training_hub is not installed; showing printed configs only.")
+
+# Standard library imports
+import math
@@
-print(f" GPUs: {NPROC_PER_NODE}")
+print(f" GPUs: {NPROC_PER_NODE * NNODES}")
@@
small_dataset_config = {
"dataset_size": "1K samples",
"data_path": "/path/to/your/small_dataset_1k_samples.jsonl", # Replace with your path
"effective_batch_size": 16, # Small batch size for more gradient updates
- "warmup_steps": 50, # Quick warmup for small dataset
"use_case": "Domain-specific terminology or specialized knowledge"
}
@@
-steps_per_epoch_1k = 1000 // small_dataset_config["effective_batch_size"]
-total_steps_1k = steps_per_epoch_1k * NUM_EPOCHS
+steps_per_epoch_1k = math.ceil(1000 / small_dataset_config["effective_batch_size"])
+total_steps_1k = steps_per_epoch_1k * NUM_EPOCHS
+warmup_steps_1k = max(1, int(0.10 * total_steps_1k))
@@
-print(f" warmup_steps={small_dataset_config['warmup_steps']},")
+print(f" warmup_steps={warmup_steps_1k},")
@@
medium_dataset_config = {
"dataset_size": "10K samples",
"data_path": "/path/to/your/medium_dataset_10k_samples.jsonl", # Replace with your path
"effective_batch_size": 128, # Moderate batch size for efficiency
- "warmup_steps": 100, # Standard warmup
"use_case": "Domain adaptation or moderate-scale instruction tuning"
}
@@
-steps_per_epoch_10k = 10000 // medium_dataset_config["effective_batch_size"]
-total_steps_10k = steps_per_epoch_10k * NUM_EPOCHS
+steps_per_epoch_10k = math.ceil(10000 / medium_dataset_config["effective_batch_size"])
+total_steps_10k = steps_per_epoch_10k * NUM_EPOCHS
+warmup_steps_10k = max(1, int(0.10 * total_steps_10k))
@@
-print(f" warmup_steps={medium_dataset_config['warmup_steps']},")
+print(f" warmup_steps={warmup_steps_10k},")
@@
large_dataset_config = {
"dataset_size": "100K samples",
"data_path": "/path/to/your/large_dataset_100k_samples.jsonl", # Replace with your path
"effective_batch_size": 1024, # Large batch size for efficiency
- "warmup_steps": 500, # Extended warmup for large batch
"use_case": "Large-scale instruction tuning or comprehensive domain coverage"
}
@@
-steps_per_epoch_100k = 100000 // large_dataset_config["effective_batch_size"]
-total_steps_100k = steps_per_epoch_100k * NUM_EPOCHS
+steps_per_epoch_100k = math.ceil(100000 / large_dataset_config["effective_batch_size"])
+total_steps_100k = steps_per_epoch_100k * NUM_EPOCHS
+warmup_steps_100k = max(1, int(0.10 * total_steps_100k))
@@
-print(f" warmup_steps={large_dataset_config['warmup_steps']},")
+print(f" warmup_steps={warmup_steps_100k},")Also applies to: 124-131, 163-171, 172-175, 229-260, 292-300, 301-304, 358-389, 421-428, 430-433, 487-517, 554-566
🏁 Script executed:
#!/bin/bash
set -euo pipefail
nb="examples/notebooks/osft_dataset_scaling_guide.ipynb"
echo "CHECK: repository root files (top-level)"
ls -la | sed -n '1,120p' || true
echo
if [ ! -f "$nb" ]; then
echo "FILE_MISSING: $nb not found"
rg -n --hidden --no-ignore -S "osft_dataset_scaling_guide|osft_dataset_scaling" || true
exit 0
fi
echo "FOUND: $nb"
echo
echo "Searching for relevant keywords in examples/ ..."
rg -n -S -C2 "training_hub|warmup_steps|steps_per_epoch|effective_batch_size|NPROC_PER_NODE|NNODES|math\.ceil|math|ceil|warmup_ratio|warmup_steps" examples || true
echo
echo "Extracting notebook code cells containing matches:"
python - <<'PY'
import json,sys,os
f="examples/notebooks/osft_dataset_scaling_guide.ipynb"
nb=json.load(open(f,'r',encoding='utf-8'))
cells=nb.get('cells',[])
for i,c in enumerate(cells):
if c.get('cell_type')!='code':
continue
src=''.join(c.get('source',[]))
if any(k in src for k in ['training_hub','warmup_steps','steps_per_epoch','effective_batch_size','NPROC_PER_NODE','NNODES','math.ceil','math','ceil','warmup_ratio']):
print("----- CELL",i,"-----")
for ln,line in enumerate(src.splitlines(),1):
print(f"{ln:03d}: {line}")
PYLength of output: 82911
Warmup steps can exceed total steps; compute dynamically and fix step math (ceil).
- Confirmed: examples/notebooks/osft_dataset_scaling_guide.ipynb uses hardcoded warmup_steps (50 / 100 / 500), computes steps_per_epoch with floor division (//) and prints GPUs as NPROC_PER_NODE — this undercounts steps and allows warmup > total_steps.
- Action: apply the suggested changes — guard the training_hub import, import math, compute steps_per_epoch with math.ceil(dataset_size / effective_batch_size), total_steps = steps_per_epoch * NUM_EPOCHS, derive warmup_steps = max(1, int(0.10 * total_steps)) (optionally cap), print total GPUs as NPROC_PER_NODE * NNODES, and remove hardcoded warmup values.
- Locations to fix: examples/notebooks/osft_dataset_scaling_guide.ipynb (68-74, 118-131, 163-171, 172-175, 229-260, 292-300, 301-304, 358-389, 421-428, 430-433, 487-517, 554-566). Search the repo for other occurrences of literal warmup_steps and '//' step math and apply the same change.
🧰 Tools
🪛 Ruff (0.12.2)
68-68: f-string without any placeholders
Remove extraneous f prefix
(F541)
69-69: f-string without any placeholders
Remove extraneous f prefix
(F541)
72-72: f-string without any placeholders
Remove extraneous f prefix
(F541)
73-73: f-string without any placeholders
Remove extraneous f prefix
(F541)
74-74: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In examples/notebooks/osft_dataset_scaling_guide.ipynb (affecting lines ~68-74,
118-131, 163-171, 172-175, 229-260, 292-300, 301-304, 358-389, 421-428, 430-433,
487-517, 554-566), guard the training_hub import with a try/except or
conditional import, add "import math", replace any floor division used to
compute steps_per_epoch with steps_per_epoch = math.ceil(dataset_size /
effective_batch_size), compute total_steps = steps_per_epoch * NUM_EPOCHS,
derive warmup_steps = max(1, int(0.10 * total_steps)) (optionally cap if
desired) and remove hardcoded warmup values, change printed GPU count to
NPROC_PER_NODE * NNODES, and globally search the repo for literal warmup_steps
and '//' step math to apply the same ceil-based calculation and removal of
hardcoded warmup entries.
|
f3b8a26 to
836dadb
Compare
This PR adds a notebook which showcases how the batch size is a function of the dataset size.
Summary by CodeRabbit