Skip to content

Commit c69d30a

Browse files
committed
add docs for OSFT
1 parent c64abd5 commit c69d30a

File tree

3 files changed

+298
-3
lines changed

3 files changed

+298
-3
lines changed

examples/README.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,31 @@ result = sft(
4141
)
4242
```
4343

44+
### Orthogonal Subspace Fine-Tuning (OSFT)
45+
46+
The OSFT algorithm supports continual training of pre-trained or instruction-tuned models without requiring supplementary datasets to maintain the original model distribution. Based on [Nayak et al. (2025)](https://arxiv.org/abs/2504.07097), it enables efficient customization while preventing catastrophic forgetting.
47+
48+
**Documentation:**
49+
- [OSFT Usage Guide](docs/osft_usage.md) - Comprehensive usage documentation with parameter reference and examples
50+
51+
**Quick Example:**
52+
```python
53+
from training_hub import osft
54+
55+
result = osft(
56+
model_path="/path/to/model",
57+
data_path="/path/to/data.jsonl",
58+
output_dir="/path/to/outputs",
59+
unfreeze_rank_ratio=0.3,
60+
batch_size=8,
61+
max_tokens_per_gpu=2048,
62+
max_seq_len=2048,
63+
learning_rate=2e-5
64+
)
65+
```
66+
4467
## Getting Started
4568

4669
1. **For detailed parameter documentation**: Check the relevant guide in `docs/`
4770
2. **For hands-on learning**: Open the interactive notebooks in `notebooks/`
48-
3. **For automation scripts**: Refer to examples in `scripts/`
71+
3. **For automation scripts**: Refer to examples in `scripts/`

examples/docs/osft_usage.md

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
# OSFT Algorithm Usage Examples
2+
3+
This document shows how to use the OSFT (Orthogonal Subspace Fine-Tuning) algorithm in training_hub.
4+
5+
## Overview
6+
7+
The OSFT algorithm implements Orthogonal Subspace Fine-Tuning based on Nayak et al. (2025), arXiv:2504.07097. This algorithm allows for continual training of pre-trained or instruction-tuned models without the need of a supplementary dataset to maintain the distribution of the original model/dataset that was trained.
8+
9+
**Key Benefits:**
10+
- Enables continual learning without catastrophic forgetting
11+
- No need for supplementary datasets to maintain original model distribution
12+
- Significantly reduces data requirements for customizing instruction-tuned models
13+
- Memory requirements similar to standard SFT
14+
15+
## Data Format Requirements
16+
17+
Training Hub's OSFT algorithm supports both **processed** and **unprocessed** data formats via the mini-trainer backend.
18+
19+
### Option 1: Standard Messages Format (Recommended)
20+
21+
Your training data should be a **JSON Lines (.jsonl)** file containing messages data:
22+
23+
```json
24+
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there! How can I help you?"}]}
25+
{"messages": [{"role": "user", "content": "What is OSFT?"}, {"role": "assistant", "content": "OSFT stands for Orthogonal Subspace Fine-Tuning..."}]}
26+
```
27+
28+
### Message Structure
29+
- **`role`**: One of `"system"`, `"user"`, `"assistant"`, or `"pretraining"`
30+
- **`content`**: The text content of the message
31+
- **`reasoning_content`** (optional): Additional reasoning traces
32+
33+
### Masking Control with `unmask_messages` Parameter
34+
35+
Control training behavior during data processing:
36+
37+
**Standard instruction tuning (default):**
38+
```python
39+
osft(..., unmask_messages=False) # Only assistant responses used for loss
40+
```
41+
42+
**Pretraining mode:**
43+
```python
44+
osft(..., unmask_messages=True) # All content except system messages used for loss
45+
```
46+
47+
### Option 2: Pre-processed Dataset
48+
49+
If you have pre-processed data with `input_ids` and `labels` fields:
50+
51+
```json
52+
{"input_ids": [1, 2, 3, ...], "labels": [1, 2, 3, ...]}
53+
{"input_ids": [4, 5, 6, ...], "labels": [4, 5, 6, ...]}
54+
```
55+
56+
Use with:
57+
```python
58+
osft(..., use_processed_dataset=True)
59+
```
60+
61+
## Simple Usage with Convenience Function
62+
63+
The easiest way to run OSFT training is using the convenience function:
64+
65+
```python
66+
from training_hub import osft
67+
68+
# Basic OSFT training
69+
result = osft(
70+
model_path="/path/to/your/model",
71+
data_path="/path/to/your/training/data.jsonl",
72+
output_dir="/path/to/save/outputs",
73+
unfreeze_rank_ratio=0.3,
74+
batch_size=8,
75+
max_tokens_per_gpu=2048,
76+
max_seq_len=2048,
77+
learning_rate=2e-5
78+
)
79+
80+
# OSFT training with custom parameters
81+
result = osft(
82+
model_path="/path/to/your/model",
83+
data_path="/path/to/your/training/data.jsonl",
84+
output_dir="/path/to/save/outputs",
85+
unfreeze_rank_ratio=0.2,
86+
batch_size=4,
87+
max_tokens_per_gpu=4096,
88+
max_seq_len=4096,
89+
learning_rate=1e-5,
90+
epochs=3,
91+
warmup_steps=100,
92+
use_liger=True,
93+
seed=42
94+
)
95+
```
96+
97+
## Using the Factory Pattern
98+
99+
For more control over the algorithm instance:
100+
101+
```python
102+
from training_hub import create_algorithm
103+
104+
# Create an OSFT algorithm instance
105+
osft_algo = create_algorithm('osft', 'mini-trainer')
106+
107+
# Run training
108+
result = osft_algo.train(
109+
model_path="/path/to/your/model",
110+
data_path="/path/to/your/training/data.jsonl",
111+
output_dir="/path/to/save/outputs",
112+
unfreeze_rank_ratio=0.25,
113+
batch_size=6,
114+
max_tokens_per_gpu=3072,
115+
max_seq_len=2048,
116+
learning_rate=1.5e-5,
117+
epochs=2
118+
)
119+
120+
# Check required parameters
121+
required_params = osft_algo.get_required_params()
122+
print("Required parameters:", list(required_params.keys()))
123+
```
124+
125+
## Algorithm and Backend Discovery
126+
127+
Explore available algorithms and backends:
128+
129+
```python
130+
from training_hub import AlgorithmRegistry
131+
132+
# List all available algorithms
133+
algorithms = AlgorithmRegistry.list_algorithms()
134+
print("Available algorithms:", algorithms) # ['sft', 'osft']
135+
136+
# List backends for OSFT
137+
osft_backends = AlgorithmRegistry.list_backends('osft')
138+
print("OSFT backends:", osft_backends) # ['mini-trainer']
139+
140+
# Get algorithm class directly
141+
OSFTAlgorithm = AlgorithmRegistry.get_algorithm('osft')
142+
```
143+
144+
## Parameter Reference
145+
146+
### Required Parameters
147+
148+
- `model_path` (str): Local path or HuggingFace model ID to be used for fine-tuning
149+
- `data_path` (str): Path to the training data (processed or unprocessed)
150+
- `output_dir` (str): Directory where outputs from training will be saved
151+
- `unfreeze_rank_ratio` (float): Controls the amount that each matrix is unfrozen during OSFT (0.0-1.0)
152+
- `batch_size` (int): Batch size for training
153+
- `max_tokens_per_gpu` (int): Maximum number of tokens placed on a single GPU
154+
- `max_seq_len` (int): Maximum sequence length (in tokens) for training samples
155+
- `learning_rate` (float): Learning rate for model update size
156+
157+
### Optional Training Parameters
158+
159+
**OSFT-Specific Parameters:**
160+
- `target_patterns` (list[str]): Patterns to match when selecting modules for OSFT
161+
- `unfreeze_rank_ratio` (float): Valid values are between 0.0 and 1.0 (seldom need >0.5)
162+
163+
**Data Processing Parameters:**
164+
- `use_processed_dataset` (bool): Whether to use pre-processed dataset format
165+
- `unmask_messages` (bool): Whether to unmask messages during data processing
166+
167+
**Core Training Parameters:**
168+
- `epochs` (int): Number of epochs to train for
169+
- `seed` (int): Random seed for training
170+
- `use_liger` (bool): Whether to use Liger kernels for training
171+
172+
**Learning Rate Scheduler:**
173+
- `lr_scheduler` (str): Name of the PyTorch learning rate scheduler to use
174+
- `warmup_steps` (int): Number of warmup steps for the learning rate scheduler
175+
- `lr_scheduler_kwargs` (dict[str, str]): Additional scheduler parameters
176+
177+
**Checkpointing:**
178+
- `checkpoint_at_epoch` (bool): Whether to checkpoint at each epoch
179+
- `save_final_checkpoint` (bool): Whether to save final checkpoint
180+
181+
**Multi-Node Parameters:**
182+
- `nproc_per_node` (int): Number of processes (GPUs) per node
183+
- `nnodes` (int): Total number of nodes in the cluster
184+
- `node_rank` (int): Rank of this node (0 to nnodes-1)
185+
- `rdzv_id` (int): Unique job ID for rendezvous
186+
- `rdzv_endpoint` (str): Master node endpoint (format: "host:port")
187+
188+
### Backend Selection
189+
190+
- `backend` (str, default="mini-trainer"): Backend implementation to use
191+
192+
## Error Handling
193+
194+
```python
195+
from training_hub import osft, AlgorithmRegistry
196+
197+
try:
198+
result = osft(
199+
model_path="/valid/model/path",
200+
data_path="/valid/data/path",
201+
output_dir="/valid/output/path",
202+
unfreeze_rank_ratio=0.3,
203+
batch_size=8,
204+
max_tokens_per_gpu=2048,
205+
max_seq_len=2048,
206+
learning_rate=2e-5
207+
)
208+
except ValueError as e:
209+
print(f"Configuration error: {e}")
210+
except Exception as e:
211+
print(f"Training error: {e}")
212+
213+
# Check if algorithm exists before using
214+
if 'osft' in AlgorithmRegistry.list_algorithms():
215+
print("OSFT algorithm is available")
216+
217+
# Check if backend exists
218+
if 'mini-trainer' in AlgorithmRegistry.list_backends('osft'):
219+
print("Mini-trainer backend is available")
220+
```
221+
222+
## Multi-Node Training
223+
224+
The OSFT algorithm supports multi-node distributed training through torchrun parameters:
225+
226+
```python
227+
from training_hub import osft
228+
229+
# Single-node, multi-GPU training (2 GPUs)
230+
result = osft(
231+
model_path="/path/to/model",
232+
data_path="/path/to/data.jsonl",
233+
output_dir="/path/to/outputs",
234+
unfreeze_rank_ratio=0.3,
235+
batch_size=4,
236+
max_tokens_per_gpu=2048,
237+
max_seq_len=2048,
238+
learning_rate=2e-5,
239+
nproc_per_node=2, # Number of GPUs per node
240+
nnodes=1, # Single node
241+
node_rank=0, # This node's rank
242+
rdzv_id=12345, # Rendezvous ID
243+
rdzv_endpoint="" # Empty for single node
244+
)
245+
246+
# Multi-node training (2 nodes, 4 GPUs each)
247+
# Run this on the first node (rank 0):
248+
result = osft(
249+
model_path="/path/to/model",
250+
data_path="/path/to/data.jsonl",
251+
output_dir="/path/to/outputs",
252+
unfreeze_rank_ratio=0.25,
253+
batch_size=2,
254+
max_tokens_per_gpu=1024,
255+
max_seq_len=2048,
256+
learning_rate=1e-5,
257+
nproc_per_node=4, # 4 GPUs per node
258+
nnodes=2, # 2 total nodes
259+
node_rank=0, # This is node 0
260+
rdzv_id=12345, # Shared rendezvous ID
261+
rdzv_endpoint="node0:29500" # Master node endpoint
262+
)
263+
```
264+
265+
## Best Practices
266+
267+
1. **unfreeze_rank_ratio**: Start with values between 0.1-0.5. Values >0.5 are rarely needed for general continual-learning regimes.
268+
269+
2. **Memory Management**: OSFT doesn't reduce memory requirements compared to SFT, so adjust `max_tokens_per_gpu` accordingly.
270+
271+
3. **Data Processing**: The algorithm handles data processing automatically. Use `use_processed_dataset=True` only if you have pre-tokenized data.
272+
273+
4. **Continual Learning**: OSFT is particularly effective for adapting instruction-tuned models to new domains without catastrophic forgetting.

examples/docs/sft_usage.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -241,11 +241,10 @@ This architecture supports adding new algorithms and backends:
241241
# Future algorithms might include:
242242
# - DPO (Direct Preference Optimization)
243243
# - LoRA (Low-Rank Adaptation)
244-
# - OSFT (Continual Learning via OSFT)
245244

246245
# Example of what future usage might look like:
247246
# from training_hub import dpo, lora
248247
#
249248
# dpo_result = dpo(model_path="...", data_path="...", ckpt_output_dir="...")
250249
# lora_result = lora(model_path="...", data_path="...", rank=16)
251-
```
250+
```

0 commit comments

Comments
 (0)