Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 34 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ An algorithm-focused interface for common llm training, continual learning, and
|-----------|---------------------|---------------|------|------|--------|
| **Supervised Fine-tuning (SFT)** | ✅ | - | - | - | Implemented |
| Continual Learning (OSFT) | 🔄 | ✅ | 🔄 | - | Implemented |
| **Low-Rank Adaptation (LoRA) + SFT** | - | - | ✅ | - | Implemented |
| Direct Preference Optimization (DPO) | - | - | - | 🔄 | Planned |
| Low-Rank Adaptation (LoRA) | 🔄 | - | 🔄 | - | Planned |
| Group Relative Policy Optimization (GRPO) | - | - | - | 🔄 | Planned |

**Legend:**
Expand Down Expand Up @@ -63,6 +63,29 @@ result = osft(
)
```

### [Low-Rank Adaptation (LoRA) + SFT](examples/docs/lora_usage.md)

Parameter-efficient fine-tuning using LoRA with supervised fine-tuning. Features:
- Memory-efficient training with significantly reduced VRAM requirements
- Single-GPU and multi-GPU distributed training support
- Unsloth backend for 2x faster training and 70% less memory usage
- Support for QLoRA (4-bit quantization) for even lower memory usage
- Compatible with messages and Alpaca dataset formats

```python
from training_hub import lora_sft

result = lora_sft(
model_path="meta-llama/Llama-2-7b-hf",
data_path="/path/to/data.jsonl",
ckpt_output_dir="/path/to/outputs",
lora_r=16,
lora_alpha=32,
num_epochs=3,
learning_rate=2e-4
)
```

## Installation

### Basic Installation
Expand All @@ -85,6 +108,16 @@ pip install training-hub[cuda]
pip install -e .[cuda]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Maxusmusti Would you mind adding this flag to the README? I think in most cases this flag will need to be present, unless the users get lucky with the prebuilt wheels.

Suggested change
pip install -e .[cuda]
pip install -e .[cuda] --no-build-isolation

I can't suggest it, but the same change needs to be made on line 106.

```

### LoRA Support
For LoRA training with optimized dependencies:
```bash
pip install training-hub[lora]
# or for development
pip install -e .[lora]
```

**Note:** The LoRA extras include Unsloth optimizations and PyTorch-optimized xformers for better performance and compatibility.

**Note:** If you encounter build issues with flash-attn, install the base package first:
```bash
# Install base package (provides torch, packaging, wheel, ninja)
Expand Down
29 changes: 29 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,35 @@ result = osft(
)
```

### Low-Rank Adaptation (LoRA) + SFT

LoRA provides parameter-efficient fine-tuning with significantly reduced memory requirements by training low-rank adaptation matrices instead of the full model weights. Training hub implements LoRA with supervised fine-tuning using the optimized Unsloth backend.

**Documentation:**
- [LoRA Usage Guide](docs/lora_usage.md) - Comprehensive usage documentation with parameter reference and examples

**Scripts:**
- [LoRA Example](scripts/lora_example.py) - Basic LoRA training examples with different configurations and dataset formats

**Launch Requirements:**
- **Single-GPU**: Standard Python launch: `python my_script.py`
- **Multi-GPU**: Unlike other algorithms, LoRA requires torchrun: `torchrun --nproc-per-node=4 my_script.py`

**Quick Example:**
```python
from training_hub import lora_sft

result = lora_sft(
model_path="meta-llama/Llama-2-7b-hf",
data_path="/path/to/data.jsonl",
ckpt_output_dir="/path/to/outputs",
lora_r=16,
lora_alpha=32,
num_epochs=3,
learning_rate=2e-4
)
```

### Memory Estimation (Experimental / In-Development)

training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are built off of formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).
Expand Down
190 changes: 190 additions & 0 deletions examples/docs/lora_usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# LoRA + SFT Usage Guide

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that allows you to fine-tune large language models with significantly reduced memory requirements. Training hub implements LoRA combined with supervised fine-tuning (SFT) using the optimized Unsloth backend.

## Quick Start

### Basic LoRA Training

```python
from training_hub import lora_sft

result = lora_sft(
model_path="meta-llama/Llama-2-7b-hf",
data_path="./training_data.jsonl",
ckpt_output_dir="./outputs",
lora_r=16, # LoRA rank
lora_alpha=32, # LoRA scaling parameter
num_epochs=3,
learning_rate=2e-4
)
```

### Single-GPU Launch

For standard single-GPU training, run your script directly with Python (same as other algorithms):

```bash
python my_training_script.py
```

### Multi-GPU Launch

**Important:** Unlike other algorithms in training-hub which handle distributed setup internally, LoRA training requires `torchrun` for multi-GPU setups due to Unsloth's distributed training requirements:

```bash
# For 4 GPUs
torchrun --nproc-per-node=4 my_training_script.py

# For 8 GPUs
torchrun --nproc-per-node=8 my_training_script.py
```

## Installation

```bash
pip install training-hub[lora]
```

This includes:
- Unsloth optimizations for 2x faster training and 70% less VRAM
- PyTorch-optimized xformers for better performance
- TRL for advanced training features

## LoRA Parameters

### Core LoRA Settings
- **`lora_r`**: LoRA rank (default: 16) - Higher values capture more information but use more memory
- **`lora_alpha`**: LoRA scaling parameter (default: 32) - Controls the magnitude of LoRA updates
- **`lora_dropout`**: Dropout rate for LoRA layers (default: 0.0) - Optimized for Unsloth
- **`target_modules`**: List of modules to apply LoRA to (default: auto-detect)

### Quantization (QLoRA)
For even lower memory usage, enable 4-bit quantization:

```python
result = lora_sft(
model_path="meta-llama/Llama-2-13b-hf",
data_path="./data.jsonl",
ckpt_output_dir="./outputs",
lora_r=64, # Higher rank for quantized model
lora_alpha=128,
load_in_4bit=True, # Enable QLoRA
learning_rate=1e-4 # Lower LR for quantized training
)
```

## Dataset Formats

LoRA training supports the same dataset formats as SFT:

### Messages Format (Recommended)
```json
{
"messages": [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]
}
```

### Alpaca Format
```json
{
"instruction": "Explain machine learning",
"input": "",
"output": "Machine learning is..."
}
```

## Memory Benefits

LoRA provides significant memory savings compared to full fine-tuning by only training low-rank adaptation matrices instead of the full model weights. The exact memory reduction depends on your specific model, LoRA configuration, and batch size settings.

## Multi-GPU Training

For distributed training across multiple GPUs:

```python
result = lora_sft(
model_path="meta-llama/Llama-2-7b-hf",
data_path="./large_dataset.jsonl",
ckpt_output_dir="./outputs",

# LoRA settings
lora_r=32,
lora_alpha=64,

# Distributed training
effective_batch_size=128, # Total across all GPUs
micro_batch_size=2, # Per GPU

# Training settings
num_epochs=3,
learning_rate=2e-4
)
```

Launch with torchrun:
```bash
torchrun --nproc-per-node=4 my_script.py
```

## Performance Tips

1. **Use Unsloth optimizations** (included by default)
2. **Enable BF16** for better performance: `bf16=True`
3. **Use sample packing**: `sample_packing=True`
4. **Optimize batch sizes**: Start with `micro_batch_size=2` and adjust
5. **For large models**: Use `load_in_4bit=True` for QLoRA

## Advanced Configuration

### Custom Target Modules
```python
result = lora_sft(
model_path="meta-llama/Llama-2-7b-hf",
data_path="./data.jsonl",
ckpt_output_dir="./outputs",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Attention only
lora_r=16,
lora_alpha=32
)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does wandb need to be installed separately or does it come with one of the new dependencies? Iirc, wandb currently isn't installed automatically and usually we instead recommend users to install it separately.

### Weights & Biases Integration
```python
result = lora_sft(
model_path="meta-llama/Llama-2-7b-hf",
data_path="./data.jsonl",
ckpt_output_dir="./outputs",
lora_r=16,
lora_alpha=32,
wandb_project="my-lora-project",
wandb_entity="my-team"
)
```

## Examples

See [lora_example.py](../lora_example.py) for complete working examples including:
- Basic LoRA training
- QLoRA with 4-bit quantization
- Multi-GPU distributed training
- Different dataset format handling

## Troubleshooting

### Memory Issues
- Reduce `micro_batch_size`
- Enable `load_in_4bit=True` for QLoRA
- Lower the `lora_r` value

### Multi-GPU Issues
- Ensure you're using `torchrun` for multi-GPU (not direct Python execution)
- Check that `effective_batch_size` is divisible by `nproc_per_node * micro_batch_size`
- For very large models, try `enable_model_splitting=True`

### Installation Issues
- If xformers conflicts occur, the LoRA extras use PyTorch-optimized builds
- For CUDA version issues, try the appropriate extra: `[lora-cu129]` or `[lora-cu130]`
Loading