This repository contains examples of fine-tuning language models using various reinforcement learning and preference optimization techniques. All examples are designed to run efficiently on CPU.
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtThe repository includes several examples of different optimization techniques:
python examples/dpo_example.py- Standard DPO implementation for preference learning
- Uses paired comparison data from IMDB reviews
- Optimizes for positive sentiment in movie reviews
python examples/orpo_example.py- Enhanced version of DPO using odds ratio optimization
- More stable training through improved loss function
- Better handling of preference intensity
python examples/grpo_example.py- Extends PPO with group-relative advantages
- Compares responses within groups
- Improved handling of complex preference structures
python examples/epo_example.py- Combines DPO with exploration bonuses
- Uses policy entropy to maintain output diversity
- Prevents preference collapse and mode collapse
python examples/kto_example.py- Implements prospect theory in preference learning
- Models human risk preferences and loss aversion
- Asymmetric treatment of gains and losses
- ILQL (Implicit Q-Learning):
python examples/ilql_example.py - PPO (Proximal Policy Optimization):
python examples/ppo_example.py
- Automatic dataset preparation and splitting
- Save training and test data for reproducibility
- Support for various data formats (JSON, CSV)
- Real-time training metrics visualization
- Method-specific metric tracking:
- Odds ratios for ORPO
- Group advantages for GRPO
- Policy entropy for EPO
- Prospect values for KTO
- Dataset statistics and distributions
- Save original and fine-tuned models
- Regular checkpointing during training
- Test generation capabilities
Each training run creates a timestamped directory with:
test_results/[method]_test_YYYYMMDD_HHMMSS/
├── data/ # Training and test data
│ ├── train_pairs.csv
│ ├── test_pairs.csv
│ └── dataset_stats.json
├── models/ # Model checkpoints
│ ├── original_model/
│ ├── checkpoints/
│ └── final_model/
├── visualizations/ # Training visualizations
│ ├── loss.png
│ ├── rewards.png
│ ├── method_specific_metrics.png
│ └── text_lengths.png
└── test_generation.txt # Sample model output
All examples support the following configurations:
- Model selection (default: facebook/opt-125m)
- Dataset size (default: 200 samples)
- Training parameters (epochs, batch size, etc.)
- Method-specific parameters (e.g., exploration weight for EPO)
- Python 3.8+
- PyTorch
- Transformers
- TRL (Transformer Reinforcement Learning)
- Other dependencies in requirements.txt
- All examples are configured for CPU training by default
- Adjust batch sizes and model sizes based on your hardware
- Visualization requires matplotlib and seaborn
- Training metrics are saved in TensorBoard format