Skip to content

RnD4ASI/learn_trl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning Fine-tuning Examples

This repository contains examples of fine-tuning language models using various reinforcement learning and preference optimization techniques. All examples are designed to run efficiently on CPU.

Installation

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Examples

The repository includes several examples of different optimization techniques:

Direct Preference Optimization (DPO)

python examples/dpo_example.py
  • Standard DPO implementation for preference learning
  • Uses paired comparison data from IMDB reviews
  • Optimizes for positive sentiment in movie reviews

Odds Ratio Preference Optimization (ORPO)

python examples/orpo_example.py
  • Enhanced version of DPO using odds ratio optimization
  • More stable training through improved loss function
  • Better handling of preference intensity

Group Relative Policy Optimization (GRPO)

python examples/grpo_example.py
  • Extends PPO with group-relative advantages
  • Compares responses within groups
  • Improved handling of complex preference structures

Exploratory Preference Optimization (EPO)

python examples/epo_example.py
  • Combines DPO with exploration bonuses
  • Uses policy entropy to maintain output diversity
  • Prevents preference collapse and mode collapse

Kahneman-Tversky Optimization (KTO)

python examples/kto_example.py
  • Implements prospect theory in preference learning
  • Models human risk preferences and loss aversion
  • Asymmetric treatment of gains and losses

Other Examples

  • ILQL (Implicit Q-Learning): python examples/ilql_example.py
  • PPO (Proximal Policy Optimization): python examples/ppo_example.py

Features

Data Management

  • Automatic dataset preparation and splitting
  • Save training and test data for reproducibility
  • Support for various data formats (JSON, CSV)

Training Visualization

  • Real-time training metrics visualization
  • Method-specific metric tracking:
    • Odds ratios for ORPO
    • Group advantages for GRPO
    • Policy entropy for EPO
    • Prospect values for KTO
  • Dataset statistics and distributions

Model Management

  • Save original and fine-tuned models
  • Regular checkpointing during training
  • Test generation capabilities

Directory Structure

Each training run creates a timestamped directory with:

test_results/[method]_test_YYYYMMDD_HHMMSS/
├── data/                    # Training and test data
│   ├── train_pairs.csv
│   ├── test_pairs.csv
│   └── dataset_stats.json
├── models/                  # Model checkpoints
│   ├── original_model/
│   ├── checkpoints/
│   └── final_model/
├── visualizations/          # Training visualizations
│   ├── loss.png
│   ├── rewards.png
│   ├── method_specific_metrics.png
│   └── text_lengths.png
└── test_generation.txt      # Sample model output

Configuration

All examples support the following configurations:

  • Model selection (default: facebook/opt-125m)
  • Dataset size (default: 200 samples)
  • Training parameters (epochs, batch size, etc.)
  • Method-specific parameters (e.g., exploration weight for EPO)

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers
  • TRL (Transformer Reinforcement Learning)
  • Other dependencies in requirements.txt

Notes

  • All examples are configured for CPU training by default
  • Adjust batch sizes and model sizes based on your hardware
  • Visualization requires matplotlib and seaborn
  • Training metrics are saved in TensorBoard format

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages