Releases · Red-Hat-AI-Innovation-Team/training_hub

14 Oct 21:56

v0.3.0

e6c8cca

v0.3.0 - Granite 4, Mamba, Env var support, and Memory Estimation Latest

Latest

This release introduces memory profiling capabilities, enhanced distributed training orchestration, and support for Granite 4 and Mamba models. Backend implementations have been updated to instructlab-training v0.12.1 and mini-trainer v0.3.0.

What's New

Memory Profiling API (Experimental)

New memory estimation tool for fine-tuning workloads
Reports per-GPU VRAM requirements (parameters, optimizer state, gradients, activations, outputs)
Supports both SFT and OSFT algorithms
Returns low/expected/high memory bounds for better resource planning
Includes Liger-kernel-aware adjustments
Example notebook and documentation included

Enhanced Distributed Training

Automatic torchrun configuration from environment variables
Full compatibility with Kubeflow and other orchestration systems
Support for auto and gpu process count specifications
Centralized launch parameter handling with hierarchical priority
Improved validation with clear conflict warnings and error messages
Flexible argument types (string or integer) for multi-node parameters
Explicit master address and port configuration options

Model Support Expansion

Granite 4 support (transformers>=4.57.0)
Mamba model support with optional CUDA acceleration (mamba-ssm[causal-conv1d]>=2.2.5)
Enhanced compatibility through dependency updates

Infrastructure Improvements

Uncapped NumPy for better forward compatibility
Minimum Numba version raised to 0.62.0
Liger kernel pinned to >=0.5.10 for stability
Updated backend implementations (instructlab-training>=0.12.1, rhai-innovation-mini-trainer>=0.3.0)

What's Changed

Pinning liger-kernal version by @Fiona-Waters in #9
Adding min dependencies for Granite 4 / Mamba support by @Maxusmusti in #14
uncap numpy and raise minimum numba version by @RobotSail in #15
Adding basic API for memory profiling (src/training_hub/profiling) by @mazam-lab in #11
feat(traininghub): Use torchrun environment variables for default configuration by @szaher in #13
Update backend implementation dep versions in pyproject.toml by @Maxusmusti in #19