This release introduces memory profiling capabilities, enhanced distributed training orchestration, and support for Granite 4 and Mamba models. Backend implementations have been updated to instructlab-training v0.12.1 and mini-trainer v0.3.0.
What's New
Memory Profiling API (Experimental)
- New memory estimation tool for fine-tuning workloads
- Reports per-GPU VRAM requirements (parameters, optimizer state, gradients, activations, outputs)
- Supports both SFT and OSFT algorithms
- Returns low/expected/high memory bounds for better resource planning
- Includes Liger-kernel-aware adjustments
- Example notebook and documentation included
Enhanced Distributed Training
- Automatic torchrun configuration from environment variables
- Full compatibility with Kubeflow and other orchestration systems
- Support for auto and gpu process count specifications
- Centralized launch parameter handling with hierarchical priority
- Improved validation with clear conflict warnings and error messages
- Flexible argument types (string or integer) for multi-node parameters
- Explicit master address and port configuration options
Model Support Expansion
- Granite 4 support (transformers>=4.57.0)
- Mamba model support with optional CUDA acceleration (mamba-ssm[causal-conv1d]>=2.2.5)
- Enhanced compatibility through dependency updates
Infrastructure Improvements
- Uncapped NumPy for better forward compatibility
- Minimum Numba version raised to 0.62.0
- Liger kernel pinned to >=0.5.10 for stability
- Updated backend implementations (instructlab-training>=0.12.1, rhai-innovation-mini-trainer>=0.3.0)
What's Changed
- Pinning liger-kernal version by @Fiona-Waters in #9
- Adding min dependencies for Granite 4 / Mamba support by @Maxusmusti in #14
- uncap numpy and raise minimum numba version by @RobotSail in #15
- Adding basic API for memory profiling (src/training_hub/profiling) by @mazam-lab in #11
- feat(traininghub): Use torchrun environment variables for default configuration by @szaher in #13
- Update backend implementation dep versions in pyproject.toml by @Maxusmusti in #19
New Contributors
- @Fiona-Waters made their first contribution in #9
- @mazam-lab made their first contribution in #11
- @szaher made their first contribution in #13
Full Changelog: v0.2.0...v0.3.0