Add Granite 4 SFT example #20

mtake · 2025-11-04T09:45:22Z

This PR is for adding an SFT example for Granite 4.0 models. It also applies minor non-functional updates to the existing SFT/OSFT examples for Granite 3.3. The reason for not adding an OSFT example for Granite 4.0 models is that Granite-4.0-H-Small (32B total / 9B active) is too large for a single node training even if the cpu offloading is enabled.

Summary by CodeRabbit

New Features
- Added a Granite 4.0 SFT example script for single-node multi‑GPU training with CLI-configurable params, multiple Granite 4.0 presets, checkpoint discovery, post‑training reporting, and optional model interpolation.
Documentation
- Updated examples README to reference the new Granite 4.0 SFT script.
Refactor
- Introduced example_* defaults (including example_min_nproc_per_node) and switched to local data paths; unified CLI defaults to those example values.
Behavior
- Relaxed nproc validation to emit informational hints and commented out distributed-launch options for simplified single-node usage.

coderabbitai · 2025-11-04T09:45:42Z

Walkthrough

Adds a new SFT example for Granite 4.0, updates the examples README, and refactors existing Granite example scripts to use module-level example_* defaults, relax nproc validation, switch data output to local paths, and simplify launcher/distributed options for single-node multi-GPU runs.

Changes

Cohort / File(s)	Summary
Documentation `examples/README.md`	Added entry for "SFT with Granite 4.0" under SFT Scripts.
New Granite 4.0 SFT example `examples/scripts/sft_granite4_example.py`	New single-node multi-GPU SFT script: multiple Granite4 presets, CLI, checkpoint discovery (`find_most_recent_checkpoint`), training orchestration (`main` -> `sft()`), post-training reporting, optional model interpolation, and several new public example configuration variables.
OSFT example refactor `examples/scripts/osft_granite_example.py`	Introduced `example_*` module defaults (including `example_min_nproc_per_node`), switched CLI defaults to use these variables, changed `data_output_dir` to a local path (RAM-disk option commented), relaxed nproc_per_node fatal validation to an informational hint, and commented out distributed launcher parameters.
SFT example refactor `examples/scripts/sft_granite_example.py`	Added `example_min_nproc_per_node` and other `example_*` defaults, updated CLI defaults to use them, moved data output to local path (RAM-disk commented), and relaxed launcher/distributed parameter usage and nproc validation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Trainer
    participant Checkpoint
    participant Interpolator

    User->>CLI: run `sft_granite4_example.py --args`
    CLI->>CLI: parse args & apply `example_*` defaults
    CLI->>Trainer: invoke `sft(...)` with config
    Trainer->>Checkpoint: emit checkpoints (hf_format/samples_*)
    Trainer-->>CLI: training finished
    CLI->>Checkpoint: call `find_most_recent_checkpoint(output_dir)`
    Checkpoint-->>CLI: return checkpoint_path
    alt interpolation_weight > 0
        CLI->>Interpolator: `interpolate_models(base, checkpoint, weight)`
        Interpolator-->>CLI: return interpolated model path
    end
    CLI->>User: print duration and result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pay extra attention to:
- examples/scripts/sft_granite4_example.py (new, sizeable script: CLI semantics, checkpoint discovery, optional interpolation)
- Consistent propagation and use of example_* defaults across modified example scripts
- Changes to nproc_per_node handling and commented distributed-launch settings

Possibly related PRs

Add granite training example #16 — Adds/extends Granite SFT/OSFT example scripts and interpolation tooling; strong overlap with the new SFT Granite 4.0 example and interpolation usage.

Suggested reviewers

Maxusmusti
RobotSail

Poem

🐰 I hopped in code with a tiny cheer,
Granite four now trains quite near,
Checkpoints stacked and models blend,
Single-node GPUs race to the end,
Hop—examples ready, shaped and clear!

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main addition in the changeset: a new SFT example for Granite 4.0 models (sft_granite4_example.py), which is the primary focus of the PR.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/scripts/sft_granite4_example.py (1)
178-209: Drop the unused result binding

result is never read, so the assignment only triggers lint noise. Calling sft directly keeps the control flow identical while eliminating the warning.
-        result = sft(
+        sft(
             # Model and data
             model_path=args.model_path,
             data_path=args.data_path,
             ckpt_output_dir=args.ckpt_output_dir,
@@
             # Additional parameters to the backend
             **kwargs
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23bd7e6 and 6ec3167.

📒 Files selected for processing (4)

examples/README.md (1 hunks)
examples/scripts/osft_granite_example.py (5 hunks)
examples/scripts/sft_granite4_example.py (1 hunks)
examples/scripts/sft_granite_example.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/scripts/sft_granite4_example.py (3)

src/training_hub/algorithms/sft.py (1)

sft (169-248)

examples/scripts/sft_granite_example.py (2)

find_most_recent_checkpoint (70-93)

main (96-207)

examples/scripts/interpolator.py (1)

interpolate_models (19-78)

🪛 Ruff (0.14.3)

examples/scripts/sft_granite4_example.py

1-1: Shebang is present but file is not executable

(EXE001)

119-119: Avoid specifying long messages outside the exception class

(TRY003)

178-178: Local variable result is assigned to but never used

Remove assignment to unused variable result

(F841)

232-232: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)

examples/README.md (1)

31-31: Granite 4.0 entry is clear

Thanks for surfacing the new Granite 4.0 SFT example so users can find it quickly.

Maxusmusti

Just one small nit, but otherwise looks good!

examples/scripts/sft_granite4_example.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/scripts/sft_granite4_example.py (2)
93-96: Consider moving timestamp generation into main().

The timestamp is captured at module import time rather than when main() executes, which could cause confusion if the module is imported but training is started later.

Move the timestamp and data_output_dir generation into main():
-timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-full_experiment_name = f"{experiment_name}_{timestamp}"
-
-data_output_dir=f"data/{full_experiment_name}"  # Directory for processed data
-# data_output_dir=f"/dev/shm/data/{full_experiment_name}"  # Directory for processed data (RAM disk for speed)
Then in main() after parsing args:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    full_experiment_name = f"{experiment_name}_{timestamp}"
    data_output_dir = f"data/{full_experiment_name}"
100-124: Consider extracting to a shared utility module.

This function is duplicated from osft_continual_learning_example.py. To follow DRY principles, consider extracting it to a shared utilities module that both scripts can import.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4719491 and a0b472b.

📒 Files selected for processing (1)

examples/scripts/sft_granite4_example.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/scripts/sft_granite4_example.py (2)

src/training_hub/algorithms/sft.py (1)

sft (169-248)

examples/scripts/interpolator.py (1)

interpolate_models (19-78)

🪛 Ruff (0.14.3)

examples/scripts/sft_granite4_example.py

1-1: Shebang is present but file is not executable

(EXE001)

119-119: Avoid specifying long messages outside the exception class

(TRY003)

232-232: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)

examples/scripts/sft_granite4_example.py (4)

127-172: LGTM!

The CLI argument parsing and configuration validation are well-structured. The soft validation approach (printing tips instead of hard enforcement) gives users flexibility while providing helpful guidance.

194-194: Good choice for checkpoint strategy.

Setting save_samples=0 disables sample-based checkpointing and relies on epoch-based checkpoints only, which addresses the storage concern mentioned in previous reviews for large models like Granite 4 Small.

211-231: LGTM!

The post-training reporting and conditional interpolation logic are well-implemented. The condition on line 223 correctly excludes edge cases where interpolation would be trivial (weights of 0.0 or 1.0).

232-241: Broad exception handling is acceptable here.

While static analysis warns about the broad Exception catch, this is appropriate for a top-level CLI script where you want to provide user-friendly error messages for any failure. The error handling correctly reports duration and provides helpful troubleshooting tips.

examples/scripts/sft_granite4_example.py

Add Granite 4 SFT example

6ec3167

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

mtake added 2 commits November 4, 2025 19:21

address bot comments

4719491

Merge branch 'main' into granite-training-examples

786cb8a

Maxusmusti approved these changes Nov 5, 2025

View reviewed changes

examples/scripts/sft_granite4_example.py Outdated Show resolved Hide resolved

disable auto-resumption

a0b472b

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

examples/scripts/sft_granite4_example.py Show resolved Hide resolved

Maxusmusti merged commit ce5903a into Red-Hat-AI-Innovation-Team:main Nov 5, 2025
4 checks passed

mtake deleted the granite-training-examples branch November 6, 2025 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Granite 4 SFT example #20

Add Granite 4 SFT example #20

Uh oh!

mtake commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Maxusmusti left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Granite 4 SFT example #20

Add Granite 4 SFT example #20

Uh oh!

Conversation

mtake commented Nov 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Maxusmusti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mtake commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading