Skip to content

Conversation

@mtake
Copy link
Contributor

@mtake mtake commented Nov 4, 2025

This PR is for adding an SFT example for Granite 4.0 models. It also applies minor non-functional updates to the existing SFT/OSFT examples for Granite 3.3. The reason for not adding an OSFT example for Granite 4.0 models is that Granite-4.0-H-Small (32B total / 9B active) is too large for a single node training even if the cpu offloading is enabled.

Summary by CodeRabbit

  • New Features

    • Added a Granite 4.0 SFT example script for single-node multi‑GPU training with CLI-configurable params, multiple Granite 4.0 presets, checkpoint discovery, post‑training reporting, and optional model interpolation.
  • Documentation

    • Updated examples README to reference the new Granite 4.0 SFT script.
  • Refactor

    • Introduced example_* defaults (including example_min_nproc_per_node) and switched to local data paths; unified CLI defaults to those example values.
  • Behavior

    • Relaxed nproc validation to emit informational hints and commented out distributed-launch options for simplified single-node usage.

@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Adds a new SFT example for Granite 4.0, updates the examples README, and refactors existing Granite example scripts to use module-level example_* defaults, relax nproc validation, switch data output to local paths, and simplify launcher/distributed options for single-node multi-GPU runs.

Changes

Cohort / File(s) Summary
Documentation
examples/README.md
Added entry for "SFT with Granite 4.0" under SFT Scripts.
New Granite 4.0 SFT example
examples/scripts/sft_granite4_example.py
New single-node multi-GPU SFT script: multiple Granite4 presets, CLI, checkpoint discovery (find_most_recent_checkpoint), training orchestration (main -> sft()), post-training reporting, optional model interpolation, and several new public example configuration variables.
OSFT example refactor
examples/scripts/osft_granite_example.py
Introduced example_* module defaults (including example_min_nproc_per_node), switched CLI defaults to use these variables, changed data_output_dir to a local path (RAM-disk option commented), relaxed nproc_per_node fatal validation to an informational hint, and commented out distributed launcher parameters.
SFT example refactor
examples/scripts/sft_granite_example.py
Added example_min_nproc_per_node and other example_* defaults, updated CLI defaults to use them, moved data output to local path (RAM-disk commented), and relaxed launcher/distributed parameter usage and nproc validation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Trainer
    participant Checkpoint
    participant Interpolator

    User->>CLI: run `sft_granite4_example.py --args`
    CLI->>CLI: parse args & apply `example_*` defaults
    CLI->>Trainer: invoke `sft(...)` with config
    Trainer->>Checkpoint: emit checkpoints (hf_format/samples_*)
    Trainer-->>CLI: training finished
    CLI->>Checkpoint: call `find_most_recent_checkpoint(output_dir)`
    Checkpoint-->>CLI: return checkpoint_path
    alt interpolation_weight > 0
        CLI->>Interpolator: `interpolate_models(base, checkpoint, weight)`
        Interpolator-->>CLI: return interpolated model path
    end
    CLI->>User: print duration and result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay extra attention to:
    • examples/scripts/sft_granite4_example.py (new, sizeable script: CLI semantics, checkpoint discovery, optional interpolation)
    • Consistent propagation and use of example_* defaults across modified example scripts
    • Changes to nproc_per_node handling and commented distributed-launch settings

Possibly related PRs

  • Add granite training example #16 — Adds/extends Granite SFT/OSFT example scripts and interpolation tooling; strong overlap with the new SFT Granite 4.0 example and interpolation usage.

Suggested reviewers

  • Maxusmusti
  • RobotSail

Poem

🐰 I hopped in code with a tiny cheer,
Granite four now trains quite near,
Checkpoints stacked and models blend,
Single-node GPUs race to the end,
Hop—examples ready, shaped and clear!

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main addition in the changeset: a new SFT example for Granite 4.0 models (sft_granite4_example.py), which is the primary focus of the PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/scripts/sft_granite4_example.py (1)

178-209: Drop the unused result binding

result is never read, so the assignment only triggers lint noise. Calling sft directly keeps the control flow identical while eliminating the warning.

-        result = sft(
+        sft(
             # Model and data
             model_path=args.model_path,
             data_path=args.data_path,
             ckpt_output_dir=args.ckpt_output_dir,
@@
             # Additional parameters to the backend
             **kwargs
         )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23bd7e6 and 6ec3167.

📒 Files selected for processing (4)
  • examples/README.md (1 hunks)
  • examples/scripts/osft_granite_example.py (5 hunks)
  • examples/scripts/sft_granite4_example.py (1 hunks)
  • examples/scripts/sft_granite_example.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/scripts/sft_granite4_example.py (3)
src/training_hub/algorithms/sft.py (1)
  • sft (169-248)
examples/scripts/sft_granite_example.py (2)
  • find_most_recent_checkpoint (70-93)
  • main (96-207)
examples/scripts/interpolator.py (1)
  • interpolate_models (19-78)
🪛 Ruff (0.14.3)
examples/scripts/sft_granite4_example.py

1-1: Shebang is present but file is not executable

(EXE001)


119-119: Avoid specifying long messages outside the exception class

(TRY003)


178-178: Local variable result is assigned to but never used

Remove assignment to unused variable result

(F841)


232-232: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
examples/README.md (1)

31-31: Granite 4.0 entry is clear

Thanks for surfacing the new Granite 4.0 SFT example so users can find it quickly.

Copy link
Contributor

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small nit, but otherwise looks good!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/scripts/sft_granite4_example.py (2)

93-96: Consider moving timestamp generation into main().

The timestamp is captured at module import time rather than when main() executes, which could cause confusion if the module is imported but training is started later.

Move the timestamp and data_output_dir generation into main():

-timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-full_experiment_name = f"{experiment_name}_{timestamp}"
-
-data_output_dir=f"data/{full_experiment_name}"  # Directory for processed data
-# data_output_dir=f"/dev/shm/data/{full_experiment_name}"  # Directory for processed data (RAM disk for speed)

Then in main() after parsing args:

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    full_experiment_name = f"{experiment_name}_{timestamp}"
    data_output_dir = f"data/{full_experiment_name}"

100-124: Consider extracting to a shared utility module.

This function is duplicated from osft_continual_learning_example.py. To follow DRY principles, consider extracting it to a shared utilities module that both scripts can import.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4719491 and a0b472b.

📒 Files selected for processing (1)
  • examples/scripts/sft_granite4_example.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/scripts/sft_granite4_example.py (2)
src/training_hub/algorithms/sft.py (1)
  • sft (169-248)
examples/scripts/interpolator.py (1)
  • interpolate_models (19-78)
🪛 Ruff (0.14.3)
examples/scripts/sft_granite4_example.py

1-1: Shebang is present but file is not executable

(EXE001)


119-119: Avoid specifying long messages outside the exception class

(TRY003)


232-232: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)
examples/scripts/sft_granite4_example.py (4)

127-172: LGTM!

The CLI argument parsing and configuration validation are well-structured. The soft validation approach (printing tips instead of hard enforcement) gives users flexibility while providing helpful guidance.


194-194: Good choice for checkpoint strategy.

Setting save_samples=0 disables sample-based checkpointing and relies on epoch-based checkpoints only, which addresses the storage concern mentioned in previous reviews for large models like Granite 4 Small.


211-231: LGTM!

The post-training reporting and conditional interpolation logic are well-implemented. The condition on line 223 correctly excludes edge cases where interpolation would be trivial (weights of 0.0 or 1.0).


232-241: Broad exception handling is acceptable here.

While static analysis warns about the broad Exception catch, this is appropriate for a top-level CLI script where you want to provide user-friendly error messages for any failure. The error handling correctly reports duration and provides helpful troubleshooting tips.

@Maxusmusti Maxusmusti merged commit ce5903a into Red-Hat-AI-Innovation-Team:main Nov 5, 2025
4 checks passed
@mtake mtake deleted the granite-training-examples branch November 6, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants