feat: refactor main_ds.py (1/n) Model class #572

cdoern · 2025-05-27T19:35:54Z

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Model class

NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change.

The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support

These classes are one of a few steps needed to "SDK-ify" the training library

Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:

Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.

Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.

RobotSail · 2025-05-28T00:58:51Z

src/instructlab/training/config.py

+class ModelTypes(Enum):
+    LIGER = "Liger"
+    CAUSALLM = "Causallm"
+    DOLOMITE = "Dolomite"


We've dropped dolomite, no need to include this.

@RobotSail Interesting! What does it mean exactly? If I grep through the code, I still see hits for dolomite, including the mandatory dependency on instructlab-dolomite. Was some decision made to drop it? Should we clean these remnants from the tree then?

Being worked on in #589

mergify · 2025-05-28T16:50:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

booxter

I haven't reviewed tests or Accelerator class in detail. I need to step off this PR. Posting questions and concerns I have collected so far.

booxter · 2025-05-30T13:49:07Z

src/instructlab/training/main_ds.py

+    parser.add_argument(
+        "--model-class",
+        type=str,
+        default=ModelTypes.CAUSALLM.value,


nit: you can use choice=[x.value for x in enum] to avoid listing them below

booxter · 2025-05-30T13:50:47Z

src/instructlab/training/config.py

    sharding_strategy: ShardingStrategies = ShardingStrategies.HYBRID_SHARD


+class Optimizers(Enum):


(No action required, Observation) I think it's more common to call enums as singular, not plural. But it's a matter of habit of course.

changed to singular

src/instructlab/training/config.py

booxter · 2025-05-30T13:54:25Z

src/instructlab/training/model.py

+    from deepspeed.ops.adam import DeepSpeedCPUAdam
+except ImportError:
+    DeepSpeedCPUAdam = None
+    local_rank = int(os.getenv("LOCAL_RANK", "0"))


(No action required) I know it was done in main_ds so you are not introducing anything new here, but consider not running code / issuing warnings when importing the module. An import should not, generally, produce side effects of this sort, especially in a library. Consider warning later when the missing class is actually referred to / used.

booxter · 2025-05-30T13:55:45Z

src/instructlab/training/model.py

+        output_dir: str,
+        distributed_framework: DistributedBackend,
+        model_type: ModelTypes,
+        noise_alpha: Optional[float],


nit: use type | None instead of Optional

booxter · 2025-05-30T15:01:09Z

src/instructlab/training/model.py

+            )
+            self.model.config.eos_token_id = self.tokenizer.eos_token_id
+
+        if "ForCausalLM" not in self.model.__class__.__name__:


this is fragile; can you think of a more robust way of checking it? if not, maybe the Model class could have a helper method to hide the check?

this is inherited from main:

training/src/instructlab/training/main_ds.py

Line 229 in ccac4fd

if "ForCausalLM" not in model.__class__.__name__:

I will refactor into a helper and we can investigate a better solution if there is one

booxter · 2025-05-30T15:02:50Z

src/instructlab/training/model.py

+        from .utils import add_noisy_embeddings, convert_loss_to_reduce_sum
+
+        self.model = convert_loss_to_reduce_sum(
+            self.model, use_dolomite=(self.model_type == "dolomite")


incorrect enum == str check

fixed with children classes I created, I think

src/instructlab/training/model.py

booxter · 2025-05-30T15:06:11Z

src/instructlab/training/model.py

+        """Check if a GPU supports FlashAttention."""
+        major, minor = torch.cuda.get_device_capability(device_id)
+        # Check if the GPU architecture is Ampere (SM 8.x) or newer (SM 9.0)
+        is_sm8x = major == 8 and minor >= 0


(No action required) Could be:

if ...: return True if ...: return True if ...: return True return False

src/instructlab/training/main_ds.py

cdoern · 2025-05-30T15:20:37Z

@booxter thanks for the review. I actually meant to remove Accelerator in this PR which is why there is a confusing non-usage of that class. I am intending to introduce it in a 2/n PR just for clarity.

In regard to most other comments, a lot of them are inherited from the existing code or mis-steps by me when splitting out my mega PR (I forgot to take my changes from utils.py for example). Will take another pass here. Thanks!

mergify · 2025-06-02T22:35:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2025-06-03T13:57:47Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-06-03T14:57:17Z

e2e workflow failed on this PR: View run, please investigate.

github-actions · 2025-06-03T22:43:05Z

e2e workflow succeeded on this PR: View run, congrats!

booxter

bnb question should be addressed before merging. Do we need it? Is it ok to drop it here?

booxter · 2025-06-04T00:58:19Z

src/instructlab/training/main_ds.py

-    base_model_args = {
-        "pretrained_model_name_or_path": args.model_name_or_path,
-        "torch_dtype": torch.bfloat16,
-        "quantization_config": bnb_config,


Do you have an answer to this? Should the drop be included here?

booxter · 2025-06-04T01:02:14Z

src/instructlab/training/model.py

+
+        self.reconcile_tokenizer()
+        if self.lora_config:
+            # First Party


github-actions · 2025-06-04T12:55:45Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

cdoern · 2025-06-04T13:11:08Z

I changed model.parameters to a property so I need to remove the .parameters() refs. tests should pass now

booxter

This looks reasonable. It's hard to review a large patch line by line through multiple iterations, so this follow-up review focused on high level question of: whether prior feedback of mine was addressed. I think it was (bnb restored; logging module used, duplicate functions cleaned up; accelerator class removed; etc.)

github-actions · 2025-06-04T13:13:04Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Model class NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change. The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support These classes are one of a few steps needed to "SDK-ify" the training library Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale: Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another. Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code. Signed-off-by: Charlie Doern <[email protected]>

cdoern · 2025-06-04T13:27:07Z

model.parameters cannot be a property because accelerate expects it to be a method: https://github.com/instructlab/training/actions/runs/15443149090/job/43465937222

github-actions · 2025-06-04T13:33:13Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

thisisatharva-rh · 2025-06-04T13:58:14Z

src/instructlab/training/model.py

+        lora_config: Optional[LoraConfig] = None,
+        lora_quant_bits: int = 0,
+    ):
+        self.lora_config = lora_config


i think lora_config should not be put inside the model class, it should act as a wrapper to our model. We can deliberate this in a further issue/pr

cdoern · 2025-06-04T14:01:03Z

holding for the L40s test to pass

fynnsu

I support moving quickly with these prs so that we can start to refine the final shape of the new sdk style codebase.

This is reasonable for now, pending future prs to update the other components.

github-actions · 2025-06-04T17:13:26Z

e2e workflow succeeded on this PR: View run, congrats!

mergify bot added testing Relates to testing ci-failure and removed ci-failure labels May 27, 2025

RobotSail reviewed May 28, 2025

View reviewed changes

mergify bot added needs-rebase ci-failure labels May 28, 2025

booxter suggested changes May 30, 2025

View reviewed changes

cdoern force-pushed the refactor-model branch from 1543a6a to 250d1bc Compare June 2, 2025 22:25

mergify bot removed needs-rebase ci-failure labels Jun 2, 2025

mergify bot added the needs-rebase label Jun 2, 2025

cdoern force-pushed the refactor-model branch from 250d1bc to 3f7ef22 Compare June 3, 2025 13:51

mergify bot removed the needs-rebase label Jun 3, 2025

mergify bot added the ci-failure label Jun 3, 2025

cdoern force-pushed the refactor-model branch from 3f7ef22 to 13efc99 Compare June 3, 2025 14:03

mergify bot removed the ci-failure label Jun 3, 2025

cdoern force-pushed the refactor-model branch from 13efc99 to b7e3c4d Compare June 3, 2025 14:12

mergify bot added the ci-failure label Jun 3, 2025

cdoern force-pushed the refactor-model branch from b7e3c4d to 0850ca0 Compare June 3, 2025 17:10

mergify bot added ci-failure and removed ci-failure labels Jun 3, 2025

cdoern force-pushed the refactor-model branch from 0850ca0 to 17caab6 Compare June 3, 2025 17:31

mergify bot added ci-failure and removed ci-failure labels Jun 3, 2025

cdoern force-pushed the refactor-model branch from 17caab6 to 528dfe4 Compare June 3, 2025 17:33

cdoern force-pushed the refactor-model branch from b78d12f to 88e2c1f Compare June 3, 2025 21:27

mergify bot removed the ci-failure label Jun 3, 2025

booxter reviewed Jun 4, 2025

View reviewed changes

cdoern force-pushed the refactor-model branch from 88e2c1f to 5c4136d Compare June 4, 2025 12:50

cdoern requested a review from booxter June 4, 2025 12:51

cdoern force-pushed the refactor-model branch from 5c4136d to a5e7f66 Compare June 4, 2025 13:08

booxter approved these changes Jun 4, 2025

View reviewed changes

mergify bot added the one-approval label Jun 4, 2025

mergify bot added the ci-failure label Jun 4, 2025

cdoern force-pushed the refactor-model branch from a5e7f66 to c9737ff Compare June 4, 2025 13:25

mergify bot removed the ci-failure label Jun 4, 2025

thisisatharva-rh reviewed Jun 4, 2025

View reviewed changes

thisisatharva-rh approved these changes Jun 4, 2025

View reviewed changes

cdoern added the hold label Jun 4, 2025

cdoern mentioned this pull request Jun 4, 2025

feat: refactor main_ds.py (2/n) Accelerator class #594

Merged

fynnsu approved these changes Jun 4, 2025

View reviewed changes

mergify bot removed the one-approval label Jun 4, 2025

cdoern removed the hold label Jun 4, 2025

mergify bot merged commit e78908c into instructlab:main Jun 4, 2025
18 checks passed

JamesKunstle mentioned this pull request Jun 4, 2025

remove GPTDolomite #589

Closed

cdoern mentioned this pull request Jun 10, 2025

feat: refactor main_ds.py (3/n) Checkpointer Class #605

Open

		sharding_strategy: ShardingStrategies = ShardingStrategies.HYBRID_SHARD


		class Optimizers(Enum):

feat: refactor main_ds.py (1/n) Model class #572

feat: refactor main_ds.py (1/n) Model class #572

Uh oh!

Conversation

cdoern commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 28, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cdoern commented May 30, 2025

Uh oh!

mergify bot commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

cdoern commented Jun 4, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

cdoern commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern commented Jun 4, 2025

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

Uh oh!