fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

szaher · 2025-10-03T12:10:06Z

The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

TorchrunArgs(
    nproc_per_node=4,
    nnodes=2,
    node_rank=0,
    rdzv_id="abc",
    master_addr="localhost",
    master_port=29500
)  # ✅ OK

TorchrunArgs(
    nproc_per_node="gpu",
    nnodes=2,
    node_rank=1,
    rdzv_id=123,
    rdzv_endpoint="localhost:1234"
)  # ✅ OK

TorchrunArgs(
    nproc_per_node=4,
    nnodes=2,
    node_rank=0,
    rdzv_id="xyz"
)  # ❌ Raises ValueError

RobotSail · 2025-10-03T12:32:24Z

src/instructlab/training/config.py

    """

-    nproc_per_node: int
+    nproc_per_node: str


Did you mean to make this change?

RobotSail · 2025-10-03T12:33:13Z

src/instructlab/training/main_ds.py

+    # build args for this file. Ignore empty or unset values except int values
+    for key, value in train_args.model_dump(exclude_none=True).items():
+        # avoid ignoring int attrs with value = 0
+        if not isinstance(value, int) and (not value or value == ""):


How would this handle booleans?

I have updated this one to only check for string types.

RobotSail · 2025-10-03T12:36:34Z

src/instructlab/training/main_ds.py

+        # avoid ignoring int attrs with value = 0
+        if not isinstance(value, int) and (not value or value == ""):
+            continue
+        command.append(f"--{key}={value}")


Have you verified that all of our CLI arguments are perfectly 1:1 with the variable names we're using here?

I have updated this one to only process torchrun args and leave the scripts args as they're not perfectly 1:1 mapped.

RobotSail

Thank you for the PR @szaher , I think the changes here are reasonable and had a few questions about the implementation.

RobotSail · 2025-10-03T16:14:40Z

src/instructlab/training/config.py

+    # this will tell the model construct to ignore
+    # extra arguments that aren't part of this model
+    class Config:
+        extra = "ignore"


@szaher Do you know when this would be the case? If our goal here is to dynamically build the torchrun command using the defined interface, this seems like it now opens the floor up for users to pass invalid arguments through torchrun. This means that any incorrect interface usage wouldn't be detected until runtime.

In fact this will actually drop additionally provided arguments and only keep torchrun ones

torchrun_defaults = { 'nnodes': 1, 'node_rank': 0, 'rdzv_id': 0, 'rdzv_endpoint': '', 'nproc_per_node': 2, "fake_arg": "what" } y = TorchrunArgs(**torchrun_defaults) print(y) TorchrunArgs(nproc_per_node=2, nnodes=1, node_rank=0, rdzv_id=0, rdzv_endpoint='')

I see, that's fine then.

RobotSail

Thanks for the PR @szaher , LGTM!

RobotSail

A few comments but otherwise great work! LGTM !!

src/instructlab/training/config.py

tests/smoke/test_train.py

src/instructlab/training/main_ds.py

RobotSail · 2025-10-14T03:27:45Z

@szaher Looks like you will also need to rebase this PR.

The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value. Additionally, the nproc_per_node argument type has been changed from int to str to support special values accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'. Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88 Signed-off-by: Saad Zaher <[email protected]>

Signed-off-by: Saad Zaher <[email protected]>

Signed-off-by: Oleg Silkin <[email protected]>

RobotSail · 2025-10-14T15:01:19Z

LGTM, will merge once tests pass.

…oint are provided Signed-off-by: Oleg Silkin <[email protected]>

mergify bot added the ci-failure label Oct 3, 2025

RobotSail reviewed Oct 3, 2025

View reviewed changes

szaher mentioned this pull request Oct 3, 2025

feat(traininghub): Use torchrun environment variables for default configuration Red-Hat-AI-Innovation-Team/training_hub#13

Merged

mergify bot added testing Relates to testing and removed ci-failure labels Oct 3, 2025

RobotSail reviewed Oct 3, 2025

View reviewed changes

mergify bot added the ci-failure label Oct 13, 2025

RobotSail approved these changes Oct 13, 2025

View reviewed changes

mergify bot added one-approval and removed ci-failure labels Oct 13, 2025

RobotSail mentioned this pull request Oct 14, 2025

adds hierarchical priority, handles edge cases, surface warnings and … szaher/training_hub#1

Merged

RobotSail approved these changes Oct 14, 2025

View reviewed changes

src/instructlab/training/config.py Outdated Show resolved Hide resolved

tests/smoke/test_train.py Outdated Show resolved Hide resolved

src/instructlab/training/main_ds.py Outdated Show resolved Hide resolved

mergify bot added ci-failure dependencies Pull requests that update a dependency file and removed ci-failure labels Oct 14, 2025

szaher added 9 commits October 14, 2025 10:49

only dynamically add torchrun args & change rdzv_id type to str

4ea5d4f

Signed-off-by: Saad Zaher <[email protected]>

fix smoke tests

38fa614

Signed-off-by: Saad Zaher <[email protected]>

Enable both dtypes str, int for nproc_per_node, rdzv_id

f70eb32

Signed-off-by: Saad Zaher <[email protected]>

Use python3.11 style for pydatnic model

782f1b3

Signed-off-by: Saad Zaher <[email protected]>

add all torchrun args and validate them

325ec0d

Signed-off-by: Saad Zaher <[email protected]>

Remove non-required dependencies

e512996

Signed-off-by: Saad Zaher <[email protected]>

update datatypes only

d501cce

Signed-off-by: Saad Zaher <[email protected]>

replace _ with - when passing torchrun args

f2b48c0

Signed-off-by: Saad Zaher <[email protected]>

szaher and others added 9 commits October 14, 2025 10:50

make nproc_per_node to only accept gpu or int

e44b797

Signed-off-by: Saad Zaher <[email protected]>

add master_{addr, port} validate args

3a17017

Signed-off-by: Saad Zaher <[email protected]>

check for not set or empty rdzv endpoint

22a2dc5

Signed-off-by: Saad Zaher <[email protected]>

fix formatting error

1ebce99

Signed-off-by: Saad Zaher <[email protected]>

Update src/instructlab/training/config.py

bfcf485

Signed-off-by: Saad Zaher <[email protected]>

Update tests/smoke/test_train.py

3dfe98d

Signed-off-by: Saad Zaher <[email protected]>

Update src/instructlab/training/main_ds.py

36e8f42

Signed-off-by: Saad Zaher <[email protected]>

fixes indentation

6cfdf0b

Signed-off-by: Oleg Silkin <[email protected]>

formatting

27ff594

RobotSail force-pushed the pytorch-env-vars branch 3 times, most recently from ffff971 to 27ff594 Compare October 14, 2025 15:00

RobotSail and others added 2 commits October 14, 2025 13:13

add standalone as the fallback when neither master_addr nor rdzv_endp…

8588028

…oint are provided Signed-off-by: Oleg Silkin <[email protected]>

clarify rdzv-backend arg

f0825d8

RobotSail merged commit 637afae into instructlab:main Oct 14, 2025
17 checks passed

fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

Uh oh!

Conversation

szaher commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szaher Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail commented Oct 14, 2025

Uh oh!

RobotSail commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szaher commented Oct 3, 2025 •

edited

Loading

szaher Oct 3, 2025 •

edited

Loading