Skip to content

Conversation

@szaher
Copy link
Contributor

@szaher szaher commented Oct 3, 2025

The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

TorchrunArgs(
    nproc_per_node=4,
    nnodes=2,
    node_rank=0,
    rdzv_id="abc",
    master_addr="localhost",
    master_port=29500
)  # ✅ OK

TorchrunArgs(
    nproc_per_node="gpu",
    nnodes=2,
    node_rank=1,
    rdzv_id=123,
    rdzv_endpoint="localhost:1234"
)  # ✅ OK

TorchrunArgs(
    nproc_per_node=4,
    nnodes=2,
    node_rank=0,
    rdzv_id="xyz"
)  # ❌ Raises ValueError

@mergify mergify bot added the ci-failure label Oct 3, 2025
"""

nproc_per_node: int
nproc_per_node: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

# build args for this file. Ignore empty or unset values except int values
for key, value in train_args.model_dump(exclude_none=True).items():
# avoid ignoring int attrs with value = 0
if not isinstance(value, int) and (not value or value == ""):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this handle booleans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this one to only check for string types.

# avoid ignoring int attrs with value = 0
if not isinstance(value, int) and (not value or value == ""):
continue
command.append(f"--{key}={value}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that all of our CLI arguments are perfectly 1:1 with the variable names we're using here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this one to only process torchrun args and leave the scripts args as they're not perfectly 1:1 mapped.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @szaher , I think the changes here are reasonable and had a few questions about the implementation.

# this will tell the model construct to ignore
# extra arguments that aren't part of this model
class Config:
extra = "ignore"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szaher Do you know when this would be the case? If our goal here is to dynamically build the torchrun command using the defined interface, this seems like it now opens the floor up for users to pass invalid arguments through torchrun. This means that any incorrect interface usage wouldn't be detected until runtime.

Copy link
Contributor Author

@szaher szaher Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this will actually drop additionally provided arguments and only keep torchrun ones

torchrun_defaults = {
'nnodes': 1, 'node_rank': 0, 'rdzv_id': 0, 'rdzv_endpoint': '', 
'nproc_per_node': 2, "fake_arg": "what"
}
y = TorchrunArgs(**torchrun_defaults)
print(y)
TorchrunArgs(nproc_per_node=2, nnodes=1, node_rank=0, rdzv_id=0, rdzv_endpoint='')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's fine then.

@mergify mergify bot added the ci-failure label Oct 13, 2025
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @szaher , LGTM!

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but otherwise great work! LGTM !!

@RobotSail
Copy link
Member

@szaher Looks like you will also need to rebase this PR.

@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file and removed ci-failure labels Oct 14, 2025
The command generation logic is updated to dynamically
build the torchrun command, excluding arguments that
are empty or None. This prevents them from overriding
environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is
made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been
changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

Signed-off-by: Saad Zaher <[email protected]>
Signed-off-by: Saad Zaher <[email protected]>
Signed-off-by: Saad Zaher <[email protected]>
@RobotSail RobotSail force-pushed the pytorch-env-vars branch 3 times, most recently from ffff971 to 27ff594 Compare October 14, 2025 15:00
@RobotSail
Copy link
Member

LGTM, will merge once tests pass.

@RobotSail RobotSail merged commit 637afae into instructlab:main Oct 14, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file one-approval testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants