Skip to content

feat(runtimes): add support for launcher resource allocation in MPI jobs#2653

Merged
google-oss-prow[bot] merged 8 commits into
kubeflow:masterfrom
jskswamy:fix-resource-allocation
Oct 20, 2025
Merged

feat(runtimes): add support for launcher resource allocation in MPI jobs#2653
google-oss-prow[bot] merged 8 commits into
kubeflow:masterfrom
jskswamy:fix-resource-allocation

Conversation

@jskswamy
Copy link
Copy Markdown
Contributor

The Trainer method has been updated to apply resources appropriately to both the launcher and node containers based on this flag.

Key changes include:

  • Added the isRunLauncherAsNode method to determine if the launcher should be run as a node.
  • Updated the Trainer method to conditionally apply resource configurations to the launcher container based on the runLauncherAsNode value.
  • Enhanced test cases to cover scenarios for resource application to both launcher and node pods based on the MPI policy settings.

Which issue(s) this PR fixes: Fixes #2650

@coveralls
Copy link
Copy Markdown

coveralls commented May 30, 2025

Pull Request Test Coverage Report for Build 18651664516

Details

  • 13 of 62 (20.97%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.7%) to 51.477%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/jobset/builder.go 0 16 0.0%
pkg/runtime/runtime.go 0 33 0.0%
Totals Coverage Status
Change from base Build 18612435311: -0.7%
Covered Lines: 1255
Relevant Lines: 2438

💛 - Coveralls

@jskswamy jskswamy force-pushed the fix-resource-allocation branch 2 times, most recently from c0d40e8 to 6925f41 Compare June 5, 2025 08:21
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @jskswamy!
Please can you rebase your PR, so we can take a look!
/assign @tenzen-y @Electronic-Waste @astefanutti

@jskswamy
Copy link
Copy Markdown
Contributor Author

I've rebased the changes, kindly have a look at the changes

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @jskswamy!
/assign @tenzen-y @astefanutti Appreciate your review!

@andreyvelich
Copy link
Copy Markdown
Member

@jskswamy Can you also update the title please to align with conventions ?

// Update values from the TrainJob trainer.
if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
if image := jobTrainer.Image; image != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that will also override the image for the MPI workers, isn't ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, this will override the image. I have provided a fix for it, kindly have a look at it

Comment on lines +131 to +139
// Skip if container is neither node nor launcher
if *container.Name != constants.Node && *container.Name != constants.Launcher {
continue
}

// Skip launcher container if runLauncherAsNode is false
if *container.Name == constants.Launcher && !b.isRunLauncherAsNode(info) {
continue
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti How do you like this approach ?
Shall we check that resources need to updated in the JobSet plugin or in the MPI Plugin ?
E.g. we can use the PodSet in the Info object.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it seems like it would be a better separation of concerns to update the launcher resources in the MPI plugin.

I think the current PR can be merged as is to enable fix the current behavior and follow-up to have a proper design.

@jskswamy jskswamy force-pushed the fix-resource-allocation branch from 0859508 to 9052324 Compare September 2, 2025 12:14
@google-oss-prow google-oss-prow Bot added size/XL and removed size/L labels Sep 2, 2025
@jskswamy jskswamy force-pushed the fix-resource-allocation branch from 9052324 to 1122a5c Compare September 2, 2025 12:22
@jskswamy
Copy link
Copy Markdown
Contributor Author

jskswamy commented Sep 2, 2025

@jskswamy Can you also update the title please to align with conventions ?

I've update the commit message and the title as well according to the conventions

@jskswamy jskswamy changed the title Apply resources appropriately to both launcher and node containers feat(runtime): add support for launcher resource allocation in MPI jobs Sep 2, 2025
@andreyvelich
Copy link
Copy Markdown
Member

/milestone v2.1

@google-oss-prow google-oss-prow Bot added this to the v2.1 milestone Sep 24, 2025
// Apply resources to both Node and Launcher containers (when launcher is included)
if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
requirements := corev1ac.ResourceRequirements()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that impact the resources requirement for the RuntimeInfo, that Kueue relies on for example?

}

// Update the Parallelism and Completions values for the Trainer Job.
if ancestor, ok := jobMetadata.Labels[constants.LabelTrainJobAncestor]; ok && ancestor == constants.AncestorTrainer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure, why not adding the trainer.kubeflow.org/trainjob-ancestor-step: trainer label on the node ReplicatedJob as well?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that since it will also override the container spec using .trainer parameters: #2653 (comment)
However, we should always run OpenSSH server there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't that be controlled with the container name?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we don't have such capability. We use this label to mark ReplicatedJob to which we should apply values from the Trainer, Model Initializer, and Dataset Initializer spec. The container name is always must be equal to node to apply values from the Trainer values.

@astefanutti Any thoughts on how we can enhance it ?

@andreyvelich
Copy link
Copy Markdown
Member

As we discussed on today's call, @astefanutti will help to finalize this PR.
/assign @astefanutti

@astefanutti
Copy link
Copy Markdown
Contributor

/retitle feat(runtimes): add support for launcher resource allocation in MPI jobs

@google-oss-prow google-oss-prow Bot changed the title feat(runtime): add support for launcher resource allocation in MPI jobs feat(runtimes): add support for launcher resource allocation in MPI jobs Oct 17, 2025
@google-oss-prow google-oss-prow Bot added the lgtm label Oct 17, 2025
@andreyvelich
Copy link
Copy Markdown
Member

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB


if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumProcPerNode != nil {
info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti @Electronic-Waste I auto set number of slots for MPI plugin equal to number of GPUs, if TrainJob doesn't set NumProcPerNode and NumProcPerNode = 1 (which is default value in our MPI runtimes).

This will help users to use DeepSpeed runtime more easily without modifying the numProcPerNode.
Let me know if that sounds good to you.
/assign @tenzen-y @astefanutti @Electronic-Waste

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic SGTM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TrainJob doesn't set NumProcPerNode, would that make sense to set it to the number of GPUs if NumProcPerNode < num GPUs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti I would suggest that we always set NumProcPerNode == num GPUs if the default value: 1 is set in NumProcPerNode.
If users manually override this value in the Runtime or in the TrainJob, we won't override it.
WDYT @astefanutti ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich right, better not override user-defined values.

Copy link
Copy Markdown
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this. Some comments for you


if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumProcPerNode != nil {
info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic SGTM

info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.
} else if *info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode == 1 {
resourcesPerNode := ptr.Deref(torch.ExtractResourcePerNodeFromRuntime(info), corev1.ResourceRequirements{})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks weird to reference the code in torch plugin. Can we make it global somewhere?

if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil && jobTrainer.ResourcesPerNode != nil {
resourcesPerNode = ptr.Deref(jobTrainer.ResourcesPerNode, corev1.ResourceRequirements{})
}
gpuQ := torch.GetNumGPUPerNode(&resourcesPerNode)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. And also we can combine the assign clause and if clause:

if gpuQ := torch.GetNumGPUPerNode(&resourcesPerNode); gpuQ > 1 {
    info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(gpuQ))
}

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Comment on lines +115 to +140
// TODO (andreyvelich): For MPI we should apply container resources to the Node ReplicatedJob also.
// Eventually, we should find better way to propagate resources from TrainJob to JobSet.
if b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node {
for j, container := range rJob.Template.Spec.Template.Spec.Containers {
if *container.Name == constants.Node {
if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
requirements := corev1ac.ResourceRequirements()
if limits := resourcesPerNode.Limits; limits != nil {
requirements.WithLimits(limits)
}
if requests := resourcesPerNode.Requests; requests != nil {
requirements.WithRequests(requests)
}
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
WithResources(requirements)
}
apply.UpsertEnvVars(
&b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Env,
apply.EnvVars(jobTrainer.Env...)...,
)
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this to avoid duplicating the resources logic:

ancestor := ""

jobMetadata := rJob.Template.ObjectMetaApplyConfiguration
if jobMetadata != nil && jobMetadata.Labels != nil {
	ancestor, _ = jobMetadata.Labels[constants.LabelTrainJobAncestor]
}

if ancestor == constants.AncestorTrainer {
	// TODO: Support multiple replicas ('.template.spec.replicatedJobs[*].replicas') for replicated Jobs.
	// REF: https://github.com/kubeflow/trainer/issues/2318
	b.Spec.ReplicatedJobs[i].Replicas = ptr.To[int32](1)
	// Update the Parallelism and Completions values for the Trainer Job.
	b.Spec.ReplicatedJobs[i].Template.Spec.Parallelism = info.FindPodSetByAncestor(constants.AncestorTrainer).Count
	b.Spec.ReplicatedJobs[i].Template.Spec.Completions = info.FindPodSetByAncestor(constants.AncestorTrainer).Count

	// Update values for the Trainer container.
	for j, container := range rJob.Template.Spec.Template.Spec.Containers {
		if *container.Name == constants.Node {
			// Update values from the TrainJob trainer.
			if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
				if image := jobTrainer.Image; image != nil {
					b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image
				}
				if command := jobTrainer.Command; command != nil {
					b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Command = command
				}
				if args := jobTrainer.Args; args != nil {
					b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Args = args
				}
				apply.UpsertEnvVars(
					&b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Env,
					apply.EnvVars(jobTrainer.Env...)...,
				)
			}
		}
	}
}

// Apply trainer configuration to node containers.
if ancestor == constants.AncestorTrainer ||
	if b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node {
	for j, container := range rJob.Template.Spec.Template.Spec.Containers {
		if *container.Name == constants.Node {
			// Update values from the TrainJob trainer.
			if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
				if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
					(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
					requirements := corev1ac.ResourceRequirements()
					if limits := resourcesPerNode.Limits; limits != nil {
						requirements.WithLimits(limits)
					}
					if requests := resourcesPerNode.Requests; requests != nil {
						requirements.WithRequests(requests)
					}
					b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
						WithResources(requirements)
				}
			}
		}
	}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
WithResources(requirements)
}
apply.UpsertEnvVars(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to propagate the environment variables as well?

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that env should be propagated as well, for now.
We need to investigate whether mpirun can read env from the Worker nodes.


if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumProcPerNode != nil {
info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TrainJob doesn't set NumProcPerNode, would that make sense to set it to the number of GPUs if NumProcPerNode < num GPUs?

ancestor = jobMetadata.Labels[constants.LabelTrainJobAncestor]
}
if ancestor, ok := jobMetadata.Labels[constants.LabelTrainJobAncestor]; ok && ancestor == constants.AncestorTrainer {
if ancestor == constants.AncestorTrainer {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to explore why the unit tests are failing.
/hold

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich force-pushed the fix-resource-allocation branch from 303be89 to 2b83fae Compare October 19, 2025 14:35
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Comment on lines -115 to -117
// Update the Parallelism and Completions values for the Trainer Job.
b.Spec.ReplicatedJobs[i].Template.Spec.Parallelism = info.FindPodSetByAncestor(constants.AncestorTrainer).Count
b.Spec.ReplicatedJobs[i].Template.Spec.Completions = info.FindPodSetByAncestor(constants.AncestorTrainer).Count
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assign Parallelism and Completions spec when we sync pod sets to JobSet template in the plugins

So we can remove these lines from the Builder.

If the changes look good, we can move this forward.
/assign @tenzen-y @astefanutti @Electronic-Waste

/hold cancel

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that part about syncPodSets will simplify with #2877.

Comment on lines +693 to +695
WithLabels(map[string]string{
constants.LabelTrainJobAncestor: "invalid",
}).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

},
wantObjs: []apiruntime.Object{
testingutil.MakeJobSetWrapper(metav1.NamespaceDefault, "test-job").
// This is needed to override default label in MakeJobSetWrapper() for Node rJob.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, why add the "invalid" label is needed. Could we add a TODO maybe to clean this up in a follow-up?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Added TODO statement.

@astefanutti
Copy link
Copy Markdown
Contributor

/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Oct 20, 2025
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow Bot removed the lgtm label Oct 20, 2025
@andreyvelich
Copy link
Copy Markdown
Member

/approve

@astefanutti
Copy link
Copy Markdown
Contributor

/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Oct 20, 2025
Copy link
Copy Markdown
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this great work!

/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [Electronic-Waste,andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 3c062ac into kubeflow:master Oct 20, 2025
36 checks passed
alexxfan pushed a commit to red-hat-data-services/trainer that referenced this pull request Nov 24, 2025
…obs (kubeflow#2653)

* feat(runtime): add support for launcher resource allocation in MPI jobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Set numProcPerNode for MPI plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move util func to runtime package

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchtune plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Inline if for GPU check

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Assign container resources once

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add todo for test wrappers

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Dec 29, 2025
…obs (kubeflow#2653)

* feat(runtime): add support for launcher resource allocation in MPI jobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Set numProcPerNode for MPI plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move util func to runtime package

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchtune plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Inline if for GPU check

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Assign container resources once

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add todo for test wrappers

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for ResourcesPerNode in DeepSpeed Training Job Containers

6 participants