feat(runtimes): add support for launcher resource allocation in MPI jobs by jskswamy · Pull Request #2653 · kubeflow/trainer

jskswamy · 2025-05-30T03:46:14Z

The Trainer method has been updated to apply resources appropriately to both the launcher and node containers based on this flag.

Key changes include:

Added the isRunLauncherAsNode method to determine if the launcher should be run as a node.
Updated the Trainer method to conditionally apply resource configurations to the launcher container based on the runLauncherAsNode value.
Enhanced test cases to cover scenarios for resource application to both launcher and node pods based on the MPI policy settings.

Which issue(s) this PR fixes: Fixes #2650

coveralls · 2025-05-30T15:18:50Z

Pull Request Test Coverage Report for Build 18651664516

Details

13 of 62 (20.97%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.7%) to 51.477%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/jobset/builder.go	0	16	0.0%
pkg/runtime/runtime.go	0	33	0.0%

Totals
Change from base Build 18612435311:	-0.7%
Covered Lines:	1255
Relevant Lines:	2438

💛 - Coveralls

andreyvelich

Sorry for the late reply @jskswamy!
Please can you rebase your PR, so we can take a look!
/assign @tenzen-y @Electronic-Waste @astefanutti

jskswamy · 2025-08-17T23:13:38Z

I've rebased the changes, kindly have a look at the changes

andreyvelich

Thank you for this @jskswamy!
/assign @tenzen-y @astefanutti Appreciate your review!

andreyvelich · 2025-08-21T12:16:59Z

@jskswamy Can you also update the title please to align with conventions ?

andreyvelich · 2025-08-21T13:02:09Z

+			// Update values from the TrainJob trainer.
+			if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
+				if image := jobTrainer.Image; image != nil {
+					b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image


But that will also override the image for the MPI workers, isn't ?

Yes, you are right, this will override the image. I have provided a fix for it, kindly have a look at it

andreyvelich · 2025-08-21T13:05:41Z

+			// Skip if container is neither node nor launcher
+			if *container.Name != constants.Node && *container.Name != constants.Launcher {
+				continue
+			}
+
+			// Skip launcher container if runLauncherAsNode is false
+			if *container.Name == constants.Launcher && !b.isRunLauncherAsNode(info) {
+				continue
+			}


@tenzen-y @astefanutti How do you like this approach ?
Shall we check that resources need to updated in the JobSet plugin or in the MPI Plugin ?
E.g. we can use the PodSet in the Info object.

Yes it seems like it would be a better separation of concerns to update the launcher resources in the MPI plugin.

I think the current PR can be merged as is to enable fix the current behavior and follow-up to have a proper design.

jskswamy · 2025-09-02T12:23:03Z

@jskswamy Can you also update the title please to align with conventions ?

I've update the commit message and the title as well according to the conventions

andreyvelich · 2025-09-24T09:27:12Z

/milestone v2.1

astefanutti · 2025-10-03T13:06:05Z

+				// Apply resources to both Node and Launcher containers (when launcher is included)
+				if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
+					(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
+					requirements := corev1ac.ResourceRequirements()


How does that impact the resources requirement for the RuntimeInfo, that Kueue relies on for example?

astefanutti · 2025-10-03T13:13:08Z

 		}
+
+		// Update the Parallelism and Completions values for the Trainer Job.
 		if ancestor, ok := jobMetadata.Labels[constants.LabelTrainJobAncestor]; ok && ancestor == constants.AncestorTrainer {


To make sure, why not adding the trainer.kubeflow.org/trainjob-ancestor-step: trainer label on the node ReplicatedJob as well?

We can't do that since it will also override the container spec using .trainer parameters: #2653 (comment)
However, we should always run OpenSSH server there.

Can't that be controlled with the container name?

Currently, we don't have such capability. We use this label to mark ReplicatedJob to which we should apply values from the Trainer, Model Initializer, and Dataset Initializer spec. The container name is always must be equal to node to apply values from the Trainer values.

@astefanutti Any thoughts on how we can enhance it ?

andreyvelich · 2025-10-15T18:06:40Z

As we discussed on today's call, @astefanutti will help to finalize this PR.
/assign @astefanutti

astefanutti · 2025-10-17T13:57:18Z

/retitle feat(runtimes): add support for launcher resource allocation in MPI jobs

andreyvelich · 2025-10-17T17:55:31Z

/hold since we should fix the bug identified in Slack: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1760719119325289?thread_ts=1760462777.817729&cid=C0742LDFZ4K

review-notebook-app · 2025-10-18T00:39:02Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich · 2025-10-18T00:41:30Z


 	if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumProcPerNode != nil {
 		info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
+		// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.


@tenzen-y @astefanutti @Electronic-Waste I auto set number of slots for MPI plugin equal to number of GPUs, if TrainJob doesn't set NumProcPerNode and NumProcPerNode = 1 (which is default value in our MPI runtimes).

This will help users to use DeepSpeed runtime more easily without modifying the numProcPerNode.
Let me know if that sounds good to you.
/assign @tenzen-y @astefanutti @Electronic-Waste

This logic SGTM

if TrainJob doesn't set NumProcPerNode, would that make sense to set it to the number of GPUs if NumProcPerNode < num GPUs?

@astefanutti I would suggest that we always set NumProcPerNode == num GPUs if the default value: 1 is set in NumProcPerNode.
If users manually override this value in the Runtime or in the TrainJob, we won't override it.
WDYT @astefanutti ?

@andreyvelich right, better not override user-defined values.

Electronic-Waste

@andreyvelich Thanks for this. Some comments for you

Electronic-Waste · 2025-10-18T01:46:47Z


 	if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumProcPerNode != nil {
 		info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
+		// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.


This logic SGTM

Electronic-Waste · 2025-10-18T01:47:19Z

 		info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
+		// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.
+	} else if *info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode == 1 {
+		resourcesPerNode := ptr.Deref(torch.ExtractResourcePerNodeFromRuntime(info), corev1.ResourceRequirements{})


It looks weird to reference the code in torch plugin. Can we make it global somewhere?

Electronic-Waste · 2025-10-18T01:48:26Z

+		if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil && jobTrainer.ResourcesPerNode != nil {
+			resourcesPerNode = ptr.Deref(jobTrainer.ResourcesPerNode, corev1.ResourceRequirements{})
+		}
+		gpuQ := torch.GetNumGPUPerNode(&resourcesPerNode)


Same as above. And also we can combine the assign clause and if clause:

if gpuQ := torch.GetNumGPUPerNode(&resourcesPerNode); gpuQ > 1 { info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(gpuQ)) }

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

astefanutti · 2025-10-18T09:39:31Z

+		// TODO (andreyvelich): For MPI we should apply container resources to the Node ReplicatedJob also.
+		// Eventually, we should find better way to propagate resources from TrainJob to JobSet.
+		if b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node {
+			for j, container := range rJob.Template.Spec.Template.Spec.Containers {
+				if *container.Name == constants.Node {
+					if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
+						if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
+							(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
+							requirements := corev1ac.ResourceRequirements()
+							if limits := resourcesPerNode.Limits; limits != nil {
+								requirements.WithLimits(limits)
+							}
+							if requests := resourcesPerNode.Requests; requests != nil {
+								requirements.WithRequests(requests)
+							}
+							b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
+								WithResources(requirements)
+						}
+						apply.UpsertEnvVars(
+							&b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Env,
+							apply.EnvVars(jobTrainer.Env...)...,
+						)
+					}
+				}
+			}
+		}


Maybe something like this to avoid duplicating the resources logic:

ancestor := "" jobMetadata := rJob.Template.ObjectMetaApplyConfiguration if jobMetadata != nil && jobMetadata.Labels != nil { ancestor, _ = jobMetadata.Labels[constants.LabelTrainJobAncestor] } if ancestor == constants.AncestorTrainer { // TODO: Support multiple replicas ('.template.spec.replicatedJobs[*].replicas') for replicated Jobs. // REF: https://github.com/kubeflow/trainer/issues/2318 b.Spec.ReplicatedJobs[i].Replicas = ptr.To[int32](1) // Update the Parallelism and Completions values for the Trainer Job. b.Spec.ReplicatedJobs[i].Template.Spec.Parallelism = info.FindPodSetByAncestor(constants.AncestorTrainer).Count b.Spec.ReplicatedJobs[i].Template.Spec.Completions = info.FindPodSetByAncestor(constants.AncestorTrainer).Count // Update values for the Trainer container. for j, container := range rJob.Template.Spec.Template.Spec.Containers { if *container.Name == constants.Node { // Update values from the TrainJob trainer. if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil { if image := jobTrainer.Image; image != nil { b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image } if command := jobTrainer.Command; command != nil { b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Command = command } if args := jobTrainer.Args; args != nil { b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Args = args } apply.UpsertEnvVars( &b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Env, apply.EnvVars(jobTrainer.Env...)..., ) } } } } // Apply trainer configuration to node containers. if ancestor == constants.AncestorTrainer || if b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node { for j, container := range rJob.Template.Spec.Template.Spec.Containers { if *container.Name == constants.Node { // Update values from the TrainJob trainer. if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil { if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil && (resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) { requirements := corev1ac.ResourceRequirements() if limits := resourcesPerNode.Limits; limits != nil { requirements.WithLimits(limits) } if requests := resourcesPerNode.Requests; requests != nil { requirements.WithRequests(requests) } b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j]. WithResources(requirements) } } } } }

Looks good!

astefanutti · 2025-10-18T09:40:05Z

+							b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
+								WithResources(requirements)
+						}
+						apply.UpsertEnvVars(


Do we want to propagate the environment variables as well?

I was thinking that env should be propagated as well, for now.
We need to investigate whether mpirun can read env from the Worker nodes.

astefanutti · 2025-10-18T09:44:17Z


 	if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumProcPerNode != nil {
 		info.RuntimePolicy.MLPolicySource.MPI.NumProcPerNode = ptr.To(int32(trainJob.Spec.Trainer.NumProcPerNode.IntValue()))
+		// If numProcPerNode is set to 1 in runtime, we make it equal to number of GPUs.


if TrainJob doesn't set NumProcPerNode, would that make sense to set it to the number of GPUs if NumProcPerNode < num GPUs?

andreyvelich · 2025-10-19T14:34:53Z

+			ancestor = jobMetadata.Labels[constants.LabelTrainJobAncestor]
 		}
-		if ancestor, ok := jobMetadata.Labels[constants.LabelTrainJobAncestor]; ok && ancestor == constants.AncestorTrainer {
+		if ancestor == constants.AncestorTrainer {


Need to explore why the unit tests are failing.
/hold

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-10-20T03:04:00Z

-			// Update the Parallelism and Completions values for the Trainer Job.
-			b.Spec.ReplicatedJobs[i].Template.Spec.Parallelism = info.FindPodSetByAncestor(constants.AncestorTrainer).Count
-			b.Spec.ReplicatedJobs[i].Template.Spec.Completions = info.FindPodSetByAncestor(constants.AncestorTrainer).Count


We assign Parallelism and Completions spec when we sync pod sets to JobSet template in the plugins

trainer/pkg/runtime/framework/plugins/mpi/mpi.go

Line 223 in f02f3e4

info.SyncPodSetsToTemplateSpec()

trainer/pkg/runtime/core/trainingruntime.go

Line 240 in f02f3e4

func syncPodSets(info *runtime.Info) {

So we can remove these lines from the Builder.

If the changes look good, we can move this forward.
/assign @tenzen-y @astefanutti @Electronic-Waste

/hold cancel

I think that part about syncPodSets will simplify with #2877.

astefanutti · 2025-10-20T10:21:54Z

+									WithLabels(map[string]string{
+										constants.LabelTrainJobAncestor: "invalid",
+									}).


Can this be removed?

astefanutti · 2025-10-20T10:28:32Z

+			},
+			wantObjs: []apiruntime.Object{
+				testingutil.MakeJobSetWrapper(metav1.NamespaceDefault, "test-job").
+					// This is needed to override default label in MakeJobSetWrapper() for Node rJob.


Ah I see, why add the "invalid" label is needed. Could we add a TODO maybe to clean this up in a follow-up?

@astefanutti Added TODO statement.

astefanutti · 2025-10-20T11:18:53Z

/lgtm

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-10-20T12:15:50Z

/approve

astefanutti · 2025-10-20T12:33:05Z

/lgtm

Electronic-Waste

@andreyvelich Thanks for this great work!

/lgtm
/approve

google-oss-prow · 2025-10-20T13:10:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Electronic-Waste,andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…obs (kubeflow#2653) * feat(runtime): add support for launcher resource allocation in MPI jobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Set numProcPerNode for MPI plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move util func to runtime package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix torchtune plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Inline if for GPU check Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Assign container resources once Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add todo for test wrappers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow Bot requested review from jinchihe and kuizhiqing May 30, 2025 03:46

google-oss-prow Bot added the size/L label May 30, 2025

jskswamy force-pushed the fix-resource-allocation branch 2 times, most recently from c0d40e8 to 6925f41 Compare June 5, 2025 08:21

andreyvelich reviewed Aug 11, 2025

View reviewed changes

google-oss-prow Bot assigned astefanutti, Electronic-Waste and tenzen-y Aug 11, 2025

andreyvelich mentioned this pull request Aug 11, 2025

Support for ResourcesPerNode in DeepSpeed Training Job Containers #2650

Closed

jskswamy force-pushed the fix-resource-allocation branch from 6925f41 to 0859508 Compare August 17, 2025 23:12

andreyvelich reviewed Aug 18, 2025

View reviewed changes

andreyvelich reviewed Aug 21, 2025

View reviewed changes

jskswamy force-pushed the fix-resource-allocation branch from 0859508 to 9052324 Compare September 2, 2025 12:14

google-oss-prow Bot added size/XL and removed size/L labels Sep 2, 2025

jskswamy force-pushed the fix-resource-allocation branch from 9052324 to 1122a5c Compare September 2, 2025 12:22

jskswamy changed the title ~~Apply resources appropriately to both launcher and node containers~~ feat(runtime): add support for launcher resource allocation in MPI jobs Sep 2, 2025

google-oss-prow Bot added this to the v2.1 milestone Sep 24, 2025

astefanutti reviewed Oct 3, 2025

View reviewed changes

google-oss-prow Bot changed the title ~~feat(runtime): add support for launcher resource allocation in MPI jobs~~ feat(runtimes): add support for launcher resource allocation in MPI jobs Oct 17, 2025

google-oss-prow Bot added the lgtm label Oct 17, 2025

google-oss-prow Bot added size/XXL and removed size/XL labels Oct 18, 2025

andreyvelich reviewed Oct 18, 2025

View reviewed changes

Electronic-Waste reviewed Oct 18, 2025

View reviewed changes

andreyvelich added 2 commits October 18, 2025 03:40

Move util func to runtime package

999c9b8

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix torchtune plugin

0ebdef0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

astefanutti reviewed Oct 18, 2025

View reviewed changes

andreyvelich reviewed Oct 19, 2025

View reviewed changes

google-oss-prow Bot added the do-not-merge/hold label Oct 19, 2025

Inline if for GPU check

2b83fae

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the fix-resource-allocation branch from 303be89 to 2b83fae Compare October 19, 2025 14:35

Assign container resources once

f02f3e4

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich reviewed Oct 20, 2025

View reviewed changes

google-oss-prow Bot removed the do-not-merge/hold label Oct 20, 2025

astefanutti reviewed Oct 20, 2025

View reviewed changes

google-oss-prow Bot added the lgtm label Oct 20, 2025

Add todo for test wrappers

8ab0ab0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow Bot removed the lgtm label Oct 20, 2025

google-oss-prow Bot added the approved label Oct 20, 2025

andreyvelich added the ok-to-test-gpu-runner label Oct 20, 2025

google-oss-prow Bot added the lgtm label Oct 20, 2025

Electronic-Waste approved these changes Oct 20, 2025

View reviewed changes

google-oss-prow Bot merged commit 3c062ac into kubeflow:master Oct 20, 2025
36 checks passed

Conversation

jskswamy commented May 30, 2025

Uh oh!

coveralls commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18651664516

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

jskswamy commented Aug 17, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jskswamy commented Sep 2, 2025

Uh oh!

andreyvelich commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Oct 15, 2025

Uh oh!

astefanutti commented Oct 17, 2025

Uh oh!

andreyvelich commented Oct 17, 2025

Uh oh!

review-notebook-app Bot commented Oct 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented May 30, 2025 •

edited

Loading

andreyvelich Oct 19, 2025 •

edited

Loading