Skip to content

fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime#2774

Merged
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
andreyvelich:fix-deepspeed-npoc
Aug 5, 2025
Merged

fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime#2774
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
andreyvelich:fix-deepspeed-npoc

Conversation

@andreyvelich
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich commented Aug 5, 2025

We should set the numProcPerNode: 1 in DeepSpeed runtime by default for now.

Users have to manually configure resources in MPI-based runtimes if they want to override it.

We have an open issue to enhance resources for MPI-based runtimes: #2751

/assign @kubeflow/kubeflow-trainer-team @astefanutti

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 16757019652

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 47.949%

Totals Coverage Status
Change from base Build 16749934417: 0.0%
Covered Lines: 947
Relevant Lines: 1975

💛 - Coveralls

@astefanutti
Copy link
Copy Markdown
Contributor

/lgtm

Thanks!

@kramaranya
Copy link
Copy Markdown
Contributor

Thank you!
/lgtm

Copy link
Copy Markdown
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 9011ad7 into kubeflow:master Aug 5, 2025
18 checks passed
@google-oss-prow google-oss-prow Bot added this to the v2.1 milestone Aug 5, 2025
@andreyvelich andreyvelich deleted the fix-deepspeed-npoc branch August 5, 2025 20:21
@andreyvelich
Copy link
Copy Markdown
Member Author

/cherry-pick release-2.0

@google-oss-robot
Copy link
Copy Markdown

@andreyvelich: new pull request created: #2863

Details

In response to this:

/cherry-pick release-2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexxfan pushed a commit to red-hat-data-services/trainer that referenced this pull request Nov 24, 2025
)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants