Skip to content

[Cluster launcher] [vSphere] Fix multiple worker_types and timeout issue and support GPU nodes#40667

Merged
vitsai merged 4 commits intoray-project:releases/2.8.0from
architkulkarni:vsphere-2.8-cherrypick
Oct 26, 2023
Merged

[Cluster launcher] [vSphere] Fix multiple worker_types and timeout issue and support GPU nodes#40667
vitsai merged 4 commits intoray-project:releases/2.8.0from
architkulkarni:vsphere-2.8-cherrypick

Conversation

@architkulkarni
Copy link
Contributor

Why are these changes needed?

Cherry-picks the following PRs to the 2.8 release branch:

The changes are localized to the vSphere cluster launcher, so it will not affect any other Ray component.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

JingChen23 and others added 4 commits October 25, 2023 11:56
…oesn't work (ray-project#40487)

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…roject#40516)

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
ray-project#40616)

This is for supporting passthrough the GPU on vSphere ESXi host into the Ray nodes.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
…hed_nodes (ray-project#40655)

Power-on-off status is runtime info of VM, should not fetch it from cached-nodes, which is probably dirty data.
It should query by pyvmomi_sdk every time.

Signed-off-by: Chen Hui <huchen@vmware.com>
Copy link
Contributor

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes are contained within vsphere support and there's a deadline for this feature. lets pick

@vitsai vitsai merged commit dd3e687 into ray-project:releases/2.8.0 Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-blocker P0 Issue that blocks the release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants