koord-scheduler: deviceshare plugin skips handling nodes without device cr by lucming · Pull Request #1594 · koordinator-sh/koordinator

lucming · 2023-08-29T15:37:19Z

Ⅰ. Describe what this PR does

background:
Initially, gpu scheduling was implemented through the gpu device plugin, and after switching to koord-scheduler scheduling, the pod could not be scheduled, prompts that no device resources.

reason:
The gpu node doesn't have a koordlet, but it does report gpu info to node.status via the device plugin, so it can be scheduled by default-scheduler. But koord-scheduler's deviceshare plugin intercepts scheduling pod because there are no device resources.

resolve:
deviceshare skips handling nodes without device

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

codecov · 2023-08-29T16:24:26Z

Codecov Report

Patch coverage: 66.66% and project coverage change: -0.09% ⚠️

Comparison is base (490dfd6) 65.28% compared to head (7804be6) 65.20%.
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1594      +/-   ##
==========================================
- Coverage   65.28%   65.20%   -0.09%     
==========================================
  Files         359      359              
  Lines       36718    36783      +65     
==========================================
+ Hits        23971    23983      +12     
- Misses      10995    11048      +53     
  Partials     1752     1752

Flag	Coverage Δ
unittests	`65.20% <66.66%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
pkg/scheduler/plugins/deviceshare/plugin.go	`60.81% <66.66%> (ø)`

... and 9 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

eahydra · 2023-08-30T03:23:12Z

We've also had this problem for the last few days. But such modifications may not be enough.
For example, there may be both nvidia device-plugin and koordlet on a node, then the scheduler needs to check whether the Device CRD object exists at this time; but if there is only nvidia device-plugin, then there is no need to check.

So I have an idea to add a label on the node dimension, such as node.koordinator.sh/gpu-resource-model, the corresponding value is as follows:

Shared indicates that only the Koordinator GPU Shared protocol is supported
Vendor indicates that only vendor protocols are supported, such as nvidia.com/gpu
Mixed means that both of the above are supported
Native indicates that the node is in native mode and does not need to be processed by the device-share plugin of the scheduler.

WDYT? @lucming

eahydra · 2023-08-30T03:37:27Z

pkg/scheduler/plugins/deviceshare/plugin.go

And the lifetime of the state is one scheduling cycle, so cannot fix as this.

Yes, it will work for the entire scheduling cycle, and I'm handling it this way in the hope that the plugin wouldn't handle nodes without devices.

so, if there is no device it will only be affected by the result of noderesourcefit, and if there is a device it will be handled by deviceshare

But state.skip means that Pod does not need to request GPU or other devices.

eahydra · 2023-08-30T08:35:46Z

We've also had this problem for the last few days. But such modifications may not be enough. For example, there may be both nvidia device-plugin and koordlet on a node, then the scheduler needs to check whether the Device CRD object exists at this time; but if there is only nvidia device-plugin, then there is no need to check.

So I have an idea to add a label on the node dimension, such as node.koordinator.sh/gpu-resource-model, the corresponding value is as follows:

Shared indicates that only the Koordinator GPU Shared protocol is supported

Vendor indicates that only vendor protocols are supported, such as nvidia.com/gpu

Mixed means that both of the above are supported

Native indicates that the node is in native mode and does not need to be processed by the device-share plugin of the scheduler.

WDYT? @lucming

I think it's complicated. In fact, it is only useful in the Native scenario, but in this scenario, judging whether it is Native is equivalent to not checking the Device CRD.

lucming · 2023-08-30T08:39:51Z

We've also had this problem for the last few days. But such modifications may not be enough. For example, there may be both nvidia device-plugin and koordlet on a node, then the scheduler needs to check whether the Device CRD object exists at this time; but if there is only nvidia device-plugin, then there is no need to check.
So I have an idea to add a label on the node dimension, such as node.koordinator.sh/gpu-resource-model, the corresponding value is as follows:

Shared indicates that only the Koordinator GPU Shared protocol is supported

Vendor indicates that only vendor protocols are supported, such as nvidia.com/gpu

Mixed means that both of the above are supported

Native indicates that the node is in native mode and does not need to be processed by the device-share plugin of the scheduler.

WDYT? @lucming

I think it's complicated. In fact, it is only useful in the Native scenario, but in this scenario, judging whether it is Native is equivalent to not checking the Device CRD.

agree with you (^^)

lucming · 2023-08-30T08:55:43Z

We also need to clean up node.status.allocatable after the device resource has been deleted, this is part of the logic in koord-manger, I'll fix it by another pr.

…ce cr Signed-off-by: lucming <2876757716@qq.com> Signed-off-by: liuming6 <liuming6@360.cn>

eahydra

/lgtm
/approve

koordinator-bot · 2023-08-30T09:23:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eahydra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [eahydra]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jasonliu747

/lgtm

koordinator-bot bot requested review from buptcozy and eahydra August 29, 2023 15:37

koordinator-bot bot added the size/XS label Aug 29, 2023

lucming force-pushed the fix-gpu-sched branch 3 times, most recently from a10c5a4 to e905bcb Compare August 29, 2023 16:20

eahydra reviewed Aug 30, 2023

View reviewed changes

lucming force-pushed the fix-gpu-sched branch 2 times, most recently from 1bb05b9 to c69b34b Compare August 30, 2023 09:12

koordinator-bot bot added size/S and removed size/XS labels Aug 30, 2023

koord-scheduler: deviceshare plugin skips handling nodes without devi…

7804be6

…ce cr Signed-off-by: lucming <2876757716@qq.com> Signed-off-by: liuming6 <liuming6@360.cn>

lucming force-pushed the fix-gpu-sched branch from c69b34b to 7804be6 Compare August 30, 2023 09:13

eahydra approved these changes Aug 30, 2023

View reviewed changes

koordinator-bot bot assigned eahydra Aug 30, 2023

koordinator-bot bot added the lgtm label Aug 30, 2023

koordinator-bot bot added the approved label Aug 30, 2023

koordinator-bot bot merged commit 3142de0 into koordinator-sh:main Aug 30, 2023

jasonliu747 reviewed Aug 30, 2023

View reviewed changes

lucming mentioned this pull request Aug 30, 2023

koord-manager: cleanup node.alloc when device cr have been deleted #1597

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

koord-scheduler: deviceshare plugin skips handling nodes without device cr#1594

koord-scheduler: deviceshare plugin skips handling nodes without device cr#1594
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
lucming:fix-gpu-sched

lucming commented Aug 29, 2023

Uh oh!

codecov bot commented Aug 29, 2023 •

edited

Loading

Uh oh!

eahydra commented Aug 30, 2023 •

edited

Loading

Uh oh!

eahydra Aug 30, 2023

Uh oh!

lucming Aug 30, 2023

Uh oh!

eahydra Aug 30, 2023

Uh oh!

eahydra commented Aug 30, 2023

Uh oh!

lucming commented Aug 30, 2023

Uh oh!

lucming commented Aug 30, 2023

Uh oh!

eahydra left a comment

Uh oh!

koordinator-bot bot commented Aug 30, 2023

Uh oh!

jasonliu747 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lucming commented Aug 29, 2023

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

Uh oh!

codecov bot commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eahydra commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eahydra Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

lucming Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

eahydra Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

eahydra commented Aug 30, 2023

Uh oh!

lucming commented Aug 30, 2023

Uh oh!

lucming commented Aug 30, 2023

Uh oh!

eahydra left a comment

Choose a reason for hiding this comment

Uh oh!

koordinator-bot bot commented Aug 30, 2023

Uh oh!

jasonliu747 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Aug 29, 2023 •

edited

Loading

eahydra commented Aug 30, 2023 •

edited

Loading