koord-scheduler: deviceshare plugin skips handling nodes without device cr#1594
Conversation
a10c5a4 to
e905bcb
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1594 +/- ##
==========================================
- Coverage 65.28% 65.20% -0.09%
==========================================
Files 359 359
Lines 36718 36783 +65
==========================================
+ Hits 23971 23983 +12
- Misses 10995 11048 +53
Partials 1752 1752
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
|
We've also had this problem for the last few days. But such modifications may not be enough. So I have an idea to add a label on the node dimension, such as
WDYT? @lucming |
There was a problem hiding this comment.
And the lifetime of the state is one scheduling cycle, so cannot fix as this.
There was a problem hiding this comment.
Yes, it will work for the entire scheduling cycle, and I'm handling it this way in the hope that the plugin wouldn't handle nodes without devices.
so, if there is no device it will only be affected by the result of noderesourcefit, and if there is a device it will be handled by deviceshare
There was a problem hiding this comment.
But state.skip means that Pod does not need to request GPU or other devices.
I think it's complicated. In fact, it is only useful in the Native scenario, but in this scenario, judging whether it is Native is equivalent to not checking the Device CRD. |
agree with you (^^) |
|
We also need to clean up node.status.allocatable after the device resource has been deleted, this is part of the logic in koord-manger, I'll fix it by another pr. |
1bb05b9 to
c69b34b
Compare
…ce cr Signed-off-by: lucming <2876757716@qq.com> Signed-off-by: liuming6 <liuming6@360.cn>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eahydra The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ⅰ. Describe what this PR does
background:
Initially, gpu scheduling was implemented through the gpu device plugin, and after switching to koord-scheduler scheduling, the pod could not be scheduled, prompts that no device resources.
reason:
The gpu node doesn't have a koordlet, but it does report gpu info to node.status via the device plugin, so it can be scheduled by default-scheduler. But koord-scheduler's deviceshare plugin intercepts scheduling pod because there are no device resources.
resolve:
deviceshare skips handling nodes without device
Ⅱ. Does this pull request fix one issue?
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
make test