Skip to content

koord-scheduler: fix duplicate allocate GPU after leader selection changed#1289

Merged
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
eahydra:fix_duplicate_allocate_gpu
May 12, 2023
Merged

koord-scheduler: fix duplicate allocate GPU after leader selection changed#1289
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
eahydra:fix_duplicate_allocate_gpu

Conversation

@eahydra
Copy link
Member

@eahydra eahydra commented May 12, 2023

Ⅰ. Describe what this PR does

The master and slave instances of koord-scheduler both sync pods from api-server via informer and both build nodeDeviceCache with add events. But the slave instance ignore the update event that missed the bound pod by master instance that leaky the pod's deviceAllocations. So if the leader election changed, the slave instance will serve with old states that allocates GPUs that already be allocated for the new Pod.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@eahydra eahydra added this to the v1.3 milestone May 12, 2023
@koordinator-bot koordinator-bot bot requested review from buptcozy and hormes May 12, 2023 09:57
@eahydra eahydra requested a review from jasonliu747 May 12, 2023 09:58
Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@codecov
Copy link

codecov bot commented May 12, 2023

Codecov Report

Patch coverage: 66.66% and no project coverage change.

Comparison is base (66c5e90) 64.71% compared to head (cff9862) 64.72%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1289   +/-   ##
=======================================
  Coverage   64.71%   64.72%           
=======================================
  Files         313      313           
  Lines       32978    33011   +33     
=======================================
+ Hits        21341    21365   +24     
- Misses      10074    10078    +4     
- Partials     1563     1568    +5     
Flag Coverage Δ
unittests 64.72% <66.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/scheduler/plugins/deviceshare/pod_handler.go 75.65% <66.66%> (-12.16%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

…anged

Signed-off-by: Joseph <joseph.t.lee@outlook.com>
@eahydra eahydra force-pushed the fix_duplicate_allocate_gpu branch from e0913f5 to cff9862 Compare May 12, 2023 10:07
@koordinator-bot koordinator-bot bot removed the lgtm label May 12, 2023
@jasonliu747
Copy link
Member

/lgtm

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: jasonliu747

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit f60b243 into koordinator-sh:main May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants