Skip to content

scheduler: fine-grained device scheduling support Cambricon dynamic sMLU#2624

Merged
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
zqzten:cambricon
Sep 19, 2025
Merged

scheduler: fine-grained device scheduling support Cambricon dynamic sMLU#2624
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
zqzten:cambricon

Conversation

@zqzten
Copy link
Member

@zqzten zqzten commented Sep 18, 2025

Ⅰ. Describe what this PR does

This PR makes fine-grained device scheduling support Cambricon dynamic sMLU. The core scheduling logic is the same as NVIDIA so that this work only covers DP adaption.

Note that Cambricon DP does not support external MLU (full card) scheduling so we can only support dynamic sMLU (GPU share) for now.

How to use:

koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-core: "5"
koordinator.sh/gpu-memory: 1Gi
# below are only used to trigger Cambricon DP
cambricon.com/mlu.smlu.vcore: "5"
cambricon.com/mlu.smlu.vmemory: "4"

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 63.10680% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.48%. Comparing base (441eb39) to head (0cbedd6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...duler/plugins/deviceshare/device_plugin_adapter.go 72.22% 12 Missing and 8 partials ⚠️
pkg/util/utils.go 43.33% 12 Missing and 5 partials ⚠️
pkg/scheduler/plugins/deviceshare/plugin.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2624      +/-   ##
==========================================
- Coverage   66.49%   66.48%   -0.02%     
==========================================
  Files         491      491              
  Lines       59008    59103      +95     
==========================================
+ Hits        39240    39295      +55     
- Misses      16914    16945      +31     
- Partials     2854     2863       +9     
Flag Coverage Δ
unittests 66.48% <63.10%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ZiMengSheng
Copy link
Member

/lgtm
/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ZiMengSheng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit 1845a44 into koordinator-sh:main Sep 19, 2025
22 checks passed
@zqzten zqzten deleted the cambricon branch September 19, 2025 08:22
qinfustu added a commit to qinfustu/koordinator that referenced this pull request Sep 22, 2025
…GPU sharing is enabled

Signed-off-by: qinfustu <fu_qin_stu@163.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>

koordlet: fix path for sched_idle_saver_wmark (koordinator-sh#2611)

Signed-off-by: saintube <saintube@foxmail.com>
Co-authored-by: shenxin <rougang.hrg@alibaba-inc.com>

scheduler: takeover nominatingInfo when waitingPod rejected (koordinator-sh#2613)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: avoid create multi nri connection (koordinator-sh#2617)

Signed-off-by: zhengj5 <zhengj5@trip.com>

scheduler: make customized workflow (koordinator-sh#2618)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: fix typo and update xpu condition/partition (koordinator-sh#2619)

Signed-off-by: ZhuZhezz <zzhuzju@163.com>

scheduler: add diagnosis api (koordinator-sh#2607)

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>
Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>

scheduler: fine-grained device scheduling support Cambricon dynamic sMLU (koordinator-sh#2624)

Signed-off-by: Zach Zhu <zzqshu@126.com>

scheduler: remove unused apis/util (koordinator-sh#2627)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>
qinfustu added a commit to qinfustu/koordinator that referenced this pull request Sep 23, 2025
…GPU sharing is enabled

Signed-off-by: qinfustu <fu_qin_stu@163.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>

koordlet: fix path for sched_idle_saver_wmark (koordinator-sh#2611)

Signed-off-by: saintube <saintube@foxmail.com>
Co-authored-by: shenxin <rougang.hrg@alibaba-inc.com>

scheduler: takeover nominatingInfo when waitingPod rejected (koordinator-sh#2613)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: avoid create multi nri connection (koordinator-sh#2617)

Signed-off-by: zhengj5 <zhengj5@trip.com>

scheduler: make customized workflow (koordinator-sh#2618)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: fix typo and update xpu condition/partition (koordinator-sh#2619)

Signed-off-by: ZhuZhezz <zzhuzju@163.com>

scheduler: add diagnosis api (koordinator-sh#2607)

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>
Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>

scheduler: fine-grained device scheduling support Cambricon dynamic sMLU (koordinator-sh#2624)

Signed-off-by: Zach Zhu <zzqshu@126.com>

scheduler: remove unused apis/util (koordinator-sh#2627)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants