Skip to content

koordlet: avoid create multi nri connection#2617

Merged
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
yyrdl:main
Sep 12, 2025
Merged

koordlet: avoid create multi nri connection#2617
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
yyrdl:main

Conversation

@yyrdl
Copy link
Contributor

@yyrdl yyrdl commented Sep 12, 2025

Ⅰ. Describe what this PR does

解决 nri server 重连时有概率导致创建大量 nri 链接的问题, 这会产生两个风险:

  • containerd 将多个 nri 链接认为是多个 plugin ,发现多个plugin 修改同一个字段,导致创建容器失败
  • onClose callback 的设计会导致创建大量 goroutine ,在携程的使用过程中,观测到创建了 1000多万个 goroutine ,进而导致 OOM

Ⅱ. Does this pull request fix one issue?

fixes #2334

Ⅲ. Describe how to verify it

线上场景不好构造,check 同步提交的单测吧

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@codecov
Copy link

codecov bot commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 70.04049% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.46%. Comparing base (614ee0d) to head (c4674e3).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/koordlet/runtimehooks/nri/server.go 72.80% 26 Missing and 5 partials ⚠️
pkg/koordlet/runtimehooks/nri/handlers.go 68.67% 20 Missing and 6 partials ⚠️
pkg/koordlet/runtimehooks/nri/mock_nri.go 60.46% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2617      +/-   ##
==========================================
+ Coverage   66.43%   66.46%   +0.02%     
==========================================
  Files         487      490       +3     
  Lines       58812    58914     +102     
==========================================
+ Hits        39074    39158      +84     
- Misses      16892    16903      +11     
- Partials     2846     2853       +7     
Flag Coverage Δ
unittests 66.46% <70.04%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: zhengj5 <zhengj5@trip.com>
@saintube
Copy link
Member

/lgtm
/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saintube

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit 175111e into koordinator-sh:main Sep 12, 2025
22 checks passed
qinfustu added a commit to qinfustu/koordinator that referenced this pull request Sep 22, 2025
…GPU sharing is enabled

Signed-off-by: qinfustu <fu_qin_stu@163.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>

koordlet: fix path for sched_idle_saver_wmark (koordinator-sh#2611)

Signed-off-by: saintube <saintube@foxmail.com>
Co-authored-by: shenxin <rougang.hrg@alibaba-inc.com>

scheduler: takeover nominatingInfo when waitingPod rejected (koordinator-sh#2613)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: avoid create multi nri connection (koordinator-sh#2617)

Signed-off-by: zhengj5 <zhengj5@trip.com>

scheduler: make customized workflow (koordinator-sh#2618)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: fix typo and update xpu condition/partition (koordinator-sh#2619)

Signed-off-by: ZhuZhezz <zzhuzju@163.com>

scheduler: add diagnosis api (koordinator-sh#2607)

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>
Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>

scheduler: fine-grained device scheduling support Cambricon dynamic sMLU (koordinator-sh#2624)

Signed-off-by: Zach Zhu <zzqshu@126.com>

scheduler: remove unused apis/util (koordinator-sh#2627)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>
qinfustu added a commit to qinfustu/koordinator that referenced this pull request Sep 23, 2025
…GPU sharing is enabled

Signed-off-by: qinfustu <fu_qin_stu@163.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>

koordlet: fix path for sched_idle_saver_wmark (koordinator-sh#2611)

Signed-off-by: saintube <saintube@foxmail.com>
Co-authored-by: shenxin <rougang.hrg@alibaba-inc.com>

scheduler: takeover nominatingInfo when waitingPod rejected (koordinator-sh#2613)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: avoid create multi nri connection (koordinator-sh#2617)

Signed-off-by: zhengj5 <zhengj5@trip.com>

scheduler: make customized workflow (koordinator-sh#2618)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

koordlet: fix typo and update xpu condition/partition (koordinator-sh#2619)

Signed-off-by: ZhuZhezz <zzhuzju@163.com>

scheduler: add diagnosis api (koordinator-sh#2607)

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>
Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>

scheduler: fine-grained device scheduling support Cambricon dynamic sMLU (koordinator-sh#2624)

Signed-off-by: Zach Zhu <zzqshu@126.com>

scheduler: remove unused apis/util (koordinator-sh#2627)

Signed-off-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>
Co-authored-by: 乔普 <wangjianyu.wjy@alibaba-inc.com>

manager: synchronize node GPU resources while distinguishing whether GPU sharing is enabled

Signed-off-by: qinfustu <30459241+qinfustu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[question] NRI Plugin Conflict: Duplicate CPU Pinning Attempt Detected

2 participants