GPU: make health monitoring optional by byako · Pull Request #67 · intel/intel-resource-drivers-for-kubernetes

byako · 2026-06-25T09:05:15Z

Could help remediating intel/xpumanager#129 (comment)

@rouke-broersma , @niklasfrick this could help. WDYT ?

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>

rouke-broersma · 2026-06-25T09:18:53Z

I think it could be interesting to have stats about how many gpus fail to have health stats as well, maybe something for xumpd to expose. Don't personally have a need for it this solution solves my problem from the gpu operator side 👌👌

eero-t

LGTM, but naming is marginally misleading because XPUMD provides also GPU capability (mem amount), not just health info.

(In legacy GPU case, they are all iGPUs i.e. have no device memory anyway.)

niklasfrick · 2026-06-25T15:25:30Z

Nice, this fixes the crashloop on the Gen9.5 boxes from xpumanager#129. With the flag on, the plugin keeps serving allocations while xpumd is uninitialized, which is what I needed. Two things before merge:

1. It still panics on shutdown. waitForXPUMDStream only leaves the loop via err == nil || d.stopXPUMDListener, and returns the last err as-is. In infinite mode against an unavailable xpumd, err is always set, so when the context is cancelled (SIGTERM, drain, rollout) the loop breaks with a non-nil err and the caller hits:

if err != nil {
    panic("xpumd-client: failed to connect to xpumd within expected time, exiting")
}

So every shutdown on an affected node still prints a panic and stacktrace, which is the exact case this PR is meant to avoid. Easiest fix: set err = nil when breaking out because d.stopXPUMDListener is set, or check it (or ctx.Err()) before the panic. Same applies to the reconnect call in the Recv loop.

2. Correct me if I'm wrong but the log probably gets stuck on "attempt 1/30". With attemptStep = 0, attempt never changes, so the log always prints attempt 1/30 no matter how many retries happen. For the case this targets, that's the one line you'd check, and it looks stuck on the first try. A normal loop fixes both this and the step-0 trick:

for attempt := 0; infiniteWait || attempt < ConnectAttemptsMax; attempt++ {
    ...
}

Approach is good and it solves the crashloop. I'd just want (1) fixed before merge; (2) if I am right is an easy win to do at the same time.

GPU: make health monitoring optional

7fdf739

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>

byako requested a review from eero-t June 25, 2026 09:06

byako mentioned this pull request Jun 25, 2026

xpumd v2 crashloops with ERROR_UNINITIALIZED on Gen9.5 (Coffee Lake) iGPUs intel/xpumanager#129

Closed

eero-t approved these changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU: make health monitoring optional#67

GPU: make health monitoring optional#67
byako wants to merge 1 commit into
mainfrom
gpu-optional-hc

byako commented Jun 25, 2026

Uh oh!

rouke-broersma commented Jun 25, 2026

Uh oh!

eero-t left a comment

Uh oh!

niklasfrick commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

byako commented Jun 25, 2026

Uh oh!

rouke-broersma commented Jun 25, 2026

Uh oh!

eero-t left a comment

Choose a reason for hiding this comment

Uh oh!

niklasfrick commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants