Skip to content

GPU: make health monitoring optional#67

Open
byako wants to merge 1 commit into
mainfrom
gpu-optional-hc
Open

GPU: make health monitoring optional#67
byako wants to merge 1 commit into
mainfrom
gpu-optional-hc

Conversation

@byako

@byako byako commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Could help remediating intel/xpumanager#129 (comment)

@rouke-broersma , @niklasfrick this could help. WDYT ?

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
@rouke-broersma

Copy link
Copy Markdown
Contributor

I think it could be interesting to have stats about how many gpus fail to have health stats as well, maybe something for xumpd to expose. Don't personally have a need for it this solution solves my problem from the gpu operator side 👌👌

@eero-t eero-t left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but naming is marginally misleading because XPUMD provides also GPU capability (mem amount), not just health info.

(In legacy GPU case, they are all iGPUs i.e. have no device memory anyway.)

@niklasfrick

Copy link
Copy Markdown

Nice, this fixes the crashloop on the Gen9.5 boxes from xpumanager#129. With the flag on, the plugin keeps serving allocations while xpumd is uninitialized, which is what I needed. Two things before merge:

1. It still panics on shutdown. waitForXPUMDStream only leaves the loop via err == nil || d.stopXPUMDListener, and returns the last err as-is. In infinite mode against an unavailable xpumd, err is always set, so when the context is cancelled (SIGTERM, drain, rollout) the loop breaks with a non-nil err and the caller hits:

if err != nil {
    panic("xpumd-client: failed to connect to xpumd within expected time, exiting")
}

So every shutdown on an affected node still prints a panic and stacktrace, which is the exact case this PR is meant to avoid. Easiest fix: set err = nil when breaking out because d.stopXPUMDListener is set, or check it (or ctx.Err()) before the panic. Same applies to the reconnect call in the Recv loop.

2. Correct me if I'm wrong but the log probably gets stuck on "attempt 1/30". With attemptStep = 0, attempt never changes, so the log always prints attempt 1/30 no matter how many retries happen. For the case this targets, that's the one line you'd check, and it looks stuck on the first try. A normal loop fixes both this and the step-0 trick:

for attempt := 0; infiniteWait || attempt < ConnectAttemptsMax; attempt++ {
    ...
}

Approach is good and it solves the crashloop. I'd just want (1) fixed before merge; (2) if I am right is an easy win to do at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants