GPU: make health monitoring optional#67
Conversation
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
|
I think it could be interesting to have stats about how many gpus fail to have health stats as well, maybe something for xumpd to expose. Don't personally have a need for it this solution solves my problem from the gpu operator side 👌👌 |
eero-t
left a comment
There was a problem hiding this comment.
LGTM, but naming is marginally misleading because XPUMD provides also GPU capability (mem amount), not just health info.
(In legacy GPU case, they are all iGPUs i.e. have no device memory anyway.)
|
Nice, this fixes the crashloop on the Gen9.5 boxes from xpumanager#129. With the flag on, the plugin keeps serving allocations while xpumd is uninitialized, which is what I needed. Two things before merge: 1. It still panics on shutdown. So every shutdown on an affected node still prints a panic and stacktrace, which is the exact case this PR is meant to avoid. Easiest fix: set 2. Correct me if I'm wrong but the log probably gets stuck on "attempt 1/30". With Approach is good and it solves the crashloop. I'd just want (1) fixed before merge; (2) if I am right is an easy win to do at the same time. |
Could help remediating intel/xpumanager#129 (comment)
@rouke-broersma , @niklasfrick this could help. WDYT ?