feat: Add dcgm diagnostics as a preflight check#772
feat: Add dcgm diagnostics as a preflight check#772lalitadithya merged 33 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
📝 WalkthroughWalkthroughReplaces the preflight Changes
Sequence DiagramsequenceDiagram
participant Pod as Pod
participant Webhook as Webhook Injector
participant Init as preflight-dcgm-diag Init Container
participant NVML as NVML
participant DCGM as DCGM Hostengine
participant Reporter as HealthReporter
participant Connector as Platform Connector (gRPC)
Pod->>Webhook: creation webhook invoked
Webhook->>Webhook: evaluate DCGM config & connector socket
Webhook->>Pod: inject init container `preflight-dcgm-diag` + env vars
Webhook->>Pod: add `nvsentinel-socket` hostPath volume (if configured)
Init->>NVML: discover GPUs
NVML-->>Init: GPU indices & UUIDs
Init->>DCGM: connect and run diagnostics
DCGM-->>Init: diagnostic results
Init->>Init: parse results -> DiagResult list
Init->>Reporter: build HealthEvents
Reporter->>Connector: send HealthEvents via unix-socket gRPC
Connector-->>Reporter: ack
Init->>Pod: exit (0 success / 1 failure)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
preflight/go.mod (1)
3-3:⚠️ Potential issue | 🟡 MinorUpdate Go version to the latest stable release. Go 1.25.0 is outdated; the current stable release is Go 1.25.6. Update the
godirective togo 1.25.6to stay current with security updates and bug fixes, aligning with the guidance to keep Go versions up-to-date.
🤖 Fix all issues with AI agents
In
`@distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl`:
- Around line 45-51: The helper gpu-health-monitor.dcgmEnabled uses a partial
nil check (.Values.global) but then accesses .Values.global.dcgm.enabled
directly which can panic if global.dcgm is missing; update the conditional to
mirror the dcgmEndpoint/dcgmPort pattern by checking the full chain (e.g., if
.Values.global and .Values.global.dcgm) and then use .Values.global.dcgm.enabled
| default .Values.dcgm.dcgmK8sServiceEnabled so the fallback
(.Values.dcgm.dcgmK8sServiceEnabled) is used safely when global.dcgm or its
enabled key is absent.
In `@distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl`:
- Around line 102-111: The helper preflight.dcgmHostengineAddr currently returns
an empty string when .Values.global.dcgm.service is missing; change it to mirror
gpu-health-monitor.dcgmAddr by falling back to the local .Values.dcgm.service
values: if .Values.global.dcgm.service exists use
.Values.global.dcgm.service.endpoint and .port, otherwise use
.Values.dcgm.service.endpoint and .Values.dcgm.service.port; update the
conditional in the define "preflight.dcgmHostengineAddr" to check
.Values.global.dcgm.service and select the appropriate endpoint:port string
accordingly so it never returns an empty string when local dcgm.service is
provided.
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 29-32: Replace the development image and undocumented fields: set
image.repository to the official NVIDIA registry
(ghcr.io/nvidia/nvsentinel/preflight) instead of xrfxlp/preflight, change
image.tag to use the chart’s release/version pattern (e.g., the same value used
by other charts such as the global image tag or Chart.AppVersion) instead of
10420202, and keep pullPolicy as IfNotPresent; also add short inline comments
for the image.repository, image.pullPolicy, and image.tag fields mirroring the
style and content used in the global configuration block above so each field is
documented.
In `@preflight-checks/dcgm-diag/Dockerfile`:
- Around line 44-49: The Dockerfile copies only go.mod for the dcgm-diag module
but then runs RUN go mod download in WORKDIR
/workspace/preflight-checks/dcgm-diag; update the COPY instruction that
references the dcgm-diag module so it copies both go.mod and go.sum (similar to
how data-models copies go.mod and go.sum) to ensure go mod download can verify
checksums and produce reproducible builds—locate the COPY line that currently
says COPY preflight-checks/dcgm-diag/go.mod ./preflight-checks/dcgm-diag/ and
modify it to include go.sum as well.
- Line 19: Update the Dockerfile build ARG for Go to a current patched 1.25
release: change the ARG named GOLANG_VERSION used in the Dockerfile from 1.25.0
to 1.25.6 (or a newer 1.25.x patch) so the image uses Go 1.25.6+; ensure any
related build/test scripts or CI that reference GOLANG_VERSION are updated to
the same value to keep versions consistent.
In `@preflight-checks/dcgm-diag/gpu.go`:
- Line 15: Add a package-level godoc comment describing the purpose of this
package immediately above the package declaration for package main; create a
concise sentence or two that explains what the dcgm-diag/gpu package does (e.g.,
GPU preflight diagnostics using DCGM) so the package has proper documentation
and meets the coding guidelines.
In `@scripts/buildko.sh`:
- Line 55: The script currently only includes ./preflight in the go work/use and
ko build targets; add the missing dcgm-diag module by including
./preflight-checks/dcgm-diag alongside ./preflight in the `go work use`
directive (replace or extend the existing ./preflight token) and add
./preflight-checks/dcgm-diag to the list of targets passed to `ko build` (the
same place where ./preflight is listed) so the dcgm-diag binary is included in
the container build.
🧹 Nitpick comments (8)
preflight/pkg/config/config.go (2)
39-43: Add godoc comment for exportedDCGMConfigtype.Per Go conventions, exported types should have documentation comments describing their purpose and fields.
📝 Proposed documentation
+// DCGMConfig holds DCGM (Data Center GPU Manager) diagnostic configuration +// for preflight health checks. type DCGMConfig struct { + // HostengineAddr is the address of the DCGM hostengine service (e.g., "nvidia-dcgm.gpu-operator.svc:5555") HostengineAddr string `yaml:"hostengineAddr"` + // DiagLevel specifies the DCGM diagnostic level (1-4), defaults to 1 DiagLevel int `yaml:"diagLevel"` + // Timeout specifies the diagnostic timeout duration (e.g., "5m"), defaults to "5m" Timeout string `yaml:"timeout"` }
60-66: Consider validatingTimeoutas a parseable duration.The
Timeoutfield is set as a string but will likely be parsed as atime.Durationdownstream. Validating it early inLoadwould surface configuration errors at startup rather than at runtime.🛡️ Proposed validation
if fileConfig.DCGM.Timeout == "" { fileConfig.DCGM.Timeout = "5m" } + if _, err := time.ParseDuration(fileConfig.DCGM.Timeout); err != nil { + return nil, fmt.Errorf("invalid DCGM timeout duration %q: %w", fileConfig.DCGM.Timeout, err) + } return &Config{FileConfig: fileConfig}, nilYou'll also need to add
"time"to the imports.preflight-checks/dcgm-diag/main.go (2)
32-43: Remove redundantos.Exit(0)call.Go programs exit with status 0 by default when
main()returns normally. The explicitos.Exit(0)is redundant and prevents deferred functions from executing (though none exist here currently).♻️ Suggested simplification
func main() { slog.Info("Starting preflight dcgm-diag check", "version", version, "commit", commit, "date", date) if err := run(); err != nil { slog.Error("DCGM diagnostic check failed", "error", err) os.Exit(1) } slog.Info("DCGM diagnostic check passed") - os.Exit(0) }
94-99: Consider moving logger configuration out ofparseConfig.Setting the global logger as a side effect within
parseConfigcouples configuration parsing with logging setup. For better separation of concerns, consider returning the verbose flag and configuring the logger inrun()ormain().preflight-checks/dcgm-diag/diag.go (1)
51-74: Goroutine may continue running after context timeout.When the context times out,
runDCGMDiagreturns but the goroutine callingdcgm.RunDiagcontinues executing. This is likely acceptable since DCGM diagnostics cannot be cancelled mid-execution, but be aware that the diagnostic process will complete in the background.preflight-checks/dcgm-diag/Makefile (1)
41-41: Platform inconsistency between local and publish builds.
PLATFORMSis set tolinux/amd64,linux/arm64(line 41), butdocker-build(line 85) hardcodes--platform linux/amd64. This is likely intentional for faster local builds, but consider adding a comment to clarify this design choice.📝 Add clarifying comment
.PHONY: docker-build docker-build: setup-buildx ## Build Docker image locally `@echo` "Building Docker image for preflight-dcgm-diag (local)..." $(if $(filter true,$(DISABLE_REGISTRY_CACHE)),`@echo` "Registry cache disabled for this build") cd $(REPO_ROOT) && docker buildx build \ + # Local builds use single platform for speed; docker-publish uses PLATFORMS --platform linux/amd64 \Also applies to: 85-85
preflight-checks/dcgm-diag/health.go (1)
47-61: Consider deduplicating GPU entities.If multiple test failures occur on the same GPU, the
entitiesslice will contain duplicate UUIDs. Depending on the reporting API's expectations, this may cause issues or redundant data.♻️ Deduplicate entities using a map
func reportHealthEvent(connectorSocket string, results []dcgm.DiagResult, isFatal bool, message string) error { reporter := newReporter(connectorSocket) - entities := make([]reporting.Entity, 0, len(results)) + seen := make(map[uint]bool) + entities := make([]reporting.Entity, 0, len(results)) for _, r := range results { + if seen[r.EntityID] { + continue + } + seen[r.EntityID] = true + uuid, err := GetGPUUUID(r.EntityID) if err != nil { slog.Error("Failed to get GPU UUID for health event", "gpuIndex", r.EntityID, "error", err) return err } entities = append(entities, reporting.Entity{ Type: "GPU_UUID", Value: uuid, }) }preflight-checks/dcgm-diag/gpu.go (1)
26-27: Global map lacks synchronization and clearing logic.The
gpuIndexToUUIDmap is written ingetAllocatedGPUsand read inGetGPUUUIDwithout synchronization. If these are called concurrently, there's a data race. Additionally, the map is never cleared before repopulating, so repeated calls togetAllocatedGPUscould leave stale entries.If this code path is guaranteed to be single-threaded and called only once, this is acceptable. Otherwise, consider using a
sync.RWMutexand clearing the map at the start ofgetAllocatedGPUs.♻️ Suggested fix with mutex protection
+import "sync" + +var gpuIndexToUUIDMu sync.RWMutex var gpuIndexToUUID = make(map[uint]string)Then in
getAllocatedGPUs(around line 66):gpuIndexToUUIDMu.Lock() gpuIndexToUUID = make(map[uint]string) // clear stale entries defer gpuIndexToUUIDMu.Unlock()And in
GetGPUUUID:gpuIndexToUUIDMu.RLock() defer gpuIndexToUUIDMu.RUnlock()
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl
Show resolved
Hide resolved
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
20-83:⚠️ Potential issue | 🟠 MajorAdd inline docs for newly added values (and examples where non‑obvious).
Several new blocks lack inline comments (e.g.,
serviceAccount,securityContext,service,resources, probes,nodeSelector/tolerations/affinity, andglobal.dcgm.service.port). The chart guidelines require every value to be documented and examples for non‑obvious settings.Example style (apply broadly)
global: dcgm: service: # DCGM hostengine service endpoint (GPU Operator's DCGM service) endpoint: "nvidia-dcgm.gpu-operator.svc" + # DCGM hostengine service port port: 5555 @@ serviceAccount: + # Create a dedicated ServiceAccount for the preflight webhook create: truepreflight/pkg/webhook/injector.go (1)
17-24:⚠️ Potential issue | 🟠 MajorDerive socket hostPath from
ConnectorSocketinstead of hardcoding.Hardcoding
/var/run/nvsentinelwon’t include the default socket (/var/run/nvsentinel.sock) or custom paths, so the init container may not see the socket. Parse the configured socket URI and mount its directory.Suggested fix
import ( "fmt" "log/slog" + "path/filepath" + "strings" "github.com/nvidia/nvsentinel/preflight/pkg/config" corev1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/api/resource" ) @@ - // Only inject socket volume if connector socket is configured - if i.cfg.DCGM.ConnectorSocket == "" { + // Only inject socket volume if connector socket is configured + socketPath := strings.TrimPrefix(i.cfg.DCGM.ConnectorSocket, "unix://") + if socketPath == "" { return patches } @@ - hostPathType := corev1.HostPathDirectory + hostPathType := corev1.HostPathDirectoryOrCreate socketVolume := corev1.Volume{ Name: nvsentinelSocketVolumeName, VolumeSource: corev1.VolumeSource{ HostPath: &corev1.HostPathVolumeSource{ - Path: "/var/run/nvsentinel", + Path: filepath.Dir(socketPath), Type: &hostPathType, }, }, }Also applies to: 164-184
🤖 Fix all issues with AI agents
In `@preflight-checks/dcgm-diag/Dockerfile`:
- Around line 18-35: The Dockerfile downloads the Go tarball without verifying
integrity; add an ARG GOLANG_SHA256 and replace the current RUN that uses
wget+tar with a download-and-verify sequence that checks the SHA256 before
extraction: declare ARG GOLANG_SHA256, download
go${GOLANG_VERSION}.linux-amd64.tar.gz (using wget or curl) and then verify it
with sha256sum -c (e.g. echo "$GOLANG_SHA256
go${GOLANG_VERSION}.linux-amd64.tar.gz" | sha256sum -c -) and fail the build on
mismatch, only then tar -C /usr/local -xzf the verified archive and remove the
artifacts; update the RUN that currently contains wget/tar to perform these
steps and reference ARG GOLANG_VERSION and the new ARG GOLANG_SHA256.
In `@preflight-checks/dcgm-diag/pkg/gpu/gpu.go`:
- Around line 18-25: The indexToUUID map is written during discovery and read by
GetUUID without synchronization, causing races and stale data; protect all
accesses by adding a package-level sync.RWMutex (or mutex) and use a write lock
when populating/clearing indexToUUID in the discovery code that populates the
map and a read lock inside GetUUID (and any other readers) to ensure consistent
reads; also ensure the discovery path clears or replaces the map while holding
the write lock so stale entries are removed atomically.
In `@preflight-checks/dcgm-diag/pkg/health/health.go`:
- Around line 66-77: The loop building gpuUUIDs from results using gpu.GetUUID
may append the same UUID multiple times when multiple tests fail on one GPU;
update the logic in health.go (the block that iterates over results and
populates gpuUUIDs) to deduplicate UUIDs before returning/reporting by using a
temporary set/map keyed by UUID (or checking for existence before append) so
only unique GPU UUIDs are added to the gpuUUIDs slice; ensure the rest of the
code that consumes gpuUUIDs continues to receive a slice of unique UUID strings.
- Around line 115-128: The HealthEvent currently sets IsHealthy using IsHealthy:
!isFatal which wrongly marks warnings as healthy; change the constructor to
accept an explicit isHealthy bool (or pass an existing isHealthy flag) and set
IsHealthy: isHealthy in the pb.HealthEvent literal (keeping IsFatal: isFatal),
and update all callers that construct this event (locations invoking the
function/section that sets event := &pb.HealthEvent{...}, using
agentName/isFatal) to supply the correct isHealthy value so warnings become
IsFatal=false, IsHealthy=false while healthy remains IsFatal=false,
IsHealthy=true.
🧹 Nitpick comments (7)
preflight/pkg/config/config.go (1)
39-44: Add godoc for exportedDCGMConfig.golint expects exported identifiers to have a leading comment; this new type is missing one.
Suggested fix
+// DCGMConfig holds DCGM diagnostics configuration loaded from YAML. type DCGMConfig struct { HostengineAddr string `yaml:"hostengineAddr"` DiagLevel int `yaml:"diagLevel"` Timeout string `yaml:"timeout"` ConnectorSocket string `yaml:"connectorSocket"` }preflight-checks/dcgm-diag/main.go (3)
35-46: Redundantos.Exit(0)call.After a successful
run()returnsnil, the program will naturally exit with code 0. The explicitos.Exit(0)on line 45 is unnecessary and prevents any deferred functions inmain()from executing (though there are none currently).🧹 Proposed cleanup
func main() { slog.Info("Starting preflight dcgm-diag check", "version", version, "commit", commit, "date", date) if err := run(); err != nil { slog.Error("DCGM diagnostic check failed", "error", err) os.Exit(1) } slog.Info("DCGM diagnostic check passed") - os.Exit(0) }
104-113: Consider usingstrconv.Atoiand logging parse failures.
fmt.Sscanfis unconventional for simple integer parsing. The silent fallback on parse errors could mask configuration issues in production.♻️ Proposed improvement
+import "strconv" + func getEnvInt(key string, defaultValue int) int { if value := os.Getenv(key); value != "" { - var result int - if _, err := fmt.Sscanf(value, "%d", &result); err == nil { + result, err := strconv.Atoi(value) + if err == nil { return result } + slog.Warn("Invalid integer value for environment variable, using default", + "key", key, "value", value, "default", defaultValue, "error", err) } return defaultValue }
115-123: Consider logging parse failures for duration environment variables.Similar to integer parsing, silently falling back on invalid duration strings could hide misconfiguration.
♻️ Proposed improvement
func getEnvDuration(key string, defaultValue time.Duration) time.Duration { if value := os.Getenv(key); value != "" { if d, err := time.ParseDuration(value); err == nil { return d } + slog.Warn("Invalid duration value for environment variable, using default", + "key", key, "value", value, "default", defaultValue) } return defaultValue }preflight-checks/dcgm-diag/pkg/diag/diag.go (2)
34-79: Goroutine may leak on context cancellation.When the context times out (line 70-71), the goroutine running
dcgm.RunDiagcontinues executing and will eventually send toresultChorerrCh. Since these are buffered channels of size 1, this won't block, but the DCGM resources may not be properly cleaned up if the diagnostic is still running.Consider documenting this behavior or investigating whether
dcgm.RunDiagcan be cancelled. The current implementation is acceptable for a preflight check that exits immediately after.📝 Suggested documentation improvement
// Run executes DCGM diagnostics using the go-dcgm bindings. // // Note: go-dcgm requires CGO and links against libdcgm.so at compile time. // The binary must be built with DCGM 4.2.3+ which introduced dcgmDiagResponse_version12. +// +// On context timeout, the diagnostic goroutine may continue running until completion. +// The process is expected to exit shortly after, releasing all resources. func Run(ctx context.Context, level int, hostengineAddr string) (*dcgm.DiagResults, error) {
182-210: Minor code duplication betweenformatFailuresandformatWarnings.These functions are nearly identical, differing only in the fallback suffix. Consider extracting a shared helper.
♻️ Optional refactor to reduce duplication
-func formatFailures(failures []dcgm.DiagResult) string { +func formatResults(results []dcgm.DiagResult, fallbackSuffix string) string { var parts []string - for _, f := range failures { - msg := fmt.Sprintf("%s (GPU %d): %s", f.TestName, f.EntityID, f.ErrorMessage) - if f.ErrorMessage == "" { - msg = fmt.Sprintf("%s (GPU %d): failed", f.TestName, f.EntityID) + for _, r := range results { + msg := fmt.Sprintf("%s (GPU %d): %s", r.TestName, r.EntityID, r.ErrorMessage) + if r.ErrorMessage == "" { + msg = fmt.Sprintf("%s (GPU %d): %s", r.TestName, r.EntityID, fallbackSuffix) } parts = append(parts, msg) } return strings.Join(parts, "; ") } - -func formatWarnings(warnings []dcgm.DiagResult) string { - var parts []string - - for _, w := range warnings { - msg := fmt.Sprintf("%s (GPU %d): %s", w.TestName, w.EntityID, w.ErrorMessage) - if w.ErrorMessage == "" { - msg = fmt.Sprintf("%s (GPU %d): warning", w.TestName, w.EntityID) - } - - parts = append(parts, msg) - } - - return strings.Join(parts, "; ") -}Then update callers:
msg := formatResults(failures, "failed") msg := formatResults(warnings, "warning")preflight-checks/dcgm-diag/pkg/health/health.go (1)
42-56: Consider returning an error fromReportErrorfor consistency.
ReportErrorsilently logs and discards the error fromsendHealthEvent, whileReportEventreturns the error. This inconsistency may confuse maintainers. Even if the caller ignores the error, returning it provides a consistent API.♻️ Proposed consistency improvement
// ReportError reports a fatal error without specific GPU entities. -func ReportError(connectorSocket, message string) { +func ReportError(connectorSocket, message string) error { if connectorSocket == "" { slog.Info("Skipping health event reporting (no connector socket configured)") - return + return nil } slog.Info("Reporting error health event", "socket", connectorSocket, "message", message) if err := sendHealthEvent(connectorSocket, nil, true, message); err != nil { slog.Warn("Failed to report health event", "error", err) + return err } + + return nil }Then in
main.go, the caller can choose to ignore the returned error:_ = health.ReportError(cfg.connectorSocket, err.Error())
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@preflight-checks/dcgm-diag/pkg/diag/diag.go`:
- Around line 30-78: The select using ctx.Done() races with deferred cleanup
because dcgm.RunDiag (called inside the goroutine) is not cancellable; update
Run to either remove the ctx timeout handling entirely or ensure the goroutine
finishes before returning: for example, replace the select with a synchronous
receive from resultCh/errCh (removing ctx.Done) or add a done/wait mechanism
(sync.WaitGroup or a separate done channel that the goroutine closes after
sending to resultCh/errCh) so that when ctx.Done() fires you still wait for the
goroutine to complete before executing deferred cleanup (refer to Run, initDCGM,
dcgm.RunDiag, resultCh, errCh, groupCleanup, and cleanup).
In `@preflight-checks/dcgm-diag/pkg/health/health.go`:
- Around line 73-90: The code compares errors directly with io.EOF (e.g., err ==
io.EOF) which fails for wrapped errors; update that comparison to use
errors.Is(err, io.EOF) so wrapped io.EOF is detected, and add the errors import
if missing. Locate the logic that handles the sendToConnector result inside the
wait.ExponentialBackoff retry block (and the subsequent status.FromError(err)
handling) and replace the direct equality check with errors.Is to correctly
unwrap wrapped errors.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl`:
- Around line 42-51: The helper template "gpu-health-monitor.dcgmEnabled"
currently uses default which treats false as empty and re-enables DCGM; update
the inner logic to check for the presence of the explicit key instead of using
default: when .Values.global and .Values.global.dcgm exist use hasKey on
.Values.global.dcgm for "enabled" and if the key exists return
.Values.global.dcgm.enabled, otherwise fall back to
.Values.dcgm.dcgmK8sServiceEnabled; ensure this change targets the define
"gpu-health-monitor.dcgmEnabled" and the keys .Values.global.dcgm.enabled and
.Values.dcgm.dcgmK8sServiceEnabled so explicit false is preserved.
In `@distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl`:
- Around line 142-145: The helper preflight.connectorSocket currently accesses
.Values.global.socketPath directly which can panic; update the template to first
check whether .Values.global and .Values.global.socketPath are defined (e.g.,
use an if with and to test .Values.global and .Values.global.socketPath), and
only then use that value; otherwise use the default "/var/run/nvsentinel.sock"
and return the formatted "unix://%s" string—mirror the defensive pattern used in
preflight.dcgmEndpoint/preflight.dcgmPort so nested property access is guarded
before applying defaults.
In `@preflight-checks/dcgm-diag/pkg/health/health.go`:
- Around line 170-172: The current check in health.go only tests whether err
implements interface{ Temporary() bool } but never calls Temporary(), so
non-temporary errors that implement it will be retried; update the logic to
perform the type assertion to a variable (e.g., temp, ok := err.(interface{
Temporary() bool })) and then call temp.Temporary() and only return true (or
trigger a retry) when ok is true AND temp.Temporary() == true; ensure you
reference the existing err variable and replace the current if block that uses
the anonymous type assertion.
In `@preflight/pkg/webhook/injector.go`:
- Around line 159-200: The injectVolumes function currently hardcodes
"/var/run/nvsentinel"; update it to derive the host path from
i.cfg.DCGM.ConnectorSocket (strip "unix://" if present, use filepath.Dir to get
the directory) and set hostPathType to corev1.HostPathDirectoryOrCreate so the
directory is created if missing; adjust the socketVolume construction in
injectVolumes to use that derived path and type, keep the existing checks for
empty ConnectorSocket and existing volume, and add imports for "strings" and
"path/filepath".
🧹 Nitpick comments (1)
preflight/pkg/webhook/injector.go (1)
203-207: Avoid hard‑coding the DCGM init‑container name.If users override the init container name in config/values, env injection will silently skip. Consider making the target name configurable or validating it during config load.
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl
Show resolved
Hide resolved
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@preflight-checks/dcgm-diag/dcgm_diag/__main__.py`:
- Around line 76-86: The code logs "DCGM diagnostic check passed"
unconditionally even when the warnings branch ran; update the control flow so
that when warnings is non-empty you emit a warning-level completion log (e.g.,
use log.warning with a message like "DCGM diagnostic completed with warnings")
and only log the "passed" info-level message in the else branch (no warnings).
Locate the warnings handling block (the list comprehension for uuids, the
message variable and reporter.send_event calls) and change the final log call
there from log.info to a warning-level log; keep the existing log.info("DCGM
diagnostic check passed") only in the no-warnings (else) branch so logs
accurately reflect health.
- Around line 51-56: The exception handler for diag.run currently reports the
diagnostic failure as non-fatal; update the handler so that when diag.run raises
an Exception you log the error and call reporter.send_event with
is_healthy=False and is_fatal=True (preserving the message from the exception),
then exit with sys.exit(1); this change affects the except block that catches
Exception as e, the log.error call, and the reporter.send_event invocation.
In `@preflight-checks/dcgm-diag/dcgm_diag/config.py`:
- Line 31: The code currently does diag_level = int(os.getenv("DCGM_DIAG_LEVEL",
"1")) which will raise an unhandled ValueError if DCGM_DIAG_LEVEL is
non-numeric; wrap the int conversion in a try/except that catches ValueError,
read the raw env value via os.getenv("DCGM_DIAG_LEVEL"), and either fallback to
a safe default (e.g. 1) or raise a new ValueError with a clear message including
the offending value (and reference to DCGM_DIAG_LEVEL) so callers see what
failed; update the diag_level assignment in config.py accordingly and use the
same variable name diag_level in the fix.
In `@preflight-checks/dcgm-diag/dcgm_diag/health.py`:
- Around line 61-62: The send_event path currently logs on failure but silently
continues; change send_event so that if self._send_with_retries(health_events)
returns False it raises a specific exception (e.g., HealthEventDeliveryError)
instead of only calling log.error, and define that exception class
(HealthEventDeliveryError) in the module; update callers of send_event (or
document callers of send_event/_send_with_retries) to catch or propagate this
exception so preflight can fail fast. Ensure the raised exception message
includes MAX_RETRIES and context about the health_events for debugging.
In `@preflight-checks/dcgm-diag/Dockerfile`:
- Around line 43-54: Remove the line that writes "Acquire::https::Verify-Peer
\"false\";" to /etc/apt/apt.conf.d/99disable-cert-check and ensure HTTPS
verification remains enabled; instead run apt-get update first and then install
ca-certificates (and other packages) so apt uses system CAs for TLS.
Specifically delete the RUN that creates 99disable-cert-check and reorder the
block around the apt-get update + apt-get install sequence in the same
Dockerfile RUN (referencing the echo to /etc/apt/apt.conf.d/99disable-cert-check
and the apt-get install invocation) so ca-certificates is installed/used
normally rather than disabling verification.
In `@preflight-checks/dcgm-diag/Makefile`:
- Around line 24-27: Add an explicit all target to the Makefile to satisfy
checkmake by creating an alias target named all that depends on the existing
lint-test (so default make behavior is preserved); update the Makefile near the
existing .PHONY and lint-test target (reference: .PHONY and lint-test) to
include an all target that simply delegates to lint-test.
In `@preflight-checks/dcgm-diag/pyproject.toml`:
- Around line 10-11: The pyproject dependency version for grpcio is too low
compared to the generated module's runtime check in health_event_pb2_grpc.py;
update the grpc-related pins in pyproject.toml (grpcio and grpcio-tools) to at
least 1.75.1 (or match the minimum checked by health_event_pb2_grpc.py) so the
import-time version check passes, and run a quick install and test import of
health_event_pb2_grpc to verify resolution.
🧹 Nitpick comments (5)
preflight-checks/dcgm-diag/dcgm_diag/gpu.py (2)
30-31: Consider raisingKeyErroror returningOptional[str]for invalid GPU index.
get_uuidsilently returns an empty string for an invalid index, which could mask bugs if callers don't check for empty strings. Consider either raisingKeyErrorfor invalid indices or changing the return type tostr | Noneto make the failure case explicit.Option A: Raise on invalid index
def get_uuid(self, index: int) -> str: - return self._index_to_uuid.get(index, "") + if index not in self._index_to_uuid: + raise KeyError(f"GPU index {index} not found") + return self._index_to_uuid[index]Option B: Return Optional
- def get_uuid(self, index: int) -> str: - return self._index_to_uuid.get(index, "") + def get_uuid(self, index: int) -> str | None: + return self._index_to_uuid.get(index)
47-49: Error message mismatch between log and exception.The log says "Failed to discover GPUs" but the raised exception says "NVML initialization failed". This could be confusing during debugging since discovery failure could happen after successful init (e.g., during
nvmlDeviceGetCountornvmlDeviceGetHandleByIndex).Align error messages
except pynvml.NVMLError as e: - log.error(f"Failed to discover GPUs: {e}") - raise RuntimeError(f"NVML initialization failed: {e}") + log.error(f"GPU discovery failed: {e}") + raise RuntimeError(f"GPU discovery failed: {e}")preflight-checks/dcgm-diag/dcgm_diag/health.py (1)
96-107: Consider adding a timeout to the gRPC call.The
stub.HealthEventOccurredV1call has no timeout, which could cause the preflight check to hang indefinitely if the platform connector is unresponsive. gRPC calls support atimeoutparameter.Add timeout to gRPC call
+GRPC_TIMEOUT = 30.0 # seconds + +# ... in _send_with_retries: with grpc.insecure_channel(f"unix://{self._socket_path}") as channel: stub = pb_grpc.PlatformConnectorStub(channel) - stub.HealthEventOccurredV1(health_events) + stub.HealthEventOccurredV1(health_events, timeout=GRPC_TIMEOUT) log.info("Health event sent successfully")preflight-checks/dcgm-diag/dcgm_diag/diag.py (2)
86-86: Silent fallback toDCGM_DIAG_LVL_SHORTfor invalid levels.If an invalid level is passed (outside 1-4), the code silently defaults to
DCGM_DIAG_LVL_SHORT. This could mask configuration errors. SinceConfig.from_envalready validates the level, this is defensive, but a warning log would help debugging.Add warning for fallback
- diag_level = self.DIAG_LEVELS.get(level, dcgm_structs.DCGM_DIAG_LVL_SHORT) + diag_level = self.DIAG_LEVELS.get(level) + if diag_level is None: + log.warning(f"Invalid diagnostic level {level}, defaulting to SHORT") + diag_level = dcgm_structs.DCGM_DIAG_LVL_SHORT
102-102: Add type hints forresponseandvalueparameters.Per coding guidelines, type hints are required for all functions. The
responseparameter in_parse_responseandvalueparameter in_decode_stringare missing type annotations.Add type hints
- def _parse_response(self, response, gpu_indices: list[int]) -> list[DiagResult]: + def _parse_response(self, response: dcgm_structs.c_dcgmDiagResponse_v12, gpu_indices: list[int]) -> list[DiagResult]:`@staticmethod` - def _decode_string(value) -> str: + def _decode_string(value: bytes | str | None) -> str:Also applies to: 144-145
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@preflight-checks/dcgm-diag/dcgm_diag/diag.py`:
- Around line 43-46: Add an explicit return-type annotation "-> None" to the
constructor: update the __init__ definition in diag.py (the __init__ that sets
self._hostengine_addr, self._handle and self._gpu_discovery) to declare a None
return type (e.g., def __init__(self, hostengine_addr: str = "") -> None:) so it
satisfies the project’s type-hint requirement.
- Around line 146-152: The _decode_string method lacks a type hint for its
parameter; update the signature of _decode_string to accept Any (def
_decode_string(value: Any) -> str) and ensure typing.Any is imported (add "from
typing import Any" if missing) so the function handles bytes, str, or other
values with proper type annotations.
- Around line 113-132: error_lookup is keyed by err.testId but the lookup uses
the loop index test_idx; change the lookup to use the actual test id from the
result (entity_result.testId) so errors map correctly to results. Specifically,
in the block iterating results (where entity_result is read and gpu_idx set)
replace the key (test_idx, gpu_idx) with (entity_result.testId, gpu_idx) when
calling error_lookup.get so error_msg is retrieved using the true test
identifier.
- Around line 102-103: The _parse_response function is missing a type annotation
for the response parameter; update its signature to include an appropriate type
(e.g., response: dict[str, Any] or response: Mapping[str, Any]) and add the
necessary typing import (Any or Mapping) at the top of the module so
_parse_response(self, response: dict[str, Any], gpu_indices: list[int]) ->
list[DiagResult] (or with Mapping) is fully typed and satisfies the project's
type-hint rule.
🧹 Nitpick comments (3)
preflight-checks/dcgm-diag/Makefile (1)
61-69: Add.PHONYdeclarations fordocker-buildanddocker-publish.These targets don't produce files with those names, so they should be declared as phony to ensure they always run when invoked.
Suggested fix
.PHONY: setup setup: `@echo` "Setting up Poetry environment for $(MODULE_NAME)..." poetry config virtualenvs.in-project true poetry install +.PHONY: docker-build docker-build: setup-buildx `@echo` "Building Docker image for $(MODULE_NAME)..."And similarly for
docker-publish:+.PHONY: docker-publish docker-publish: setup-buildx `@echo` "Publishing Docker image for $(MODULE_NAME)..."preflight-checks/dcgm-diag/dcgm_diag/diag.py (2)
35-41: AnnotateDIAG_LEVELSasClassVarto clarify intent.
This avoids instance-level mutation ambiguity and aligns with typing best practices.♻️ Suggested update
+from typing import ClassVar ... class DCGMDiagnostic: - DIAG_LEVELS = { + DIAG_LEVELS: ClassVar[dict[int, int]] = { }
69-75: Avoid catching broadExceptionon shutdown.
Please narrow this to the specific DCGM/pydcgm exception type your version emits to avoid masking unexpected errors.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@preflight-checks/dcgm-diag/dcgm_diag/health.py`:
- Around line 36-41: The __init__ method of HealthReporter is missing an
explicit return type; update the constructor signature for
HealthReporter.__init__ to include "-> None" (i.e., make it def __init__(...,
processing_strategy: pb.ProcessingStrategy) -> None:) so it matches the return
type annotations used by other methods in the class and satisfies the project's
typing consistency.
- Around line 26-28: The retry loop uses stub.HealthEventOccurredV1() without a
deadline so hung RPCs block retries; add a bounded timeout (e.g., RPC_TIMEOUT)
and pass it to the gRPC call (deadline/timeout param) and make it configurable
alongside MAX_RETRIES/INITIAL_DELAY/BACKOFF_FACTOR; update the call site
(stub.HealthEventOccurredV1(..., timeout=RPC_TIMEOUT) or the appropriate gRPC
deadline API used in this codebase) so the RPC raises on timeout and the
existing exception handling and backoff logic can proceed.
In `@preflight-checks/dcgm-diag/dcgm_diag/tests/conftest.py`:
- Around line 20-32: Add explicit type hints for the pytest fixtures: annotate
the monkeypatch parameter as pytest.MonkeyPatch and the fixture return types as
None for both clean_env and valid_env, and annotate the clean_env parameter on
valid_env as None; ensure pytest is imported for the MonkeyPatch type (e.g.,
import pytest) so functions read def clean_env(monkeypatch: pytest.MonkeyPatch)
-> None: and def valid_env(monkeypatch: pytest.MonkeyPatch, clean_env: None) ->
None:.
In `@preflight-checks/dcgm-diag/dcgm_diag/tests/test_config.py`:
- Around line 15-18: Add missing type hints to every test function/method in
test_config.py: annotate all parameters (e.g., fixtures) with their expected
types and give each test a return type of None; update method signatures inside
the test class (and top-level test functions) such as any test_* functions in
test_config.py to include parameter and return annotations, and import required
typing names (e.g., Callable, Any) if needed. Ensure fixtures referenced in
signatures use their proper types (or typing.Any) and that all test methods end
with -> None so the module conforms to the coding guidelines.
In `@preflight-checks/dcgm-diag/dcgm_diag/tests/test_health.py`:
- Around line 24-95: Tests and fixture lack type annotations; add appropriate
typing to the reporter fixture and each test function (including methods in
TestSendWithRetries and TestSendEvent) to satisfy project requirements: annotate
the reporter fixture to return HealthReporter, annotate test functions'
parameters (e.g., mock objects like mock_channel, mock_sleep, mock_send and the
reporter param) with types from unittest.mock (MagicMock/Mock) or grpc/types
where applicable and mark return types as None; update signatures for
reporter(), test_success_first_attempt(...), test_retries_on_failure(...),
test_fails_after_max_retries(...), test_exponential_backoff(...), and
TestSendEvent.test_raises_on_failure(...) accordingly, importing typing names
(e.g., Mock, MagicMock) if needed.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
|
/ok to test 4d8056a |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@preflight-checks/dcgm-diag/dcgm_diag/errors.py`:
- Around line 49-52: Strip CSV fields before validating and using them: assign
stripped_name = row[0].strip() and stripped_action = row[1].strip(), then check
that stripped_name is non-empty and stripped_action is a valid key of
pb.RecommendedAction (e.g., stripped_action in pb.RecommendedAction.keys())
before adding mapping[stripped_name] =
pb.RecommendedAction.Value(stripped_action); this prevents creating empty keys
and preserves mappings where only surrounding whitespace was present.
🧹 Nitpick comments (1)
preflight-checks/dcgm-diag/dcgm_diag/errors.py (1)
30-42: Cache the loader functions to avoid repeated file I/O and reflection.Both loader functions are called transitively by public APIs; caching keeps repeated calls fast and avoids rereading the CSV.
♻️ Proposed refactor
+@lru_cache(maxsize=1) def _load_code_to_name() -> dict[int, str]: """Load DCGM error code → name mapping from dcgm_errors module.""" try: import dcgm_errors @@ - return {} + return {} +@lru_cache(maxsize=1) def _load_name_to_action() -> dict[str, int]: """Load DCGM error name → action mapping from CSV file."""
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/lint-test.yml:
- Around line 153-154: The CI step currently invokes "make -C
preflight-checks/${{ matrix.component }} lint" which only runs the Black
formatter check; update this to run the full "lint-test" target (i.e., replace
the make target with "lint-test") so linting plus tests and coverage run in CI,
or alternatively add a separate job that runs "make -C preflight-checks/${{
matrix.component }} lint-test" (or a dedicated test job) to ensure the dcgm-diag
tests and coverage artifacts are executed; modify the job that currently has the
name "Run lint" (or create a new job) accordingly so it calls the "lint-test"
target instead of "lint".
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
|
/ok to test 132ba37 |
|
/ok to test 07e6ae8 |
|
/ok to test 416ea9c |
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
|
/ok to test 8eba746 |
|
/ok to test c7e40b6 |
|
/ok to test 5463918 |
|
/ok to test 4c712c2 |
Pull Request is not mergeable
|
@lalitadithya @XRFXLP I would appreciate not merging code against this branch without consulting with @pteranodan and my self first. |
Summary
Adds DCGM diagnostic tests as a part of preflight test suits. Overflow is:
Package structure:
Testing
Tested with DCGM 4.3.1-1-ubuntu22.04 and 4.2.3-1-ubuntu22.04
Created pod:
Verified that init container was injected:
Healthy node
Platform connector logs (per GPU, per check event)
Unhealthy Node
From platform connector logs:
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Chores
Tests