Release v0.9.0
This release delivers end-to-end GPU reset support as a first-class remediation action, major expansions to the preflight check framework (DCGM diagnostics, NCCL loopback tests, gang discovery), enhanced Kubernetes operator health monitoring, and significant performance and reliability improvements across the platform.
Major New Features
End-to-End GPU Reset
GPU reset is now a fully integrated remediation path in NVSentinel. Building on the foundational work in v0.8.0, this release completes the pipeline:
- GPU Reset Controller in Janitor (#797): New controller that consumes
GPUResetCRDs and orchestrates the full reset lifecycle — tearing down GPU Operator components, executing the reset via nvidia-smi, and restoring services. - GPU Reset Container Image (#788): Dedicated
gpu-resetcontainer image used by Janitor's reset jobs to perform the actual GPU reset on target nodes. - E2E and UAT Test Coverage (#768): Enables GPU reset across fault-remediation (mapping
COMPONENT_RESETtoGPUReset), node-drainer (partial drain for GPU-scoped events), and health monitors (fallback toRESTART_VMwhen UUID discovery fails). Includes comprehensive end-to-end and UAT tests validating the full reset workflow.
This provides a lightweight recovery mechanism that resolves many GPU issues without full node reboots — resetting only the affected GPU while keeping healthy workloads running via partial drain.
Preflight Check Framework Expansion
The preflight check framework introduced in v0.8.0 now includes real diagnostic capabilities:
- DCGM Diagnostics (#772): Runs DCGM diagnostic tests as preflight checks, discovering allocated GPUs via gonvml and executing diagnostics via pydcgm. Reports per-GPU, per-test health events (fatal for failures, non-fatal for warnings, healthy for passes).
- NCCL Loopback Tests (#808): Validates intra-node GPU interconnect health by running NCCL all-reduce loopback tests. Detects degraded PCIe/NVLink bandwidth — tested across A100, H100, and GB200/GB300 hardware.
- Gang Discovery (#818): Discovers pods belonging to the same scheduling group as a prerequisite for multi-node NCCL tests. Supports both native Kubernetes Workload API (1.35+) and PodGroup-based schedulers (Volcano, etc.) with config-driven CRD resolution. Coordinates peer discovery via ConfigMap injection at admission time.
Kubernetes Operator Health Monitoring
- GPU & Network Operator Pod Monitoring (#751): The kubernetes-object-monitor now tracks DaemonSet pod health in
gpu-operatorandnetwork-operatornamespaces. Detects pods that fail to reach Running state within a configurable timeout and publishes fatal health events. Automatically publishes healthy events when pods recover.
Performance & Observability
Histogram Bucket Cardinality Reduction
- 96% Series Reduction (#799): Replaced linear histogram buckets (500 buckets) with exponential buckets (12 buckets) in platform-connector metrics. Eliminates ~500K metric series cluster-wide, resolving Prometheus remote write bottlenecks and significantly reducing memory usage.
Configurable Network Policy
- Optional Metrics Network Policy (#789): The
metrics-accessnetwork policy can now be disabled vianetworkPolicy.enabled: false. Resolves conflicts when NVSentinel shares a namespace with services like cert-manager that require ingress on non-metrics ports.
Bug Fixes & Reliability
- Nolint Directive Cleanup (#828, #831): Cleaned up
nolintdirectives previously marked as TODO across the codebase, improving lint compliance and code quality. - E2E Test Retry for InfoROM Errors (#834): Added retry logic when injecting InfoROM errors in E2E tests, improving test reliability.
- Demo Script Fix (#809): Fixed demo script to display correct node conditions.
- SBOM Generation Disk Space (#817, #827): Added disk cleanup logic before SBOM generation in the publish container CI job, preventing build failures due to insufficient disk space.
- CUDA Image Source (#792): Switched to CUDA images from NVCR to avoid Docker Hub rate limits in CI.
Build & Infrastructure
- Overrideable Module Names (#816): Component Makefiles can now override the Go module name, improving build flexibility.
- Mixed Eviction Scale Tests (#830): Added scale test results for mixed eviction modes (Immediate, AllowCompletion, DeleteAfterTimeout) on a 1500-node cluster, validating correct behavior at 10%, 25%, and 50% cluster scale.
- Copy-PR-Bot Config (#805): Added username to copy-pr-bot configuration.
Documentation
- K8s Data Store Design Doc (#787): Design document for introducing a Kubernetes-native data store for health events, reducing dependency on MongoDB.
Dependency Updates
- Bumped protobuf from 6.33.4 to 6.33.5 in gpu-health-monitor (#769)
- Multiple dependency updates via dependabot (#803, #806, #829)
Acknowledgments
This release includes contributions from:
- @natherz97
- @XRFXLP
- @deesharma24
- @tanishagoyal2
- @ksaur
- @jtschelling
- @cbumb
- @yuanchen8911
- @yavinash007
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.9.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.8.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.9.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.