Release v0.9.0

This release delivers end-to-end GPU reset support as a first-class remediation action, major expansions to the preflight check framework (DCGM diagnostics, NCCL loopback tests, gang discovery), enhanced Kubernetes operator health monitoring, and significant performance and reliability improvements across the platform.

Major New Features

End-to-End GPU Reset

GPU reset is now a fully integrated remediation path in NVSentinel. Building on the foundational work in v0.8.0, this release completes the pipeline:

GPU Reset Controller in Janitor (#797): New controller that consumes GPUReset CRDs and orchestrates the full reset lifecycle — tearing down GPU Operator components, executing the reset via nvidia-smi, and restoring services.
GPU Reset Container Image (#788): Dedicated gpu-reset container image used by Janitor's reset jobs to perform the actual GPU reset on target nodes.
E2E and UAT Test Coverage (#768): Enables GPU reset across fault-remediation (mapping COMPONENT_RESET to GPUReset), node-drainer (partial drain for GPU-scoped events), and health monitors (fallback to RESTART_VM when UUID discovery fails). Includes comprehensive end-to-end and UAT tests validating the full reset workflow.

This provides a lightweight recovery mechanism that resolves many GPU issues without full node reboots — resetting only the affected GPU while keeping healthy workloads running via partial drain.

Preflight Check Framework Expansion

The preflight check framework introduced in v0.8.0 now includes real diagnostic capabilities:

DCGM Diagnostics (#772): Runs DCGM diagnostic tests as preflight checks, discovering allocated GPUs via gonvml and executing diagnostics via pydcgm. Reports per-GPU, per-test health events (fatal for failures, non-fatal for warnings, healthy for passes).
NCCL Loopback Tests (#808): Validates intra-node GPU interconnect health by running NCCL all-reduce loopback tests. Detects degraded PCIe/NVLink bandwidth — tested across A100, H100, and GB200/GB300 hardware.
Gang Discovery (#818): Discovers pods belonging to the same scheduling group as a prerequisite for multi-node NCCL tests. Supports both native Kubernetes Workload API (1.35+) and PodGroup-based schedulers (Volcano, etc.) with config-driven CRD resolution. Coordinates peer discovery via ConfigMap injection at admission time.

Kubernetes Operator Health Monitoring

GPU & Network Operator Pod Monitoring (#751): The kubernetes-object-monitor now tracks DaemonSet pod health in gpu-operator and network-operator namespaces. Detects pods that fail to reach Running state within a configurable timeout and publishes fatal health events. Automatically publishes healthy events when pods recover.

Performance & Observability

Histogram Bucket Cardinality Reduction

96% Series Reduction (#799): Replaced linear histogram buckets (500 buckets) with exponential buckets (12 buckets) in platform-connector metrics. Eliminates ~500K metric series cluster-wide, resolving Prometheus remote write bottlenecks and significantly reducing memory usage.

Configurable Network Policy

Optional Metrics Network Policy (#789): The metrics-access network policy can now be disabled via networkPolicy.enabled: false. Resolves conflicts when NVSentinel shares a namespace with services like cert-manager that require ingress on non-metrics ports.

Bug Fixes & Reliability

Nolint Directive Cleanup (#828, #831): Cleaned up nolint directives previously marked as TODO across the codebase, improving lint compliance and code quality.
E2E Test Retry for InfoROM Errors (#834): Added retry logic when injecting InfoROM errors in E2E tests, improving test reliability.
Demo Script Fix (#809): Fixed demo script to display correct node conditions.
SBOM Generation Disk Space (#817, #827): Added disk cleanup logic before SBOM generation in the publish container CI job, preventing build failures due to insufficient disk space.
CUDA Image Source (#792): Switched to CUDA images from NVCR to avoid Docker Hub rate limits in CI.

Build & Infrastructure

Overrideable Module Names (#816): Component Makefiles can now override the Go module name, improving build flexibility.
Mixed Eviction Scale Tests (#830): Added scale test results for mixed eviction modes (Immediate, AllowCompletion, DeleteAfterTimeout) on a 1500-node cluster, validating correct behavior at 10%, 25%, and 50% cluster scale.
Copy-PR-Bot Config (#805): Added username to copy-pr-bot configuration.

Documentation

K8s Data Store Design Doc (#787): Design document for introducing a Kubernetes-native data store for health events, reducing dependency on MongoDB.

Dependency Updates

Bumped protobuf from 6.33.4 to 6.33.5 in gpu-health-monitor (#769)
Multiple dependency updates via dependabot (#803, #806, #829)

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.9.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.8.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.9.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.9.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release v0.9.0

Major New Features

End-to-End GPU Reset

Preflight Check Framework Expansion

Kubernetes Operator Health Monitoring

Performance & Observability

Histogram Bucket Cardinality Reduction

Configurable Network Policy

Bug Fixes & Reliability

Build & Infrastructure

Documentation

Dependency Updates

Acknowledgments

Getting Started

Contributors

Uh oh!