Skip to content

feat: enable GPU reset with e2e and UAT tests#768

Merged
lalitadithya merged 2 commits intoNVIDIA:mainfrom
natherz97:enable-gpu-reset
Feb 12, 2026
Merged

feat: enable GPU reset with e2e and UAT tests#768
lalitadithya merged 2 commits intoNVIDIA:mainfrom
natherz97:enable-gpu-reset

Conversation

@natherz97
Copy link
Contributor

@natherz97 natherz97 commented Jan 30, 2026

Summary

Related design doc for GPU reset: https://github.com/NVIDIA/NVSentinel/blob/main/docs/designs/020-nvsentinel-gpu-reset.md.

This PR enables GPU reset in NVSentinel and adds a e2e and UAT test for this functionality. To enable GPU reset, we need to:

  1. Set partialDrainEnabled to true in node-drainer.
  2. Map COMPONENT_RESET recommended action to GPUReset rather than RebootNode in fault-remediation.
maintenance:
  actions:
    "COMPONENT_RESET":  # Action 2
      apiGroup: "janitor.dgxc.nvidia.com"
      version: "v1alpha1"
      kind: "GPUReset"
      scope: "Cluster"
      completeConditionType: "Complete"
      templateFileName: "gpureset-template.yaml"
      equivalenceGroup: "reset"
      supersedingEquivalenceGroups: ["restart"]
      impactedEntityScope: "GPU_UUID"

Note that the default chart values will set partialDrainEnabled to false and not map COMPONENT_RESET to GPUReset. The Tilt and UAT clusters have these 2 settings enabled for testing. Note that if partial drain is enabled but COMPONENT_RESET maps to RebootNode, nodes with running pods will be rebooted which could occur if someone consumes the default values node-drainer when we set it to true by default and already overrides the maintenance options. The reverse is also possible where someone sets partialDrainEnabled to false but maps COMPONENT_RESET -> GPUReset which isn't dangerous but is sub-optimal.

Supporting component resets

  • [devices]: the component's corresponding device name must be exposed through a device plugin or DRA
  • [metadata-collector]: must be configured to expose the pod-to-device mapping for the given component's device plugin or DRA resource name as a pod object annotation. This is currently hard-coded but we could choose to expose this as a Helm variable if we anticipate components other than GPUs needing reset.
	EntityTypeToResourceNames = map[string][]string{
		"GPU_UUID": {
			"nvidia.com/gpu",
			"nvidia.com/pgpu",
		},
	}
  • [node-drainer]: for COMPONENT_RESET HealthEvents, the node-drainer will check that there exists a mapping for the given impacted entity to it's component's device plugin resource names. If there exists a COMPONENT_RESET HealthEvent where this mapping doesn't exist, then we will fail the drain. For COMPONENT_RESET events that include a supported impacted entity, the node drainer will only drain pods which are leveraging this entity by leveraging the pod annotation written by the metadata-collector.
    • the device resource names that are referenced by the metadata-collector and the entity to resource name mapping used by the node-drainer is defined in a single object in pod_device_annotation.go
  • [health-monitors]: must be configured to only write COMPONENT_RESET events that include the GPU_UUID. COMPONENT_RESET events which are missing this entity will result in failed drains (or failed remediation). Currently, health-monitors will fallback to RESTART_VM if a HealthEvent for COMPONENT_RESET cannot discover the UUID (currently only if metadata-collector is unhealthy)
  • [fault-remediation]: the COMPONENT_RESET action must map to a custom resource that targets only the given impacted entity. Any events which make it to the fault-remediation module for COMPONENT_RESET but without a GPU_UUID will have a failed remediation (which isn't possible since this would first result in a failed drain). The fault-remediation module will ensure that any remediation actions in its config which include an impacted entity scope are configured for partial draining (by checking the same pod_device_annotation.go file) or else it will fail to start. Note that the impacted entity scope for GPU_UUID will be included in the rendered maintenance custom resource.
    • We will map the COMPONENT_RESET to the GPUReset custom resource. If we ever support other forms of component resets we will need to make sure that a single custom resource supports all these impacted entities or we will need to map from remediation action + entity to custom resources (rather than just remediation action to custom resource today).

Testing

  1. e2e tests succeeded on a local Kind cluster leveraging a version of Janitor which supports GPUReset. We will need a use this config for Janitor in our Tilt cluster:
    gpuResetController:
      enabled: true
      timeout: 20m
      resetJobImage: "alpine:latest"
      serviceManager:
        name: "gpu-operator"
        spec:
          namespace: "gpu-operator"
          teardownTimeout: 5m
          restoreTimeout: 10m
          managerSelector:
            app.kubernetes.io/managed-by: "tilt"
          apps:
          - appSelector:
              app: "nvidia-dcgm"
            nodeLabel: "nvidia.com/gpu.deploy.dcgm"
            enabledValue: "true"
            disabledValue: "false"
  1. UAT tests succeeded when running manually with the following config for Janitor:
    gpuResetController:
      enabled: true
      timeout: 20m
      serviceManager:
        name: "gpu-operator"

The existing UAT confirms that a reboot occurred by checking the bootID and is agnostic of whatever fault-remediation implementation executed the reboot. For GPU reset we don't have the same benefit that a reset can be detected independently of our GPUReset solution. For the GPUReset UAT, we detect that a reset occurred by checking for the syslog line written by GPUReset job. Later we can change this to a generic reset line written by driver once that's available (rather than detect K8s objects for the GPUReset like the CR, job or privileged pod). Logs for the test execution (the reboot tests timed out due to high reboot latency on OCI):

nherz@DP66VX7CLX uat % ./tests.sh
[2026-01-29 12:45:06] Starting NVSentinel UAT tests...
[2026-01-29 12:45:06] =========================================
[2026-01-29 12:45:06] Test 1: GPU monitoring via DCGM
[2026-01-29 12:45:06] =========================================
[2026-01-29 12:45:07] Selected GPU node: 10.0.6.34
[2026-01-29 12:45:09] Original boot ID: 29398bd5-3390-444d-b88f-e64f26b8d1bd
Defaulted container "nvidia-dcgm-ctr" out of: nvidia-dcgm-ctr, toolkit-validation (init)
Successfully injected field info.
[2026-01-29 12:45:11] Waiting for node events to appear...
[2026-01-29 12:45:27] Found power event
[2026-01-29 12:45:27] Verifying node events are populated (non-fatal errors appear here)
GpuPowerWatchIsNotHealthy Message=ErrorCode:DCGM_FR_CLOCK_THROTTLE_POWER GPU:0 PCI:0000:0f:00.0 GPU_UUID:GPU-8598879c-4839-1709-231e-36a2b2844bca Detected clocks event due to power violation in GPU 0. Monitor the power conditions. This GPU can still perform workload. Recommended Action=NONE;
[2026-01-29 12:45:29] Node event verified: GpuPowerWatch is non-fatal, appears in events ✓
Defaulted container "nvidia-dcgm-ctr" out of: nvidia-dcgm-ctr, toolkit-validation (init)
Successfully injected field info.
[2026-01-29 12:45:31] Waiting for node condition 'GpuMemWatch' to appear on node 10.0.6.34...
[2026-01-29 12:45:42] Node condition 'GpuMemWatch' found ✓
  Status=True Reason=GpuMemWatchIsNotHealthy
[2026-01-29 12:45:43] Waiting for node 10.0.6.34 to be quarantined (cordoned)...
[2026-01-29 12:45:45] Node 10.0.6.34 is quarantined (cordoned) ✓
[2026-01-29 12:45:45] Waiting for node to reboot and recover...
[2026-01-29 12:45:45] Waiting for node 10.0.6.34 to reboot (boot ID to change)...
[2026-01-29 12:58:22] ERROR: Timeout waiting for node 10.0.6.34 to reboot
...
[2026-01-29 12:59:51] ======================================================
[2026-01-29 12:59:51] Test 2: XID monitoring via syslog triggers RESTART_VM
[2026-01-29 12:59:52] ======================================================
[2026-01-29 12:59:53] Selected GPU node: 10.0.6.34
[2026-01-29 12:59:54] Original boot ID: 29398bd5-3390-444d-b88f-e64f26b8d1bd
[2026-01-29 12:59:55] Injecting XID 79 message via logger on pod: nvidia-driver-daemonset-k5c5d
[2026-01-29 12:59:57] Waiting for node condition 'SysLogsXIDError' to appear on node 10.0.6.34...
[2026-01-29 13:00:12] Node condition 'SysLogsXIDError' found ✓
  Status=True Reason=SysLogsXIDErrorIsNotHealthy
[2026-01-29 13:00:13] Waiting for node 10.0.6.34 to be quarantined (cordoned)...
[2026-01-29 13:00:15] Node 10.0.6.34 is quarantined (cordoned) ✓
[2026-01-29 13:00:15] Waiting for node to reboot and recover...
[2026-01-29 13:00:15] Waiting for node 10.0.6.34 to reboot (boot ID to change)...
[2026-01-29 13:13:22] ERROR: Timeout waiting for node 10.0.6.34 to reboot
...
[2026-01-29 13:11:12] ==========================================================
[2026-01-29 13:11:12] Test 3: XID monitoring via syslog triggers COMPONENT_RESET
[2026-01-29 13:11:12] ==========================================================
[2026-01-29 13:11:16] Selected GPU node: 10.0.6.34 (has healthy syslog-health-monitor)
[2026-01-29 13:11:17] Fetching GPU UUID and PCI from nvidia-smi in driver pod to construct syslog message
[2026-01-29 13:11:19] Resetting GPU UUID GPU-8598879c-4839-1709-231e-36a2b2844bca on PCI 0000:0f:00
[2026-01-29 13:11:19] Injecting XID 119 message on GPU GPU-8598879c-4839-1709-231e-36a2b2844bca via logger on pod: nvidia-driver-daemonset-k5c5d
[2026-01-29 13:11:21] Waiting for node condition 'SysLogsXIDError' to appear on node 10.0.6.34...
[2026-01-29 13:11:22] Node condition 'SysLogsXIDError' found ✓
  Status=True Reason=SysLogsXIDErrorIsNotHealthy
[2026-01-29 13:11:23] Waiting for node 10.0.6.34 to be quarantined (cordoned)...
[2026-01-29 13:11:31] Node 10.0.6.34 is quarantined (cordoned) ✓
[2026-01-29 13:11:31] Waiting for node to GPU reset and recover...
[2026-01-29 13:11:31] Waiting for GPU reset for GPU-8598879c-4839-1709-231e-36a2b2844bca on 10.0.6.34 (GPU reset syslog message from Janitor)...
command terminated with exit code 1
...
Jan 29 21:12:33 inst-q0xyj-dgxce-dgxc-k8s-oci-lhr-dev1-gpu root: GPU reset executed: GPU-8598879c-4839-1709-231e-36a2b2844bca
[2026-01-29 13:12:39] GPU GPU-8598879c-4839-1709-231e-36a2b2844bca reset successfully
[2026-01-29 13:12:39] Waiting for node 10.0.6.34 to be uncordoned...
[2026-01-29 13:12:40] Node 10.0.6.34 is uncordoned and ready ✓
[2026-01-29 13:12:40] Waiting for node 10.0.6.34 to be uncordoned...
[2026-01-29 13:12:42] Node 10.0.6.34 is uncordoned and ready ✓
[2026-01-29 13:12:42] Test 3 PASSED ✓
[2026-01-29 13:12:42] =========================================
[2026-01-29 13:12:42] All tests PASSED ✓
[2026-01-29 13:12:42] =========================================

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added GPU reset remediation capability with new action configuration and templates.
    • Introduced partial drain enablement option for workload management.
    • Expanded fault remediation actions to support component reset, VM restart, and bare-metal restart with per-action configuration.
    • Enhanced entity scoping for GPU-based remediation workflows.
  • Documentation

    • Updated fault remediation configuration documentation with new completion condition types.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors fault remediation configuration from a single-template model to a per-action structure with multiple templates, introduces generic custom resource helpers parameterized by GroupVersionKind, adds GPU reset remediation capabilities alongside RebootNode remediation, and expands test coverage with GPU reset and node-locking validation.

Changes

Cohort / File(s) Summary
Configuration Restructuring
distros/kubernetes/nvsentinel/values-full.yaml, distros/kubernetes/nvsentinel/values-tilt.yaml, tests/uat/aws/nvsentinel-values.yaml, tests/uat/gcp/nvsentinel-values.yaml, tests/uat/kind/nvsentinel-values.yaml
Replaced monolithic maintenance configuration with per-action remediation definitions (COMPONENT_RESET, RESTART_VM, RESTART_BM) specifying apiGroup, version, kind, scope, completeConditionType, templateFileName, and equivalenceGroup. Introduced templates mapping for reusable YAML content and updateRetry settings. Added GPU reset action with GPUReset kind, Complete condition, and GPU_UUID scoping.
Kubernetes Template Updates
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml, distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
Extended configmap template to render impactedEntityScope and supersedingEquivalenceGroups conditionally within remediationActions. Minor whitespace normalization in values file.
Generic CR Helper Refactoring
tests/helpers/kube.go
Introduced RebootNodeGVK and GPUResetGVK constants, generalized CR operations (ListAllCRs, WaitForNoCR, WaitForCR, DeleteAllCRs, DeleteCR) with GroupVersionKind parameters, added CreateGPUResetCR for GPU reset operations. Replaced hard-coded RebootNode logic with parameterized GVK-based handling.
Test Helper Updates
tests/helpers/fault_remediation.go, tests/helpers/health_events_analyzer.go
Updated cleanup and CR wait calls to use generalized helpers with RebootNodeGVK parameter, replacing RebootNode-specific functions.
Test Data Files
tests/data/busybox-pod-with-devices.yaml, tests/data/fatal-health-event-component-reset.json, tests/data/healthy-event-component-reset.json
Updated GPU annotation to concrete UUID (GPU-455d8f70-2051-db6c-0430-ffc457bff834). Added two new health event payloads for GPU component reset scenarios (fatal and healthy states).
Test File Updates (Helper Usage)
tests/csp_health_monitor_test.go, tests/fault_management_test.go, tests/fault_remediation_test.go, tests/log_collector_test.go, tests/node_drainer_test.go, tests/scale_test.go, tests/smoke_test.go
Updated test calls from RebootNode-specific helpers to generic CR helpers with RebootNodeGVK parameter. Updated health event data paths to use fault-health-event-restart-vm.json and updated GPU UUID references.
New Test Additions
tests/gpu_reset_test.go, tests/janitor_test.go
Added TestGPUReset exercising full GPU reset workflow with pod deployment, health events, drain/cordon operations, and CR completion validation. Added TestJanitorNodeLocking verifying non-overlapping CR execution on same node with helper function for status time extraction.
Documentation & Configuration
docs/configuration/fault-remediation.md, .ctlptl.yaml
Updated completeConditionType from "Completed" to "Complete" in COMPONENT_RESET documentation. Corrected trailing newline formatting in ctlptl configuration.
UAT Test Orchestration
tests/uat/tests.sh
Added wait_for_node_unquarantine() and wait_for_gpu_reset() helper functions, introduced test_xid_monitoring_syslog_gpu_reset() test case exercising GPU reset after syslog XID events with dynamic GPU UUID/PCI extraction.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop! Hop! The remediations now take action,
Each GPU reset flows with generalized traction,
From nodes rebooting to components anew,
Generic helpers bind every CR through!
Tests validate locking, GPU resets shine,
A remediation garden, precisely designed! 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: enable GPU reset with e2e and UAT tests' accurately reflects the main change: introducing GPU reset functionality and comprehensive testing. It is concise, specific, and clearly communicates the primary objective from the developer's perspective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/csp_health_monitor_test.go (1)

512-518: ⚠️ Potential issue | 🟠 Major

Fix context key mismatch to avoid teardown panic.

originalArgsContextKey is used to store the value (Line 436), but teardown reads keyOriginalArgsContextKey, which no longer exists. This will return nil and panic on ([]string).

🩹 Proposed fix
-        originalArgs := ctx.Value(keyOriginalArgsContextKey).([]string)
+        originalArgs := ctx.Value(originalArgsContextKey).([]string)
🤖 Fix all issues with AI agents
In `@fault-remediation/pkg/remediation/remediation_test.go`:
- Around line 331-339: The remediation config entry for
protos.RecommendedAction_COMPONENT_RESET sets Kind: "RebootNode" but the test
"Successful GPUReset creation" expects gpuResetGVK (Kind "GPUReset"); update the
remediationConfig entry for COMPONENT_RESET to use Kind: "GPUReset" (or align
gpuResetGVK to "RebootNode") so the Kind in the remediationConfig and the test's
gpuResetGVK match; locate the entry keyed by
protos.RecommendedAction_COMPONENT_RESET and the test case "Successful GPUReset
creation" to ensure both refer to the same GVK symbol (gpuResetGVK) and adjust
the Kind accordingly.
- Around line 300-320: The templates use inconsistent variable access for the
node name; update rebootNodeTemplate to match gpuResetTemplate's pattern by
replacing instances of {{.NodeName}} and {{.HealthEventID}} in the
rebootNodeTemplate Parse string with the dotted/space-padded forms {{
.HealthEvent.NodeName }} and {{ .HealthEventID }} (so rebootNodeTemplate and
gpuResetTemplate consistently access fields via .HealthEvent), keeping the same
surrounding template structure.

In `@tests/fault_management_test.go`:
- Around line 250-251: Replace the incorrect GVK constant when listing CRs: in
this file update all calls that pass helpers.RebootNodeGVK into
helpers.DeleteAllCRs, helpers.WaitForCR, and helpers.WaitForNoCR to instead pass
helpers.RebootNodeGVKList (the List-kind constant); locate usages of
DeleteAllCRs, WaitForCR, and WaitForNoCR in this test and change the argument
from RebootNodeGVK to RebootNodeGVKList so the listing uses the correct List
kind.

In `@tests/gpu_reset_test.go`:
- Around line 175-185: The loop over conditions uses an unchecked type assertion
condMap := c.(map[string]interface{}) which can panic; change it to a checked
assertion (e.g., condMap, ok := c.(map[string]interface{})) and skip or fail
gracefully when ok is false, then continue to check
condMap["type"],["reason"],["status"] as before; keep setting
foundCompleteCondition and the final assert.True using gpuReset.GetName() so
non-map entries don't cause a test panic.

In `@tests/helpers/health_events_analyzer.go`:
- Line 199: Replace the incorrect GVK used when waiting for the reboot CR: in
the call to WaitForCR (function WaitForCR) change the fourth argument from
RebootNodeGVKList to RebootNodeGVK so the code queries the singular RebootNode
custom resource (replace RebootNodeGVKList with RebootNodeGVK in the statement
that assigns rebootNodeCR).

In `@tests/scale_test.go`:
- Around line 327-328: The call to helpers.DeleteAllCRs uses the singular
RebootNodeGVK but DeleteAllCRs (and its ListAllCRs helper) expects a list-kind
GVK; change the argument at the helpers.DeleteAllCRs call to use
RebootNodeGVKList (the list-kind GVK) so ListAllCRs can create the correct
UnstructuredList and successfully list/delete RebootNode CRs.

In `@tests/uat/tests.sh`:
- Around line 218-249: In wait_for_gpu_reset, the code mistakenly references the
non-local variable $gpu_node; change those references to the function parameter
$node: update the kubectl jsonpath selector used when assigning driver_pod to
use $node (so the pod lookup targets the passed-in node) and update the error
message that currently says "No driver pod found on node $gpu_node" to reference
$node; ensure no other occurrences in wait_for_gpu_reset still use $gpu_node
(keep driver_pod and the grep/uuid logic unchanged).
🧹 Nitpick comments (10)
tests/csp_health_monitor_test.go (1)

498-507: Remove duplicate quarantine assertion.

The same helpers.AssertQuarantineState(...) call is executed twice back-to-back; it adds noise without extra coverage.

🧹 Proposed cleanup
         helpers.AssertQuarantineState(ctx, t, client, testCtx.NodeName, helpers.QuarantineAssertion{
             ExpectCordoned:   false,
             ExpectAnnotation: false,
         })
 
         t.Logf("Verified: node %s was not cordoned when processing STORE_ONLY strategy", testCtx.NodeName)
-        helpers.AssertQuarantineState(ctx, t, client, testCtx.NodeName, helpers.QuarantineAssertion{
-            ExpectCordoned:   false,
-            ExpectAnnotation: false,
-        })
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml (1)

30-31: LGTM! Good documentation for the new configuration.

The inline comment correctly documents the omission behavior. The nvidia runtime class is appropriate for NVSentinel's GPU-related functionality.

Consider adding a brief note about why this runtime class is needed (e.g., for GPU device access) to help operators understand the requirement. As per coding guidelines, examples for non-obvious configurations are recommended.

-# Runtime class name for the pod. If empty or not set, the field will be omitted.
+# Runtime class name for the pod. Required for GPU device access (e.g., "nvidia" for NVIDIA GPU Operator).
+# If empty or not set, the field will be omitted.
 runtimeClassName: "nvidia"
tests/node_drainer_test.go (1)

234-243: Centralize the GPU UUID to avoid drift across test steps.

The same UUID is duplicated in multiple event payloads; a single constant keeps fixtures consistent and eases future updates.

♻️ Suggested refactor
 func TestNodeDrainerPartialDrain(t *testing.T) {
+	const impactedGPUUUID = "GPU-455d8f70-2051-db6c-0430-ffc457bff834"
 	feature := features.New("TestNodeDrainerPartialDrain").
 		WithLabel("suite", "node-drainer")
@@
 			WithEntitiesImpacted([]helpers.EntityImpacted{
 				{
 					EntityType:  "GPU_UUID",
-					EntityValue: "GPU-455d8f70-2051-db6c-0430-ffc457bff834",
+					EntityValue: impactedGPUUUID,
 				},
 			})
@@
 			WithEntitiesImpacted([]helpers.EntityImpacted{
 				{
 					EntityType:  "GPU_UUID",
-					EntityValue: "GPU-455d8f70-2051-db6c-0430-ffc457bff834",
+					EntityValue: impactedGPUUUID,
 				},
 			})

Also applies to: 287-296

distros/kubernetes/nvsentinel/values-tilt.yaml (1)

265-276: Minor template inconsistency: whitespace in API version template.

The gpureset-template.yaml uses {{.ApiGroup}}/{{.Version}} (no spaces), while rebootnode-template.yaml uses {{ .ApiGroup }}/{{ .Version }} (with spaces). While both are valid Go template syntax, consider using consistent formatting across templates for maintainability.

♻️ Suggested fix for consistency
       "gpureset-template.yaml": |
-        apiVersion: {{.ApiGroup}}/{{.Version}}
+        apiVersion: {{ .ApiGroup }}/{{ .Version }}
         kind: GPUReset
tests/janitor_test.go (2)

185-188: Rename the test to follow the required naming convention.

TestJanitorNodeLocking doesn’t include the scenario/expected behavior suffix, which makes it harder to scan among other tests.

✏️ Rename to match the naming convention
-func TestJanitorNodeLocking(t *testing.T) {
+func TestJanitorNodeLocking_SameNodeSequential_DifferentNodesOverlap(t *testing.T) {
As per coding guidelines: Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`.

214-216: Use test metadata fixtures for GPU UUIDs instead of arbitrary hard-coded values.

Other integration tests (e.g., gpu_health_monitor_test.go) derive GPU UUIDs from tests/helpers/metadata.go, which provides test fixtures like GPU-00000000-0000-0000-0000-000000000000. Align this test with that pattern by passing one of the predefined test metadata UUIDs to CreateGPUResetCR, or document why a different UUID is intentional here.

tests/gpu_reset_test.go (2)

44-49: Consider using require instead of assert for setup failures.

In the Setup function, failures like creating a Kubernetes client should use require.NoError since the test cannot continue meaningfully without a client. assert.NoError will log the error but continue execution, potentially causing confusing downstream failures.

♻️ Suggested change
 	feature.Setup(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
 		workloadNamespace := "immediate-test"
 
 		client, err := c.NewClient()
-		assert.NoError(t, err, "failed to create kubernetes client")
+		require.NoError(t, err, "failed to create kubernetes client")

280-292: Helper function should use require for critical assertions.

getDCGMPodOnNode uses assert.NoError and assert.Fail, which log errors but allow the test to continue with an empty string return value. This could lead to confusing downstream failures. Consider using require to fail fast.

♻️ Suggested change
 func getDCGMPodOnNode(ctx context.Context, t *testing.T, client klient.Client, nodeName string) string {
+	t.Helper()
 	var initialDCGMPod string
 	pods, err := helpers.GetPodsOnNode(ctx, client.Resources(), nodeName)
-	assert.NoError(t, err, "failed to get pods on node %s", nodeName)
+	require.NoError(t, err, "failed to get pods on node %s", nodeName)
 	for _, pod := range pods {
 		if strings.Contains(pod.Name, "nvidia-dcgm") {
 			initialDCGMPod = pod.Name
 		}
 	}
 	if len(initialDCGMPod) == 0 {
-		assert.Fail(t, "failed to find nvidia-dcgm pod on node %s", nodeName)
+		require.Fail(t, "failed to find nvidia-dcgm pod on node %s", nodeName)
 	}
 	return initialDCGMPod
 }

As per coding guidelines, helper functions should also include t.Helper() to improve test failure location reporting.

tests/uat/tests.sh (2)

233-248: Consider removing the unnecessary elapsed=0 assignment.

Line 237 sets elapsed=0 before break, but since the loop exits immediately after, this assignment serves no purpose. The timeout check on line 244 will correctly evaluate to false (0 < timeout) after a successful match.

♻️ Suggested simplification
     while [[ $elapsed -lt $timeout ]]; do
         # Exec in a subshell to prevent grep from occurring in client shell
         if kubectl exec -n gpu-operator "$driver_pod" -- sh -c  "tail -n 10000 /var/log/syslog | grep \"GPU reset executed: $uuid\" | grep -v \"RuntimeService\""; then
             log "GPU $uuid reset successfully"
-            elapsed=0
             break
         fi
         sleep 5
         elapsed=$((elapsed + 5))
     done

-    if [[ $elapsed -ge $timeout ]]; then
+    if [[ $elapsed -ge $timeout ]]; then  # Only true if loop exited without break
         error "Timeout waiting for GPU $uuid to reset"
     fi

386-389: Redundant call to wait_for_node_unquarantine.

Line 387 calls wait_for_gpu_reset which already calls wait_for_node_unquarantine at line 248. The second call on line 389 is redundant.

♻️ Proposed fix: remove duplicate call
     log "Waiting for node to GPU reset and recover..."
     wait_for_gpu_reset "$gpu_node" "$uuid"

-    wait_for_node_unquarantine "$gpu_node"
-
     log "Test 3 PASSED ✓"
 }

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml (1)

70-78: ⚠️ Potential issue | 🟡 Minor

Add an inline example and explicit boolean guidance for partialDrainEnabled.

This is a non-obvious, cross-component toggle and the default now flips to true. Please add a short example plus a “use boolean true/false (unquoted)” note to meet Helm values documentation requirements.

📄 Proposed inline doc update
 # HealthEvents with the COMPONENT_RESET remediation action must include an impacted entity for the
 # unhealthy GPU_UUID or else the drain will fail. IMPORTANT: If this setting is enabled, the COMPONENT_RESET
 # action in fault-remediation must map to a custom resource which takes action only against the GPU_UUID.
 # If partial drain was enabled in node-drainer but fault-remediation mapped COMPONENT_RESET to a reboot
 # action, pods which weren't drained would be restarted as part of the reboot.
+# Example (GPU reset partial drain):
+# partialDrainEnabled: true
+# NOTE: use boolean true/false (do not quote)
 partialDrainEnabled: true

As per coding guidelines "Include examples for non-obvious configurations in Helm chart documentation" and "Note truthy value requirements in Helm chart documentation where applicable".

distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml (1)

50-97: ⚠️ Potential issue | 🟡 Minor

Add inline comments for the new GPUReset config keys.

This values file requires inline documentation for all values; the newly added GPUReset fields and template key are currently undocumented. Suggested minimal inline comments:

💡 Suggested inline comments
     "COMPONENT_RESET":  # Action 2
       apiGroup: "janitor.dgxc.nvidia.com"
       version: "v1alpha1"
-      kind: "GPUReset"
+      kind: "GPUReset" # GPUReset CRD for component reset
       scope: "Cluster"
-      completeConditionType: "Complete"
-      templateFileName: "gpureset-template.yaml"
-      equivalenceGroup: "reset"
-      supersedingEquivalenceGroups: ["restart"]
-      impactedEntityScope: "GPU_UUID"
+      completeConditionType: "Complete" # Condition that marks GPUReset completion
+      templateFileName: "gpureset-template.yaml" # Template used to render GPUReset CR
+      equivalenceGroup: "reset" # Remediation equivalence group for reset
+      supersedingEquivalenceGroups: ["restart"] # Reset supersedes restart
+      impactedEntityScope: "GPU_UUID" # Target entity from health event
   templates:
-    "gpureset-template.yaml": |
+    "gpureset-template.yaml": | # Template for GPUReset CR

As per coding guidelines: Document all values in Helm chart values.yaml with inline comments.

tests/uat/tests.sh (1)

500-525: ⚠️ Potential issue | 🟡 Minor

Address misleading test results when most tests are disabled.

Only test_xid_monitoring_syslog_gpu_reset runs while test_gpu_monitoring_dcgm, test_xid_monitoring_syslog, and test_sxid_monitoring_syslog are commented out (as noted in README "Test Scenarios"). The final "All tests PASSED ✓" message will be misleading when disabled. Either:

  • Gate test selection with an environment variable (e.g., RUN_ALL_TESTS=true)
  • Update the final summary to reflect which tests actually ran
🤖 Fix all issues with AI agents
In `@tests/gpu_reset_test.go`:
- Around line 253-259: The non-blocking select in the feature.Assess block that
reads from nodeLabelSequenceObserved can flake; replace the default branch with
a bounded wait (e.g., select between receiving from nodeLabelSequenceObserved
and a timeout via time.After or context.WithTimeout) so the test waits briefly
for a published value instead of failing immediately; update the Assessment
closure (the lambda passed to feature.Assess) to perform a timed receive from
nodeLabelSequenceObserved and assert on the received value or fail with a clear
timeout message if the wait expires.
- Around line 280-291: The function getDCGMPodOnNode uses assert.Fail which
marks failure but continues execution; change the failure to an immediate stop
by replacing assert.Fail with assert.FailNow (or use t.Fatalf) so the test halts
when no nvidia-dcgm pod is found. Update the failure path in getDCGMPodOnNode to
call assert.FailNow(t, "failed to find nvidia-dcgm pod on node %s", nodeName) or
t.Fatalf("failed to find nvidia-dcgm pod on node %s", nodeName) and ensure
callers no longer receive an empty pod name.

In `@tests/helpers/kube.go`:
- Around line 68-79: Add proper GoDoc comments for the exported identifiers by
placing a short descriptive comment immediately above each exported symbol that
starts with the symbol name: RebootNodeGVK, GPUResetGVK, ListAllCRs, WaitForCR,
DeleteAllCRs, DeleteCR, and CreateGPUResetCR; describe what each
GroupVersionKind represents and what each function does, its important
parameters/return behavior and any side effects. Ensure comments follow GoDoc
style (start with the exact exported name) and apply the same pattern to other
exported symbols in the file range referenced (lines ~524-903) so all exported
functions/types have appropriate documentation.

In `@tests/janitor_test.go`:
- Around line 237-240: The strict cross-node overlap assertion is flaky; keep
the same-node non-overlap check using periodOverlapsOnNode1
(startTimeReboot/completionTimeReset) but replace the hard assert.True on
periodOverlapsOnNode1And2 with a weaker, time-bounded condition: either use
assert.Eventually (or a small polling loop) to wait a short timeout for the two
reboot intervals (startTimeReboot/completionTimeReboot and
startTimeReboot2/completionTimeReboot2) to overlap, or assert that their start
times are within a small tolerance (e.g., < 200–500ms); update the assertion
around periodOverlapsOnNode1And2 accordingly instead of requiring immediate
True.
- Around line 214-216: The test uses a hard-coded GPU UUID when calling
helpers.CreateGPUResetCR which can fail if that UUID doesn't exist on the chosen
node; instead, query the node's GPU/device metadata to discover a valid UUID for
nodeName (e.g., via an existing helper that lists GPU UUIDs or by reading the
node/device status) and replace the literal
"GPU-455d8f70-2051-db6c-0430-ffc457bff834" with the discovered UUID before
creating gpuResetCRName with CreateGPUResetCR to make the test deterministic and
environment-independent.
- Around line 195-197: The test uses assert.NoError when calling
helpers.GetRealNodeName(ctx, client), which lets the test continue with an empty
nodeName and causes confusing failures later (e.g., CreateRebootNodeCR); change
the assertion to require.NoError to fail fast if GetRealNodeName returns an
error and ensure nodeName is valid before proceeding, updating the assertion
that checks the call to helpers.GetRealNodeName and any related uses of nodeName
in CreateRebootNodeCR.

In `@tests/uat/tests.sh`:
- Around line 149-174: The wait_for_node_unquarantine function uses a too-short
default UAT_QUARANTINE_TIMEOUT and only checks .spec.unschedulable while
claiming the node is "ready"; increase the default timeout (e.g., to 600s or
make UAT_QUARANTINE_TIMEOUT configurable) and change the readiness check to
verify both that .spec.unschedulable is not "true" and that the node's Ready
condition is True (query .status.conditions where type=Ready and status=True via
kubectl jsonpath) before logging "ready"; also adjust the progress logging to
reflect "uncordoned" vs "ready" states so messages are accurate (reference
symbols: wait_for_node_unquarantine, UAT_QUARANTINE_TIMEOUT, kubectl get node
jsonpath {.spec.unschedulable} and .status.conditions).
- Around line 359-399: The UUID/PCI parsed from nvidia-smi in
test_xid_monitoring_syslog_gpu_reset may be empty and must be validated before
proceeding; after computing uuid_pci, uuid, and pci, add a guard that checks if
either "$uuid" or "$pci" is empty and if so call error (or return/fail the test)
with a message including the raw "$uuid_pci" output so the test stops rather
than sending the logger and relying on wait_for_gpu_reset to match everything;
ensure this validation occurs before the logger exec and before calling
wait_for_node_condition/wait_for_gpu_reset so wait_for_gpu_reset sees a real
uuid argument.
- Around line 218-257: In wait_for_gpu_reset, the UUID check uses echo
$exec_output | grep "$uuid" which treats the UUID as a regex and can mis-handle
content; replace that line to use printf '%s\n' "$exec_output" piped into
fixed-string grep, e.g. change the check to: printf '%s\n' "$exec_output" | grep
-F -- "$uuid" (use the -- to guard against UUIDs starting with -); this
preserves output exactly and ensures fixed-string matching of the UUID.
🧹 Nitpick comments (1)
tests/janitor_test.go (1)

185-188: Add a doc comment for the new exported test.
This keeps exported test functions compliant with package documentation conventions.

As per coding guidelines: Function comments required for all exported Go functions.

@natherz97 natherz97 force-pushed the enable-gpu-reset branch 2 times, most recently from 4744977 to f5af3ae Compare January 30, 2026 21:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/values-full.yaml`:
- Around line 636-643: The production mapping for COMPONENT_RESET currently
points to kind: "RebootNode" with equivalenceGroup "restart"; change it so
COMPONENT_RESET uses apiGroup "janitor.dgxc.nvidia.com", version "v1alpha1",
kind "GPUReset", scope "Cluster", completeConditionType "NodeReady",
templateFileName "nvidia-reboot.yaml", set equivalenceGroup to "reset" and add
supersedingEquivalenceGroups: ["restart"] so GPU-scoped resets align with the
design and do not cause full node reboots when partial drain is enabled.

In `@tests/gpu_reset_test.go`:
- Around line 135-139: The test currently hard-codes a GPU UUID in the
nodeCondition.Message assertion, making it non-portable; update the test to
obtain the real GPU UUID at runtime (e.g., add a helper like getGPUUUID or call
a test utility that shells out to nvidia-smi) and use that dynamic value when
constructing the injected health event and when asserting nodeCondition.Message
(and any downstream GPUReset-related expectations). Locate the assertion that
compares nodeCondition.Message and replace the literal UUID with the
helper-returned UUID so the injected event and GPUReset validation use the same
dynamically retrieved GPU UUID.
- Around line 47-56: The test uses assert.NoError for critical setup steps which
can allow execution to continue with invalid state; replace assert.NoError(t,
err, ...) calls for creating the Kubernetes client (c.NewClient()), and for
getting the real node (helpers.GetRealNodeName(ctx, client)) with
require.NoError so the test fails immediately on these setup errors; update the
two calls referencing client and nodeName (and any other similar critical
setup/assertions in this file) from assert to require to ensure fast-fail on
setup failures.
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/values-full.yaml (1)

670-682: Consider adding a GPUReset template example.

The templates section only includes nvidia-reboot.yaml for RebootNode. Consider adding a commented-out GPUReset template example to help users configure GPU reset functionality:

      # "gpureset-template.yaml": |
      #   apiVersion: janitor.dgxc.nvidia.com/v1alpha1
      #   kind: GPUReset
      #   metadata:
      #     name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }}
      #     labels:
      #       app.kubernetes.io/managed-by: nvsentinel
      #   spec:
      #     nodeName: {{ .HealthEvent.NodeName }}
      #     selector:
      #       uuids:
      #         - {{ .ImpactedEntityScopeValue }}
tests/janitor_test.go (1)

185-188: Rename test to include scenario and expected behavior.

♻️ Suggested rename
-func TestJanitorNodeLocking(t *testing.T) {
+func TestJanitorNodeLocking_RebootAndGPUReset_EnforcesNodeLock(t *testing.T) {

As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.

tests/gpu_reset_test.go (1)

37-40: Rename test to follow the required naming convention.

♻️ Suggested rename
-func TestGPUReset(t *testing.T) {
+func TestGPUReset_EndToEnd_ComponentResetCompletes(t *testing.T) {

As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/csp_health_monitor_test.go (1)

516-516: ⚠️ Potential issue | 🔴 Critical

Compilation error: reference to undefined constant keyOriginalArgsContextKey.

Line 516 still uses the old constant name keyOriginalArgsContextKey, but it was renamed to originalArgsContextKey on line 36. This will fail to compile.

🐛 Proposed fix
-		originalArgs := ctx.Value(keyOriginalArgsContextKey).([]string)
+		originalArgs := ctx.Value(originalArgsContextKey).([]string)
🤖 Fix all issues with AI agents
In `@tests/gpu_reset_test.go`:
- Around line 167-174: The test currently uses assert.Fail after calling
unstructured.NestedMap and unstructured.NestedSlice which does not stop
execution and can lead to nil dereferences; update the test to import the
require package and replace the assert.Fail checks for the "status" and
"conditions" extraction with require.NoError/require.True (or require.NotNil) so
the test stops immediately on failure—specifically change the checks around
unstructured.NestedMap(gpuReset.Object, "status") and
unstructured.NestedSlice(status, "conditions") to use require assertions that
halt execution.

In `@tests/uat/tests.sh`:
- Around line 382-392: The check for empty nvidia-smi output uses the literal
string instead of the variable—change the conditional that tests uuid_pci so it
references the variable (uuid_pci) with the $ and quotes; update the if
condition that currently reads the literal "uuid_pci" to use [[ -z "$uuid_pci"
]] so the error function is invoked when kubectl returns no output, leaving the
subsequent parsing of uuid and pci (variables uuid and pci) unchanged.
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/values-tilt.yaml (1)

265-276: GPUReset template structure looks correct.

The template properly includes nodeName and a selector.uuids array populated from .ImpactedEntityScopeValue, which aligns with the GPU reset flow targeting a specific GPU UUID.

Minor style nit: The template variable syntax uses {{.ApiGroup}} (no spaces) while the existing rebootnode-template.yaml uses {{ .ApiGroup }} (with spaces). Consider aligning for consistency.

♻️ Optional: Align template variable spacing
      "gpureset-template.yaml": |
-        apiVersion: {{.ApiGroup}}/{{.Version}}
+        apiVersion: {{ .ApiGroup }}/{{ .Version }}
        kind: GPUReset
        metadata:
-          name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }}
+          name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }}
          labels:
            app.kubernetes.io/managed-by: nvsentinel
        spec:
-          nodeName: {{ .HealthEvent.NodeName }}
+          nodeName: {{ .HealthEvent.NodeName }}
          selector:
            uuids:
-              - {{ .ImpactedEntityScopeValue }}
+              - {{ .ImpactedEntityScopeValue }}
tests/janitor_test.go (1)

199-203: Consider a more explicit KWOK node selection.

The test assumes the last node in the list is a KWOK node. This ordering assumption may be fragile if node registration order changes.

♻️ Suggested: Explicitly select KWOK node by label or name pattern
 		// use a KWOK node for the second RebootNode
-		nodes, err := helpers.GetAllNodesNames(ctx, client)
-		require.NoError(t, err, "failed to get cluster nodes")
-		require.True(t, len(nodes) > 0, "no nodes found in cluster")
-		kwokNodeName := nodes[len(nodes)-1]
+		kwokNodeName, err := helpers.GetKWOKNodeName(ctx, client)
+		require.NoError(t, err, "failed to get KWOK node")

If a dedicated helper doesn't exist, consider filtering by a label like type=kwok or a name prefix pattern.

tests/gpu_reset_test.go (1)

37-39: Rename test to match TestFunctionName_Scenario_ExpectedBehavior format.
For example: TestGPUReset_EndToEnd_Succeeds. As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.

🔧 Suggested rename
-func TestGPUReset(t *testing.T) {
+func TestGPUReset_EndToEnd_Succeeds(t *testing.T) {

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/helpers/kube.go`:
- Around line 537-565: When ListAllCRs returns an error inside WaitForNoCR, fail
fast instead of logging and continuing: inside the require.Never closure in
WaitForNoCR, replace the t.Logf + return false behavior on err != nil with a
call to t.Fatalf (including the error and nodeName) so the test stops
immediately when ListAllCRs fails; keep the rest of the loop and the use of
require.Never, NeverWaitTimeout and WaitInterval unchanged.
🧹 Nitpick comments (3)
tests/gpu_reset_test.go (1)

37-38: Rename the test to include scenario + expected behavior.
Consider something like TestGPUReset_ComponentReset_Completes to encode intent.

As per coding guidelines, name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.

tests/janitor_test.go (1)

185-187: Rename the test to follow the required naming format.

Consider a name like TestJanitorNodeLocking_SameNodeSequential_DifferentNodeOverlap to match the expected TestFunctionName_Scenario_ExpectedBehavior convention.

✏️ Suggested rename
-func TestJanitorNodeLocking(t *testing.T) {
+func TestJanitorNodeLocking_SameNodeSequential_DifferentNodeOverlap(t *testing.T) {

As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.

tests/uat/tests.sh (1)

402-406: wait_for_gpu_reset already waits for uncordon — this is redundant.

You can drop the extra unquarantine wait to reduce test time.

♻️ Suggested cleanup
-    wait_for_node_unquarantine "$gpu_node"
-
     log "Test 3 PASSED ✓"

@natherz97 natherz97 force-pushed the enable-gpu-reset branch 3 times, most recently from 31db2c7 to 9996387 Compare February 10, 2026 23:24
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.74% (+0.03%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@natherz97 natherz97 force-pushed the enable-gpu-reset branch 2 times, most recently from 11681d1 to a05f015 Compare February 11, 2026 02:24
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.74% (+0.03%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.13% (+0.05%) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.71% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.13% (+0.05%) 1294 (+12) 364 (+4) 930 (+8) 👍
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@natherz97 natherz97 force-pushed the enable-gpu-reset branch 2 times, most recently from c1eee72 to 165f7f2 Compare February 11, 2026 05:50
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.08% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.88% (+0.17%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.08% (ø) 1282 360 922
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.61% (+0.08%) 3601 (+6) 670 (+4) 2931 (+2) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 127 (-38) 0 127 (-38)
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@lalitadithya lalitadithya requested a review from XRFXLP February 11, 2026 06:26
@natherz97 natherz97 force-pushed the enable-gpu-reset branch 2 times, most recently from ca200ba to 4c7ad97 Compare February 11, 2026 19:24
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.08% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.88% (+0.13%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.08% (ø) 1282 360 922
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.61% (+0.02%) 3601 (+6) 670 (+2) 2931 (+4) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 127 (-38) 0 127 (-38)
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

Signed-off-by: Nathan Herz <nherz@nvidia.com>
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.08% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.88% (+0.13%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.08% (ø) 1282 360 922
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.61% (+0.02%) 3601 (+6) 670 (+2) 2931 (+4) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 127 (-38) 0 127 (-38)
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@lalitadithya lalitadithya enabled auto-merge (squash) February 12, 2026 11:41
@lalitadithya lalitadithya merged commit d7b6b85 into NVIDIA:main Feb 12, 2026
62 checks passed
cbumb pushed a commit to cbumb/cbumb that referenced this pull request Feb 12, 2026
Signed-off-by: Nathan Herz <nherz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants