feat: enable GPU reset with e2e and UAT tests#768
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR refactors fault remediation configuration from a single-template model to a per-action structure with multiple templates, introduces generic custom resource helpers parameterized by GroupVersionKind, adds GPU reset remediation capabilities alongside RebootNode remediation, and expands test coverage with GPU reset and node-locking validation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/csp_health_monitor_test.go (1)
512-518:⚠️ Potential issue | 🟠 MajorFix context key mismatch to avoid teardown panic.
originalArgsContextKeyis used to store the value (Line 436), but teardown readskeyOriginalArgsContextKey, which no longer exists. This will returnniland panic on([]string).🩹 Proposed fix
- originalArgs := ctx.Value(keyOriginalArgsContextKey).([]string) + originalArgs := ctx.Value(originalArgsContextKey).([]string)
🤖 Fix all issues with AI agents
In `@fault-remediation/pkg/remediation/remediation_test.go`:
- Around line 331-339: The remediation config entry for
protos.RecommendedAction_COMPONENT_RESET sets Kind: "RebootNode" but the test
"Successful GPUReset creation" expects gpuResetGVK (Kind "GPUReset"); update the
remediationConfig entry for COMPONENT_RESET to use Kind: "GPUReset" (or align
gpuResetGVK to "RebootNode") so the Kind in the remediationConfig and the test's
gpuResetGVK match; locate the entry keyed by
protos.RecommendedAction_COMPONENT_RESET and the test case "Successful GPUReset
creation" to ensure both refer to the same GVK symbol (gpuResetGVK) and adjust
the Kind accordingly.
- Around line 300-320: The templates use inconsistent variable access for the
node name; update rebootNodeTemplate to match gpuResetTemplate's pattern by
replacing instances of {{.NodeName}} and {{.HealthEventID}} in the
rebootNodeTemplate Parse string with the dotted/space-padded forms {{
.HealthEvent.NodeName }} and {{ .HealthEventID }} (so rebootNodeTemplate and
gpuResetTemplate consistently access fields via .HealthEvent), keeping the same
surrounding template structure.
In `@tests/fault_management_test.go`:
- Around line 250-251: Replace the incorrect GVK constant when listing CRs: in
this file update all calls that pass helpers.RebootNodeGVK into
helpers.DeleteAllCRs, helpers.WaitForCR, and helpers.WaitForNoCR to instead pass
helpers.RebootNodeGVKList (the List-kind constant); locate usages of
DeleteAllCRs, WaitForCR, and WaitForNoCR in this test and change the argument
from RebootNodeGVK to RebootNodeGVKList so the listing uses the correct List
kind.
In `@tests/gpu_reset_test.go`:
- Around line 175-185: The loop over conditions uses an unchecked type assertion
condMap := c.(map[string]interface{}) which can panic; change it to a checked
assertion (e.g., condMap, ok := c.(map[string]interface{})) and skip or fail
gracefully when ok is false, then continue to check
condMap["type"],["reason"],["status"] as before; keep setting
foundCompleteCondition and the final assert.True using gpuReset.GetName() so
non-map entries don't cause a test panic.
In `@tests/helpers/health_events_analyzer.go`:
- Line 199: Replace the incorrect GVK used when waiting for the reboot CR: in
the call to WaitForCR (function WaitForCR) change the fourth argument from
RebootNodeGVKList to RebootNodeGVK so the code queries the singular RebootNode
custom resource (replace RebootNodeGVKList with RebootNodeGVK in the statement
that assigns rebootNodeCR).
In `@tests/scale_test.go`:
- Around line 327-328: The call to helpers.DeleteAllCRs uses the singular
RebootNodeGVK but DeleteAllCRs (and its ListAllCRs helper) expects a list-kind
GVK; change the argument at the helpers.DeleteAllCRs call to use
RebootNodeGVKList (the list-kind GVK) so ListAllCRs can create the correct
UnstructuredList and successfully list/delete RebootNode CRs.
In `@tests/uat/tests.sh`:
- Around line 218-249: In wait_for_gpu_reset, the code mistakenly references the
non-local variable $gpu_node; change those references to the function parameter
$node: update the kubectl jsonpath selector used when assigning driver_pod to
use $node (so the pod lookup targets the passed-in node) and update the error
message that currently says "No driver pod found on node $gpu_node" to reference
$node; ensure no other occurrences in wait_for_gpu_reset still use $gpu_node
(keep driver_pod and the grep/uuid logic unchanged).
🧹 Nitpick comments (10)
tests/csp_health_monitor_test.go (1)
498-507: Remove duplicate quarantine assertion.The same
helpers.AssertQuarantineState(...)call is executed twice back-to-back; it adds noise without extra coverage.🧹 Proposed cleanup
helpers.AssertQuarantineState(ctx, t, client, testCtx.NodeName, helpers.QuarantineAssertion{ ExpectCordoned: false, ExpectAnnotation: false, }) t.Logf("Verified: node %s was not cordoned when processing STORE_ONLY strategy", testCtx.NodeName) - helpers.AssertQuarantineState(ctx, t, client, testCtx.NodeName, helpers.QuarantineAssertion{ - ExpectCordoned: false, - ExpectAnnotation: false, - })distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml (1)
30-31: LGTM! Good documentation for the new configuration.The inline comment correctly documents the omission behavior. The
nvidiaruntime class is appropriate for NVSentinel's GPU-related functionality.Consider adding a brief note about why this runtime class is needed (e.g., for GPU device access) to help operators understand the requirement. As per coding guidelines, examples for non-obvious configurations are recommended.
-# Runtime class name for the pod. If empty or not set, the field will be omitted. +# Runtime class name for the pod. Required for GPU device access (e.g., "nvidia" for NVIDIA GPU Operator). +# If empty or not set, the field will be omitted. runtimeClassName: "nvidia"tests/node_drainer_test.go (1)
234-243: Centralize the GPU UUID to avoid drift across test steps.The same UUID is duplicated in multiple event payloads; a single constant keeps fixtures consistent and eases future updates.
♻️ Suggested refactor
func TestNodeDrainerPartialDrain(t *testing.T) { + const impactedGPUUUID = "GPU-455d8f70-2051-db6c-0430-ffc457bff834" feature := features.New("TestNodeDrainerPartialDrain"). WithLabel("suite", "node-drainer") @@ WithEntitiesImpacted([]helpers.EntityImpacted{ { EntityType: "GPU_UUID", - EntityValue: "GPU-455d8f70-2051-db6c-0430-ffc457bff834", + EntityValue: impactedGPUUUID, }, }) @@ WithEntitiesImpacted([]helpers.EntityImpacted{ { EntityType: "GPU_UUID", - EntityValue: "GPU-455d8f70-2051-db6c-0430-ffc457bff834", + EntityValue: impactedGPUUUID, }, })Also applies to: 287-296
distros/kubernetes/nvsentinel/values-tilt.yaml (1)
265-276: Minor template inconsistency: whitespace in API version template.The
gpureset-template.yamluses{{.ApiGroup}}/{{.Version}}(no spaces), whilerebootnode-template.yamluses{{ .ApiGroup }}/{{ .Version }}(with spaces). While both are valid Go template syntax, consider using consistent formatting across templates for maintainability.♻️ Suggested fix for consistency
"gpureset-template.yaml": | - apiVersion: {{.ApiGroup}}/{{.Version}} + apiVersion: {{ .ApiGroup }}/{{ .Version }} kind: GPUResettests/janitor_test.go (2)
185-188: Rename the test to follow the required naming convention.
TestJanitorNodeLockingdoesn’t include the scenario/expected behavior suffix, which makes it harder to scan among other tests.As per coding guidelines: Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`.✏️ Rename to match the naming convention
-func TestJanitorNodeLocking(t *testing.T) { +func TestJanitorNodeLocking_SameNodeSequential_DifferentNodesOverlap(t *testing.T) {
214-216: Use test metadata fixtures for GPU UUIDs instead of arbitrary hard-coded values.Other integration tests (e.g.,
gpu_health_monitor_test.go) derive GPU UUIDs fromtests/helpers/metadata.go, which provides test fixtures likeGPU-00000000-0000-0000-0000-000000000000. Align this test with that pattern by passing one of the predefined test metadata UUIDs toCreateGPUResetCR, or document why a different UUID is intentional here.tests/gpu_reset_test.go (2)
44-49: Consider usingrequireinstead ofassertfor setup failures.In the Setup function, failures like creating a Kubernetes client should use
require.NoErrorsince the test cannot continue meaningfully without a client.assert.NoErrorwill log the error but continue execution, potentially causing confusing downstream failures.♻️ Suggested change
feature.Setup(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { workloadNamespace := "immediate-test" client, err := c.NewClient() - assert.NoError(t, err, "failed to create kubernetes client") + require.NoError(t, err, "failed to create kubernetes client")
280-292: Helper function should userequirefor critical assertions.
getDCGMPodOnNodeusesassert.NoErrorandassert.Fail, which log errors but allow the test to continue with an empty string return value. This could lead to confusing downstream failures. Consider usingrequireto fail fast.♻️ Suggested change
func getDCGMPodOnNode(ctx context.Context, t *testing.T, client klient.Client, nodeName string) string { + t.Helper() var initialDCGMPod string pods, err := helpers.GetPodsOnNode(ctx, client.Resources(), nodeName) - assert.NoError(t, err, "failed to get pods on node %s", nodeName) + require.NoError(t, err, "failed to get pods on node %s", nodeName) for _, pod := range pods { if strings.Contains(pod.Name, "nvidia-dcgm") { initialDCGMPod = pod.Name } } if len(initialDCGMPod) == 0 { - assert.Fail(t, "failed to find nvidia-dcgm pod on node %s", nodeName) + require.Fail(t, "failed to find nvidia-dcgm pod on node %s", nodeName) } return initialDCGMPod }As per coding guidelines, helper functions should also include
t.Helper()to improve test failure location reporting.tests/uat/tests.sh (2)
233-248: Consider removing the unnecessaryelapsed=0assignment.Line 237 sets
elapsed=0beforebreak, but since the loop exits immediately after, this assignment serves no purpose. The timeout check on line 244 will correctly evaluate to false (0 < timeout) after a successful match.♻️ Suggested simplification
while [[ $elapsed -lt $timeout ]]; do # Exec in a subshell to prevent grep from occurring in client shell if kubectl exec -n gpu-operator "$driver_pod" -- sh -c "tail -n 10000 /var/log/syslog | grep \"GPU reset executed: $uuid\" | grep -v \"RuntimeService\""; then log "GPU $uuid reset successfully" - elapsed=0 break fi sleep 5 elapsed=$((elapsed + 5)) done - if [[ $elapsed -ge $timeout ]]; then + if [[ $elapsed -ge $timeout ]]; then # Only true if loop exited without break error "Timeout waiting for GPU $uuid to reset" fi
386-389: Redundant call towait_for_node_unquarantine.Line 387 calls
wait_for_gpu_resetwhich already callswait_for_node_unquarantineat line 248. The second call on line 389 is redundant.♻️ Proposed fix: remove duplicate call
log "Waiting for node to GPU reset and recover..." wait_for_gpu_reset "$gpu_node" "$uuid" - wait_for_node_unquarantine "$gpu_node" - log "Test 3 PASSED ✓" }
3de16e2 to
4637783
Compare
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml (1)
70-78:⚠️ Potential issue | 🟡 MinorAdd an inline example and explicit boolean guidance for
partialDrainEnabled.This is a non-obvious, cross-component toggle and the default now flips to
true. Please add a short example plus a “use boolean true/false (unquoted)” note to meet Helm values documentation requirements.📄 Proposed inline doc update
# HealthEvents with the COMPONENT_RESET remediation action must include an impacted entity for the # unhealthy GPU_UUID or else the drain will fail. IMPORTANT: If this setting is enabled, the COMPONENT_RESET # action in fault-remediation must map to a custom resource which takes action only against the GPU_UUID. # If partial drain was enabled in node-drainer but fault-remediation mapped COMPONENT_RESET to a reboot # action, pods which weren't drained would be restarted as part of the reboot. +# Example (GPU reset partial drain): +# partialDrainEnabled: true +# NOTE: use boolean true/false (do not quote) partialDrainEnabled: trueAs per coding guidelines "Include examples for non-obvious configurations in Helm chart documentation" and "Note truthy value requirements in Helm chart documentation where applicable".
distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml (1)
50-97:⚠️ Potential issue | 🟡 MinorAdd inline comments for the new GPUReset config keys.
This values file requires inline documentation for all values; the newly added GPUReset fields and template key are currently undocumented. Suggested minimal inline comments:
💡 Suggested inline comments
"COMPONENT_RESET": # Action 2 apiGroup: "janitor.dgxc.nvidia.com" version: "v1alpha1" - kind: "GPUReset" + kind: "GPUReset" # GPUReset CRD for component reset scope: "Cluster" - completeConditionType: "Complete" - templateFileName: "gpureset-template.yaml" - equivalenceGroup: "reset" - supersedingEquivalenceGroups: ["restart"] - impactedEntityScope: "GPU_UUID" + completeConditionType: "Complete" # Condition that marks GPUReset completion + templateFileName: "gpureset-template.yaml" # Template used to render GPUReset CR + equivalenceGroup: "reset" # Remediation equivalence group for reset + supersedingEquivalenceGroups: ["restart"] # Reset supersedes restart + impactedEntityScope: "GPU_UUID" # Target entity from health event templates: - "gpureset-template.yaml": | + "gpureset-template.yaml": | # Template for GPUReset CRAs per coding guidelines: Document all values in Helm chart
values.yamlwith inline comments.tests/uat/tests.sh (1)
500-525:⚠️ Potential issue | 🟡 MinorAddress misleading test results when most tests are disabled.
Only
test_xid_monitoring_syslog_gpu_resetruns whiletest_gpu_monitoring_dcgm,test_xid_monitoring_syslog, andtest_sxid_monitoring_syslogare commented out (as noted in README "Test Scenarios"). The final "All tests PASSED ✓" message will be misleading when disabled. Either:
- Gate test selection with an environment variable (e.g.,
RUN_ALL_TESTS=true)- Update the final summary to reflect which tests actually ran
🤖 Fix all issues with AI agents
In `@tests/gpu_reset_test.go`:
- Around line 253-259: The non-blocking select in the feature.Assess block that
reads from nodeLabelSequenceObserved can flake; replace the default branch with
a bounded wait (e.g., select between receiving from nodeLabelSequenceObserved
and a timeout via time.After or context.WithTimeout) so the test waits briefly
for a published value instead of failing immediately; update the Assessment
closure (the lambda passed to feature.Assess) to perform a timed receive from
nodeLabelSequenceObserved and assert on the received value or fail with a clear
timeout message if the wait expires.
- Around line 280-291: The function getDCGMPodOnNode uses assert.Fail which
marks failure but continues execution; change the failure to an immediate stop
by replacing assert.Fail with assert.FailNow (or use t.Fatalf) so the test halts
when no nvidia-dcgm pod is found. Update the failure path in getDCGMPodOnNode to
call assert.FailNow(t, "failed to find nvidia-dcgm pod on node %s", nodeName) or
t.Fatalf("failed to find nvidia-dcgm pod on node %s", nodeName) and ensure
callers no longer receive an empty pod name.
In `@tests/helpers/kube.go`:
- Around line 68-79: Add proper GoDoc comments for the exported identifiers by
placing a short descriptive comment immediately above each exported symbol that
starts with the symbol name: RebootNodeGVK, GPUResetGVK, ListAllCRs, WaitForCR,
DeleteAllCRs, DeleteCR, and CreateGPUResetCR; describe what each
GroupVersionKind represents and what each function does, its important
parameters/return behavior and any side effects. Ensure comments follow GoDoc
style (start with the exact exported name) and apply the same pattern to other
exported symbols in the file range referenced (lines ~524-903) so all exported
functions/types have appropriate documentation.
In `@tests/janitor_test.go`:
- Around line 237-240: The strict cross-node overlap assertion is flaky; keep
the same-node non-overlap check using periodOverlapsOnNode1
(startTimeReboot/completionTimeReset) but replace the hard assert.True on
periodOverlapsOnNode1And2 with a weaker, time-bounded condition: either use
assert.Eventually (or a small polling loop) to wait a short timeout for the two
reboot intervals (startTimeReboot/completionTimeReboot and
startTimeReboot2/completionTimeReboot2) to overlap, or assert that their start
times are within a small tolerance (e.g., < 200–500ms); update the assertion
around periodOverlapsOnNode1And2 accordingly instead of requiring immediate
True.
- Around line 214-216: The test uses a hard-coded GPU UUID when calling
helpers.CreateGPUResetCR which can fail if that UUID doesn't exist on the chosen
node; instead, query the node's GPU/device metadata to discover a valid UUID for
nodeName (e.g., via an existing helper that lists GPU UUIDs or by reading the
node/device status) and replace the literal
"GPU-455d8f70-2051-db6c-0430-ffc457bff834" with the discovered UUID before
creating gpuResetCRName with CreateGPUResetCR to make the test deterministic and
environment-independent.
- Around line 195-197: The test uses assert.NoError when calling
helpers.GetRealNodeName(ctx, client), which lets the test continue with an empty
nodeName and causes confusing failures later (e.g., CreateRebootNodeCR); change
the assertion to require.NoError to fail fast if GetRealNodeName returns an
error and ensure nodeName is valid before proceeding, updating the assertion
that checks the call to helpers.GetRealNodeName and any related uses of nodeName
in CreateRebootNodeCR.
In `@tests/uat/tests.sh`:
- Around line 149-174: The wait_for_node_unquarantine function uses a too-short
default UAT_QUARANTINE_TIMEOUT and only checks .spec.unschedulable while
claiming the node is "ready"; increase the default timeout (e.g., to 600s or
make UAT_QUARANTINE_TIMEOUT configurable) and change the readiness check to
verify both that .spec.unschedulable is not "true" and that the node's Ready
condition is True (query .status.conditions where type=Ready and status=True via
kubectl jsonpath) before logging "ready"; also adjust the progress logging to
reflect "uncordoned" vs "ready" states so messages are accurate (reference
symbols: wait_for_node_unquarantine, UAT_QUARANTINE_TIMEOUT, kubectl get node
jsonpath {.spec.unschedulable} and .status.conditions).
- Around line 359-399: The UUID/PCI parsed from nvidia-smi in
test_xid_monitoring_syslog_gpu_reset may be empty and must be validated before
proceeding; after computing uuid_pci, uuid, and pci, add a guard that checks if
either "$uuid" or "$pci" is empty and if so call error (or return/fail the test)
with a message including the raw "$uuid_pci" output so the test stops rather
than sending the logger and relying on wait_for_gpu_reset to match everything;
ensure this validation occurs before the logger exec and before calling
wait_for_node_condition/wait_for_gpu_reset so wait_for_gpu_reset sees a real
uuid argument.
- Around line 218-257: In wait_for_gpu_reset, the UUID check uses echo
$exec_output | grep "$uuid" which treats the UUID as a regex and can mis-handle
content; replace that line to use printf '%s\n' "$exec_output" piped into
fixed-string grep, e.g. change the check to: printf '%s\n' "$exec_output" | grep
-F -- "$uuid" (use the -- to guard against UUIDs starting with -); this
preserves output exactly and ensures fixed-string matching of the UUID.
🧹 Nitpick comments (1)
tests/janitor_test.go (1)
185-188: Add a doc comment for the new exported test.
This keeps exported test functions compliant with package documentation conventions.As per coding guidelines: Function comments required for all exported Go functions.
4744977 to
f5af3ae
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/values-full.yaml`:
- Around line 636-643: The production mapping for COMPONENT_RESET currently
points to kind: "RebootNode" with equivalenceGroup "restart"; change it so
COMPONENT_RESET uses apiGroup "janitor.dgxc.nvidia.com", version "v1alpha1",
kind "GPUReset", scope "Cluster", completeConditionType "NodeReady",
templateFileName "nvidia-reboot.yaml", set equivalenceGroup to "reset" and add
supersedingEquivalenceGroups: ["restart"] so GPU-scoped resets align with the
design and do not cause full node reboots when partial drain is enabled.
In `@tests/gpu_reset_test.go`:
- Around line 135-139: The test currently hard-codes a GPU UUID in the
nodeCondition.Message assertion, making it non-portable; update the test to
obtain the real GPU UUID at runtime (e.g., add a helper like getGPUUUID or call
a test utility that shells out to nvidia-smi) and use that dynamic value when
constructing the injected health event and when asserting nodeCondition.Message
(and any downstream GPUReset-related expectations). Locate the assertion that
compares nodeCondition.Message and replace the literal UUID with the
helper-returned UUID so the injected event and GPUReset validation use the same
dynamically retrieved GPU UUID.
- Around line 47-56: The test uses assert.NoError for critical setup steps which
can allow execution to continue with invalid state; replace assert.NoError(t,
err, ...) calls for creating the Kubernetes client (c.NewClient()), and for
getting the real node (helpers.GetRealNodeName(ctx, client)) with
require.NoError so the test fails immediately on these setup errors; update the
two calls referencing client and nodeName (and any other similar critical
setup/assertions in this file) from assert to require to ensure fast-fail on
setup failures.
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/values-full.yaml (1)
670-682: Consider adding a GPUReset template example.The templates section only includes
nvidia-reboot.yamlfor RebootNode. Consider adding a commented-out GPUReset template example to help users configure GPU reset functionality:# "gpureset-template.yaml": | # apiVersion: janitor.dgxc.nvidia.com/v1alpha1 # kind: GPUReset # metadata: # name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }} # labels: # app.kubernetes.io/managed-by: nvsentinel # spec: # nodeName: {{ .HealthEvent.NodeName }} # selector: # uuids: # - {{ .ImpactedEntityScopeValue }}tests/janitor_test.go (1)
185-188: Rename test to include scenario and expected behavior.♻️ Suggested rename
-func TestJanitorNodeLocking(t *testing.T) { +func TestJanitorNodeLocking_RebootAndGPUReset_EnforcesNodeLock(t *testing.T) {As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.
tests/gpu_reset_test.go (1)
37-40: Rename test to follow the required naming convention.♻️ Suggested rename
-func TestGPUReset(t *testing.T) { +func TestGPUReset_EndToEnd_ComponentResetCompletes(t *testing.T) {As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.
f5af3ae to
f05c41b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/csp_health_monitor_test.go (1)
516-516:⚠️ Potential issue | 🔴 CriticalCompilation error: reference to undefined constant
keyOriginalArgsContextKey.Line 516 still uses the old constant name
keyOriginalArgsContextKey, but it was renamed tooriginalArgsContextKeyon line 36. This will fail to compile.🐛 Proposed fix
- originalArgs := ctx.Value(keyOriginalArgsContextKey).([]string) + originalArgs := ctx.Value(originalArgsContextKey).([]string)
🤖 Fix all issues with AI agents
In `@tests/gpu_reset_test.go`:
- Around line 167-174: The test currently uses assert.Fail after calling
unstructured.NestedMap and unstructured.NestedSlice which does not stop
execution and can lead to nil dereferences; update the test to import the
require package and replace the assert.Fail checks for the "status" and
"conditions" extraction with require.NoError/require.True (or require.NotNil) so
the test stops immediately on failure—specifically change the checks around
unstructured.NestedMap(gpuReset.Object, "status") and
unstructured.NestedSlice(status, "conditions") to use require assertions that
halt execution.
In `@tests/uat/tests.sh`:
- Around line 382-392: The check for empty nvidia-smi output uses the literal
string instead of the variable—change the conditional that tests uuid_pci so it
references the variable (uuid_pci) with the $ and quotes; update the if
condition that currently reads the literal "uuid_pci" to use [[ -z "$uuid_pci"
]] so the error function is invoked when kubectl returns no output, leaving the
subsequent parsing of uuid and pci (variables uuid and pci) unchanged.
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/values-tilt.yaml (1)
265-276: GPUReset template structure looks correct.The template properly includes
nodeNameand aselector.uuidsarray populated from.ImpactedEntityScopeValue, which aligns with the GPU reset flow targeting a specific GPU UUID.Minor style nit: The template variable syntax uses
{{.ApiGroup}}(no spaces) while the existingrebootnode-template.yamluses{{ .ApiGroup }}(with spaces). Consider aligning for consistency.♻️ Optional: Align template variable spacing
"gpureset-template.yaml": | - apiVersion: {{.ApiGroup}}/{{.Version}} + apiVersion: {{ .ApiGroup }}/{{ .Version }} kind: GPUReset metadata: - name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }} + name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }} labels: app.kubernetes.io/managed-by: nvsentinel spec: - nodeName: {{ .HealthEvent.NodeName }} + nodeName: {{ .HealthEvent.NodeName }} selector: uuids: - - {{ .ImpactedEntityScopeValue }} + - {{ .ImpactedEntityScopeValue }}tests/janitor_test.go (1)
199-203: Consider a more explicit KWOK node selection.The test assumes the last node in the list is a KWOK node. This ordering assumption may be fragile if node registration order changes.
♻️ Suggested: Explicitly select KWOK node by label or name pattern
// use a KWOK node for the second RebootNode - nodes, err := helpers.GetAllNodesNames(ctx, client) - require.NoError(t, err, "failed to get cluster nodes") - require.True(t, len(nodes) > 0, "no nodes found in cluster") - kwokNodeName := nodes[len(nodes)-1] + kwokNodeName, err := helpers.GetKWOKNodeName(ctx, client) + require.NoError(t, err, "failed to get KWOK node")If a dedicated helper doesn't exist, consider filtering by a label like
type=kwokor a name prefix pattern.tests/gpu_reset_test.go (1)
37-39: Rename test to matchTestFunctionName_Scenario_ExpectedBehaviorformat.
For example:TestGPUReset_EndToEnd_Succeeds. As per coding guidelines: Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior.🔧 Suggested rename
-func TestGPUReset(t *testing.T) { +func TestGPUReset_EndToEnd_Succeeds(t *testing.T) {
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/helpers/kube.go`:
- Around line 537-565: When ListAllCRs returns an error inside WaitForNoCR, fail
fast instead of logging and continuing: inside the require.Never closure in
WaitForNoCR, replace the t.Logf + return false behavior on err != nil with a
call to t.Fatalf (including the error and nodeName) so the test stops
immediately when ListAllCRs fails; keep the rest of the loop and the use of
require.Never, NeverWaitTimeout and WaitInterval unchanged.
🧹 Nitpick comments (3)
tests/gpu_reset_test.go (1)
37-38: Rename the test to include scenario + expected behavior.
Consider something likeTestGPUReset_ComponentReset_Completesto encode intent.As per coding guidelines, name Go tests descriptively using format:
TestFunctionName_Scenario_ExpectedBehavior.tests/janitor_test.go (1)
185-187: Rename the test to follow the required naming format.Consider a name like
TestJanitorNodeLocking_SameNodeSequential_DifferentNodeOverlapto match the expectedTestFunctionName_Scenario_ExpectedBehaviorconvention.✏️ Suggested rename
-func TestJanitorNodeLocking(t *testing.T) { +func TestJanitorNodeLocking_SameNodeSequential_DifferentNodeOverlap(t *testing.T) {As per coding guidelines: Name Go tests descriptively using format:
TestFunctionName_Scenario_ExpectedBehavior.tests/uat/tests.sh (1)
402-406:wait_for_gpu_resetalready waits for uncordon — this is redundant.You can drop the extra unquarantine wait to reduce test time.
♻️ Suggested cleanup
- wait_for_node_unquarantine "$gpu_node" - log "Test 3 PASSED ✓"
f05c41b to
3a21a25
Compare
31db2c7 to
9996387
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
11681d1 to
a05f015
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
a05f015 to
b57b3f6
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
c1eee72 to
165f7f2
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
ca200ba to
4c7ad97
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Signed-off-by: Nathan Herz <nherz@nvidia.com>
4c7ad97 to
8b90c8d
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Signed-off-by: Nathan Herz <nherz@nvidia.com>
Summary
Related design doc for GPU reset: https://github.com/NVIDIA/NVSentinel/blob/main/docs/designs/020-nvsentinel-gpu-reset.md.
This PR enables GPU reset in NVSentinel and adds a e2e and UAT test for this functionality. To enable GPU reset, we need to:
Note that the default chart values will set partialDrainEnabled to false and not map COMPONENT_RESET to GPUReset. The Tilt and UAT clusters have these 2 settings enabled for testing. Note that if partial drain is enabled but COMPONENT_RESET maps to RebootNode, nodes with running pods will be rebooted which could occur if someone consumes the default values node-drainer when we set it to true by default and already overrides the maintenance options. The reverse is also possible where someone sets partialDrainEnabled to false but maps COMPONENT_RESET -> GPUReset which isn't dangerous but is sub-optimal.
Supporting component resets
Testing
The existing UAT confirms that a reboot occurred by checking the bootID and is agnostic of whatever fault-remediation implementation executed the reboot. For GPU reset we don't have the same benefit that a reset can be detected independently of our GPUReset solution. For the GPUReset UAT, we detect that a reset occurred by checking for the syslog line written by GPUReset job. Later we can change this to a generic reset line written by driver once that's available (rather than detect K8s objects for the GPUReset like the CR, job or privileged pod). Logs for the test execution (the reboot tests timed out due to high reboot latency on OCI):
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Documentation