Skip to content

[Bug]: compute-domain plugin stale PrepareResources retry can succeed after UnprepareResources and leave channel allocated #1162

@khushir-nv

Description

@khushir-nv

Component

compute-domain-kubelet-plugin

Bug Description

We observed a ComputeDomain DRA retry ordering issue where a stale NodePrepareResources retry continued after an authoritative NodeUnprepareResources for the same ResourceClaim had already completed.

The stale prepare retry later succeeded and wrote a PrepareCompleted checkpoint entry for a claim/pod that had already been torn down. This left channel-0 allocated to the old claim UID, causing later workloads on the same node to fail prepare with:

requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289

Node: 10.0.141.6
Plugin pod: nvidia-dra-driver-gpu-kubelet-plugin-7wxlq
Container: compute-domains
Claim: <namespace>-<pod-name>-compute-domain-channel-8gwxw
Claim UID: 4914da61-1a20-42b3-bf26-981025df4289

Timeline from logs:

  22:44:58 Prepare retry attempt 1: ComputeDomain not ready
  22:44:58 Prepare retry attempt 2: ComputeDomain not ready
  22:44:59 Prepare retry attempt 3: ComputeDomain not ready
  22:45:01 Prepare retry attempt 4: ComputeDomain not ready
  22:45:04 Prepare retry attempt 5: ComputeDomain not ready
  22:45:05 Unprepared devices for same claim UID
  22:45:07 Prepare retry attempt 6 still runs after unprepare
  22:45:10 Prepare retry attempt 7 still runs after unprepare
  22:45:13 Prepare retry attempt 8 still runs after unprepare
  22:45:16 Prepared devices for same claim UID

After UnprepareResources completed for claim UID 4914da61-1a20-42b3-bf26-981025df4289 at 22:45:05, later prepare retries for the same claim UID continued and eventually succeeded at 22:45:16.

Actual behavior: stale prepare retry succeeded after unprepare and left channel-0 allocated to the old claim UID.

Impact: later workloads on the same node failed to prepare because the plugin believed channel-0 was still allocated to the stale claim UID.

Steps to Reproduce

We could not reproduce this issue again. It was a one-time issue seen in logs.

Expected Behavior

After successful UnprepareResources for a claim UID, no later PrepareResources retry for that same claim UID should mutate plugin state, allocate a channel, or create a PrepareCompleted checkpoint entry.

Stale prepare retries should be cancelled, skipped before execution, or fail without mutating checkpoint/device state.

DRA Driver Version

v25.12.0

Kubernetes Version

v1.33.1

GPU Model

NVIDIA GB200

NVIDIA Driver Version

580.82.07

OS / Kernel

Ubuntu 24.04.2 LTS, Kernel: 6.8.0-1039-nvidia-64k

Container Runtime

cri-o://1.33.2

Feature Gates (non-default settings)

ComputeDomainCliques (disabled)

Helm Values (non-default)

Driver Helm feature gate:
ComputeDomainCliques: false

Relevant Logs

kubectl -n nvidia-dra-driver-gpu logs <kubelet-plugin-pod> -c compute-domains --since=96h | grep '<claim-uid>'

I0523 22:44:58.336698       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 1)

I0523 22:44:58.852154       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 2)

I0523 22:44:59.860455       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 3)

I0523 22:45:01.867376       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 4)

I0523 22:45:04.873682       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 5)

I0523 22:45:05.074122       1 driver.go:290] Unprepared devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289'

I0523 22:45:07.894511       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 6)

I0523 22:45:10.901863       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 7)

I0523 22:45:13.908809       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 8)

I0523 22:45:16.918489       1 driver.go:272] Prepared devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': [{[channel] <node-ip> channel-0 [k8s.compute-domain.nvidia.com/claim=4914da61-1a20-42b3-bf26-981025df4289-channel-0]}]

I0523 22:45:43.452016       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-dhm8v:857f3929-869c-400b-8fbd-de146c0f080e': unable to prepare claim 857f3929-869c-400b-8fbd-de146c0f080e: requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289 (attempt 1)

I0523 22:45:43.953184       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-dhm8v:857f3929-869c-400b-8fbd-de146c0f080e': unable to prepare claim 857f3929-869c-400b-8fbd-de146c0f080e: requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289 (attempt 2)

Debug Information Attached

  • kubectl get pods -n dra-driver-nvidia-gpu -o wide
  • kubectl get resourceclaims -n <namespace>
  • kubectl get resourceslices.resource.k8s.io
  • kubectl describe pod <pod> or kubectl events --for pod/<pod>
  • Kubelet plugin logs: kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400
  • nvidia-smi output from the host
  • Kubelet logs

IMEX / ComputeDomain Debug Information (if applicable)

  • Host IMEX service disabled: systemctl status nvidia-imex.service (must be masked/disabled)
  • Node clique labels: kubectl get nodes -L nvidia.com/gpu.clique
  • Per-GPU clique info: nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'
  • NVLink/NVSwitch topology: nvidia-smi topo -m
  • ComputeDomain status: kubectl get computedomains.resource.nvidia.com -o yaml
  • ComputeDomainClique status: kubectl get computedomaincliques.resource.nvidia.com -o yaml
  • IMEX daemon pods: kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain
  • IMEX daemon logs: kubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1
  • IMEX domain status: nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions