Component
compute-domain-kubelet-plugin
Bug Description
We observed a ComputeDomain DRA retry ordering issue where a stale NodePrepareResources retry continued after an authoritative NodeUnprepareResources for the same ResourceClaim had already completed.
The stale prepare retry later succeeded and wrote a PrepareCompleted checkpoint entry for a claim/pod that had already been torn down. This left channel-0 allocated to the old claim UID, causing later workloads on the same node to fail prepare with:
requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289
Node: 10.0.141.6
Plugin pod: nvidia-dra-driver-gpu-kubelet-plugin-7wxlq
Container: compute-domains
Claim: <namespace>-<pod-name>-compute-domain-channel-8gwxw
Claim UID: 4914da61-1a20-42b3-bf26-981025df4289
Timeline from logs:
22:44:58 Prepare retry attempt 1: ComputeDomain not ready
22:44:58 Prepare retry attempt 2: ComputeDomain not ready
22:44:59 Prepare retry attempt 3: ComputeDomain not ready
22:45:01 Prepare retry attempt 4: ComputeDomain not ready
22:45:04 Prepare retry attempt 5: ComputeDomain not ready
22:45:05 Unprepared devices for same claim UID
22:45:07 Prepare retry attempt 6 still runs after unprepare
22:45:10 Prepare retry attempt 7 still runs after unprepare
22:45:13 Prepare retry attempt 8 still runs after unprepare
22:45:16 Prepared devices for same claim UID
After UnprepareResources completed for claim UID 4914da61-1a20-42b3-bf26-981025df4289 at 22:45:05, later prepare retries for the same claim UID continued and eventually succeeded at 22:45:16.
Actual behavior: stale prepare retry succeeded after unprepare and left channel-0 allocated to the old claim UID.
Impact: later workloads on the same node failed to prepare because the plugin believed channel-0 was still allocated to the stale claim UID.
Steps to Reproduce
We could not reproduce this issue again. It was a one-time issue seen in logs.
Expected Behavior
After successful UnprepareResources for a claim UID, no later PrepareResources retry for that same claim UID should mutate plugin state, allocate a channel, or create a PrepareCompleted checkpoint entry.
Stale prepare retries should be cancelled, skipped before execution, or fail without mutating checkpoint/device state.
DRA Driver Version
v25.12.0
Kubernetes Version
v1.33.1
GPU Model
NVIDIA GB200
NVIDIA Driver Version
580.82.07
OS / Kernel
Ubuntu 24.04.2 LTS, Kernel: 6.8.0-1039-nvidia-64k
Container Runtime
cri-o://1.33.2
Feature Gates (non-default settings)
ComputeDomainCliques (disabled)
Helm Values (non-default)
Driver Helm feature gate:
ComputeDomainCliques: false
Relevant Logs
kubectl -n nvidia-dra-driver-gpu logs <kubelet-plugin-pod> -c compute-domains --since=96h | grep '<claim-uid>'
I0523 22:44:58.336698 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 1)
I0523 22:44:58.852154 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 2)
I0523 22:44:59.860455 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 3)
I0523 22:45:01.867376 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 4)
I0523 22:45:04.873682 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 5)
I0523 22:45:05.074122 1 driver.go:290] Unprepared devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289'
I0523 22:45:07.894511 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 6)
I0523 22:45:10.901863 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 7)
I0523 22:45:13.908809 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 8)
I0523 22:45:16.918489 1 driver.go:272] Prepared devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': [{[channel] <node-ip> channel-0 [k8s.compute-domain.nvidia.com/claim=4914da61-1a20-42b3-bf26-981025df4289-channel-0]}]
I0523 22:45:43.452016 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-dhm8v:857f3929-869c-400b-8fbd-de146c0f080e': unable to prepare claim 857f3929-869c-400b-8fbd-de146c0f080e: requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289 (attempt 1)
I0523 22:45:43.953184 1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-dhm8v:857f3929-869c-400b-8fbd-de146c0f080e': unable to prepare claim 857f3929-869c-400b-8fbd-de146c0f080e: requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289 (attempt 2)
Debug Information Attached
IMEX / ComputeDomain Debug Information (if applicable)
Component
compute-domain-kubelet-plugin
Bug Description
We observed a ComputeDomain DRA retry ordering issue where a stale NodePrepareResources retry continued after an authoritative NodeUnprepareResources for the same ResourceClaim had already completed.
The stale prepare retry later succeeded and wrote a PrepareCompleted checkpoint entry for a claim/pod that had already been torn down. This left channel-0 allocated to the old claim UID, causing later workloads on the same node to fail prepare with:
requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289Node: 10.0.141.6
Plugin pod: nvidia-dra-driver-gpu-kubelet-plugin-7wxlq
Container: compute-domains
Claim:
<namespace>-<pod-name>-compute-domain-channel-8gwxwClaim UID: 4914da61-1a20-42b3-bf26-981025df4289
Timeline from logs:
After UnprepareResources completed for claim
UID 4914da61-1a20-42b3-bf26-981025df4289at22:45:05, later prepare retries for the same claim UID continued and eventually succeeded at22:45:16.Actual behavior: stale prepare retry succeeded after unprepare and left channel-0 allocated to the old claim UID.
Impact: later workloads on the same node failed to prepare because the plugin believed channel-0 was still allocated to the stale claim UID.
Steps to Reproduce
We could not reproduce this issue again. It was a one-time issue seen in logs.
Expected Behavior
After successful UnprepareResources for a claim UID, no later PrepareResources retry for that same claim UID should mutate plugin state, allocate a channel, or create a PrepareCompleted checkpoint entry.
Stale prepare retries should be cancelled, skipped before execution, or fail without mutating checkpoint/device state.
DRA Driver Version
v25.12.0
Kubernetes Version
v1.33.1
GPU Model
NVIDIA GB200
NVIDIA Driver Version
580.82.07
OS / Kernel
Ubuntu 24.04.2 LTS, Kernel: 6.8.0-1039-nvidia-64k
Container Runtime
cri-o://1.33.2
Feature Gates (non-default settings)
ComputeDomainCliques (disabled)
Helm Values (non-default)
Relevant Logs
Debug Information Attached
kubectl get pods -n dra-driver-nvidia-gpu -o widekubectl get resourceclaims -n <namespace>kubectl get resourceslices.resource.k8s.iokubectl describe pod <pod>orkubectl events --for pod/<pod>kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400nvidia-smioutput from the hostIMEX / ComputeDomain Debug Information (if applicable)
systemctl status nvidia-imex.service(must be masked/disabled)kubectl get nodes -L nvidia.com/gpu.cliquenvidia-smi -q | grep -E 'ClusterUUID|CliqueId'nvidia-smi topo -mkubectl get computedomains.resource.nvidia.com -o yamlkubectl get computedomaincliques.resource.nvidia.com -o yamlkubectl get pods -n <namespace> -l resource.nvidia.com/computeDomainkubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N