[Bug]:  compute-domain plugin stale PrepareResources retry can succeed after UnprepareResources and leave channel allocated

### Component

compute-domain-kubelet-plugin

### Bug Description

We observed a ComputeDomain DRA retry ordering issue where a stale NodePrepareResources retry continued after an authoritative NodeUnprepareResources for the same ResourceClaim had already completed.

The stale prepare retry later succeeded and wrote a PrepareCompleted checkpoint entry for a claim/pod that had already been torn down. This left channel-0 allocated to the old claim UID, causing later workloads on the same node to fail prepare with:

  `requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289`

  Node: 10.0.141.6
  Plugin pod: nvidia-dra-driver-gpu-kubelet-plugin-7wxlq
  Container: compute-domains
  Claim: `<namespace>-<pod-name>-compute-domain-channel-8gwxw`
  Claim UID: 4914da61-1a20-42b3-bf26-981025df4289

  Timeline from logs:

```
  22:44:58 Prepare retry attempt 1: ComputeDomain not ready
  22:44:58 Prepare retry attempt 2: ComputeDomain not ready
  22:44:59 Prepare retry attempt 3: ComputeDomain not ready
  22:45:01 Prepare retry attempt 4: ComputeDomain not ready
  22:45:04 Prepare retry attempt 5: ComputeDomain not ready
  22:45:05 Unprepared devices for same claim UID
  22:45:07 Prepare retry attempt 6 still runs after unprepare
  22:45:10 Prepare retry attempt 7 still runs after unprepare
  22:45:13 Prepare retry attempt 8 still runs after unprepare
  22:45:16 Prepared devices for same claim UID
```

After UnprepareResources completed for claim `UID 4914da61-1a20-42b3-bf26-981025df4289` at `22:45:05`, later prepare retries for the same claim UID continued and eventually succeeded at `22:45:16.`

Actual behavior: stale prepare retry succeeded after unprepare and left channel-0 allocated to the old claim UID.

Impact: later workloads on the same node failed to prepare because the plugin believed channel-0 was still allocated to the stale claim UID.

### Steps to Reproduce

 We could not reproduce this issue again. It was a one-time issue seen in logs.

### Expected Behavior

After successful UnprepareResources for a claim UID, no later PrepareResources retry for that same claim UID should mutate plugin state, allocate a channel, or create a PrepareCompleted checkpoint entry.

Stale prepare retries should be cancelled, skipped before execution, or fail without mutating checkpoint/device state.

### DRA Driver Version

v25.12.0

### Kubernetes Version

v1.33.1

### GPU Model

NVIDIA GB200

### NVIDIA Driver Version

580.82.07

### OS / Kernel

Ubuntu 24.04.2 LTS, Kernel: 6.8.0-1039-nvidia-64k

### Container Runtime

cri-o://1.33.2

### Feature Gates (non-default settings)

ComputeDomainCliques (disabled)

### Helm Values (non-default)

```yaml
Driver Helm feature gate:
ComputeDomainCliques: false
```

### Relevant Logs

```shell
kubectl -n nvidia-dra-driver-gpu logs <kubelet-plugin-pod> -c compute-domains --since=96h | grep '<claim-uid>'

I0523 22:44:58.336698       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 1)

I0523 22:44:58.852154       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 2)

I0523 22:44:59.860455       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 3)

I0523 22:45:01.867376       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 4)

I0523 22:45:04.873682       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 5)

I0523 22:45:05.074122       1 driver.go:290] Unprepared devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289'

I0523 22:45:07.894511       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 6)

I0523 22:45:10.901863       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 7)

I0523 22:45:13.908809       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': prepare devices failed: error applying config: error asserting ComputeDomain Ready: current node not ready in ComputeDomain (attempt 8)

I0523 22:45:16.918489       1 driver.go:272] Prepared devices for claim '<namespace>/<pod-name>-compute-domain-channel-8gwxw:4914da61-1a20-42b3-bf26-981025df4289': [{[channel] <node-ip> channel-0 [k8s.compute-domain.nvidia.com/claim=4914da61-1a20-42b3-bf26-981025df4289-channel-0]}]

I0523 22:45:43.452016       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-dhm8v:857f3929-869c-400b-8fbd-de146c0f080e': unable to prepare claim 857f3929-869c-400b-8fbd-de146c0f080e: requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289 (attempt 1)

I0523 22:45:43.953184       1 workqueue.go:171] Reconcile: error preparing devices for claim '<namespace>/<pod-name>-compute-domain-channel-dhm8v:857f3929-869c-400b-8fbd-de146c0f080e': unable to prepare claim 857f3929-869c-400b-8fbd-de146c0f080e: requested device channel-0 is already allocated to different claim 4914da61-1a20-42b3-bf26-981025df4289 (attempt 2)
```


### Debug Information Attached

- [ ] `kubectl get pods -n dra-driver-nvidia-gpu -o wide`
- [ ] `kubectl get resourceclaims -n <namespace>`
- [ ] `kubectl get resourceslices.resource.k8s.io`
- [ ] `kubectl describe pod <pod>` or `kubectl events --for pod/<pod>`
- [x] Kubelet plugin logs: `kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400`
- [ ] `nvidia-smi` output from the host
- [ ] Kubelet logs

### IMEX / ComputeDomain Debug Information (if applicable)

- [ ] Host IMEX service disabled: `systemctl status nvidia-imex.service` (must be masked/disabled)
- [ ] Node clique labels: `kubectl get nodes -L nvidia.com/gpu.clique`
- [ ] Per-GPU clique info: `nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'`
- [ ] NVLink/NVSwitch topology: `nvidia-smi topo -m`
- [ ] ComputeDomain status: `kubectl get computedomains.resource.nvidia.com -o yaml`
- [ ] ComputeDomainClique status: `kubectl get computedomaincliques.resource.nvidia.com -o yaml`
- [ ] IMEX daemon pods: `kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain`
- [ ] IMEX daemon logs: `kubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1`
- [ ] IMEX domain status: `nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: compute-domain plugin stale PrepareResources retry can succeed after UnprepareResources and leave channel allocated #1162

Component

Bug Description

Steps to Reproduce

Expected Behavior

DRA Driver Version

Kubernetes Version

GPU Model

NVIDIA Driver Version

OS / Kernel

Container Runtime

Feature Gates (non-default settings)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

IMEX / ComputeDomain Debug Information (if applicable)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: compute-domain plugin stale PrepareResources retry can succeed after UnprepareResources and leave channel allocated #1162

Description

Component

Bug Description

Steps to Reproduce

Expected Behavior

DRA Driver Version

Kubernetes Version

GPU Model

NVIDIA Driver Version

OS / Kernel

Container Runtime

Feature Gates (non-default settings)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

IMEX / ComputeDomain Debug Information (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions