[Bug]: Concurrent ComputeDomain status updates can lose peer entries and leave the domain non-converged

### Component

compute-domain-daemon

### Bug Description

When multiple compute-domain daemon pods update `ComputeDomain.Status.Nodes` concurrently, each pod rewrites the full status.nodes from its own version of the CD object. This can cause lost node updates, and leave a `ComputeDomain` that stays `NotReady` even after expected number of nodes have joined.

Snippets from real cluster highlighting the issue - 

Status:
  Nodes:
    Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
    Index: 0
    Ip Address: 10.42.3.138
    Name: cluster1-mgx-00011
    Status: NotReady
    Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
    Index: 1
    Ip Address: 10.42.1.100
    Name: cluster1-mgx-00017
    Status: NotReady
  Status: NotReady
Events: <none>

```
I0415 16:14:28.038085 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:28.038098 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.044285 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.044300 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.053413 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.053424 1 cdstatus.go:354] IP set for clique did not change

I0415 16:14:30.033811 1 cdstatus.go:259] CD status does not contain node name 'cluster1-mgx-00011' yet, try to insert myself: &{cluster1-mgx-00011 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady}
I0415 16:14:30.037427 1 round_trippers.go:632] "Response" verb="PUT" url="https://10.43.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=3
I0415 16:14:30.037620 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:30.037656 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.042979 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.042995 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.052870 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.052888 1 cdstatus.go:354] IP set for clique did not change

```

Here is a unit test to reproduce the issue - https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/pull/1049. 

### Steps to Reproduce

1. Deploy with helm 
`helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu     --namespace nvidia-dra-driver-gpu   --create-namespace   --set nvidiaDriverRoot=/   --set gpuResourcesEnabledOverride=true --version="25.12.0”`

Disable `IMEXDaemonsWithDNSNames` (and `ComputeDomainCliques`) feature flags.

2. Deploy a workload requesting `computedomain` resource with multiple nodes (> 1). 
Able to reproduce the issue with a modified version of the `imex-channel-injection.yaml` (only used for testing) where `numNodes` field is updated to `2` and also the mpi operator workload listed on the validation page. 

```
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 2
  channel:
    allocationMode: All
    resourceClaimTemplate:
      name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection2
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0

```

### Expected Behavior

`ComputeDomain` moves to Ready state once expected number of nodes are joined. Example output when `ComputeDomainCliques` is enabled - 

```
$ k describe computedomain nvbandwidth-test-compute-domain
Name:         nvbandwidth-test-compute-domain
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  [[resource.nvidia.com/v1beta1](http://resource.nvidia.com/v1beta1)](http://resource.nvidia.com/v1beta1)
Kind:         ComputeDomain
Metadata:
Creation Timestamp:  2026-04-15T22:46:21Z
Finalizers:
[[resource.nvidia.com/computeDomain](http://resource.nvidia.com/computeDomain)](http://resource.nvidia.com/computeDomain)
Generation:        1
Resource Version:  4221096
UID:               d7245c1a-3bf6-460b-90a3-d8ed8bbf12d4
Spec:
Channel:
Allocation Mode:  Single
Resource Claim Template:
Name:   nvbandwidth-test-compute-domain-channel
Num Nodes:  2
Status:
Nodes:
Clique Id:   9f410827-489f-46fd-8203-51266ca08eb7.32766
Index:       0
Ip Address:  10.42.1.165
Name:        cluster1-mgx-00017
Status:      Ready
Clique Id:   9f410827-489f-46fd-8203-51266ca08eb7.32766
Index:       1
Ip Address:  10.42.0.69
Name:        cluster1-mgx-00010
Status:      Ready
Status:        Ready
Events:          <none>
```

### DRA Driver Version

v25.12.0

### Kubernetes Version

v1.35 rke

### GPU Model

NVIDIA GB300

### NVIDIA Driver Version

_No response_

### OS / Kernel

_No response_

### Container Runtime

_No response_

### Feature Gates (non-default settings)

IMEXDaemonsWithDNSNames (disabled)

### Helm Values (non-default)

```yaml

```

### Relevant Logs

```shell

```

### Debug Information Attached

- [ ] `kubectl get pods -n dra-driver-nvidia-gpu -o wide`
- [ ] `kubectl get resourceclaims -n <namespace>`
- [ ] `kubectl get resourceslices.resource.k8s.io`
- [ ] `kubectl describe pod <pod>` or `kubectl events --for pod/<pod>`
- [ ] Kubelet plugin logs: `kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400`
- [ ] `nvidia-smi` output from the host
- [ ] Kubelet logs

### IMEX / ComputeDomain Debug Information (if applicable)

- [ ] Host IMEX service disabled: `systemctl status nvidia-imex.service` (must be masked/disabled)
- [ ] Node clique labels: `kubectl get nodes -L nvidia.com/gpu.clique`
- [ ] Per-GPU clique info: `nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'`
- [ ] NVLink/NVSwitch topology: `nvidia-smi topo -m`
- [ ] ComputeDomain status: `kubectl get computedomains.resource.nvidia.com -o yaml`
- [ ] ComputeDomainClique status: `kubectl get computedomaincliques.resource.nvidia.com -o yaml`
- [ ] IMEX daemon pods: `kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain`
- [ ] IMEX daemon logs: `kubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1`
- [ ] IMEX domain status: `nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Concurrent ComputeDomain status updates can lose peer entries and leave the domain non-converged #1050

Component

Bug Description

Steps to Reproduce

Expected Behavior

DRA Driver Version

Kubernetes Version

GPU Model

NVIDIA Driver Version

OS / Kernel

Container Runtime

Feature Gates (non-default settings)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

IMEX / ComputeDomain Debug Information (if applicable)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Concurrent ComputeDomain status updates can lose peer entries and leave the domain non-converged #1050

Description

Component

Bug Description

Steps to Reproduce

Expected Behavior

DRA Driver Version

Kubernetes Version

GPU Model

NVIDIA Driver Version

OS / Kernel

Container Runtime

Feature Gates (non-default settings)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

IMEX / ComputeDomain Debug Information (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions