Component
compute-domain-daemon
Bug Description
When multiple compute-domain daemon pods update ComputeDomain.Status.Nodes concurrently, each pod rewrites the full status.nodes from its own version of the CD object. This can cause lost node updates, and leave a ComputeDomain that stays NotReady even after expected number of nodes have joined.
Snippets from real cluster highlighting the issue -
Status:
Nodes:
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 0
Ip Address: 10.42.3.138
Name: cluster1-mgx-00011
Status: NotReady
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 1
Ip Address: 10.42.1.100
Name: cluster1-mgx-00017
Status: NotReady
Status: NotReady
Events:
I0415 16:14:28.038085 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:28.038098 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.044285 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.044300 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.053413 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.053424 1 cdstatus.go:354] IP set for clique did not change
I0415 16:14:30.033811 1 cdstatus.go:259] CD status does not contain node name 'cluster1-mgx-00011' yet, try to insert myself: &{cluster1-mgx-00011 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady}
I0415 16:14:30.037427 1 round_trippers.go:632] "Response" verb="PUT" url="https://10.43.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=3
I0415 16:14:30.037620 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:30.037656 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.042979 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.042995 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.052870 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.052888 1 cdstatus.go:354] IP set for clique did not change
Here is a unit test to reproduce the issue - #1049.
Steps to Reproduce
- Deploy with helm
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --namespace nvidia-dra-driver-gpu --create-namespace --set nvidiaDriverRoot=/ --set gpuResourcesEnabledOverride=true --version="25.12.0”
Disable IMEXDaemonsWithDNSNames (and ComputeDomainCliques) feature flags.
- Deploy a workload requesting
computedomain resource with multiple nodes (> 1).
Able to reproduce the issue with a modified version of the imex-channel-injection.yaml (only used for testing) where numNodes field is updated to 2 and also the mpi operator workload listed on the validation page.
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: imex-channel-injection
spec:
numNodes: 2
channel:
allocationMode: All
resourceClaimTemplate:
name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
name: imex-channel-injection
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.clique
operator: Exists
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: imex-channel-0
resourceClaims:
- name: imex-channel-0
resourceClaimTemplateName: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
name: imex-channel-injection2
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.clique
operator: Exists
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: imex-channel-0
resourceClaims:
- name: imex-channel-0
resourceClaimTemplateName: imex-channel-0
Expected Behavior
ComputeDomain moves to Ready state once expected number of nodes are joined. Example output when ComputeDomainCliques is enabled -
$ k describe computedomain nvbandwidth-test-compute-domain
Name: nvbandwidth-test-compute-domain
Namespace: default
Labels: <none>
Annotations: <none>
API Version: [[resource.nvidia.com/v1beta1](http://resource.nvidia.com/v1beta1)](http://resource.nvidia.com/v1beta1)
Kind: ComputeDomain
Metadata:
Creation Timestamp: 2026-04-15T22:46:21Z
Finalizers:
[[resource.nvidia.com/computeDomain](http://resource.nvidia.com/computeDomain)](http://resource.nvidia.com/computeDomain)
Generation: 1
Resource Version: 4221096
UID: d7245c1a-3bf6-460b-90a3-d8ed8bbf12d4
Spec:
Channel:
Allocation Mode: Single
Resource Claim Template:
Name: nvbandwidth-test-compute-domain-channel
Num Nodes: 2
Status:
Nodes:
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 0
Ip Address: 10.42.1.165
Name: cluster1-mgx-00017
Status: Ready
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 1
Ip Address: 10.42.0.69
Name: cluster1-mgx-00010
Status: Ready
Status: Ready
Events: <none>
DRA Driver Version
v25.12.0
Kubernetes Version
v1.35 rke
GPU Model
NVIDIA GB300
NVIDIA Driver Version
No response
OS / Kernel
No response
Container Runtime
No response
Feature Gates (non-default settings)
IMEXDaemonsWithDNSNames (disabled)
Helm Values (non-default)
Relevant Logs
Debug Information Attached
IMEX / ComputeDomain Debug Information (if applicable)
Component
compute-domain-daemon
Bug Description
When multiple compute-domain daemon pods update
ComputeDomain.Status.Nodesconcurrently, each pod rewrites the full status.nodes from its own version of the CD object. This can cause lost node updates, and leave aComputeDomainthat staysNotReadyeven after expected number of nodes have joined.Snippets from real cluster highlighting the issue -
Status:
Nodes:
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 0
Ip Address: 10.42.3.138
Name: cluster1-mgx-00011
Status: NotReady
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 1
Ip Address: 10.42.1.100
Name: cluster1-mgx-00017
Status: NotReady
Status: NotReady
Events:
Here is a unit test to reproduce the issue - #1049.
Steps to Reproduce
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --namespace nvidia-dra-driver-gpu --create-namespace --set nvidiaDriverRoot=/ --set gpuResourcesEnabledOverride=true --version="25.12.0”Disable
IMEXDaemonsWithDNSNames(andComputeDomainCliques) feature flags.computedomainresource with multiple nodes (> 1).Able to reproduce the issue with a modified version of the
imex-channel-injection.yaml(only used for testing) wherenumNodesfield is updated to2and also the mpi operator workload listed on the validation page.Expected Behavior
ComputeDomainmoves to Ready state once expected number of nodes are joined. Example output whenComputeDomainCliquesis enabled -DRA Driver Version
v25.12.0
Kubernetes Version
v1.35 rke
GPU Model
NVIDIA GB300
NVIDIA Driver Version
No response
OS / Kernel
No response
Container Runtime
No response
Feature Gates (non-default settings)
IMEXDaemonsWithDNSNames (disabled)
Helm Values (non-default)
Relevant Logs
Debug Information Attached
kubectl get pods -n dra-driver-nvidia-gpu -o widekubectl get resourceclaims -n <namespace>kubectl get resourceslices.resource.k8s.iokubectl describe pod <pod>orkubectl events --for pod/<pod>kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400nvidia-smioutput from the hostIMEX / ComputeDomain Debug Information (if applicable)
systemctl status nvidia-imex.service(must be masked/disabled)kubectl get nodes -L nvidia.com/gpu.cliquenvidia-smi -q | grep -E 'ClusterUUID|CliqueId'nvidia-smi topo -mkubectl get computedomains.resource.nvidia.com -o yamlkubectl get computedomaincliques.resource.nvidia.com -o yamlkubectl get pods -n <namespace> -l resource.nvidia.com/computeDomainkubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N