Skip to content

[Bug]: Concurrent ComputeDomain status updates can lose peer entries and leave the domain non-converged #1050

@aditighag

Description

@aditighag

Component

compute-domain-daemon

Bug Description

When multiple compute-domain daemon pods update ComputeDomain.Status.Nodes concurrently, each pod rewrites the full status.nodes from its own version of the CD object. This can cause lost node updates, and leave a ComputeDomain that stays NotReady even after expected number of nodes have joined.

Snippets from real cluster highlighting the issue -

Status:
Nodes:
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 0
Ip Address: 10.42.3.138
Name: cluster1-mgx-00011
Status: NotReady
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 1
Ip Address: 10.42.1.100
Name: cluster1-mgx-00017
Status: NotReady
Status: NotReady
Events:

I0415 16:14:28.038085 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:28.038098 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.044285 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.044300 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.053413 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.053424 1 cdstatus.go:354] IP set for clique did not change

I0415 16:14:30.033811 1 cdstatus.go:259] CD status does not contain node name 'cluster1-mgx-00011' yet, try to insert myself: &{cluster1-mgx-00011 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady}
I0415 16:14:30.037427 1 round_trippers.go:632] "Response" verb="PUT" url="https://10.43.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=3
I0415 16:14:30.037620 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:30.037656 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.042979 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.042995 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.052870 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.052888 1 cdstatus.go:354] IP set for clique did not change

Here is a unit test to reproduce the issue - #1049.

Steps to Reproduce

  1. Deploy with helm
    helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --namespace nvidia-dra-driver-gpu --create-namespace --set nvidiaDriverRoot=/ --set gpuResourcesEnabledOverride=true --version="25.12.0”

Disable IMEXDaemonsWithDNSNames (and ComputeDomainCliques) feature flags.

  1. Deploy a workload requesting computedomain resource with multiple nodes (> 1).
    Able to reproduce the issue with a modified version of the imex-channel-injection.yaml (only used for testing) where numNodes field is updated to 2 and also the mpi operator workload listed on the validation page.
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 2
  channel:
    allocationMode: All
    resourceClaimTemplate:
      name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection2
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0

Expected Behavior

ComputeDomain moves to Ready state once expected number of nodes are joined. Example output when ComputeDomainCliques is enabled -

$ k describe computedomain nvbandwidth-test-compute-domain
Name:         nvbandwidth-test-compute-domain
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  [[resource.nvidia.com/v1beta1](http://resource.nvidia.com/v1beta1)](http://resource.nvidia.com/v1beta1)
Kind:         ComputeDomain
Metadata:
Creation Timestamp:  2026-04-15T22:46:21Z
Finalizers:
[[resource.nvidia.com/computeDomain](http://resource.nvidia.com/computeDomain)](http://resource.nvidia.com/computeDomain)
Generation:        1
Resource Version:  4221096
UID:               d7245c1a-3bf6-460b-90a3-d8ed8bbf12d4
Spec:
Channel:
Allocation Mode:  Single
Resource Claim Template:
Name:   nvbandwidth-test-compute-domain-channel
Num Nodes:  2
Status:
Nodes:
Clique Id:   9f410827-489f-46fd-8203-51266ca08eb7.32766
Index:       0
Ip Address:  10.42.1.165
Name:        cluster1-mgx-00017
Status:      Ready
Clique Id:   9f410827-489f-46fd-8203-51266ca08eb7.32766
Index:       1
Ip Address:  10.42.0.69
Name:        cluster1-mgx-00010
Status:      Ready
Status:        Ready
Events:          <none>

DRA Driver Version

v25.12.0

Kubernetes Version

v1.35 rke

GPU Model

NVIDIA GB300

NVIDIA Driver Version

No response

OS / Kernel

No response

Container Runtime

No response

Feature Gates (non-default settings)

IMEXDaemonsWithDNSNames (disabled)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

  • kubectl get pods -n dra-driver-nvidia-gpu -o wide
  • kubectl get resourceclaims -n <namespace>
  • kubectl get resourceslices.resource.k8s.io
  • kubectl describe pod <pod> or kubectl events --for pod/<pod>
  • Kubelet plugin logs: kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400
  • nvidia-smi output from the host
  • Kubelet logs

IMEX / ComputeDomain Debug Information (if applicable)

  • Host IMEX service disabled: systemctl status nvidia-imex.service (must be masked/disabled)
  • Node clique labels: kubectl get nodes -L nvidia.com/gpu.clique
  • Per-GPU clique info: nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'
  • NVLink/NVSwitch topology: nvidia-smi topo -m
  • ComputeDomain status: kubectl get computedomains.resource.nvidia.com -o yaml
  • ComputeDomainClique status: kubectl get computedomaincliques.resource.nvidia.com -o yaml
  • IMEX daemon pods: kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain
  • IMEX daemon logs: kubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1
  • IMEX domain status: nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions