Skip to content

Maintain CD daemon info in a new per-CD, per-clique ComputeDomainClique object#826

Merged
klueska merged 17 commits into
kubernetes-sigs:mainfrom
klueska:shard-nodeinfo
Jan 27, 2026
Merged

Maintain CD daemon info in a new per-CD, per-clique ComputeDomainClique object#826
klueska merged 17 commits into
kubernetes-sigs:mainfrom
klueska:shard-nodeinfo

Conversation

@klueska
Copy link
Copy Markdown
Contributor

@klueska klueska commented Jan 25, 2026

This PR introduces a new ComputeDomainClique CRD to store daemon information in separate per-clique objects rather than directly in ComputeDomain.Status.Nodes. This architectural change improves scalability by sharding daemon state across multiple objects and enables better ownership tracking through Kubernetes owner references. All changes are hidden behind the ComputeDomainCliques feature gate (default: disabled).

@klueska klueska force-pushed the shard-nodeinfo branch 7 times, most recently from 966e978 to df7db5f Compare January 25, 2026 20:42
Comment thread cmd/compute-domain-controller/daemonset.go
Comment thread cmd/compute-domain-controller/cdclique.go Outdated
Comment thread cmd/compute-domain-controller/cdclique.go Outdated
@jgehrcke
Copy link
Copy Markdown
Contributor

Connecting dots: we motivate this architectural change in #829 (and show initial measurement results).

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This keeps all CD status updates local to the CD status manager.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
These will also be used by the upcoming CD clique manager.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LVGTM

(for anyone following this: this PR was now re-worked by Kevin into a state where the CD object contents appear just as before this patch)

  • I tested this second generation of this PR manually with the 'performance/scaling testing' (analogue to what's shown in #829). Still has the same scaling characteristics as the first generation.
  • We discussed the 2nd-gen approach at length, and feel good about it. We agreed that we want to iterate on details from here in separate patches. I skimmed over the code one more time and it looks good.
  • Ran the complete test suite locally; patched another log mismatch (used the opportunity to also improve the log message while having to change things anyway).

Once CI passes, let's merge.

edit: @klueska maybe fix the failing DCO check and then we can merge finally :)
image

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
klueska and others added 3 commits January 27, 2026 10:57
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

Full test suite run output:

test_basics.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [214]
 ✓ confirm no kubelet plugin pods running [182]
 ✓ GPU Operator installed [185]
 ✓ helm-install deployments/helm/nvidia-dra-driver-gpu/25.12.0-dev [12451]
 ✓ helm list: validate output [244]
 ✓ get crd computedomains.resource.nvidia.com [168]
 ✓ wait for plugin & controller pods READY [765]
 ✓ validate CD controller container image spec [174]
 ✓ SIGUSR2 handler: GPU plugin, CD plugin [752]
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [4441]
 ✓ 2 pod(s), 1 full GPU each [5078]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [4996]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [4645]
test_gpu_mig.bats
 ✓ static MIG: allocate (1 cnt) [16977]
 ✓ static MIG: mutual exclusivity with physical GPU [29074]
test_cd_imex_chan_inject.bats
 ✓ IMEX channel injection (single) [14664]
 ✓ IMEX channel injection (all) [8989]
test_cd_mnnvl_workload.bats
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11255]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [13148]
test_cd_misc.bats
 ✓ CD daemon shutdown: confirm CD status cleanup [8557]
 ✓ reject unknown field in opaque cfg in CD chan ResourceClaim [17830]
 ✓ self-initiated unprepare of stale RCs in PrepareStarted [39764]
 ✓ global CD status [9008]
 ✓ IMEX channel injection (featureGates.ComputeDomainClique=true) [22254]
test_cd_logging.bats
 ✓ CD controller/plugin: startup config / detail in logs on level 0 [13742]
 ✓ CD controller: test log verbosity levels [116528]
 ✓ CD daemon: test log verbosity levels [46115]
test_cd_failover.bats
 ✓ CD failover nvb2: force-delete worker pod 0 [49041]
 ✓ CD failover nvb2: force-delete all IMEX daemons [52070]
 ✓ CD failover nvb2: regular-delete worker pod 1 [51127]
test_cd_updowngrade.bats
 ✓ downgrade: current-dev -> last-stable [24267]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [39086]
test_gpu_stress.bats
 ✓ Stress: shared ResourceClaim across 15 pods x 5 loops [155523]

33 tests, 0 failures in 824 seconds

BATS_RUN_TMPDIR: /tmp/k8s-dra-driver-gpu-tests-out-jgehrcke/bats-tests-1769539733/bats-run-F5IzGR
make[1]: Leaving directory '/home/jgehrcke/dev/k8s-dra-driver-gpu'

jgehrcke
jgehrcke previously approved these changes Jan 27, 2026
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this needs another approval, or github is slow.

Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another attempt to approve (and for github to understand that)

@klueska klueska merged commit 08f3d9a into kubernetes-sigs:main Jan 27, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants