Maintain CD daemon info in a new per-CD, per-clique ComputeDomainClique object#826
Merged
Conversation
966e978 to
df7db5f
Compare
klueska
commented
Jan 25, 2026
klueska
commented
Jan 25, 2026
klueska
commented
Jan 25, 2026
4389673 to
fba6f5c
Compare
This was referenced Jan 26, 2026
Contributor
|
Connecting dots: we motivate this architectural change in #829 (and show initial measurement results). |
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This keeps all CD status updates local to the CD status manager. Signed-off-by: Kevin Klues <kklues@nvidia.com>
These will also be used by the upcoming CD clique manager. Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
fba6f5c to
926592c
Compare
jgehrcke
approved these changes
Jan 27, 2026
Contributor
There was a problem hiding this comment.
LVGTM
(for anyone following this: this PR was now re-worked by Kevin into a state where the CD object contents appear just as before this patch)
- I tested this second generation of this PR manually with the 'performance/scaling testing' (analogue to what's shown in #829). Still has the same scaling characteristics as the first generation.
- We discussed the 2nd-gen approach at length, and feel good about it. We agreed that we want to iterate on details from here in separate patches. I skimmed over the code one more time and it looks good.
- Ran the complete test suite locally; patched another log mismatch (used the opportunity to also improve the log message while having to change things anyway).
Once CI passes, let's merge.
edit: @klueska maybe fix the failing DCO check and then we can merge finally :)

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
8666e37 to
8a1a972
Compare
jgehrcke
approved these changes
Jan 27, 2026
Contributor
There was a problem hiding this comment.
✔️
Full test suite run output:
test_basics.bats
✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [214]
✓ confirm no kubelet plugin pods running [182]
✓ GPU Operator installed [185]
✓ helm-install deployments/helm/nvidia-dra-driver-gpu/25.12.0-dev [12451]
✓ helm list: validate output [244]
✓ get crd computedomains.resource.nvidia.com [168]
✓ wait for plugin & controller pods READY [765]
✓ validate CD controller container image spec [174]
✓ SIGUSR2 handler: GPU plugin, CD plugin [752]
test_gpu_basic.bats
✓ 1 pod(s), 1 full GPU [4441]
✓ 2 pod(s), 1 full GPU each [5078]
✓ 2 pod(s), 1 full GPU (shared, 1 RC) [4996]
✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [4645]
test_gpu_mig.bats
✓ static MIG: allocate (1 cnt) [16977]
✓ static MIG: mutual exclusivity with physical GPU [29074]
test_cd_imex_chan_inject.bats
✓ IMEX channel injection (single) [14664]
✓ IMEX channel injection (all) [8989]
test_cd_mnnvl_workload.bats
✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11255]
✓ nvbandwidth (2 nodes, 2 GPUs each) [13148]
test_cd_misc.bats
✓ CD daemon shutdown: confirm CD status cleanup [8557]
✓ reject unknown field in opaque cfg in CD chan ResourceClaim [17830]
✓ self-initiated unprepare of stale RCs in PrepareStarted [39764]
✓ global CD status [9008]
✓ IMEX channel injection (featureGates.ComputeDomainClique=true) [22254]
test_cd_logging.bats
✓ CD controller/plugin: startup config / detail in logs on level 0 [13742]
✓ CD controller: test log verbosity levels [116528]
✓ CD daemon: test log verbosity levels [46115]
test_cd_failover.bats
✓ CD failover nvb2: force-delete worker pod 0 [49041]
✓ CD failover nvb2: force-delete all IMEX daemons [52070]
✓ CD failover nvb2: regular-delete worker pod 1 [51127]
test_cd_updowngrade.bats
✓ downgrade: current-dev -> last-stable [24267]
✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [39086]
test_gpu_stress.bats
✓ Stress: shared ResourceClaim across 15 pods x 5 loops [155523]
33 tests, 0 failures in 824 seconds
BATS_RUN_TMPDIR: /tmp/k8s-dra-driver-gpu-tests-out-jgehrcke/bats-tests-1769539733/bats-run-F5IzGR
make[1]: Leaving directory '/home/jgehrcke/dev/k8s-dra-driver-gpu'
jgehrcke
previously approved these changes
Jan 27, 2026
Contributor
jgehrcke
left a comment
There was a problem hiding this comment.
It seems like this needs another approval, or github is slow.
jgehrcke
approved these changes
Jan 27, 2026
Contributor
jgehrcke
left a comment
There was a problem hiding this comment.
another attempt to approve (and for github to understand that)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a new
ComputeDomainCliqueCRD to store daemon information in separate per-clique objects rather than directly inComputeDomain.Status.Nodes. This architectural change improves scalability by sharding daemon state across multiple objects and enables better ownership tracking through Kubernetes owner references. All changes are hidden behind theComputeDomainCliquesfeature gate (default: disabled).