CDClique: SSA per-node status, conflict retry on cleanup, scoped pod selection by shivamerla · Pull Request #1101 · kubernetes-sigs/dra-driver-nvidia-gpu

shivamerla · 2026-05-05T05:28:10Z

What type of PR is this?

/kind robustness

What this PR does / why we need it:

This change adds miscellaneous improvements to address lingering issues causing 409 (conflicts) / 429 (rate limit) errors reported with #816.

Specifically this change adds:

Per-node status with Server-Side Apply (SSA): each compute-domain-daemon (one per node in the clique) patches only its own node’s daemon status subtree on the CDClique using SSA with a distinct field manager, so concurrent updates from up to ~18 nodes do not overwrite each other’s fields the way a single shared JSON merge patch on the whole object would.
Cleanup path: wrap CDClique daemon cleanup logic in client-go’s retry.OnError with retry.DefaultRetry and reload the latest CDClique from the API on conflict. The object is hot under concurrent SSA writers, the informer can lag, so acting on a stale cached copy caused repeated 409 Conflict. Refetch + backoff improves convergence of ComputeDomain status by letting cleanup finish without stalling on stale resourceVersion.
CD cleanup: when collecting pods for teardown, restrict to pods associated with the current clique only (avoids unrelated pod churn; orthogonal to 409/429 mitigation).

Which issue(s) this PR is related to:

Fixes lingering issues reported in #816

Special notes for your reviewer:

Does this PR introduce a user-facing change?

/release-to-note-none

Additional documentation (design docs, usage docs, etc.):

Checklist

[*] make check test passes locally
[*] make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)
[*] make check-modules passes if go.mod / go.sum changed
Tests added or updated for the change
Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

…uses Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

…cache for stale data Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

k8s-ci-robot · 2026-05-05T05:28:13Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-05T05:28:13Z

@shivamerla: The label(s) kind/robustness cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind robustness

What this PR does / why we need it:

This change adds miscellaneous improvements to address lingering issues causing 409 (conflicts) / 429 (rate limit) errors reported with #816.

Specifically this change adds:

Per-node status with Server-Side Apply (SSA): each compute-domain-daemon
(one per node in the clique) patches only its own node’s daemon status
subtree on the CDClique using SSA with a distinct field manager, so
concurrent updates from up to ~18 nodes do not overwrite each other’s
fields the way a single shared JSON merge patch on the whole object would.

Cleanup path: wrap CDClique daemon cleanup logic in client-go’s
retry.OnError with retry.DefaultRetry and reload
the latest CDClique from the API on conflict. The object is hot under
concurrent SSA writers, the informer can lag, so acting on a stale cached
copy caused repeated 409 Conflict. Refetch + backoff improves convergence
of ComputeDomain status by letting cleanup finish without stalling on stale
resourceVersion.

CD cleanup: when collecting pods for teardown, restrict to pods associated
with the current clique only (avoids unrelated pod churn; orthogonal to
409/429 mitigation).

Which issue(s) this PR is related to:

Fixes lingering issues reported in #816

Special notes for your reviewer:

Does this PR introduce a user-facing change?

/release-to-note-none

Additional documentation (design docs, usage docs, etc.):

Checklist

[*] make check test passes locally

[*] make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)

[*] make check-modules passes if go.mod / go.sum changed

Tests added or updated for the change

Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2026-05-05T05:28:15Z

✅ Deploy Preview for dra-driver-nvidia-gpu ready!

Name	Link
🔨 Latest commit	`e9f1da5`
🔍 Latest deploy log	https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/69f97fed83bfd600086a06ae
😎 Deploy Preview	https://deploy-preview-1101--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-05-05T05:28:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [shivamerla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shivamerla · 2026-05-05T05:33:16Z

/kind cleanup
/release-note-none

shivamerla · 2026-05-05T05:35:12Z

/retest

k8s-ci-robot · 2026-05-05T06:16:47Z

@shivamerla: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200	`e9f1da5`	link	false	`/test pull-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200`
pull-dra-driver-nvidia-gpu-e2e-gcp-nvkind	`e9f1da5`	link	false	`/test pull-dra-driver-nvidia-gpu-e2e-gcp-nvkind`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

herb-duan · 2026-05-06T03:26:09Z

@shivamerla

Concern about SSA updates:

Before CDClique was introduced, the compute-domain-daemon had previously attempted to use SSA for CD status updates (when CD status aggregated states from all nodes):

#822 introduced SSA
#835 reverted daemon status updates from SSA back to optimistic concurrency

The revert was necessary because SSA converged slower for large-scale CD status updates:

Conflict displacement: The conflict essentially shifted from compute-domain-daemon → apiserver to apiserver → etcd, causing heavy etcd read/write load
Unthrottled retries: The massive conflicts and retries between apiserver and etcd could not be constrained by apiserver rate limiting
High CPU impact: Large-scale CD status SSA updates spiked apiserver CPU significantly

Now with CDClique SSA, while a single CDClique caps at 18 nodes, there are still many parallel CDCliques running concurrently. A potential consideration is whether the conflict/retry pattern is simply shifting from compute-domain-daemon → apiserver to apiserver → etcd, and which could lead to:

Elevated apiserver CPU usage
Excessive etcd read/writes due to continuous conflict retries between apiserver and etcd

shivamerla · 2026-05-06T14:02:03Z

@herb-duan Thanks for raising this — the #822 / #835 history is exactly why we’re being careful with SSA here.

I think the important difference is the scale and shape of the object being updated.

Back then, a single ComputeDomain status aggregated state for thousands of nodes (~2k–5k) into one hot object. Even with SSA, we still ended up with large whole-object churn at the apiserver/etcd layer: many writers updating overlapping parts of a huge status surface + managedFields state. Conflicts didn’t really localize, retries amplified load, and we saw the apiserver ↔ etcd CPU impact you mentioned.

CDClique is a pretty different scenario:

A CDClique only carries status for ~18 nodes max, so concurrent SSA writers per object are bounded to a small constant instead of thousands targeting one CD.
Contention is sharded across many small objects instead of concentrated into one giant status blob.
Each daemon only patches its own node subtree with its own field manager, so we get isolation at the node level rather than many actors stomping over the same surface.

That’s the main reason SSA makes more sense here than it did for monolithic CD status. In the old model, SSA didn’t buy enough conflict avoidance relative to the cost. Here the goal is narrower: avoid cross-node overwrites from merge-patch semantics while keeping ownership isolated to small per-node slices in a bounded-size object.

We’re still treating these as hot objects though. Cleanup paths refetch and use bounded retry-on-conflict specifically to avoid spinning on stale informer state or creating unbounded retry loops.

So I agree with the #835 conclusion for cluster-scale monolithic CD status. The CDClique design is intentionally trying to avoid that exact failure mode by bounding object size, bounding writers, and spreading contention across many small objects instead of one giant hot object.

If we still see sustained apiserver/etcd pressure in practice, we can definitely revisit with additional backoff/rate limiting/tuning, but I don’t think this lands in the same operational regime as the original CD status issues.

sakura-3 · 2026-05-07T02:46:17Z

+		"nodeName":  myDaemon.NodeName,
+		"ipAddress": myDaemon.IPAddress,
+		"cliqueID":  myDaemon.CliqueID,
+		"index":     myDaemon.Index,


How is the global uniqueness of (cliqueID, index) guaranteed here? For example, due to race conditions, two daemons may hold the same CDC object and each independently find the same available index. In the SSA (Server-Side Apply) scenario, the patches from these two daemons would not conflict with each other, resulting in the final CDC object containing duplicate (cliqueID, index) tuples.

@sakura-3 Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache. With a single large Update (whole daemons slice / whole object) guarded by resourceVersion, concurrent writers often hit 409 conflicts because they replace the same blob. Retries then recompute from the latest object, so in practice things may converge without duplicate indices.

With SSA + per-node field managers + list-as-map keyed by nodeName, the two patches no longer conflict at the merge layer, so the more likely outcome is two distinct daemon rows carrying the same index. So I agree this issue could become more visible with SSA.

One thing we could do is, after a successful SSA apply (or during the next sync), fetch the latest object and check whether another row already holds the same index (same index, different nodeName). If so, recompute getNextAvailableIndex() from the full daemon list and re-patch only the current row.

@shivamerla I don't think re-patch is a good idea.There are two reasons.

Currently, the logic of getNextAvailableIndex() scans the existing CDC state and returns the smallest unused index in ascending order. This approach can lead to a worst-case infinite retry loop. For example, when the nodes list is initially empty, all daemons would independently select index 0; upon detecting the conflict, they would all fall back to index 1, and so on. We need to revise the implementation of getNextAvailableIndex() to mitigate this issue — for instance, by having each daemon randomly pick an unused index from the range [0, 18] instead of always choosing the minimum.

However, even with the optimized approach, mathematically speaking, with 18 daemons contending, it is expected to take more than 5 rounds to reach convergence. I don't see a clear advantage over the non-SSA approach — the only difference is that the conflict detector shifts from the API server to the daemon itself.

Is there any special requirement for the index field? If not, we could consider deriving it from a hash of the node name, which would make index assignment fully deterministic and eliminate the negotiation process altogether.

consider deriving it from a hash of the node name, which would make index assignment fully deterministic and eliminate the negotiation process altogether.

This could in theory work, and I was specifically also thinking about constructing some kind of hash function with node IP and/or node name as input (and talked it through with @klueska back in January). We rejected this because of the relative complexity, potential oversights, and mainly because we wanted to retain a simple method that ensures a dense set of indices with predictable re-usage. Achieving that with a hash function seemed mathematically challenging.

klueska · 2026-05-07T06:43:00Z

Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache.

This race most definitely does not exist today. The conflicts on writing to the API server force a recompute, ensuring once a write succeeds, a unique index is chosen.

Is there any special requirement for the index field?

Yes, they must be exactly from the range 0-17, so that all nodes agree on them in the predetermined nodes_config.cfg file.

I had a never-merged PR that could pre-calculate the index here, but we decided not to go with it because the perfomance of it combined with SSA proved to be worse then just allowing conflicts:
#824

shivamerla · 2026-05-07T07:40:42Z

Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache.

This race most definitely does not exist today. The conflicts on writing to the API server force a recompute, ensuring once a write succeeds, a unique index is chosen.

Is there any special requirement for the index field?

Yes, they must be exactly from the range 0-17, so that all nodes agree on them in the predetermined nodes_config.cfg file.

I had a never-merged PR that could pre-calculate the index here, but we decided not to go with it because the perfomance of it combined with SSA proved to be worse then just allowing conflicts: #824

What if we shift index allocation to the compute-domain-controller? Each daemon can continue using SSA for its own row (nodeName, ipAddress, status), but stop assigning index locally (getNextAvailableIndex). The controller, as a single writer with the full clique view, computes and assigns unique indices centrally. Daemons then just consume/update their row with the controller-assigned index. This removes cross-daemon coordination while keeping stable DNS slots (0..17).

klueska · 2026-05-07T07:52:40Z

That is what we attempted here #810, though this was before we introduced the ComputeDomainClique object (which proved to be the real winner, so we dropped this). I believe the issue we ran into here was API server throttling, but now that its spread across lots of objects maybe that wouldn't be a problem.

That said, I'm still skeptical that anything we come up with will be faster than just allowing the conflicts on such a small number of nodes.

One concrete thing I would do immediately (to reduce the number of conflicts) is what @sakura-3 suggested in one of his comments. That is, compute a random index from any available ones not yet selected, so that everyone isn't constantly competing on the same ones.

jgehrcke · 2026-05-07T14:21:19Z

Most of this discussion is about the systematic comparison of these two approaches:

informer-driven, resourceVersion-based conflict resolution (the classical approach)
SSA-based conflict resolution

Here is what we concluded in #822 (note that we also explored randomly assigning DNS indices):

We noticed the same problems over the weekend. We fixed them (by adding jitter and more back-off, and by getting a random available DNS index). Notably, even with these changes in place, the SSA approach was still not great.

The overall conclusion from the work in in #829 and #822 was:

For a given number of conflicts C, the classical conflict resolution approach (1) always has better performance characteristics than SSA (2).

This is the reason for why we backed out of anything-SSA for performance improvements. It may very well be that we didn't properly write down all insights of that effort. I'll try to summarize what I remember our conclusions were:

SSA can only ever shine when different racers can independently update fragments of the same object. But if these updates are not independent, and have to be reconciled somewhere, then SSA worsens things by making conflict detection more costly. This may be easy to miss.

The intra-clique indices are as of today clearly not independent.

SSA in particular cannot magically resolve any business logic conflict by itself in the API server. The following example maybe makes that obvious: if two racers want to both set index 7 then

with SSA, the API server does costly serialization work and after all allows both updates, and it emits one good response payload and one bad response payload (the problem needs to be detected client-side via response payload inspection).
in the classical method, one of both updates is rejected with a 409 response right away, and it was cheap for the API server to reject one of the requests.

@sakura-3 eluded to that above by saying

the patches from these two daemons would not conflict with each other, resulting in the final CDC object containing duplicate (cliqueID, index) tuples

We really learned that SSA is not a tool for improving performance when one has to coordinate conflicting updates. The opposite is really true; introduction of SSA yields various new challenges as @herb-duan eluded to (mainly, additional stress on the API server which now needs to serialize requests and may still return a response reflecting bad state). When I measured system properties, I also kept track of CPU time consumed by the API server as a metric to quantify that. API server CPU utilization was dramatically higher when using SSA.

@sakura-3 you said above that

mathematically speaking, with 18 daemons contending, it is expected to take more than 5 rounds to reach convergence.

This is a productive point of view, because it's about the fundamental conflict resolution work that is to be done either way. You continue with

I don't see a clear advantage over the non-SSA approach

I very much agree -- that's really the same core insight: "For a given number of conflicts C, ...."

Small correction about this:

the only difference is that the conflict detector shifts from the API server to the daemon itself.

The true conflict detector is in the daemon even with SSA (detecting the same index being used more than once; as you've yourself said).

In general, @sakura-3 I think you're spot-on and our views are aligned.

@shivamerla I saw you said

I think the important difference is the scale and shape of the object being updated.

What I just tried to explain about SSA is generally true for small objects and large objects. Of course, after all, only proper measurements of the key optimization target (CD convergence time) should guide us.

Now, quick input about the more trivial / more obvious part: reducing the number of conflicts to be resolved (per object) is a conceptual improvement that matters independently of the conflict resolution method used.

This is great, and we actually tried it @klueska @sakura-3 @shivamerla:

compute a random index from any available ones not yet selected

See this commit. It introduces getRandomAvailableIndex(). We didn't commit this to main at the time, to keep things separate and because for small N the performance impact was almost negligible.

Btw, the random index selection was part of the measurement series shown as a green line in this plot.

Front-loading index determination (completely eliminating index-related conflict resolution in the hot path) or significantly reducing index conflicts is I think a viable path forward.

I also want to briefly say something about

lingering issues causing 409 (conflicts)

Informer-driven (resourceVersion-based) reconciliation implies seeing 409 responses. Seeing these responses is not generally an issue. Also, the observed rate / count of these responses is not generally a useful performance metric.

shivamerla added 4 commits April 30, 2026 05:45

Avoid cleanup of stale daemons from other cliques

dec1d68

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

Fix version conflicts during CD status updates

61fe1d3

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

Use SSA for patching CDClique object with individual node daemon stat…

0f15809

…uses Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

Use retryOnConflict during CDClique cleanup logic and avoid informer …

e9f1da5

…cache for stale data Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

github-project-automation Bot added this to DRA Driver for NVIDIA GPUs May 5, 2026

github-project-automation Bot moved this to Backlog in DRA Driver for NVIDIA GPUs May 5, 2026

k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label May 5, 2026

k8s-ci-robot requested review from dims and varunrsekar May 5, 2026 05:28

shivamerla self-assigned this May 5, 2026

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 5, 2026

shivamerla marked this pull request as draft May 5, 2026 05:34

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

sakura-3 reviewed May 7, 2026

View reviewed changes

sakura-3 mentioned this pull request May 7, 2026

[Feature]: Centrally assign indices to daemons and update the Compute Domain Clique state. #1107

Open

Conversation

shivamerla commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation (design docs, usage docs, etc.):

Checklist

Uh oh!

k8s-ci-robot commented May 5, 2026

Uh oh!

k8s-ci-robot commented May 5, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation (design docs, usage docs, etc.):

Checklist

Uh oh!

netlify Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dra-driver-nvidia-gpu ready!

Uh oh!

k8s-ci-robot commented May 5, 2026

Uh oh!

shivamerla commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivamerla commented May 5, 2026

Uh oh!

k8s-ci-robot commented May 5, 2026

Uh oh!

herb-duan commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivamerla commented May 6, 2026

Uh oh!

sakura-3 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

shivamerla May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sakura-3 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sakura-3 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klueska commented May 7, 2026

Uh oh!

shivamerla commented May 7, 2026

Uh oh!

klueska commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgehrcke commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

shivamerla commented May 5, 2026 •

edited

Loading

netlify Bot commented May 5, 2026 •

edited

Loading

shivamerla commented May 5, 2026 •

edited

Loading

herb-duan commented May 6, 2026 •

edited

Loading

jgehrcke May 7, 2026 •

edited

Loading

klueska commented May 7, 2026 •

edited

Loading

jgehrcke commented May 7, 2026 •

edited

Loading