Skip to content

CDClique: SSA per-node status, conflict retry on cleanup, scoped pod selection#1101

Draft
shivamerla wants to merge 4 commits into
kubernetes-sigs:mainfrom
shivamerla:bd_sync_improvements
Draft

CDClique: SSA per-node status, conflict retry on cleanup, scoped pod selection#1101
shivamerla wants to merge 4 commits into
kubernetes-sigs:mainfrom
shivamerla:bd_sync_improvements

Conversation

@shivamerla
Copy link
Copy Markdown
Contributor

@shivamerla shivamerla commented May 5, 2026

What type of PR is this?

/kind robustness

What this PR does / why we need it:

This change adds miscellaneous improvements to address lingering issues causing 409 (conflicts) / 429 (rate limit) errors reported with #816.

Specifically this change adds:

  • Per-node status with Server-Side Apply (SSA): each compute-domain-daemon (one per node in the clique) patches only its own node’s daemon status subtree on the CDClique using SSA with a distinct field manager, so concurrent updates from up to ~18 nodes do not overwrite each other’s fields the way a single shared JSON merge patch on the whole object would.
  • Cleanup path: wrap CDClique daemon cleanup logic in client-go’s retry.OnError with retry.DefaultRetry and reload the latest CDClique from the API on conflict. The object is hot under concurrent SSA writers, the informer can lag, so acting on a stale cached copy caused repeated 409 Conflict. Refetch + backoff improves convergence of ComputeDomain status by letting cleanup finish without stalling on stale resourceVersion.
  • CD cleanup: when collecting pods for teardown, restrict to pods associated with the current clique only (avoids unrelated pod churn; orthogonal to 409/429 mitigation).

Which issue(s) this PR is related to:

Fixes lingering issues reported in #816

Special notes for your reviewer:

Does this PR introduce a user-facing change?

/release-to-note-none

Additional documentation (design docs, usage docs, etc.):

Checklist

  • [*] make check test passes locally
  • [*] make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)
  • [*] make check-modules passes if go.mod / go.sum changed
  • Tests added or updated for the change
  • Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
…uses

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
…cache for stale data

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@shivamerla: The label(s) kind/robustness cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind robustness

What this PR does / why we need it:

This change adds miscellaneous improvements to address lingering issues causing 409 (conflicts) / 429 (rate limit) errors reported with #816.

Specifically this change adds:

  • Per-node status with Server-Side Apply (SSA): each compute-domain-daemon
    (one per node in the clique) patches only its own node’s daemon status
    subtree on the CDClique using SSA with a distinct field manager, so
    concurrent updates from up to ~18 nodes do not overwrite each other’s
    fields the way a single shared JSON merge patch on the whole object would.
  • Cleanup path: wrap CDClique daemon cleanup logic in client-go’s
    retry.OnError with retry.DefaultRetry and reload
    the latest CDClique from the API on conflict. The object is hot under
    concurrent SSA writers, the informer can lag, so acting on a stale cached
    copy caused repeated 409 Conflict. Refetch + backoff improves convergence
    of ComputeDomain status by letting cleanup finish without stalling on stale
    resourceVersion.
  • CD cleanup: when collecting pods for teardown, restrict to pods associated
    with the current clique only (avoids unrelated pod churn; orthogonal to
    409/429 mitigation).

Which issue(s) this PR is related to:

Fixes lingering issues reported in #816

Special notes for your reviewer:

Does this PR introduce a user-facing change?

/release-to-note-none

Additional documentation (design docs, usage docs, etc.):

Checklist

  • [*] make check test passes locally
  • [*] make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)
  • [*] make check-modules passes if go.mod / go.sum changed
  • Tests added or updated for the change
  • Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label May 5, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 5, 2026

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit e9f1da5
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/69f97fed83bfd600086a06ae
😎 Deploy Preview https://deploy-preview-1101--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from dims and varunrsekar May 5, 2026 05:28
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 5, 2026
@shivamerla shivamerla self-assigned this May 5, 2026
@shivamerla
Copy link
Copy Markdown
Contributor Author

shivamerla commented May 5, 2026

/kind cleanup
/release-note-none

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 5, 2026
@shivamerla shivamerla marked this pull request as draft May 5, 2026 05:34
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@shivamerla
Copy link
Copy Markdown
Contributor Author

/retest

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@shivamerla: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200 e9f1da5 link false /test pull-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200
pull-dra-driver-nvidia-gpu-e2e-gcp-nvkind e9f1da5 link false /test pull-dra-driver-nvidia-gpu-e2e-gcp-nvkind

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@herb-duan
Copy link
Copy Markdown
Contributor

herb-duan commented May 6, 2026

@shivamerla

Concern about SSA updates:

Before CDClique was introduced, the compute-domain-daemon had previously attempted to use SSA for CD status updates (when CD status aggregated states from all nodes):

  • #822 introduced SSA
  • #835 reverted daemon status updates from SSA back to optimistic concurrency

The revert was necessary because SSA converged slower for large-scale CD status updates:

  1. Conflict displacement: The conflict essentially shifted from compute-domain-daemon → apiserver to apiserver → etcd, causing heavy etcd read/write load
  2. Unthrottled retries: The massive conflicts and retries between apiserver and etcd could not be constrained by apiserver rate limiting
  3. High CPU impact: Large-scale CD status SSA updates spiked apiserver CPU significantly

Now with CDClique SSA, while a single CDClique caps at 18 nodes, there are still many parallel CDCliques running concurrently. A potential consideration is whether the conflict/retry pattern is simply shifting from compute-domain-daemon → apiserver to apiserver → etcd, and which could lead to:

  • Elevated apiserver CPU usage
  • Excessive etcd read/writes due to continuous conflict retries between apiserver and etcd

@shivamerla
Copy link
Copy Markdown
Contributor Author

@herb-duan Thanks for raising this — the #822 / #835 history is exactly why we’re being careful with SSA here.

I think the important difference is the scale and shape of the object being updated.

Back then, a single ComputeDomain status aggregated state for thousands of nodes (~2k–5k) into one hot object. Even with SSA, we still ended up with large whole-object churn at the apiserver/etcd layer: many writers updating overlapping parts of a huge status surface + managedFields state. Conflicts didn’t really localize, retries amplified load, and we saw the apiserver ↔ etcd CPU impact you mentioned.

CDClique is a pretty different scenario:

  • A CDClique only carries status for ~18 nodes max, so concurrent SSA writers per object are bounded to a small constant instead of thousands targeting one CD.

  • Contention is sharded across many small objects instead of concentrated into one giant status blob.

  • Each daemon only patches its own node subtree with its own field manager, so we get isolation at the node level rather than many actors stomping over the same surface.

That’s the main reason SSA makes more sense here than it did for monolithic CD status. In the old model, SSA didn’t buy enough conflict avoidance relative to the cost. Here the goal is narrower: avoid cross-node overwrites from merge-patch semantics while keeping ownership isolated to small per-node slices in a bounded-size object.

We’re still treating these as hot objects though. Cleanup paths refetch and use bounded retry-on-conflict specifically to avoid spinning on stale informer state or creating unbounded retry loops.

So I agree with the #835 conclusion for cluster-scale monolithic CD status. The CDClique design is intentionally trying to avoid that exact failure mode by bounding object size, bounding writers, and spreading contention across many small objects instead of one giant hot object.

If we still see sustained apiserver/etcd pressure in practice, we can definitely revisit with additional backoff/rate limiting/tuning, but I don’t think this lands in the same operational regime as the original CD status issues.

"nodeName": myDaemon.NodeName,
"ipAddress": myDaemon.IPAddress,
"cliqueID": myDaemon.CliqueID,
"index": myDaemon.Index,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the global uniqueness of (cliqueID, index) guaranteed here? For example, due to race conditions, two daemons may hold the same CDC object and each independently find the same available index. In the SSA (Server-Side Apply) scenario, the patches from these two daemons would not conflict with each other, resulting in the final CDC object containing duplicate (cliqueID, index) tuples.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sakura-3 Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache. With a single large Update (whole daemons slice / whole object) guarded by resourceVersion, concurrent writers often hit 409 conflicts because they replace the same blob. Retries then recompute from the latest object, so in practice things may converge without duplicate indices.

With SSA + per-node field managers + list-as-map keyed by nodeName, the two patches no longer conflict at the merge layer, so the more likely outcome is two distinct daemon rows carrying the same index. So I agree this issue could become more visible with SSA.

One thing we could do is, after a successful SSA apply (or during the next sync), fetch the latest object and check whether another row already holds the same index (same index, different nodeName). If so, recompute getNextAvailableIndex() from the full daemon list and re-patch only the current row.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivamerla I don't think re-patch is a good idea.There are two reasons.

  1. Currently, the logic of getNextAvailableIndex() scans the existing CDC state and returns the smallest unused index in ascending order. This approach can lead to a worst-case infinite retry loop. For example, when the nodes list is initially empty, all daemons would independently select index 0; upon detecting the conflict, they would all fall back to index 1, and so on. We need to revise the implementation of getNextAvailableIndex() to mitigate this issue — for instance, by having each daemon randomly pick an unused index from the range [0, 18] instead of always choosing the minimum.

  2. However, even with the optimized approach, mathematically speaking, with 18 daemons contending, it is expected to take more than 5 rounds to reach convergence. I don't see a clear advantage over the non-SSA approach — the only difference is that the conflict detector shifts from the API server to the daemon itself.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special requirement for the index field? If not, we could consider deriving it from a hash of the node name, which would make index assignment fully deterministic and eliminate the negotiation process altogether.

Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider deriving it from a hash of the node name, which would make index assignment fully deterministic and eliminate the negotiation process altogether.

This could in theory work, and I was specifically also thinking about constructing some kind of hash function with node IP and/or node name as input (and talked it through with @klueska back in January). We rejected this because of the relative complexity, potential oversights, and mainly because we wanted to retain a simple method that ensures a dense set of indices with predictable re-usage. Achieving that with a hash function seemed mathematically challenging.

@klueska
Copy link
Copy Markdown
Contributor

klueska commented May 7, 2026

Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache.

This race most definitely does not exist today. The conflicts on writing to the API server force a recompute, ensuring once a write succeeds, a unique index is chosen.

Is there any special requirement for the index field?

Yes, they must be exactly from the range 0-17, so that all nodes agree on them in the predetermined nodes_config.cfg file.

I had a never-merged PR that could pre-calculate the index here, but we decided not to go with it because the perfomance of it combined with SSA proved to be worse then just allowing conflicts:
#824

@shivamerla
Copy link
Copy Markdown
Contributor Author

Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache.

This race most definitely does not exist today. The conflicts on writing to the API server force a recompute, ensuring once a write succeeds, a unique index is chosen.

Is there any special requirement for the index field?

Yes, they must be exactly from the range 0-17, so that all nodes agree on them in the predetermined nodes_config.cfg file.

I had a never-merged PR that could pre-calculate the index here, but we decided not to go with it because the perfomance of it combined with SSA proved to be worse then just allowing conflicts: #824

What if we shift index allocation to the compute-domain-controller? Each daemon can continue using SSA for its own row (nodeName, ipAddress, status), but stop assigning index locally (getNextAvailableIndex). The controller, as a single writer with the full clique view, computes and assigns unique indices centrally. Daemons then just consume/update their row with the controller-assigned index. This removes cross-daemon coordination while keeping stable DNS slots (0..17).

@klueska
Copy link
Copy Markdown
Contributor

klueska commented May 7, 2026

That is what we attempted here #810, though this was before we introduced the ComputeDomainClique object (which proved to be the real winner, so we dropped this). I believe the issue we ran into here was API server throttling, but now that its spread across lots of objects maybe that wouldn't be a problem.

That said, I'm still skeptical that anything we come up with will be faster than just allowing the conflicts on such a small number of nodes.

One concrete thing I would do immediately (to reduce the number of conflicts) is what @sakura-3 suggested in one of his comments. That is, compute a random index from any available ones not yet selected, so that everyone isn't constantly competing on the same ones.

@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented May 7, 2026

Most of this discussion is about the systematic comparison of these two approaches:

  1. informer-driven, resourceVersion-based conflict resolution (the classical approach)
  2. SSA-based conflict resolution

Here is what we concluded in #822 (note that we also explored randomly assigning DNS indices):

We noticed the same problems over the weekend. We fixed them (by adding jitter and more back-off, and by getting a random available DNS index). Notably, even with these changes in place, the SSA approach was still not great.

The overall conclusion from the work in in #829 and #822 was:

For a given number of conflicts C, the classical conflict resolution approach (1) always has better performance characteristics than SSA (2).

This is the reason for why we backed out of anything-SSA for performance improvements. It may very well be that we didn't properly write down all insights of that effort. I'll try to summarize what I remember our conclusions were:

SSA can only ever shine when different racers can independently update fragments of the same object. But if these updates are not independent, and have to be reconciled somewhere, then SSA worsens things by making conflict detection more costly. This may be easy to miss.

The intra-clique indices are as of today clearly not independent.

SSA in particular cannot magically resolve any business logic conflict by itself in the API server. The following example maybe makes that obvious: if two racers want to both set index 7 then

  • with SSA, the API server does costly serialization work and after all allows both updates, and it emits one good response payload and one bad response payload (the problem needs to be detected client-side via response payload inspection).
  • in the classical method, one of both updates is rejected with a 409 response right away, and it was cheap for the API server to reject one of the requests.

@sakura-3 eluded to that above by saying

the patches from these two daemons would not conflict with each other, resulting in the final CDC object containing duplicate (cliqueID, index) tuples

We really learned that SSA is not a tool for improving performance when one has to coordinate conflicting updates. The opposite is really true; introduction of SSA yields various new challenges as @herb-duan eluded to (mainly, additional stress on the API server which now needs to serialize requests and may still return a response reflecting bad state). When I measured system properties, I also kept track of CPU time consumed by the API server as a metric to quantify that. API server CPU utilization was dramatically higher when using SSA.

@sakura-3 you said above that

mathematically speaking, with 18 daemons contending, it is expected to take more than 5 rounds to reach convergence.

This is a productive point of view, because it's about the fundamental conflict resolution work that is to be done either way. You continue with

I don't see a clear advantage over the non-SSA approach

I very much agree -- that's really the same core insight: "For a given number of conflicts C, ...."

Small correction about this:

the only difference is that the conflict detector shifts from the API server to the daemon itself.

The true conflict detector is in the daemon even with SSA (detecting the same index being used more than once; as you've yourself said).

In general, @sakura-3 I think you're spot-on and our views are aligned.

@shivamerla I saw you said

I think the important difference is the scale and shape of the object being updated.

What I just tried to explain about SSA is generally true for small objects and large objects. Of course, after all, only proper measurements of the key optimization target (CD convergence time) should guide us.

Now, quick input about the more trivial / more obvious part: reducing the number of conflicts to be resolved (per object) is a conceptual improvement that matters independently of the conflict resolution method used.

This is great, and we actually tried it @klueska @sakura-3 @shivamerla:

compute a random index from any available ones not yet selected

See this commit. It introduces getRandomAvailableIndex(). We didn't commit this to main at the time, to keep things separate and because for small N the performance impact was almost negligible.

Btw, the random index selection was part of the measurement series shown as a green line in this plot.

Front-loading index determination (completely eliminating index-related conflict resolution in the hot path) or significantly reducing index conflicts is I think a viable path forward.

I also want to briefly say something about

lingering issues causing 409 (conflicts)

Informer-driven (resourceVersion-based) reconciliation implies seeing 409 responses. Seeing these responses is not generally an issue. Also, the observed rate / count of these responses is not generally a useful performance metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

6 participants