CDClique: SSA per-node status, conflict retry on cleanup, scoped pod selection#1101
CDClique: SSA per-node status, conflict retry on cleanup, scoped pod selection#1101shivamerla wants to merge 4 commits into
Conversation
Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
…uses Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
…cache for stale data Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@shivamerla: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shivamerla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/kind cleanup |
|
/retest |
|
@shivamerla: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Concern about SSA updates: Before CDClique was introduced, the compute-domain-daemon had previously attempted to use SSA for CD status updates (when CD status aggregated states from all nodes): The revert was necessary because SSA converged slower for large-scale CD status updates:
Now with CDClique SSA, while a single CDClique caps at 18 nodes, there are still many parallel CDCliques running concurrently. A potential consideration is whether the conflict/retry pattern is simply shifting from
|
|
@herb-duan Thanks for raising this — the #822 / #835 history is exactly why we’re being careful with SSA here. I think the important difference is the scale and shape of the object being updated. Back then, a single ComputeDomain status aggregated state for thousands of nodes (~2k–5k) into one hot object. Even with SSA, we still ended up with large whole-object churn at the apiserver/etcd layer: many writers updating overlapping parts of a huge status surface + managedFields state. Conflicts didn’t really localize, retries amplified load, and we saw the apiserver ↔ etcd CPU impact you mentioned. CDClique is a pretty different scenario:
That’s the main reason SSA makes more sense here than it did for monolithic CD status. In the old model, SSA didn’t buy enough conflict avoidance relative to the cost. Here the goal is narrower: avoid cross-node overwrites from merge-patch semantics while keeping ownership isolated to small per-node slices in a bounded-size object. We’re still treating these as hot objects though. Cleanup paths refetch and use bounded retry-on-conflict specifically to avoid spinning on stale informer state or creating unbounded retry loops. So I agree with the #835 conclusion for cluster-scale monolithic CD status. The CDClique design is intentionally trying to avoid that exact failure mode by bounding object size, bounding writers, and spreading contention across many small objects instead of one giant hot object. If we still see sustained apiserver/etcd pressure in practice, we can definitely revisit with additional backoff/rate limiting/tuning, but I don’t think this lands in the same operational regime as the original CD status issues. |
| "nodeName": myDaemon.NodeName, | ||
| "ipAddress": myDaemon.IPAddress, | ||
| "cliqueID": myDaemon.CliqueID, | ||
| "index": myDaemon.Index, |
There was a problem hiding this comment.
How is the global uniqueness of (cliqueID, index) guaranteed here? For example, due to race conditions, two daemons may hold the same CDC object and each independently find the same available index. In the SSA (Server-Side Apply) scenario, the patches from these two daemons would not conflict with each other, resulting in the final CDC object containing duplicate (cliqueID, index) tuples.
There was a problem hiding this comment.
@sakura-3 Good point! This race already exists today even without SSA, since individual daemon controllers determine the index via getNextAvailableIndex() from the latest CDClique object visible through the informer/mutation cache. With a single large Update (whole daemons slice / whole object) guarded by resourceVersion, concurrent writers often hit 409 conflicts because they replace the same blob. Retries then recompute from the latest object, so in practice things may converge without duplicate indices.
With SSA + per-node field managers + list-as-map keyed by nodeName, the two patches no longer conflict at the merge layer, so the more likely outcome is two distinct daemon rows carrying the same index. So I agree this issue could become more visible with SSA.
One thing we could do is, after a successful SSA apply (or during the next sync), fetch the latest object and check whether another row already holds the same index (same index, different nodeName). If so, recompute getNextAvailableIndex() from the full daemon list and re-patch only the current row.
There was a problem hiding this comment.
@shivamerla I don't think re-patch is a good idea.There are two reasons.
-
Currently, the logic of getNextAvailableIndex() scans the existing CDC state and returns the smallest unused index in ascending order. This approach can lead to a worst-case infinite retry loop. For example, when the nodes list is initially empty, all daemons would independently select index 0; upon detecting the conflict, they would all fall back to index 1, and so on. We need to revise the implementation of getNextAvailableIndex() to mitigate this issue — for instance, by having each daemon randomly pick an unused index from the range [0, 18] instead of always choosing the minimum.
-
However, even with the optimized approach, mathematically speaking, with 18 daemons contending, it is expected to take more than 5 rounds to reach convergence. I don't see a clear advantage over the non-SSA approach — the only difference is that the conflict detector shifts from the API server to the daemon itself.
There was a problem hiding this comment.
Is there any special requirement for the index field? If not, we could consider deriving it from a hash of the node name, which would make index assignment fully deterministic and eliminate the negotiation process altogether.
There was a problem hiding this comment.
consider deriving it from a hash of the node name, which would make index assignment fully deterministic and eliminate the negotiation process altogether.
This could in theory work, and I was specifically also thinking about constructing some kind of hash function with node IP and/or node name as input (and talked it through with @klueska back in January). We rejected this because of the relative complexity, potential oversights, and mainly because we wanted to retain a simple method that ensures a dense set of indices with predictable re-usage. Achieving that with a hash function seemed mathematically challenging.
This race most definitely does not exist today. The conflicts on writing to the API server force a recompute, ensuring once a write succeeds, a unique index is chosen.
Yes, they must be exactly from the range 0-17, so that all nodes agree on them in the predetermined I had a never-merged PR that could pre-calculate the index here, but we decided not to go with it because the perfomance of it combined with SSA proved to be worse then just allowing conflicts: |
What if we shift index allocation to the compute-domain-controller? Each daemon can continue using SSA for its own row (nodeName, ipAddress, status), but stop assigning index locally (getNextAvailableIndex). The controller, as a single writer with the full clique view, computes and assigns unique indices centrally. Daemons then just consume/update their row with the controller-assigned index. This removes cross-daemon coordination while keeping stable DNS slots (0..17). |
|
That is what we attempted here #810, though this was before we introduced the That said, I'm still skeptical that anything we come up with will be faster than just allowing the conflicts on such a small number of nodes. One concrete thing I would do immediately (to reduce the number of conflicts) is what @sakura-3 suggested in one of his comments. That is, compute a random index from any available ones not yet selected, so that everyone isn't constantly competing on the same ones. |
|
Most of this discussion is about the systematic comparison of these two approaches:
Here is what we concluded in #822 (note that we also explored randomly assigning DNS indices):
The overall conclusion from the work in in #829 and #822 was: For a given number of conflicts C, the classical conflict resolution approach (1) always has better performance characteristics than SSA (2). This is the reason for why we backed out of anything-SSA for performance improvements. It may very well be that we didn't properly write down all insights of that effort. I'll try to summarize what I remember our conclusions were: SSA can only ever shine when different racers can independently update fragments of the same object. But if these updates are not independent, and have to be reconciled somewhere, then SSA worsens things by making conflict detection more costly. This may be easy to miss. The intra-clique indices are as of today clearly not independent. SSA in particular cannot magically resolve any business logic conflict by itself in the API server. The following example maybe makes that obvious: if two racers want to both set index 7 then
@sakura-3 eluded to that above by saying
We really learned that SSA is not a tool for improving performance when one has to coordinate conflicting updates. The opposite is really true; introduction of SSA yields various new challenges as @herb-duan eluded to (mainly, additional stress on the API server which now needs to serialize requests and may still return a response reflecting bad state). When I measured system properties, I also kept track of CPU time consumed by the API server as a metric to quantify that. API server CPU utilization was dramatically higher when using SSA. @sakura-3 you said above that
This is a productive point of view, because it's about the fundamental conflict resolution work that is to be done either way. You continue with
I very much agree -- that's really the same core insight: "For a given number of conflicts C, ...." Small correction about this:
The true conflict detector is in the daemon even with SSA (detecting the same index being used more than once; as you've yourself said). In general, @sakura-3 I think you're spot-on and our views are aligned. @shivamerla I saw you said
What I just tried to explain about SSA is generally true for small objects and large objects. Of course, after all, only proper measurements of the key optimization target (CD convergence time) should guide us. Now, quick input about the more trivial / more obvious part: reducing the number of conflicts to be resolved (per object) is a conceptual improvement that matters independently of the conflict resolution method used. This is great, and we actually tried it @klueska @sakura-3 @shivamerla:
See this commit. It introduces Btw, the random index selection was part of the measurement series shown as a green line in this plot. Front-loading index determination (completely eliminating index-related conflict resolution in the hot path) or significantly reducing index conflicts is I think a viable path forward. I also want to briefly say something about
Informer-driven (resourceVersion-based) reconciliation implies seeing 409 responses. Seeing these responses is not generally an issue. Also, the observed rate / count of these responses is not generally a useful performance metric. |
What type of PR is this?
/kind robustness
What this PR does / why we need it:
This change adds miscellaneous improvements to address lingering issues causing 409 (conflicts) / 429 (rate limit) errors reported with #816.
Specifically this change adds:
Which issue(s) this PR is related to:
Fixes lingering issues reported in #816
Special notes for your reviewer:
Does this PR introduce a user-facing change?
/release-to-note-none
Additional documentation (design docs, usage docs, etc.):
Checklist
make check testpasses locallymake check-generatepasses ifapi/changed (CRDs, deepcopy, informers, listers, clientset)make check-modulespasses ifgo.mod/go.sumchangeddeployments/helm) updated if flags, RBAC, or defaults changed