cmd/compute-domain-daemon: non-converging CD status due to lost node updates#1049
cmd/compute-domain-daemon: non-converging CD status due to lost node updates#1049aditighag wants to merge 1 commit into
Conversation
|
|
|
Welcome @aditighag! |
|
Hi @aditighag. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ArangoGutierrez
left a comment
There was a problem hiding this comment.
The repro is clean and the race is real. A few notes on the test shape before this lands.
1. noopMutationCache hides what a fix needs to do. Calling syncNodeInfoToCD directly with a no-op mutation cache exposes the writer race, but a fix that relies on the mutation cache (reading back the server's post-write object in onAddOrUpdate → m.Get()) is indistinguishable from a fix that just adds retry-on-conflict in the writer — both pass. Consider a second test that drives onAddOrUpdate through the real NewIntegerResourceVersionMutationCache and asserts a stale inbound UPDATE does not overwrite fresher state.
2. N=2 scenario only. The repro has two writers. The production failure mode is N writers racing after an informer resync burst. A fix that serializes only two writers (local mutex, etc.) would pass this test and still break at scale. Suggest parameterising on N, or asserting "all N entries survive" as the invariant.
3. Test-only PR, CI implications. This is the first _test.go under cmd/compute-domain-daemon/. If that path is on the default go test / make test, merging leaves CI red. Either land the fix in the same PR, or gate the repro behind t.Skip("tracked in #<issue>") with a follow-up issue.
4. Design question for the fix (non-blocking here). Full-object UpdateStatus from a stale snapshot is racy by construction. Two idiomatic paths: retry.RetryOnConflict with a ResourceVersion precondition, or a targeted patch that mutates only this daemon's own .status.nodes[i] entry. The latter eliminates the class of bug rather than retrying around it. Worth calling out in the fix PR so reviewers can anchor on one model.
Requesting changes pending the fix + broader-N test.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aditighag The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Sure, I'll extend the test by parameterizing
I can take a stab at the fix in a follow-up PR. Any objections to merging this PR that adds a unit test repro for the bug? |
4d86370 to
7cc7929
Compare
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
@ArangoGutierrez Addressed review comments in the latest push. PTAL! |
…updates
When multiple compute-domain daemon pods update ComputeDomain status
from their own view of the object, syncNodeInfoToCD rewrites the
full status.nodes list and can drop a peer node entry. The CD status
fails to converge.
Add a unit test that reproduces this lost-update behavior by simulating
multiple daemon instances reading the same initial ComputeDomain object and
then writing their node info sequentially.
Snippets from a real cluster deployment highlighting the bug -
```
Status:
Nodes:
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 0
Ip Address: 10.42.3.138
Name: cluster1-mgx-00011
Status: NotReady
Clique Id: 9f410827-489f-46fd-8203-51266ca08eb7.32766
Index: 1
Ip Address: 10.42.1.100
Name: cluster1-mgx-00017
Status: NotReady
Status: NotReady
Events: <none>
I0415 16:14:28.038085 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:28.038098 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.044285 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.044300 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:28.053413 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:28.053424 1 cdstatus.go:354] IP set for clique did not change
I0415 16:14:30.033811 1 cdstatus.go:259] CD status does not contain node name 'cluster1-mgx-00011' yet, try to insert myself: &{cluster1-mgx-00011 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady}
I0415 16:14:30.037427 1 round_trippers.go:632] "Response" verb="PUT" url="https://10.43.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=3
I0415 16:14:30.037620 1 cdstatus.go:276] Successfully inserted/updated node in CD (nodeinfo: &{cluster1-mgx-00011 10.42.3.138 9f410827-489f-46fd-8203-51266ca08eb7.32766 0 NotReady})
I0415 16:14:30.037656 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.042979 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.042995 1 cdstatus.go:339] numNodes: 2, nodes seen: 1
I0415 16:14:30.052870 1 cdstatus.go:239] syncNodeInfoToCD noop: pod IP unchanged (10.42.3.138)
I0415 16:14:30.052888 1 cdstatus.go:354] IP set for clique did not change
```
Signed-off-by: Aditi Ghag <aditi.ghag@lambdal.com>
7cc7929 to
8489b57
Compare
|
@ArangoGutierrez Ping? :) |
|
/ok-to-test |
|
/assign @shivamerla |
|
test failure is a flake. |
Could you re-trigger the test? Thanks! |
|
/retest |
When multiple compute-domain daemon pods update ComputeDomain status from their own view of the object, syncNodeInfoToCD rewrites the full status.nodes list and can drop a peer node entry. The CD status fails to converge.
Add a unit test that reproduces this lost-update behavior by simulating two daemon instances reading the same initial ComputeDomain object and then writing their node info sequentially.
Snippets from a real cluster deployment highlighting the bug -