scheduler: fix state corruption from rescheduler tracker updates#25698
scheduler: fix state corruption from rescheduler tracker updates#25698
Conversation
In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users. Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan. Ref: #12319 Fixes: https://hashicorp.atlassian.net/browse/NET-12357
9e8d64d to
4cef876
Compare
|
Fixed. The disconnected clients behavior copied an Allocation correctly, but left the old references around in the results allocsets. |
d1ed956 to
c44da75
Compare
|
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users.
Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan.
Ref: #12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
In addition to updated tests here, I deployed a job and broke it to trigger the reschedule tracking update for blocked evals. I then verified that I get the expected events in the event stream and did some comparison to the existing behavior and that looks as expected.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.