scheduler: fix state corruption from rescheduler tracker updates by tgross · Pull Request #25698 · hashicorp/nomad

tgross · 2025-04-16T18:59:29Z

In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users.

Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan.

Ref: #12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.

In addition to updated tests here, I deployed a job and broke it to trigger the reschedule tracking update for blocked evals. I then verified that I get the expected events in the event stream and did some comparison to the existing behavior and that looks as expected.

Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users. Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan. Ref: #12319 Fixes: https://hashicorp.atlassian.net/browse/NET-12357

tgross · 2025-04-16T19:36:38Z

Looks like this breaks a disconnected clients test. I'm investigating. Edit: with some debugging it looks like there's another location in the code that's performing a mutation without us noticing, and that's getting detected now that we're copying the struct. Still looking for where.

Fixed. The disconnected clients behavior copied an Allocation correctly, but left the old references around in the results allocsets.

github-actions · 2025-08-17T02:32:00Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/scheduling type/bug labels Apr 16, 2025

tgross added this to the 1.10.x milestone Apr 16, 2025

vercel Bot deployed to Preview – nomad-ui April 16, 2025 19:00 View deployment

tgross force-pushed the NET12357-scheduler-race branch from 9e8d64d to 4cef876 Compare April 16, 2025 19:00

tgross added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent labels Apr 16, 2025

vercel Bot deployed to Preview – nomad-ui April 16, 2025 19:02 View deployment

vercel Bot deployed to Preview – nomad-ui April 16, 2025 21:04 View deployment

ensure disconnected allocs update the pointer to previous alloc

c44da75

tgross force-pushed the NET12357-scheduler-race branch from d1ed956 to c44da75 Compare April 17, 2025 12:49

vercel Bot deployed to Preview – nomad-ui April 17, 2025 12:50 View deployment

tgross marked this pull request as ready for review April 17, 2025 13:58

tgross requested review from a team as code owners April 17, 2025 13:58

tgross requested review from jrasell and schmichael April 17, 2025 13:58

schmichael approved these changes Apr 17, 2025

View reviewed changes

tgross merged commit c205688 into main Apr 18, 2025
32 checks passed

tgross deleted the NET12357-scheduler-race branch April 18, 2025 12:42

hc-github-team-nomad-core mentioned this pull request Apr 18, 2025

Backport of scheduler: fix state corruption from rescheduler tracker updates into release/1.10.x #25706

Merged

6 tasks

tgross mentioned this pull request Aug 6, 2025

Batch job reschedule attempts sometimes not being respected #20462

Closed

github-actions Bot locked as resolved and limited conversation to collaborators Aug 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: fix state corruption from rescheduler tracker updates#25698

scheduler: fix state corruption from rescheduler tracker updates#25698
tgross merged 2 commits intomainfrom
NET12357-scheduler-race

tgross commented Apr 16, 2025 •

edited

Loading

Uh oh!

tgross commented Apr 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tgross commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contributor Checklist

Reviewer Checklist

Uh oh!

tgross commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tgross commented Apr 16, 2025 •

edited

Loading

tgross commented Apr 16, 2025 •

edited

Loading