Skip to content

scheduler: fix state corruption from rescheduler tracker updates#25698

Merged
tgross merged 2 commits intomainfrom
NET12357-scheduler-race
Apr 18, 2025
Merged

scheduler: fix state corruption from rescheduler tracker updates#25698
tgross merged 2 commits intomainfrom
NET12357-scheduler-race

Conversation

@tgross
Copy link
Copy Markdown
Member

@tgross tgross commented Apr 16, 2025

In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users.

Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan.

Ref: #12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.

In addition to updated tests here, I deployed a job and broke it to trigger the reschedule tracking update for blocked evals. I then verified that I get the expected events in the event stream and did some comparison to the existing behavior and that looks as expected.

  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

In #12319 we fixed a bug where updates to the reschedule tracker would be
dropped if the follow-up allocation failed to be placed by the scheduler in the
later evaluation. We did this by mutating the previous allocation's reschedule
tracker. But we did this without copying the previous allocation first and then
making sure the updated copy was in the plan. This is unfortunately unsafe and
corrupts the state store on the server where the scheduler ran; it may cause a
race condition in RPC handlers and it causes the server to be out of sync with
the other servers. This was discovered while trying to make all our tests
race-free, but likely impacts production users.

Copy the previous allocation before updating the reschedule tracker, and swap
out the updated allocation in the plan. This also requires that we include the
reschedule tracker in the "normalized" (stripped-down) allocations we send to
the leader as part of a plan.

Ref: #12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357
@tgross
Copy link
Copy Markdown
Member Author

tgross commented Apr 16, 2025

Looks like this breaks a disconnected clients test. I'm investigating. Edit: with some debugging it looks like there's another location in the code that's performing a mutation without us noticing, and that's getting detected now that we're copying the struct. Still looking for where.

Fixed. The disconnected clients behavior copied an Allocation correctly, but left the old references around in the results allocsets.

@tgross tgross force-pushed the NET12357-scheduler-race branch from d1ed956 to c44da75 Compare April 17, 2025 12:49
@tgross tgross marked this pull request as ready for review April 17, 2025 13:58
@tgross tgross requested review from a team as code owners April 17, 2025 13:58
@tgross tgross requested review from jrasell and schmichael April 17, 2025 13:58
@tgross tgross merged commit c205688 into main Apr 18, 2025
32 checks passed
@tgross tgross deleted the NET12357-scheduler-race branch April 18, 2025 12:42
@github-actions
Copy link
Copy Markdown

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Aug 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants