scheduler: allow canaries to be migrated on node drain #25726

tgross · 2025-04-23T18:31:53Z

When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than migrate.max_parallel canaries on the node and the canaries are not yet healthy (ex. they have a long update.min_healthy_time). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted.

When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it.

I've adjusted an existing reconciler test to cover multiple canaries, and I've added a new test that covers the whole scheduler based on a cluster state I extracted from the reproduction described in #17842. In addition to verifying this fixes that case, I've also run the job with canaries and forced them to fail and/or reschedule to ensure we were still properly detecting failed canaries.

Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: #17842

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than `migrate.max_parallel` canaries on the node and the canaries are not yet healthy (ex. they have a long `update.min_healthy_time`). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted. When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it. Ref: https://hashicorp.atlassian.net/browse/NMD-560 Fixes: #17842

While working on #25726, I found a method in the drainer code that records creates a map of job IDs to allocations. At first glance this looks like a bug because it effectively de-duplicates the allocations per job. But the consumer of the map is only concerned with jobs, not allocations, and simply reads the job off the allocation. Refactor this to make it obvious we're looking at the job. Ref: #25726

While working on #25726, I explored a hypothesis that the problem could be in the state store, but this proved to be a dead end. While I was in this area of the code I migrated the tests to `shoenig/test`. Ref: #25726

jrasell

LGTM, thanks @tgross!

While working on #25726, I found a method in the drainer code that records creates a map of job IDs to allocations. At first glance this looks like a bug because it effectively de-duplicates the allocations per job. But the consumer of the map is only concerned with jobs, not allocations, and simply reads the job off the allocation. Refactor this to make it obvious we're looking at the job. Ref: #25726

In #25726 we added a test of how canaries were treated when on draining nodes. But the test didn't correctly configure the job with an update block, leading to misleading test behavior. Fix the test to exercise the intended behavior and refactor for clarity. Ref: #25726

When a task group is removed from a jobspec, the reconciler stops all allocations and immediately returns from `computeGroup`. We can do the same for when the group has been scaled-to-zero, but doing so runs into an inconsistency in the way that server-terminal allocations are handled. Prior to this change server-terminal allocations fall through `computeGroup` without being marked as `ignore`, unless they are terminal canaries, in which case they are marked `stop` (but this is a no-op). This inconsistency causes a _tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it more difficult to make test assertions for `stop` vs `ignore` vs fallthrough. Remove this inconsistency by filtering out server-terminal allocations early in `computeGroup`. This brings the cluster reconciler's behavior closer to the node reconciler's behavior, except that the node reconciler discards _all_ terminal allocations because it doesn't support rescheduling. This changeset required adjustments to two tests, but the tests themselves were a bit of a mess: * In #25726 we added a test of how canaries were treated when on draining nodes. But the test didn't correctly configure the job with an update block, leading to misleading test behavior. Fix the test to exercise the intended behavior and refactor for clarity. * While working on reconciler behaviors around stopped allocations, I found it extremely hard to follow the intent of the disconnected client tests because many of the fields in the table-driven test are switches for more complex behavior or just tersely named. Attempt to make this a little more legible by moving some branches directly into fields, renaming some fields, and flattening out some branching. Ref: https://hashicorp.atlassian.net/browse/NMD-819

github-actions · 2025-08-23T02:18:37Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross force-pushed the NMD560-drained-canaries branch from 7773ddb to 3f9d86e Compare April 23, 2025 18:32

tgross added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line labels Apr 23, 2025

tgross added this to the 1.10.x milestone Apr 23, 2025

tgross added the theme/scheduling label Apr 23, 2025

vercel bot deployed to Preview – nomad-ui April 23, 2025 18:33 View deployment

tgross added theme/drain theme/deployments labels Apr 23, 2025

tgross mentioned this pull request Apr 23, 2025

refactor alloc drain to make intent more clear #25731

Merged

tgross mentioned this pull request Apr 23, 2025

testing: state store test improvements around deployments #25732

Merged

tgross marked this pull request as ready for review April 23, 2025 19:03

tgross requested review from a team as code owners April 23, 2025 19:03

tgross requested review from Juanadelacuesta, schmichael, jrasell, mismithhisler and gulducat and removed request for Juanadelacuesta April 23, 2025 19:03

tgross mentioned this pull request Apr 23, 2025

A node drain containing a canary allocation will corrupt the deployment #17842

Closed

hc-github-team-nomad-core mentioned this pull request Apr 23, 2025

Backport of testing: state store test improvements around deployments into release/1.10.x #25734

Merged

jrasell approved these changes Apr 24, 2025

View reviewed changes

tgross merged commit 5208ad4 into main Apr 24, 2025
55 checks passed

tgross deleted the NMD560-drained-canaries branch April 24, 2025 13:24

hc-github-team-nomad-core mentioned this pull request Apr 24, 2025

Backport of scheduler: allow canaries to be migrated on node drain into release/1.10.x #25745

Merged

6 tasks

hc-github-team-nomad-core mentioned this pull request Apr 24, 2025

Backport of refactor alloc drain to make intent more clear into release/1.10.x #25746

Merged

tgross mentioned this pull request Jul 17, 2025

scheduler: exit early on count=0 and filter out server-terminal #26292

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scheduler: allow canaries to be migrated on node drain #25726

scheduler: allow canaries to be migrated on node drain #25726

tgross commented Apr 23, 2025 •

edited

Loading

Uh oh!

jrasell left a comment

Uh oh!

Uh oh!

github-actions bot commented Aug 23, 2025

Uh oh!

Uh oh!

scheduler: allow canaries to be migrated on node drain #25726

scheduler: allow canaries to be migrated on node drain #25726

Conversation

tgross commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contributor Checklist

Reviewer Checklist

Uh oh!

jrasell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 23, 2025

Uh oh!

Uh oh!

tgross commented Apr 23, 2025 •

edited

Loading