Skip to content

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 23, 2025

When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than migrate.max_parallel canaries on the node and the canaries are not yet healthy (ex. they have a long update.min_healthy_time). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted.

When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it.

I've adjusted an existing reconciler test to cover multiple canaries, and I've added a new test that covers the whole scheduler based on a cluster state I extracted from the reproduction described in #17842. In addition to verifying this fixes that case, I've also run the job with canaries and forced them to fail and/or reschedule to ensure we were still properly detecting failed canaries.

Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: #17842

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.

When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.

Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: #17842
@tgross tgross force-pushed the NMD560-drained-canaries branch from 7773ddb to 3f9d86e Compare April 23, 2025 18:32
@tgross tgross added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line labels Apr 23, 2025
@tgross tgross added this to the 1.10.x milestone Apr 23, 2025
tgross added a commit that referenced this pull request Apr 23, 2025
While working on #25726, I found a method in the drainer code that records
creates a map of job IDs to allocations.

At first glance this looks like a bug because it effectively de-duplicates the
allocations per job. But the consumer of the map is only concerned with jobs,
not allocations, and simply reads the job off the allocation. Refactor this to
make it obvious we're looking at the job.

Ref: #25726
tgross added a commit that referenced this pull request Apr 23, 2025
While working on #25726, I explored a hypothesis that the problem could be
in the state store, but this proved to be a dead end. While I was in this area
of the code I migrated the tests to `shoenig/test`.

Ref: #25726
@tgross tgross marked this pull request as ready for review April 23, 2025 19:03
@tgross tgross requested review from a team as code owners April 23, 2025 19:03
tgross added a commit that referenced this pull request Apr 23, 2025
While working on #25726, I explored a hypothesis that the problem could be
in the state store, but this proved to be a dead end. While I was in this area
of the code I migrated the tests to `shoenig/test`.

Ref: #25726
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tgross!

@tgross tgross merged commit 5208ad4 into main Apr 24, 2025
55 checks passed
@tgross tgross deleted the NMD560-drained-canaries branch April 24, 2025 13:24
tgross added a commit that referenced this pull request Apr 24, 2025
While working on #25726, I found a method in the drainer code that records
creates a map of job IDs to allocations.

At first glance this looks like a bug because it effectively de-duplicates the
allocations per job. But the consumer of the map is only concerned with jobs,
not allocations, and simply reads the job off the allocation. Refactor this to
make it obvious we're looking at the job.

Ref: #25726
tgross added a commit that referenced this pull request Jul 17, 2025
In #25726 we added a test of how canaries were treated when on draining
nodes. But the test didn't correctly configure the job with an update block,
leading to misleading test behavior. Fix the test to exercise the intended
behavior and refactor for clarity.

Ref: #25726
tgross added a commit that referenced this pull request Jul 18, 2025
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.

Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
_tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it
more difficult to make test assertions for `stop` vs `ignore` vs
fallthrough. Remove this inconsistency by filtering out server-terminal
allocations early in `computeGroup`.

This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.

This changeset required adjustments to two tests, but the tests themselves were
a bit of a mess:
* In #25726 we added a test of how
  canaries were treated when on draining nodes. But the test didn't correctly
  configure the job with an update block, leading to misleading test
  behavior. Fix the test to exercise the intended behavior and refactor for
  clarity.
* While working on reconciler behaviors around stopped allocations, I found it
  extremely hard to follow the intent of the disconnected client tests because
  many of the fields in the table-driven test are switches for more complex
  behavior or just tersely named. Attempt to make this a little more legible by
  moving some branches directly into fields, renaming some fields, and
  flattening out some branching.

Ref: https://hashicorp.atlassian.net/browse/NMD-819
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line theme/deployments theme/drain theme/scheduling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A node drain containing a canary allocation will corrupt the deployment
2 participants