-
Notifications
You must be signed in to change notification settings - Fork 2k
scheduler: allow canaries to be migrated on node drain #25726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than `migrate.max_parallel` canaries on the node and the canaries are not yet healthy (ex. they have a long `update.min_healthy_time`). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted. When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it. Ref: https://hashicorp.atlassian.net/browse/NMD-560 Fixes: #17842
7773ddb to
3f9d86e
Compare
While working on #25726, I found a method in the drainer code that records creates a map of job IDs to allocations. At first glance this looks like a bug because it effectively de-duplicates the allocations per job. But the consumer of the map is only concerned with jobs, not allocations, and simply reads the job off the allocation. Refactor this to make it obvious we're looking at the job. Ref: #25726
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @tgross!
While working on #25726, I found a method in the drainer code that records creates a map of job IDs to allocations. At first glance this looks like a bug because it effectively de-duplicates the allocations per job. But the consumer of the map is only concerned with jobs, not allocations, and simply reads the job off the allocation. Refactor this to make it obvious we're looking at the job. Ref: #25726
When a task group is removed from a jobspec, the reconciler stops all allocations and immediately returns from `computeGroup`. We can do the same for when the group has been scaled-to-zero, but doing so runs into an inconsistency in the way that server-terminal allocations are handled. Prior to this change server-terminal allocations fall through `computeGroup` without being marked as `ignore`, unless they are terminal canaries, in which case they are marked `stop` (but this is a no-op). This inconsistency causes a _tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it more difficult to make test assertions for `stop` vs `ignore` vs fallthrough. Remove this inconsistency by filtering out server-terminal allocations early in `computeGroup`. This brings the cluster reconciler's behavior closer to the node reconciler's behavior, except that the node reconciler discards _all_ terminal allocations because it doesn't support rescheduling. This changeset required adjustments to two tests, but the tests themselves were a bit of a mess: * In #25726 we added a test of how canaries were treated when on draining nodes. But the test didn't correctly configure the job with an update block, leading to misleading test behavior. Fix the test to exercise the intended behavior and refactor for clarity. * While working on reconciler behaviors around stopped allocations, I found it extremely hard to follow the intent of the disconnected client tests because many of the fields in the table-driven test are switches for more complex behavior or just tersely named. Attempt to make this a little more legible by moving some branches directly into fields, renaming some fields, and flattening out some branching. Ref: https://hashicorp.atlassian.net/browse/NMD-819
|
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than
migrate.max_parallelcanaries on the node and the canaries are not yet healthy (ex. they have a longupdate.min_healthy_time). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted.When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it.
I've adjusted an existing reconciler test to cover multiple canaries, and I've added a new test that covers the whole scheduler based on a cluster state I extracted from the reproduction described in #17842. In addition to verifying this fixes that case, I've also run the job with canaries and forced them to fail and/or reschedule to ensure we were still properly detecting failed canaries.
Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: #17842
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.