Skip to content

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #25726 to be assessed for backporting due to the inclusion of the label backport/1.10.x.

The below text is copied from the body of the original PR.


When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than migrate.max_parallel canaries on the node and the canaries are not yet healthy (ex. they have a long update.min_healthy_time). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted.

When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it.

I've adjusted an existing reconciler test to cover multiple canaries, and I've added a new test that covers the whole scheduler based on a cluster state I extracted from the reproduction described in #17842. In addition to verifying this fixes that case, I've also run the job with canaries and forced them to fail and/or reschedule to ensure we were still properly detecting failed canaries.

Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: #17842

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

Overview of commits

@tgross tgross merged commit 8df29d7 into release/1.10.x Apr 24, 2025
31 checks passed
@tgross tgross deleted the backport/NMD560-drained-canaries/vaguely-wired-ewe branch April 24, 2025 13:54
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants