Skip to content

Conversation

tgross
Copy link
Member

@tgross tgross commented Jul 17, 2025

When a task group is removed from a jobspec, the reconciler stops all allocations and immediately returns from computeGroup. We can do the same for when the group has been scaled-to-zero, but doing so runs into an inconsistency in the way that server-terminal allocations are handled.

Prior to this change server-terminal allocations fall through computeGroup without being marked as ignore, unless they are terminal canaries, in which case they are marked stop (but this is a no-op). This inconsistency causes a tiny amount of extra Plan.Submit/Raft traffic, but more importantly makes it more difficult to make test assertions for stop vs ignore vs fallthrough. Remove this inconsistency by filtering out server-terminal allocations early in computeGroup.

This brings the cluster reconciler's behavior closer to the node reconciler's behavior, except that the node reconciler discards all terminal allocations because it doesn't support rescheduling.

This changeset required adjustments to two tests, but the tests themselves were a bit of a mess:

  • In scheduler: allow canaries to be migrated on node drain #25726 we added a test of how canaries were treated when on draining nodes. But the test didn't correctly configure the job with an update block, leading to misleading test behavior. Fix the test to exercise the intended behavior and refactor for clarity.
  • While working on reconciler behaviors around stopped allocations, I found it extremely hard to follow the intent of the disconnected client tests because many of the fields in the table-driven test are switches for more complex behavior or just tersely named. Attempt to make this a little more legible by moving some branches directly into fields, renaming some fields, and flattening out some branching.

There's a good bit here, but I've broken this PR up into commits that should stand alone.

Ref: https://hashicorp.atlassian.net/browse/NMD-819

tgross added 5 commits July 17, 2025 14:31
In #25726 we added a test of how canaries were treated when on draining
nodes. But the test didn't correctly configure the job with an update block,
leading to misleading test behavior. Fix the test to exercise the intended
behavior and refactor for clarity.

Ref: #25726
While working on reconciler behaviors around stopped allocations, I found it
extremely hard to follow the intent of the disconnected client tests because
many of the fields in the table-driven test are switches for more complex
behavior or just tersely named. Attempt to make this a little more legible by
moving some branches directly into fields, renaming some fields, and flattening
out some branching.
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.

Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
tiny amount of extra Raft traffic, but more importantly makes it more difficult
to make test assertions for `stop` vs `ignore` vs fallthrough. Remove this
inconsistency by filtering out server-terminal allocations early in
`computeGroup`.

This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.
@tgross
Copy link
Member Author

tgross commented Jul 17, 2025

The test-windows failure here appears to be an upstream infrastructure issue with NuGet. I'll try re-running later.

@tgross tgross marked this pull request as ready for review July 17, 2025 19:06
@tgross tgross requested review from a team as code owners July 17, 2025 19:06
@tgross tgross requested review from pkazmierczak and jrasell July 17, 2025 19:06
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tgross tgross merged commit 333dd94 into main Jul 18, 2025
46 of 47 checks passed
@tgross tgross deleted the reconciler-eager-stop branch July 18, 2025 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants