Only error on constraints if no allocs are running #25850

allisonlarson · 2025-05-14T00:20:55Z

Description

When running a system job with a constraint, any run after an initial startup returns an exit(2) and a warning about unplaced allocations due to constraints. An error that is not encountered on the initial run, though the constraint stays the same. This is because the node that satisfies the condition is already running the allocation, and the placement is ignored. Another placement is attempted, but the only node(s) left are the ones that do not satisfy the constraint. Nomad views this case (no allocations that were attempted to placed could be placed successfully) as an error, and reports it as such. In reality, no allocations should be placed or updated in this case, but it should not be treated as an error.

This change uses the ignored & in-place updated placements from diffSystemAlloc to attempt to determine if the case encountered is an error (no ignored/in-place updates placements means that nothing is already running, and is an error), or is not one (an ignored placement means that the task is already running somewhere on a node). It does this at the point where failedTGAlloc is populated, so placement functionality isn't changed, just the field that populates the error.

There is functionality that should be preserved which (correctly) notifies a user if a job is attempted that cannot be run on any node due to the constraints filtering out all available nodes. This should still behave as expected, and an explicit test has been added for it.

Testing & Reproduction steps

Define a system jobspec with a constraint on a node in the node pool, and run it. Once an allocation is running on an available node, run (or plan) the job again. In the below example, there are 3 nodes and the constraint on the job is defined as

constraint {
    attribute = "${attr.unique.hostname}"
    operator  = "="
    value     = "nomad-client01"
 }

Previous behavior (on second run):

 $ nomad job run job.nomad.hcl
==> 2025-05-14T11:04:05-07:00: Monitoring evaluation "da38faeb"
    2025-05-14T11:04:05-07:00: Evaluation triggered by job "example"
    2025-05-14T11:04:06-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-05-14T11:04:06-07:00: Evaluation "da38faeb" finished with status "complete" but failed to place all allocations:
    2025-05-14T11:04:06-07:00: Task Group "cache" (failed to place 1 allocation):
      * Constraint "${attr.unique.hostname} = nomad-client01": 2 nodes excluded by filter

Reports a failure to place an allocation due to the constraint filtering out the node

New behavior (on second run):

$ nomad job run job.nomad.hcl
==> 2025-05-14T11:08:27-07:00: Monitoring evaluation "446123ac"
    2025-05-14T11:08:27-07:00: Evaluation triggered by job "example"
    2025-05-14T11:08:28-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-05-14T11:08:28-07:00: Evaluation "446123ac" finished with status "complete"

Links

Fixes #12748 #12016 #19413 #12366

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

When running `nomad job run <JOB>` multiple times with constraints defined, there should be no error as a result of filtering out nodes that do not/have not ever satsified the constraints. When running a systems job with constraint, any run after an initial startup returns an exit(2) and a warning about unplaced allocations due to constraints. An error that is not encountered on the initial run, though the constraint stays the same. This is because the node that satisfies the condition is already running the allocation, and the placement is ignored. Another placement is attempted, but the only node(s) left are the ones that do not satisfy the constraint. Nomad views this case (no allocations that were attempted to placed could be placed successfully) as an error, and reports it as such. In reality, no allocations should be placed or updated in this case, but it should not be treated as an error. This change uses the `ignored` placements from diffSystemAlloc to attempt to determine if the case encountered is an error (no ignored placements means that nothing is already running, and is an error), or is not one (an ignored placement means that the task is already running somewhere on a node). It does this at the point where `failedTGAlloc` is populated, so placement functionality isn't changed, just the field that populates error. There is functionality that should be preserved which (correctly) notifies a user if a job is attempted that cannot be run on any node due to the constraints filtering out all available nodes. This should still behave as expected.

pkazmierczak

LGTM!

.changelog/25850.txt

tgross

Looks great! I've left a few small comments, but once those are resolved / dismissed we should be good-to-go here.

tgross · 2025-05-15T12:57:26Z

scheduler/scheduler_system_test.go

-	// Ensure `groupA` fails to be placed due to its constraint, but `groupB` doesn't
-	require.Len(t, h.Evals[2].FailedTGAllocs, 1)
-	require.Contains(t, h.Evals[2].FailedTGAllocs, "groupA")
-	require.NotContains(t, h.Evals[2].FailedTGAllocs, "groupB")


If we're only suppressing the error in the case where a specific task group has an alloc, shouldn't these assertions and the ones in scheduler/scheduler_sysbatch_test.go still work? Or am I misunderstand why we're removing these? (totally a possibility! 😁 )

I think these assertions are a bit of a misdirect to what is actually being tested. The test is testing that a node can be added an existing node pool where there are allocations running, and the node is correctly evaluated in the context of the defined constraints and the new node only get allocs that match the constraint. Since the allocs are already running in this case, the new behavior says that it shouldn't mark any of them as failed.

Theres an assertion later in the test that the allocations are only running on the nodes that they are expected to be running on, which seems like what the desired behavior should do.

Ok, sounds good! 👍

tgross · 2025-05-15T12:59:34Z

scheduler/scheduler_system_test.go

+
+// Test that the system scheduler can handle a job with a constraint on
+// subsequent runs, and report the outcome appropriately
+func TestSystemSched_JobConstraint_RunMultipleTimes(t *testing.T) {


This is a great test!

scheduler/scheduler_system.go

Co-authored-by: Piotr Kazmierczak <[email protected]>

tgross

LGTM!

vercel bot deployed to Preview – nomad-ui May 14, 2025 00:21 View deployment

allisonlarson force-pushed the b-system-job-constraints branch from c97c260 to 6af511c Compare May 14, 2025 17:44

vercel bot deployed to Preview – nomad-ui May 14, 2025 17:45 View deployment

Add changelog entry

ca4650b

allisonlarson added the backport/1.10.x backport to 1.10.x release line label May 14, 2025

vercel bot deployed to Preview – nomad-ui May 14, 2025 18:18 View deployment

allisonlarson added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent labels May 14, 2025

Handle in-place updates for constrained system jobs

1102463

vercel bot deployed to Preview – nomad-ui May 14, 2025 21:44 View deployment

allisonlarson marked this pull request as ready for review May 14, 2025 22:53

allisonlarson requested review from a team as code owners May 14, 2025 22:53

pkazmierczak previously approved these changes May 15, 2025

View reviewed changes

.changelog/25850.txt Outdated Show resolved Hide resolved

tgross reviewed May 15, 2025

View reviewed changes

Update .changelog/25850.txt

dcadf7b

Co-authored-by: Piotr Kazmierczak <[email protected]>

allisonlarson dismissed pkazmierczak’s stale review via dcadf7b May 15, 2025 16:42

vercel bot deployed to Preview – nomad-ui May 15, 2025 16:43 View deployment

Remove conditionals

a11a77c

vercel bot deployed to Preview – nomad-ui May 15, 2025 16:52 View deployment

tgross approved these changes May 15, 2025

View reviewed changes

allisonlarson merged commit fd16f80 into main May 15, 2025
37 checks passed

allisonlarson deleted the b-system-job-constraints branch May 15, 2025 22:14

hc-github-team-nomad-core mentioned this pull request May 15, 2025

Backport of Only error on constraints if no allocs are running into release/1.10.x #25863

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only error on constraints if no allocs are running #25850

Only error on constraints if no allocs are running #25850

Uh oh!

allisonlarson commented May 14, 2025 •

edited

Loading

Uh oh!

pkazmierczak left a comment

Uh oh!

Uh oh!

tgross left a comment

Uh oh!

tgross May 15, 2025

Uh oh!

allisonlarson May 15, 2025

Uh oh!

tgross May 15, 2025

Uh oh!

tgross May 15, 2025

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Uh oh!

Uh oh!

Uh oh!

Only error on constraints if no allocs are running #25850

Only error on constraints if no allocs are running #25850

Uh oh!

Conversation

allisonlarson commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing & Reproduction steps

Links

Contributor Checklist

Reviewer Checklist

Uh oh!

pkazmierczak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

tgross May 15, 2025

Choose a reason for hiding this comment

Uh oh!

allisonlarson May 15, 2025

Choose a reason for hiding this comment

Uh oh!

tgross May 15, 2025

Choose a reason for hiding this comment

Uh oh!

tgross May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

allisonlarson commented May 14, 2025 •

edited

Loading