Update the scaling policies when deregistering a job #25911

Juanadelacuesta · 2025-05-22T08:40:00Z

Description

Currently when a job is stopped, its scaling policies are not updated and they are kept as enabled, as a side effect, the autoscaler keeps monitoring them as if they were active. This PR updates the job deregister to set the scaling policies as disabled when a job is deregistered.
To start the job again the user needs to either resubmit the job, which will set the policy to whatever state is in the job spec, or use the command nomad job start, in this case the latest submitted spec of the job will be used to set the policy. Given that the scaling policies can't be modified via CLI or API, this should be the most recent state before the job was stopped.

Testing & Reproduction steps

Links

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

tgross

I've left some comments about some behavior changes we've introduced that don't seem necessary to touch here. But also, what happens to the scaling policies of a stopped job if we start it again? Won't they still be disabled?

tgross · 2025-05-22T13:25:24Z

nomad/fsm.go

 	// If it is periodic remove it from the dispatcher
 	if err := n.periodicDispatcher.Remove(namespace, jobID); err != nil {
 		return fmt.Errorf("periodicDispatcher.Remove failed: %w", err)
 	}


We've made a subtle reordering here which changes the behavior slightly in two ways:

We should probably try to remove the job from the periodic dispatcher even if current == nil. Just because it's not in the state store doesn't mean we can guarantee it's been removed from the periodic dispatcher (it should be, but that operation isn't happening in memdb so it's a lot harder to guarantee it).

Checking for nil and returning early means that we have to guarantee we've deleted all allocations for a job in the same Raft entry that the job is deleted. Are we sure that always happens?

I changed it back, but I find this "hidden" side effects error prone, how can we make it more explicit? Shouldn't we check somewhere that if a job is deleted, it should also be deleted from the periodic dispatcher so there wont be any "orphans"? It is very odd that we do all a bunch of things before even checking that the job exists

Shouldn't we check somewhere that if a job is deleted, it should also be deleted from the periodic dispatcher so there wont be any "orphans"?

Well that's exactly what we're doing here, right? This FSM function is what gets run when the job is deleted (actually, deregistered, so that we're removing it on job stop and not just job stop -purge). Which I guess means that my first bullet point doesn't make a ton of sense in terms of the ordering of peridiocDispatcher.Remove.

But I think the second bulletpoint still applies though for setting the desired transitions. If you call job stop -purge and there's no shutdown delay, we're marking a desired transition for the allocs and then deleting the job. Which shows we definitely will have allocs that aren't being atomically deleted with the job.

but I find this "hidden" side effects error prone, how can we make it more explicit?

Totally agreed! Unfortunately there is state on the leader like the periodic dispatcher and eval broker that isn't in the state store, but we need to ensure it being set as part of the FSM apply so that we can be sure it's run when a new leader takes over (the alternative is to have something like the deployment watcher / drainer / volume watcher that polls state, but those are quite resource intensive).

What I'd like to see us do where possible is to avoid calling memdb methods directly in the FSM and instead push all that logic into the nomad/state package. That at least avoids splitting the logic up (except for the leader brokers, which we can't avoid) and makes it more testable and probably avoids some errors in terms of transactions. Ex. right now the applyDeregisterJob starts a transaction but doesn't upsert the evals inside that transaction. applyUpsertJob has a similar problem with evals and deployments.

nomad/fsm.go

…icy after job start

command/job_start.go

tgross · 2025-05-30T12:49:14Z

command/job_start.go

+	case "json":
+		err = json.Unmarshal([]byte(sub.Source), &job)
+		if err != nil {
+			return nil, fmt.Errorf("command: unable to parce job submission: %w", err)


Suggested change

return nil, fmt.Errorf("command: unable to parce job submission: %w", err)

return nil, fmt.Errorf("Unable to parse job submission to re-enable scaling policies: %w", err)

command/job_start.go

nomad/fsm_test.go

Co-authored-by: Tim Gross <[email protected]>

tgross

LGTM!

func: Update the scaling policies when deregistering a job

67a046a

Juanadelacuesta requested review from a team as code owners May 22, 2025 08:40

Juanadelacuesta marked this pull request as draft May 22, 2025 08:40

func: Add tests for updating the policy

f4f856a

vercel bot deployed to Preview – nomad-ui May 22, 2025 08:58 View deployment

docs: add changelog

53aa35e

vercel bot deployed to Preview – nomad-ui May 22, 2025 09:02 View deployment

Juanadelacuesta marked this pull request as ready for review May 22, 2025 09:14

Juanadelacuesta added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line labels May 22, 2025

tgross reviewed May 22, 2025

View reviewed changes

func: set back the old order

960076a

vercel bot deployed to Preview – nomad-ui May 23, 2025 09:14 View deployment

vercel bot deployed to Preview – nomad-ui May 23, 2025 09:20 View deployment

style: rearrange for clarity and to reuse the watchset

4a7f23b

Juanadelacuesta force-pushed the NMD-410-autoscaler branch from 9489559 to 4a7f23b Compare May 23, 2025 09:24

vercel bot deployed to Preview – nomad-ui May 23, 2025 09:25 View deployment

Juanadelacuesta requested a review from tgross May 23, 2025 09:57

func: set the policies to teh last submitted when starting a job

0cba946

vercel bot deployed to Preview – nomad-ui May 28, 2025 15:36 View deployment

func: expand tests of teh start job command to include job submission

7a33607

vercel bot deployed to Preview – nomad-ui May 28, 2025 16:55 View deployment

func: Expand the tests to verify the correct state of the scaling pol…

6ad9fb2

…icy after job start

vercel bot deployed to Preview – nomad-ui May 28, 2025 17:31 View deployment

tgross reviewed May 30, 2025

View reviewed changes

Update command/job_start.go

504fe2d

Co-authored-by: Tim Gross <[email protected]>

vercel bot deployed to Preview – nomad-ui June 2, 2025 09:14 View deployment

Update nomad/fsm_test.go

0226069

Co-authored-by: Tim Gross <[email protected]>

vercel bot deployed to Preview – nomad-ui June 2, 2025 09:38 View deployment

func: add warning when there is no previous job submission

6339159

vercel bot deployed to Preview – nomad-ui June 2, 2025 12:35 View deployment

Juanadelacuesta requested a review from tgross June 2, 2025 12:56

tgross approved these changes Jun 2, 2025

View reviewed changes

Juanadelacuesta merged commit bdfd573 into main Jun 2, 2025
36 checks passed

Juanadelacuesta deleted the NMD-410-autoscaler branch June 2, 2025 14:11

hc-github-team-nomad-core mentioned this pull request Jun 2, 2025

Backport of Update the scaling policies when deregistering a job into release/1.10.x #25961

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update the scaling policies when deregistering a job #25911

Update the scaling policies when deregistering a job #25911

Uh oh!

Juanadelacuesta commented May 22, 2025 •

edited

Loading

Uh oh!

tgross left a comment

Uh oh!

tgross May 22, 2025

Uh oh!

Juanadelacuesta May 23, 2025 •

edited

Loading

Uh oh!

tgross May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross May 30, 2025

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Uh oh!

Uh oh!

Uh oh!

	return nil, fmt.Errorf("command: unable to parce job submission: %w", err)
	return nil, fmt.Errorf("Unable to parse job submission to re-enable scaling policies: %w", err)

Update the scaling policies when deregistering a job #25911

Update the scaling policies when deregistering a job #25911

Uh oh!

Conversation

Juanadelacuesta commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing & Reproduction steps

Links

Contributor Checklist

Reviewer Checklist

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

tgross May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Juanadelacuesta May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgross May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Juanadelacuesta commented May 22, 2025 •

edited

Loading

Juanadelacuesta May 23, 2025 •

edited

Loading