-
Notifications
You must be signed in to change notification settings - Fork 2k
Update the scaling policies when deregistering a job #25911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments about some behavior changes we've introduced that don't seem necessary to touch here. But also, what happens to the scaling policies of a stopped job if we start it again? Won't they still be disabled?
nomad/fsm.go
Outdated
// If it is periodic remove it from the dispatcher | ||
if err := n.periodicDispatcher.Remove(namespace, jobID); err != nil { | ||
return fmt.Errorf("periodicDispatcher.Remove failed: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've made a subtle reordering here which changes the behavior slightly in two ways:
- We should probably try to remove the job from the periodic dispatcher even if
current == nil
. Just because it's not in the state store doesn't mean we can guarantee it's been removed from the periodic dispatcher (it should be, but that operation isn't happening in memdb so it's a lot harder to guarantee it). - Checking for
nil
and returning early means that we have to guarantee we've deleted all allocations for a job in the same Raft entry that the job is deleted. Are we sure that always happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it back, but I find this "hidden" side effects error prone, how can we make it more explicit? Shouldn't we check somewhere that if a job is deleted, it should also be deleted from the periodic dispatcher so there wont be any "orphans"? It is very odd that we do all a bunch of things before even checking that the job exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check somewhere that if a job is deleted, it should also be deleted from the periodic dispatcher so there wont be any "orphans"?
Well that's exactly what we're doing here, right? This FSM function is what gets run when the job is deleted (actually, deregistered, so that we're removing it on job stop
and not just job stop -purge
). Which I guess means that my first bullet point doesn't make a ton of sense in terms of the ordering of peridiocDispatcher.Remove
.
But I think the second bulletpoint still applies though for setting the desired transitions. If you call job stop -purge
and there's no shutdown delay, we're marking a desired transition for the allocs and then deleting the job. Which shows we definitely will have allocs that aren't being atomically deleted with the job.
but I find this "hidden" side effects error prone, how can we make it more explicit?
Totally agreed! Unfortunately there is state on the leader like the periodic dispatcher and eval broker that isn't in the state store, but we need to ensure it being set as part of the FSM apply so that we can be sure it's run when a new leader takes over (the alternative is to have something like the deployment watcher / drainer / volume watcher that polls state, but those are quite resource intensive).
What I'd like to see us do where possible is to avoid calling memdb
methods directly in the FSM and instead push all that logic into the nomad/state
package. That at least avoids splitting the logic up (except for the leader brokers, which we can't avoid) and makes it more testable and probably avoids some errors in terms of transactions. Ex. right now the applyDeregisterJob
starts a transaction but doesn't upsert the evals inside that transaction. applyUpsertJob
has a similar problem with evals and deployments.
9489559
to
4a7f23b
Compare
…icy after job start
command/job_start.go
Outdated
case "json": | ||
err = json.Unmarshal([]byte(sub.Source), &job) | ||
if err != nil { | ||
return nil, fmt.Errorf("command: unable to parce job submission: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("command: unable to parce job submission: %w", err) | |
return nil, fmt.Errorf("Unable to parse job submission to re-enable scaling policies: %w", err) |
Co-authored-by: Tim Gross <[email protected]>
Co-authored-by: Tim Gross <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
Currently when a job is stopped, its scaling policies are not updated and they are kept as enabled, as a side effect, the autoscaler keeps monitoring them as if they were active. This PR updates the job deregister to set the scaling policies as disabled when a job is deregistered.
To start the job again the user needs to either resubmit the job, which will set the policy to whatever state is in the job spec, or use the command
nomad job start
, in this case the latest submitted spec of the job will be used to set the policy. Given that the scaling policies can't be modified via CLI or API, this should be the most recent state before the job was stopped.Testing & Reproduction steps
Links
Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.