UMBRELLA: design and refactor graceful termination #764

tapih · 2020-01-20T05:56:22Z

To shutdown controller gracefully, containers should be terminated after endpoints deletion.

If time.Sleep(someSeconds) is added between these lines, applications made with controller runtime can wait for endpoint deletion as default.
What do you think?

The text was updated successfully, but these errors were encountered:

alexeldeib · 2020-01-20T06:27:41Z

See some of the previous discussions about graceful termination:

I think a better approach than sleeping (and one that's been discussed a few times) is properly wiring up all the contexts (well, currently stop channels) and using either a wait group (for manager shutdown) or some sort of map + lock + shutdown mechanism (for manager shutdown + dynamically (un)loading controllers).

There's a tentative desire to replace all the stop channels with context plumbed all the way through -- graceful manager termination could plausibly be done in tandem with that change (or not, but seems natural).

tapih · 2020-01-21T04:18:19Z

Sorry for my poor explanation. What I meant to propose is that, by inserting sleep just after the signal handler, we can omit /bin/sleep in preStop hook of Kubernetes Pods.

When a Pod is shutting down, pod termination and endpoint deletion are executed at the same time.
The problem is that some requests cannot be accepted if the endpoint deletion does not finish while the pod termination finishes.
I meant by the word "graceful shutdown" in the previous comment to avoid this situation by ensuring that the pod is terminated after the endpoint is deleted.

One way to wait for endpoint deletion is to add the sleep command into preStop, but if a base image does not have the sleep command, this approach cannot be taken.
So, I would like to add time.Sleep(someSeconds) to resolve this problem at the framework level.

vincepri · 2020-02-20T18:41:20Z

/kind design

We'd need a design document in form of PR to the controller-runtime repository.

/help
/priority important-soon

k8s-ci-robot · 2020-02-20T18:41:21Z

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind design

We'd need a design document in form of PR to the controller-runtime repository.

/help
/priority important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

seh · 2020-02-20T19:25:24Z

Presumably the duration to sleep there would need to be configurable, with zero meaning not to sleep at all. But sleeping like this is a brittle cover for the lack of proper coordination. The chosen duration always turns out to be wrong.

alexeldeib · 2020-02-20T19:32:52Z

Based on the discussion today I think ideally we see one design sweep out all of the related context issues:

dynamically adding/removing controllers
stopping the manager and gracefully waiting for termination
replacing stop channels with context everywhere (maybe?)

if the manager and all dependencies respect context behavior properly, the ask here becomes a matter of usage -- wire up your signal handler to wait for the manager to cleanly exit.

I don't think the design will be anything crazy, but having poked at this a bit I think it will take some time + thought to cleanly flesh out the corner cases.

negz · 2020-02-20T20:30:37Z

It seems a small handful of issues around stopping managers and controllers (including #730, which I am fairly invested in) just got deduplicated into this one. Could we consider renaming this issue to reflect the scope? Perhaps a new issue is warranted to track all of this? The title (and to some extent content) of this issue don't immediately reflect the scope it seems to have taken on.

vincepri · 2020-02-20T20:42:29Z

@negz fair point! re-titled this issue, let me know if you have further feedback

answer1991 · 2020-02-26T18:55:29Z

consider my solution at PR #805 , which was used by kube-controller-manager.

answer1991 · 2020-02-26T19:01:45Z

The controller program exit step may including these:

close controller stop channel to notify controller and workers exit
wait for controller's queue closed, which means no new reconcile action will be invoked by controller works. And then wait for all workers exit will be better
close lease stop channel to notify lease to quit leading
wait for leading stoped
exit program

…is removed from the HNCConfiguration Spec If a type is removed from the HNCConfiguration Spec, we will set the corresponding object reconciler to "ignore" mode. Ideally, we should shut down the corresponding object reconciler. Gracefully terminating an object reconciler is still under development (kubernetes-sigs/controller-runtime#764). Once the feature is released, we will see if we can shut down the object reconciler instead of setting it to "ignore" mode.

If a type is removed from the HNCConfiguration Spec, we will set the corresponding object reconciler to "ignore" mode. Ideally, we should shut down the corresponding object reconciler. Gracefully terminating an object reconciler is still under development (kubernetes-sigs/controller-runtime#764). Once the feature is released, we will see if we can shut down the object reconciler instead of setting it to "ignore" mode.

negz · 2020-03-14T23:16:11Z

There's a tentative desire to replace all the stop channels with context plumbed all the way through -- graceful manager termination could plausibly be done in tandem with that change (or not, but seems natural).

Is it common to use contexts for this purpose, given that contexts are intended to be "request scoped"? It feels like using a context to replace a stop channel could be a misuse, depending on how you interpret a "request".

negz · 2020-03-14T23:42:36Z

Is it common to use contexts for this purpose

Apparently it is - kubernetes/kubernetes#57932 is an example of leader election being migrated from stop channels to contexts.

alexeldeib · 2020-04-16T15:53:09Z

Making incremental notes on context changes required:

source should accept context -> ⚠ make Start on Source interface cancellable #903
cache/multinamespace cache start and wait for sync should accept context
webhooks should accept context
dynamic rest mapper should wrap the underlying requests in a context to be non-blocking
runnables should accept a context
manager start should accept a context
contollers start should accept a context
informers map should accept a context

alvaroaleman · 2020-05-13T18:49:37Z

@alexeldeib just for my understanding, maybe I am missing something. How is graceful termination depending on having plugged through the usage of context everywhere? Graceful termination just means that we have to redesign some interfaces to allow Runnables to signal that they are done shutting down and then wait for that or a timeout or not?

alexeldeib · 2020-05-13T18:55:31Z

Yeah, I think this issue arguably re-conflated two things:

our usage of stop channels is very inconsistent and messy, and it'd be preferable to cleanly wire context
there's no way to do graceful shutdown properly

alexeldeib · 2020-05-13T18:56:28Z

things that should be non-blocking are actually non-blocking

tapih mentioned this issue Jan 28, 2020

✨ Enable signal handler to close channel after sleeping for some seconds #782

Closed

DirectXMan12 assigned DirectXMan12 and droot and unassigned DirectXMan12 Feb 5, 2020

This was referenced Feb 20, 2020

Manager does not wait for Runnables to stop #350

Closed

Stopping individual controllers in a manager #730

Closed

Support for stoppable managers #695

Closed

vincepri added this to the Next milestone Feb 20, 2020

vincepri mentioned this issue Feb 20, 2020

Bug in manager_test #429

Closed

vincepri changed the title ~~Sleep before stop channel close to make controller terminate gracefully~~ UMBRELLA: design and refactor graceful termination Feb 20, 2020

vincepri mentioned this issue Feb 21, 2020

✨manager graceful exit with release lease #805

Closed

muvaf mentioned this issue Mar 1, 2020

Investigate multiple controllers vs one controller per process/pod crossplane/crossplane#1312

Closed

sophieliu15 mentioned this issue Mar 9, 2020

Set an object reconciler to "ignore" mode if its corresponding type is removed from the HNCConfiguration Spec kubernetes-retired/multi-tenancy#509

Merged

negz mentioned this issue Mar 17, 2020

✨Allow controllers to be started and stopped separately from the manager #863

Merged

djzager mentioned this issue Apr 21, 2020

Informers map should accept a context #914

Open

alvaroaleman mentioned this issue Apr 30, 2020

Add leader election to tide kubernetes/test-infra#17448

Closed

alvaroaleman mentioned this issue May 26, 2020

✨ Implement graceful shutdown #967

Merged

k8s-ci-robot closed this as completed in #967 Jul 21, 2020

ebaron mentioned this issue Aug 26, 2021

Users should not need to delete CRs before uninstalling operator cryostatio/cryostat-operator#238

Open

jonathan-innis mentioned this issue Mar 11, 2023

Can we avoid context cancelled on Singleton Controllers? kubernetes-sigs/karpenter#238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMBRELLA: design and refactor graceful termination #764

UMBRELLA: design and refactor graceful termination #764

tapih commented Jan 20, 2020

alexeldeib commented Jan 20, 2020 •

edited

Loading

tapih commented Jan 21, 2020

vincepri commented Feb 20, 2020

k8s-ci-robot commented Feb 20, 2020

seh commented Feb 20, 2020

alexeldeib commented Feb 20, 2020

negz commented Feb 20, 2020

vincepri commented Feb 20, 2020

answer1991 commented Feb 26, 2020

answer1991 commented Feb 26, 2020

negz commented Mar 14, 2020

negz commented Mar 14, 2020

alexeldeib commented Apr 16, 2020 •

edited

Loading

alvaroaleman commented May 13, 2020

alexeldeib commented May 13, 2020

alexeldeib commented May 13, 2020

UMBRELLA: design and refactor graceful termination #764

UMBRELLA: design and refactor graceful termination #764

Comments

tapih commented Jan 20, 2020

alexeldeib commented Jan 20, 2020 • edited Loading

tapih commented Jan 21, 2020

vincepri commented Feb 20, 2020

k8s-ci-robot commented Feb 20, 2020

seh commented Feb 20, 2020

alexeldeib commented Feb 20, 2020

negz commented Feb 20, 2020

vincepri commented Feb 20, 2020

answer1991 commented Feb 26, 2020

answer1991 commented Feb 26, 2020

negz commented Mar 14, 2020

negz commented Mar 14, 2020

alexeldeib commented Apr 16, 2020 • edited Loading

alvaroaleman commented May 13, 2020

alexeldeib commented May 13, 2020

alexeldeib commented May 13, 2020

alexeldeib commented Jan 20, 2020 •

edited

Loading

alexeldeib commented Apr 16, 2020 •

edited

Loading