Skip to content

UMBRELLA: design and refactor graceful termination #764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tapih opened this issue Jan 20, 2020 · 16 comments · Fixed by #967
Closed

UMBRELLA: design and refactor graceful termination #764

tapih opened this issue Jan 20, 2020 · 16 comments · Fixed by #967
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@tapih
Copy link

tapih commented Jan 20, 2020

To shutdown controller gracefully, containers should be terminated after endpoints deletion.

If time.Sleep(someSeconds) is added between these lines, applications made with controller runtime can wait for endpoint deletion as default.
What do you think?

@alexeldeib
Copy link
Contributor

alexeldeib commented Jan 20, 2020

See some of the previous discussions about graceful termination:

I think a better approach than sleeping (and one that's been discussed a few times) is properly wiring up all the contexts (well, currently stop channels) and using either a wait group (for manager shutdown) or some sort of map + lock + shutdown mechanism (for manager shutdown + dynamically (un)loading controllers).

There's a tentative desire to replace all the stop channels with context plumbed all the way through -- graceful manager termination could plausibly be done in tandem with that change (or not, but seems natural).

@tapih
Copy link
Author

tapih commented Jan 21, 2020

Sorry for my poor explanation. What I meant to propose is that, by inserting sleep just after the signal handler, we can omit /bin/sleep in preStop hook of Kubernetes Pods.

When a Pod is shutting down, pod termination and endpoint deletion are executed at the same time.
The problem is that some requests cannot be accepted if the endpoint deletion does not finish while the pod termination finishes.
I meant by the word "graceful shutdown" in the previous comment to avoid this situation by ensuring that the pod is terminated after the endpoint is deleted.

One way to wait for endpoint deletion is to add the sleep command into preStop, but if a base image does not have the sleep command, this approach cannot be taken.
So, I would like to add time.Sleep(someSeconds) to resolve this problem at the framework level.

@vincepri
Copy link
Member

/kind design

We'd need a design document in form of PR to the controller-runtime repository.

/help
/priority important-soon

@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind design

We'd need a design document in form of PR to the controller-runtime repository.

/help
/priority important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 20, 2020
@seh
Copy link

seh commented Feb 20, 2020

Presumably the duration to sleep there would need to be configurable, with zero meaning not to sleep at all. But sleeping like this is a brittle cover for the lack of proper coordination. The chosen duration always turns out to be wrong.

@alexeldeib
Copy link
Contributor

Based on the discussion today I think ideally we see one design sweep out all of the related context issues:

  • dynamically adding/removing controllers
  • stopping the manager and gracefully waiting for termination
  • replacing stop channels with context everywhere (maybe?)

if the manager and all dependencies respect context behavior properly, the ask here becomes a matter of usage -- wire up your signal handler to wait for the manager to cleanly exit.

I don't think the design will be anything crazy, but having poked at this a bit I think it will take some time + thought to cleanly flesh out the corner cases.

@negz
Copy link
Contributor

negz commented Feb 20, 2020

It seems a small handful of issues around stopping managers and controllers (including #730, which I am fairly invested in) just got deduplicated into this one. Could we consider renaming this issue to reflect the scope? Perhaps a new issue is warranted to track all of this? The title (and to some extent content) of this issue don't immediately reflect the scope it seems to have taken on.

@vincepri vincepri changed the title Sleep before stop channel close to make controller terminate gracefully UMBRELLA: design and refactor graceful termination Feb 20, 2020
@vincepri
Copy link
Member

@negz fair point! re-titled this issue, let me know if you have further feedback

@answer1991
Copy link
Contributor

consider my solution at PR #805 , which was used by kube-controller-manager.

@answer1991
Copy link
Contributor

The controller program exit step may including these:

  1. close controller stop channel to notify controller and workers exit
  2. wait for controller's queue closed, which means no new reconcile action will be invoked by controller works. And then wait for all workers exit will be better
  3. close lease stop channel to notify lease to quit leading
  4. wait for leading stoped
  5. exit program

sophieliu15 added a commit to sophieliu15/multi-tenancy that referenced this issue Mar 9, 2020
…is removed from the HNCConfiguration Spec

If a type is removed from the HNCConfiguration Spec, we will set the corresponding object reconciler to "ignore" mode.

Ideally, we should shut down the corresponding object reconciler. Gracefully terminating an object reconciler is still under
development (kubernetes-sigs/controller-runtime#764). Once the feature is released, we will see if we can shut down the object reconciler instead of setting it to "ignore" mode.
sophieliu15 added a commit to sophieliu15/multi-tenancy that referenced this issue Mar 9, 2020
If a type is removed from the HNCConfiguration Spec, we will set the corresponding object reconciler to "ignore" mode.

Ideally, we should shut down the corresponding object reconciler. Gracefully terminating an object reconciler is still under
development (kubernetes-sigs/controller-runtime#764). Once the feature is released, we will see if we can shut down the object reconciler instead of setting it to "ignore" mode.
sophieliu15 added a commit to sophieliu15/multi-tenancy that referenced this issue Mar 10, 2020
If a type is removed from the HNCConfiguration Spec, we will set the corresponding object reconciler to "ignore" mode.

Ideally, we should shut down the corresponding object reconciler. Gracefully terminating an object reconciler is still under
development (kubernetes-sigs/controller-runtime#764). Once the feature is released, we will see if we can shut down the object reconciler instead of setting it to "ignore" mode.
@negz
Copy link
Contributor

negz commented Mar 14, 2020

There's a tentative desire to replace all the stop channels with context plumbed all the way through -- graceful manager termination could plausibly be done in tandem with that change (or not, but seems natural).

Is it common to use contexts for this purpose, given that contexts are intended to be "request scoped"? It feels like using a context to replace a stop channel could be a misuse, depending on how you interpret a "request".

@negz
Copy link
Contributor

negz commented Mar 14, 2020

Is it common to use contexts for this purpose

Apparently it is - kubernetes/kubernetes#57932 is an example of leader election being migrated from stop channels to contexts.

@alexeldeib
Copy link
Contributor

alexeldeib commented Apr 16, 2020

Making incremental notes on context changes required:

  • source should accept context -> ⚠ make Start on Source interface cancellable #903
  • cache/multinamespace cache start and wait for sync should accept context
  • webhooks should accept context
  • dynamic rest mapper should wrap the underlying requests in a context to be non-blocking
  • runnables should accept a context
  • manager start should accept a context
  • contollers start should accept a context
  • informers map should accept a context

@alvaroaleman
Copy link
Member

@alexeldeib just for my understanding, maybe I am missing something. How is graceful termination depending on having plugged through the usage of context everywhere? Graceful termination just means that we have to redesign some interfaces to allow Runnables to signal that they are done shutting down and then wait for that or a timeout or not?

@alexeldeib
Copy link
Contributor

Yeah, I think this issue arguably re-conflated two things:

  • our usage of stop channels is very inconsistent and messy, and it'd be preferable to cleanly wire context
  • there's no way to do graceful shutdown properly

@alexeldeib
Copy link
Contributor

  • things that should be non-blocking are actually non-blocking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants