⚠ Support shutdown controllers and watches dynamically #2099

FillZpp · 2022-12-14T03:42:15Z

Signed-off-by: FillZpp [email protected]

Support shutdown controllers and watches dynamically.

API changes:

An optional ControllerCtx added into controller.Options to let developer stop a specific controller and its watches.
A Stop() method added into some stoppable sources, e.g., Kind, KindWithCache, Informer, to let developer only stop a specific watch.

k8s-ci-robot · 2022-12-14T03:42:17Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2022-12-14T03:42:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FillZpp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [FillZpp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

FillZpp · 2022-12-14T04:10:38Z

K8s 1.26 is released and it adds removal of event handler kubernetes/kubernetes#111122 , so that c-r had better support this in the next v0.14 release.

I'm still adding tests. But one thing I'm not sure is that whether we should add a ControllerCtx into controller.Options or add a Stop method for the Controller? Which one is better to let developers stop a running controller?

Any suggestions? @alvaroaleman @vincepri @joelanford

pkg/internal/controller/controller.go

alvaroaleman · 2022-12-18T20:29:59Z

pkg/source/source.go

+	is.mu.Lock()
+	defer is.mu.Unlock()
+	if is.canceled {
+		return nil


Shouldn't we error rather than silently doing nothing?

I'm not sure, maybe we have to prevent the source been stopped multiple times by controller or user?
For example, someone may create a controller with three watches, then stop one watch manually, and finally stop the whole controller. At the time, controller will stop the stopped watch source and it will get the error.

alvaroaleman · 2022-12-18T20:31:01Z

pkg/controller/controller.go

+
+	// ControllerCtx is the optional context for only this Controller. If it is set and been canceled,
+	// this controller and its watches will be stopped dynamically.
+	ControllerCtx context.Context


Why not add a Stop method instead?

Yeah, as I asked in #2099 (comment) .
Now I have changed the ControllerCtx to a Stop method.

alvaroaleman · 2022-12-18T20:33:04Z

pkg/source/source.go

+	is.canceled = true
+
+	if is.eventHandlerRegistration != nil {
+		return is.Informer.RemoveEventHandler(is.eventHandlerRegistration)


None of this actually stops the informer which I'd argue is the more important part - Is that a folllow-up?

Admittedly not easy, we will likely need to do some kind of refcounting

Good question. It's hard to determine when should we stop the informer. Not only the informers in cache are not only created by source watches, but also triggered by user's Get, List calls to DelegatingClient. So we can't manage the lifecycle of informer according to the reference count of how many active watches in it.

Thinking out loud here a few options:

Ref count for watches + some sort of LRU or timed cache for gets/lists that don't already have informers?

Ref count for watches + if a get/list requires a new informer (i.e. a watch didn't already start one), that informer is never removed?

Hey! Just found this PR.

Gatekeeper is doing something similar, forking controller-runtime's informer cache to add a function that can remove informers. Would it be possible to have this available before solving the ref count issue? That would at least give users who are comfortable handling that complexity the ability to do so, but probably wouldn't catch unwary users.

Here is the key function we rely on:

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/third_party/sigs.k8s.io/controller-runtime/pkg/dynamiccache/informer_cache.go#L242-L251

We currently use it by maintaining a separate cache for dynamic watches:

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/pkg/watch/manager.go#L64-L91

That exports "registrars" to managing controllers to add or remove watches.

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/pkg/watch/registrar.go#L213-L276

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/pkg/controller/constrainttemplate/constrainttemplate_controller.go#L570-L582

Static watches (i.e. old-school controller.Watch() and watches initiated by client.Get()) use a separate cache.

Happy to talk more about this model, if interested, or help implementing parts if it means we can stop maintaining a fork!

This is great background! I might rebase and continue on the work on this PR (unless @FillZpp has time) to get it to the finish line before 0.15 is released. Feel free to reach out on slack if we want to chat more about it and brainstorm.

Happy to brainstorm! (sorry, it's been a busy week, so haven't reached out yet)

Sorry for this late reply (I was on vacation last two weeks). I'll continue on this and get it into 0.15.

About the removal of informer, it seems we could have several ways:

Keep ref counts to remove informers automatically, like @joelanford suggested.

Thinking out loud here a few options:

Ref count for watches + some sort of LRU or timed cache for gets/lists that don't already have informers?

Ref count for watches + if a get/list requires a new informer (i.e. a watch didn't already start one), that informer is never removed?

I prefer this way, but I'm not sure will this make users confused? Most of them don't know when and why an informer been removed or not, and the sequence of doing watch and get/list also affect whether the informer will be removed or not...

Expose a method to let users manually remove a informer, as @maxsmythe suggested, IIUC.

Maybe 1+2 both provided, if they are all necessary and needed.

WDYT @vincepri @alvaroaleman @joelanford @sbueringer

To be clear, I don't think (1) or (2) are in conflict: our real need is to have informer cache support a Remove() function, which is something a reference counter would need anyway. (2) just makes that behavior public.

Also, WRT reference counting, Gatekeeper has the additional nuance of using a different cache entirely for reference-count-style watch governance to avoid interference with watches established in more conventional ways (client.Get()) -- this seems similar to option (1.2).

In addition, because Gatekeeper uses generic controllers (we have one controller that listens for all constraint kinds), (2) would be a better fit for our model than governing watch livelihood at the controller granularity.

Because dynamic watches are probably best for generic-type controllers (if you have the type hard-coded, why the need for dynamic watches?), managing the watches directly may be a better fit than managing controllers. I think both are workable, but (2) would definitely be less of a reach for Gatekeeper to integrate.

inteon · 2023-01-27T14:47:20Z

pkg/controller/controller.go

+
+	// Stop stops the controller and all its watches dynamically.
+	// Note that it will only trigger the stop but will not wait for them all stopped.
+	Stop() error


@FillZpp Why do we add a Stop function here instead of canceling the context that was passed to Start?

Not all of the implementations of Source can stop, e.g., Func or some custom types.

Users have no way to cancel the context passed from internal controller to source, to only stop the single watch.

Signed-off-by: FillZpp <[email protected]>

FillZpp · 2023-01-31T11:29:30Z

How about get this PR merged first, and I will post new PRs to support removal of informers (automatically & manually)? Otherwise it will become a huge PR that have too many API changes to review.
/cc @vincepri @alvaroaleman

FillZpp · 2023-01-31T11:39:54Z

/retest

k8s-ci-robot · 2023-01-31T11:53:23Z

@FillZpp: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-controller-runtime-test-master	`7728c9e`	link	true	`/test pull-controller-runtime-test-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

inteon · 2023-01-31T12:55:22Z

@FillZpp I created another PR (#2159) that tries to solve the same problem as your PR.
PTAL, feel free to accept the approach in that PR or to copy (some of) the code from that PR to this PR instead.

FillZpp · 2023-02-01T09:56:04Z

Thanks @inteon , I close this PR and help out on the new one.

/close

k8s-ci-robot · 2023-02-01T09:56:09Z

@FillZpp: Closed this PR.

In response to this:

Thanks @inteon , I close this PR and help out on the new one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 14, 2022

k8s-ci-robot requested review from joelanford and varshaprasad96 December 14, 2022 03:42

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2022

FillZpp mentioned this pull request Dec 14, 2022

Add and remove watches at runtime #1884

Closed

jnan806 mentioned this pull request Dec 14, 2022

How to remove watches at runtime | 怎样在运行过程中，移除对 k8s 中 CRD的watcher opensergo/opensergo-control-plane#23

Open

alvaroaleman reviewed Dec 18, 2022

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 6, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2023

maxsmythe mentioned this pull request Jan 25, 2023

chore: Upgrade to k8s v0.26.1 and controller-runtime fork open-policy-agent/gatekeeper#2530

Merged

inteon reviewed Jan 27, 2023

View reviewed changes

Support shutdown controllers and watches dynamically

7728c9e

Signed-off-by: FillZpp <[email protected]>

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2023

FillZpp marked this pull request as ready for review January 31, 2023 11:22

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2023

k8s-ci-robot requested review from sbueringer and vincepri January 31, 2023 11:22

k8s-ci-robot requested a review from alvaroaleman January 31, 2023 11:29

inteon mentioned this pull request Jan 31, 2023

⚠ Support shutdown watches dynamically (v2) #2159

Closed

k8s-ci-robot closed this Feb 1, 2023

basti1302 mentioned this pull request Oct 14, 2024

Stop watches on controller stop #2983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠ Support shutdown controllers and watches dynamically #2099

⚠ Support shutdown controllers and watches dynamically #2099

FillZpp commented Dec 14, 2022

k8s-ci-robot commented Dec 14, 2022

k8s-ci-robot commented Dec 14, 2022

FillZpp commented Dec 14, 2022

alvaroaleman Dec 18, 2022

FillZpp Jan 13, 2023

alvaroaleman Dec 18, 2022

FillZpp Jan 13, 2023

alvaroaleman Dec 18, 2022

FillZpp Jan 13, 2023

joelanford Jan 18, 2023

maxsmythe Jan 25, 2023

vincepri Jan 25, 2023

maxsmythe Jan 28, 2023

FillZpp Jan 30, 2023 •

edited

Loading

maxsmythe Jan 30, 2023 •

edited

Loading

maxsmythe Jan 30, 2023 •

edited

Loading

inteon Jan 27, 2023 •

edited

Loading

FillZpp Jan 30, 2023

FillZpp commented Jan 31, 2023

FillZpp commented Jan 31, 2023

k8s-ci-robot commented Jan 31, 2023

inteon commented Jan 31, 2023 •

edited

Loading

FillZpp commented Feb 1, 2023

k8s-ci-robot commented Feb 1, 2023

⚠ Support shutdown controllers and watches dynamically #2099

⚠ Support shutdown controllers and watches dynamically #2099

Conversation

FillZpp commented Dec 14, 2022

k8s-ci-robot commented Dec 14, 2022

k8s-ci-robot commented Dec 14, 2022

FillZpp commented Dec 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FillZpp Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

maxsmythe Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

maxsmythe Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

inteon Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FillZpp commented Jan 31, 2023

FillZpp commented Jan 31, 2023

k8s-ci-robot commented Jan 31, 2023

inteon commented Jan 31, 2023 • edited Loading

FillZpp commented Feb 1, 2023

k8s-ci-robot commented Feb 1, 2023

FillZpp Jan 30, 2023 •

edited

Loading

maxsmythe Jan 30, 2023 •

edited

Loading

maxsmythe Jan 30, 2023 •

edited

Loading

inteon Jan 27, 2023 •

edited

Loading

inteon commented Jan 31, 2023 •

edited

Loading