-
Notifications
You must be signed in to change notification settings - Fork 1.2k
⚠ Support shutdown controllers and watches dynamically #2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠ Support shutdown controllers and watches dynamically #2099
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: FillZpp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
K8s 1.26 is released and it adds removal of event handler kubernetes/kubernetes#111122 , so that c-r had better support this in the next v0.14 release. I'm still adding tests. But one thing I'm not sure is that whether we should add a Any suggestions? @alvaroaleman @vincepri @joelanford |
is.mu.Lock() | ||
defer is.mu.Unlock() | ||
if is.canceled { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we error rather than silently doing nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, maybe we have to prevent the source been stopped multiple times by controller or user?
For example, someone may create a controller with three watches, then stop one watch manually, and finally stop the whole controller. At the time, controller will stop the stopped watch source and it will get the error.
pkg/controller/controller.go
Outdated
|
||
// ControllerCtx is the optional context for only this Controller. If it is set and been canceled, | ||
// this controller and its watches will be stopped dynamically. | ||
ControllerCtx context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add a Stop
method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I asked in #2099 (comment) .
Now I have changed the ControllerCtx
to a Stop
method.
is.canceled = true | ||
|
||
if is.eventHandlerRegistration != nil { | ||
return is.Informer.RemoveEventHandler(is.eventHandlerRegistration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this actually stops the informer which I'd argue is the more important part - Is that a folllow-up?
Admittedly not easy, we will likely need to do some kind of refcounting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It's hard to determine when should we stop the informer. Not only the informers in cache are not only created by source watches, but also triggered by user's Get
, List
calls to DelegatingClient. So we can't manage the lifecycle of informer according to the reference count of how many active watches in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud here a few options:
- Ref count for watches + some sort of LRU or timed cache for gets/lists that don't already have informers?
- Ref count for watches + if a get/list requires a new informer (i.e. a watch didn't already start one), that informer is never removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Just found this PR.
Gatekeeper is doing something similar, forking controller-runtime's informer cache to add a function that can remove informers. Would it be possible to have this available before solving the ref count issue? That would at least give users who are comfortable handling that complexity the ability to do so, but probably wouldn't catch unwary users.
Here is the key function we rely on:
We currently use it by maintaining a separate cache for dynamic watches:
That exports "registrars" to managing controllers to add or remove watches.
Static watches (i.e. old-school controller.Watch() and watches initiated by client.Get()) use a separate cache.
Happy to talk more about this model, if interested, or help implementing parts if it means we can stop maintaining a fork!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great background! I might rebase and continue on the work on this PR (unless @FillZpp has time) to get it to the finish line before 0.15 is released. Feel free to reach out on slack if we want to chat more about it and brainstorm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to brainstorm! (sorry, it's been a busy week, so haven't reached out yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for this late reply (I was on vacation last two weeks). I'll continue on this and get it into 0.15.
About the removal of informer, it seems we could have several ways:
- Keep ref counts to remove informers automatically, like @joelanford suggested.
Thinking out loud here a few options:
- Ref count for watches + some sort of LRU or timed cache for gets/lists that don't already have informers?
- Ref count for watches + if a get/list requires a new informer (i.e. a watch didn't already start one), that informer is never removed?
I prefer this way, but I'm not sure will this make users confused? Most of them don't know when and why an informer been removed or not, and the sequence of doing watch and get/list also affect whether the informer will be removed or not...
-
Expose a method to let users manually remove a informer, as @maxsmythe suggested, IIUC.
-
Maybe 1+2 both provided, if they are all necessary and needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think (1) or (2) are in conflict: our real need is to have informer cache support a Remove() function, which is something a reference counter would need anyway. (2) just makes that behavior public.
Also, WRT reference counting, Gatekeeper has the additional nuance of using a different cache entirely for reference-count-style watch governance to avoid interference with watches established in more conventional ways (client.Get()) -- this seems similar to option (1.2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, because Gatekeeper uses generic controllers (we have one controller that listens for all constraint kinds), (2) would be a better fit for our model than governing watch livelihood at the controller granularity.
Because dynamic watches are probably best for generic-type controllers (if you have the type hard-coded, why the need for dynamic watches?), managing the watches directly may be a better fit than managing controllers. I think both are workable, but (2) would definitely be less of a reach for Gatekeeper to integrate.
|
||
// Stop stops the controller and all its watches dynamically. | ||
// Note that it will only trigger the stop but will not wait for them all stopped. | ||
Stop() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FillZpp Why do we add a Stop
function here instead of canceling the context that was passed to Start
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not all of the implementations of
Source
can stop, e.g.,Func
or some custom types. - Users have no way to cancel the context passed from internal controller to source, to only stop the single watch.
Signed-off-by: FillZpp <[email protected]>
How about get this PR merged first, and I will post new PRs to support removal of informers (automatically & manually)? Otherwise it will become a huge PR that have too many API changes to review. |
/retest |
@FillZpp: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Thanks @inteon , I close this PR and help out on the new one. /close |
@FillZpp: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: FillZpp [email protected]
Support shutdown controllers and watches dynamically.
API changes:
ControllerCtx
added intocontroller.Options
to let developer stop a specific controller and its watches.Stop()
method added into some stoppable sources, e.g.,Kind
,KindWithCache
,Informer
, to let developer only stop a specific watch.fixes #1884