Skip to content

✨ Add support for controller-manager webhook shutdown delay #2601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions pkg/webhook/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@ type Options struct {

// WebhookMux is the multiplexer that handles different webhooks.
WebhookMux *http.ServeMux

// ShutdownDelay delays server shutdown to wait for clients to stop opening
// new connections before closing server listeners. Defaults to 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question:
Would it make sense to set a > 0 default, eg. 15s which is less than the 30 seconds default graceful shutdown delay?

Copy link
Author

@dippynark dippynark Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tricky one for sure, but because it's so specific to the Kubernetes cluster you're running on it doesn't feel right to align to any particular cluster or cloud provider (e.g. I'm not sure what value EKS and AKS use for example).

IMO we should keep it at 0 for backwards compatibility and to ensure that connection draining 'doesn't work' consistently across clusters by default. I think this is also more likely to encourage applications that care a lot about this (e.g. Gatekeeper) to expose a flag to change this to make it easier to reconfigure for the cluster you're running on (i.e. push the decision onto users of controller-runtime since an optimal default isn't clear).

No problems with changing to 15 seconds by default though if you were leaning towards that (that would be better for us anyway because we're using GKE 😄). We could alternatively align to the kube-proxy default of 1 second? But on GKE at least this won't help much.

ShutdownDelay time.Duration
}

// NewServer constructs a new webhook.Server from the provided options.
Expand Down Expand Up @@ -246,6 +250,15 @@ func (s *DefaultServer) Start(ctx context.Context) error {
idleConnsClosed := make(chan struct{})
go func() {
<-ctx.Done()

// Disable HTTP keep-alives to close persistent connections after next
// HTTP request. Clients may reconnect until routes are updated and
// server listeners are closed however this should start gradually
// migrating clients to server instances that are not about to shutdown
srv.SetKeepAlivesEnabled(false)
Comment on lines +254 to +258
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a good addition

// Wait before shutting down webhook server
time.Sleep(s.Options.ShutdownDelay)
Comment on lines +259 to +260
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Reading the intent of the PR, there is a context with a timeout that will be given to shutdown which allows those connections to be closed, was that not enough?

Copy link
Author

@dippynark dippynark Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although srv.Shutdown(ctx) allows active connections to finish, it will close listeners straight away before kube-proxy has had a chance to remove the iptables rules from all cluster nodes that are routing new connections. Adding a sleep here gives time for kube-proxy to sync ready endpoints to avoid it routing new connections to closed listeners

This comment speaks more about how kube-proxy iptables are involved in webhook connectivity: #2601 (comment)


log.Info("Shutting down webhook server with timeout of 1 minute")

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
Expand Down