Skip to content

feat: wait for pods to be deleted to report version ready #839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

avorima
Copy link
Contributor

@avorima avorima commented Jun 11, 2025

I was looking through the 1.33 release notes and saw that KEP 3973 had been included in alpha state.
I was trying to find an easy to get this to work in #718, as noted by my comment describing what the conditions were, but there weren't many options. The KEP looks to be the solution I was looking for.
For context: we observed that the apiserver requests sometimes failed immediately after the TCP reported "Ready" after an update. There could be several reasons in our own setup for this (I'm actually thinking about taking a closer look at our loadbalancers) but given that the duration of the probe failures strongly correlate with the duration that terminating pods are still present, I suppose just waiting a bit longer until things settle down is a reasonable approach.

Copy link

netlify bot commented Jun 11, 2025

Deploy Preview for kamaji-documentation canceled.

Name Link
🔨 Latest commit e1aed40
🔍 Latest deploy log https://app.netlify.com/projects/kamaji-documentation/deploys/6849f7a06e24c30008d12375

@prometherion
Copy link
Member

we observed that the apiserver requests sometimes failed immediately after the TCP reported "Ready" after an update

Could this be related to underlying EndpointSlice and not updated local iptables still sending traffic to the old pods?

@avorima
Copy link
Contributor Author

avorima commented Jun 12, 2025

we observed that the apiserver requests sometimes failed immediately after the TCP reported "Ready" after an update

Could this be related to underlying EndpointSlice and not updated local iptables still sending traffic to the old pods?

Yes, that could also be the case. It's not that common but it happens often enough when testing a high volume of updates

@prometherion
Copy link
Member

Back in the day, to avoid such minor edge cases, I was adding a pre-stop hook to Pods such as sleep X where X refers to the lag in EndpointSlice update.

However, this solution wouldn't be feasible here since the kube-apiserver is a single binary container with no bash support.

As a workaround, if you're using an Ingress Controller such as an HAProxy one, the option redispatch in the backend could do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants