-
Notifications
You must be signed in to change notification settings - Fork 41.3k
Description
What happened?
When a pod is entering its Terminating state, it will receive a signal, asking it kindly to finish up work after which kubernetes will proceed deleting the pod.
At the same time that the pod starts terminating, an ingress controller will receive the updated endpoints object, which will start removing the pod from the list of targets in the load balancer, that traffic could be sent to.
Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the list of endpoints - are decoupled from one another and the SIGTERM might have been handled before, or at the same time, that the target in the target group is being processed.
As result the ingress controller might still send traffic to targets, which are still in its endpoints, but have properly shut down already. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.
What did you expect to happen?
no traffic being dropped during shutdown.
The SIGTERM should only start after the ingress controller/LB has removed the target from the target group. Readiness gates work pretty good for pod startup/rollout but lack support during pod deletion.
How can we reproduce it (as minimally and precisely as possible)?
This is a very theoretical problem, which is very hard to reproduce:
- Provision an ingress controller (AWS LB for example)
- Create an ingress
- Create a service and pods (multiple ones through a deployment work best) for this ingress
- (add some delay/load to the cluster, that will cause the LB synchronization to be slower or delayed)
- startup an HTTP benchmark to produce some artificial load
- rollout a change to the deployment or just evict some pods
Anything else we need to know?
We've been relying on Pod-Graceful-Drain, which unfortunately intercepts and breaks k8s internals.
You can achieve a pretty good result as well using a sleep
as preStop
, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.
I also opened up an issue on the Ingress Controllers repo.
Kubernetes version
$ kubectl version
v1.18.20
Cloud provider
OS version
# On Linux:
sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
# paste output here
$ uname -a
Linux xxx 4.14.252-195.483.amzn2.x86_64 #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux