Skip to content

Pod Termination handling kicks in before the ingress controller has had time to process #106476

@nirnanaaa

Description

@nirnanaaa

What happened?

When a pod is entering its Terminating state, it will receive a signal, asking it kindly to finish up work after which kubernetes will proceed deleting the pod.

At the same time that the pod starts terminating, an ingress controller will receive the updated endpoints object, which will start removing the pod from the list of targets in the load balancer, that traffic could be sent to.

Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the list of endpoints - are decoupled from one another and the SIGTERM might have been handled before, or at the same time, that the target in the target group is being processed.

As result the ingress controller might still send traffic to targets, which are still in its endpoints, but have properly shut down already. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.

What did you expect to happen?

no traffic being dropped during shutdown.

The SIGTERM should only start after the ingress controller/LB has removed the target from the target group. Readiness gates work pretty good for pod startup/rollout but lack support during pod deletion.

How can we reproduce it (as minimally and precisely as possible)?

This is a very theoretical problem, which is very hard to reproduce:

  • Provision an ingress controller (AWS LB for example)
  • Create an ingress
  • Create a service and pods (multiple ones through a deployment work best) for this ingress
  • (add some delay/load to the cluster, that will cause the LB synchronization to be slower or delayed)
  • startup an HTTP benchmark to produce some artificial load
  • rollout a change to the deployment or just evict some pods

Anything else we need to know?

We've been relying on Pod-Graceful-Drain, which unfortunately intercepts and breaks k8s internals.

You can achieve a pretty good result as well using a sleep as preStop, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.

I also opened up an issue on the Ingress Controllers repo.

Kubernetes version

$ kubectl version
v1.18.20

Cloud provider

AWS/EKS

OS version

# On Linux:
sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

# paste output here
$ uname -a
Linux xxx 4.14.252-195.483.amzn2.x86_64 #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

- [https://github.com/kubernetes-sigs/aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller)

Container runtime (CRI) and and version (if applicable)

Docker version 20.10.7, build f0df350

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.sig/cloud-providerCategorizes an issue or PR as relevant to SIG Cloud Provider.sig/networkCategorizes an issue or PR as relevant to SIG Network.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions