Description
Summary
Once Envoy is enabled for ensuring instance identity validation from Go Router to the containers, we notice 502 HTTP Errors start to appear when containers are stopped (restart, evacuation, scale down)
The root cause seems to be that Envoy is accepting connections, and once accepted Gorouter is not able to retry it on another instance.
What seems to happen is following:
- BBS sends request to stop one node
- the chosen REP sends SIGTERM to the container
- in the same time TPS, routeemitter and gorouter coordinate to remove the route
- when the container receives the sigterm, it has the webserver running from the app and the envoy proxy
- when the web-server gets the SIGTERM it stops accepting new requests and waits some time (for tomcat it was 30 sec by default to stop inflight requests (but gets only 10s for this)
- when envoy proxy gets SIGTERM it does not stop accepting new requests, because it does not seem to have a consistent graceful shutdown concept, where ingress listeners are removed and egress ones are kept (here and here)
So we have a race condition, where gorouter has not yet removed the route, the web-server is failing new connections and envoy is accepting them. And what happens is that gorouter forwards a request to the container, envoy accepts the connection and is not able to forward it. But gorouter then can not retry the openend connection so customer gets 502
If i haven't missed some point, this is rather a design flaw in CF that is caused by the introduction of Envoy
In reality - well written HTTP applications should be aware that HTTP communication is best-effort, and ensure to do retries, idempotent calls or transactions. But i fear most customers assume HTTP is exactly-once QoS and write their apps like this
The same issue can happen during any stop of an instance (restart, evacuation) but this happens rarely maybe
Steps to Reproduce
- Deploy an application.
- Send a POST request to the application endpoint with some script.
- Scale down / restart the application.
- Few 502s will be returned, which should not happen
Diego repo
Environment Details
CF-Deployment v26.4.0