Skip to content

Fix: Envoy breaking ZDM during lifecycle operations #711

Closed
@vlast3k

Description

@vlast3k

Summary

Once Envoy is enabled for ensuring instance identity validation from Go Router to the containers, we notice 502 HTTP Errors start to appear when containers are stopped (restart, evacuation, scale down)

The root cause seems to be that Envoy is accepting connections, and once accepted Gorouter is not able to retry it on another instance.

What seems to happen is following:

  • BBS sends request to stop one node
  • the chosen REP sends SIGTERM to the container
  • in the same time TPS, routeemitter and gorouter coordinate to remove the route
  • when the container receives the sigterm, it has the webserver running from the app and the envoy proxy
  • when the web-server gets the SIGTERM it stops accepting new requests and waits some time (for tomcat it was 30 sec by default to stop inflight requests (but gets only 10s for this)
  • when envoy proxy gets SIGTERM it does not stop accepting new requests, because it does not seem to have a consistent graceful shutdown concept, where ingress listeners are removed and egress ones are kept (here and here)

So we have a race condition, where gorouter has not yet removed the route, the web-server is failing new connections and envoy is accepting them. And what happens is that gorouter forwards a request to the container, envoy accepts the connection and is not able to forward it. But gorouter then can not retry the openend connection so customer gets 502

If i haven't missed some point, this is rather a design flaw in CF that is caused by the introduction of Envoy
In reality - well written HTTP applications should be aware that HTTP communication is best-effort, and ensure to do retries, idempotent calls or transactions. But i fear most customers assume HTTP is exactly-once QoS and write their apps like this
The same issue can happen during any stop of an instance (restart, evacuation) but this happens rarely maybe

Steps to Reproduce

  • Deploy an application.
  • Send a POST request to the application endpoint with some script.
  • Scale down / restart the application.
  • Few 502s will be returned, which should not happen

Diego repo

Environment Details

CF-Deployment v26.4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions