What happened
Under sustained load (warm pool of ~150, ~10 QPS burst) on OSS agent-sandbox on GKE, a fraction of /execute requests through sandbox-router fail with 502 Bad Gateway. Client-side this surfaces as SandboxRequestError: Failed to communicate with the sandbox ... too many 502 error responses. Roughly 6-7% of requests in our runs. It is intermittent and load/timing dependent (a clean run can show zero failures), which is probably why this hasn't been explicitly documented yet.
Root cause
Every 502 in the router logs is the same failure:
Proxying request for sandbox 'python-sandbox-warmpool-<name>' to URL: http://python-sandbox-warmpool-<name>.default.svc.cluster.local:8888/execute
ERROR: Connection to sandbox ... failed. Error: [Errno -2] Name or service not known
"POST /execute HTTP/1.1" 502 Bad Gateway
This is a DNS resolution failure (NXDOMAIN), not a connection/dial failure. The router routes to {sandbox_id}.{namespace}.svc.cluster.local when a request does not carry an X-Sandbox-Pod-IP header (around sandbox_router.py L101-103). But warm-pool sandboxes have no per-sandbox Service (kubectl get svc -n default shows only kubernetes and sandbox-router-svc), so that name never resolves.
Requests that succeed route by direct pod IP (the client supplied X-Sandbox-Pod-IP). The failures are warm-pool sandboxes where the client did not supply the pod IP, often retrying against a sandbox already released/deleted during warm-pool rotation.
Impact
~6-7% of /execute requests fail under sustained load with warm-pool sandboxes.
Environment
OSS (self-managed) agent-sandbox on GKE, c4 nodes, Python sandbox-router, warm pool 150, ~10 QPS.
Possible directions (not prescriptive)
- Resolve the upstream from
Sandbox.status.PodIPs rather than a *.svc.cluster.local name, removing the dependency on a per-sandbox Service.
- Ensure the client always supplies
X-Sandbox-Pod-IP, and does not retry already-released sandboxes.
- Or provision a headless Service per sandbox so the name resolves.
Note: the Go router rewrite (#838) adds dial-failure retries, but those would not address an NXDOMAIN for a nonexistent Service, so this likely needs a resolution/routing change rather than retries.
What happened
Under sustained load (warm pool of ~150, ~10 QPS burst) on OSS agent-sandbox on GKE, a fraction of
/executerequests throughsandbox-routerfail with502 Bad Gateway. Client-side this surfaces asSandboxRequestError: Failed to communicate with the sandbox ... too many 502 error responses. Roughly 6-7% of requests in our runs. It is intermittent and load/timing dependent (a clean run can show zero failures), which is probably why this hasn't been explicitly documented yet.Root cause
Every 502 in the router logs is the same failure:
This is a DNS resolution failure (NXDOMAIN), not a connection/dial failure. The router routes to
{sandbox_id}.{namespace}.svc.cluster.localwhen a request does not carry anX-Sandbox-Pod-IPheader (aroundsandbox_router.pyL101-103). But warm-pool sandboxes have no per-sandbox Service (kubectl get svc -n defaultshows onlykubernetesandsandbox-router-svc), so that name never resolves.Requests that succeed route by direct pod IP (the client supplied
X-Sandbox-Pod-IP). The failures are warm-pool sandboxes where the client did not supply the pod IP, often retrying against a sandbox already released/deleted during warm-pool rotation.Impact
~6-7% of
/executerequests fail under sustained load with warm-pool sandboxes.Environment
OSS (self-managed) agent-sandbox on GKE, c4 nodes, Python sandbox-router, warm pool 150, ~10 QPS.
Possible directions (not prescriptive)
Sandbox.status.PodIPsrather than a*.svc.cluster.localname, removing the dependency on a per-sandbox Service.X-Sandbox-Pod-IP, and does not retry already-released sandboxes.Note: the Go router rewrite (#838) adds dial-failure retries, but those would not address an NXDOMAIN for a nonexistent Service, so this likely needs a resolution/routing change rather than retries.