sandbox-router returns 502 (NXDOMAIN) for warm-pool sandboxes routed by Service name

## What happened

Under sustained load (warm pool of ~150, ~10 QPS burst) on OSS agent-sandbox on GKE, a fraction of `/execute` requests through `sandbox-router` fail with `502 Bad Gateway`. Client-side this surfaces as `SandboxRequestError: Failed to communicate with the sandbox ... too many 502 error responses`. Roughly 6-7% of requests in our runs. It is intermittent and load/timing dependent (a clean run can show zero failures), which is probably why this hasn't been explicitly documented yet.

## Root cause

Every 502 in the router logs is the same failure:

```
Proxying request for sandbox 'python-sandbox-warmpool-<name>' to URL: http://python-sandbox-warmpool-<name>.default.svc.cluster.local:8888/execute
ERROR: Connection to sandbox ... failed. Error: [Errno -2] Name or service not known
"POST /execute HTTP/1.1" 502 Bad Gateway
```

This is a DNS resolution failure (NXDOMAIN), not a connection/dial failure. The router routes to `{sandbox_id}.{namespace}.svc.cluster.local` when a request does not carry an `X-Sandbox-Pod-IP` header (around `sandbox_router.py` L101-103). But warm-pool sandboxes have no per-sandbox Service (`kubectl get svc -n default` shows only `kubernetes` and `sandbox-router-svc`), so that name never resolves.

Requests that succeed route by direct pod IP (the client supplied `X-Sandbox-Pod-IP`). The failures are warm-pool sandboxes where the client did not supply the pod IP, often retrying against a sandbox already released/deleted during warm-pool rotation.

## Impact

~6-7% of `/execute` requests fail under sustained load with warm-pool sandboxes.

## Environment

OSS (self-managed) agent-sandbox on GKE, c4 nodes, Python sandbox-router, warm pool 150, ~10 QPS.

## Possible directions (not prescriptive)

- Resolve the upstream from `Sandbox.status.PodIPs` rather than a `*.svc.cluster.local` name, removing the dependency on a per-sandbox Service.
- Ensure the client always supplies `X-Sandbox-Pod-IP`, and does not retry already-released sandboxes.
- Or provision a headless Service per sandbox so the name resolves.

Note: the Go router rewrite (#838) adds dial-failure retries, but those would not address an NXDOMAIN for a nonexistent Service, so this likely needs a resolution/routing change rather than retries.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sandbox-router returns 502 (NXDOMAIN) for warm-pool sandboxes routed by Service name #883

What happened

Root cause

Impact

Environment

Possible directions (not prescriptive)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sandbox-router returns 502 (NXDOMAIN) for warm-pool sandboxes routed by Service name #883

Description

What happened

Root cause

Impact

Environment

Possible directions (not prescriptive)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions