Skip to content

sandbox-router returns 502 (NXDOMAIN) for warm-pool sandboxes routed by Service name #883

@geojaz

Description

@geojaz

What happened

Under sustained load (warm pool of ~150, ~10 QPS burst) on OSS agent-sandbox on GKE, a fraction of /execute requests through sandbox-router fail with 502 Bad Gateway. Client-side this surfaces as SandboxRequestError: Failed to communicate with the sandbox ... too many 502 error responses. Roughly 6-7% of requests in our runs. It is intermittent and load/timing dependent (a clean run can show zero failures), which is probably why this hasn't been explicitly documented yet.

Root cause

Every 502 in the router logs is the same failure:

Proxying request for sandbox 'python-sandbox-warmpool-<name>' to URL: http://python-sandbox-warmpool-<name>.default.svc.cluster.local:8888/execute
ERROR: Connection to sandbox ... failed. Error: [Errno -2] Name or service not known
"POST /execute HTTP/1.1" 502 Bad Gateway

This is a DNS resolution failure (NXDOMAIN), not a connection/dial failure. The router routes to {sandbox_id}.{namespace}.svc.cluster.local when a request does not carry an X-Sandbox-Pod-IP header (around sandbox_router.py L101-103). But warm-pool sandboxes have no per-sandbox Service (kubectl get svc -n default shows only kubernetes and sandbox-router-svc), so that name never resolves.

Requests that succeed route by direct pod IP (the client supplied X-Sandbox-Pod-IP). The failures are warm-pool sandboxes where the client did not supply the pod IP, often retrying against a sandbox already released/deleted during warm-pool rotation.

Impact

~6-7% of /execute requests fail under sustained load with warm-pool sandboxes.

Environment

OSS (self-managed) agent-sandbox on GKE, c4 nodes, Python sandbox-router, warm pool 150, ~10 QPS.

Possible directions (not prescriptive)

  • Resolve the upstream from Sandbox.status.PodIPs rather than a *.svc.cluster.local name, removing the dependency on a per-sandbox Service.
  • Ensure the client always supplies X-Sandbox-Pod-IP, and does not retry already-released sandboxes.
  • Or provision a headless Service per sandbox so the name resolves.

Note: the Go router rewrite (#838) adds dial-failure retries, but those would not address an NXDOMAIN for a nonexistent Service, so this likely needs a resolution/routing change rather than retries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions