feat: add Go sandbox-router (drop-in replacement for Python)#838
feat: add Go sandbox-router (drop-in replacement for Python)#838mastersingh24 wants to merge 14 commits into
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mastersingh24 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mastersingh24. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a Go implementation of the sandbox-router (drop-in compatible with the existing Python router’s X-Sandbox-* header contract and sandbox-router-svc Service name), adding production features like TLS/mTLS, structured access logging, Prometheus + optional OTLP metrics export, tracing, dial retries, and graceful shutdown. It also updates build/release tooling and provides example Kubernetes manifests plus a local load-test harness.
Changes:
- Added a new Go
sandbox-routerbinary with proxying, retry/backoff, TLS cert hot-reload, probes, metrics, and tracing support. - Added deployment examples (
deploy/) and a local load test harness (dev/load-test/router/). - Updated build tooling and dependencies (Makefile target, image push tooling overrides, go.mod/go.sum).
Reviewed changes
Copilot reviewed 49 out of 50 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Adds build-sandbox-router and makes build include router build. |
| go.mod | Adds/promotes deps needed for router (fsnotify, OTel metrics bridge/exporter, etc.). |
| go.sum | Updates checksums for new/updated dependencies. |
| dev/tools/push-images | Adds special-case build context + image name override for Go router Dockerfile. |
| dev/load-test/router/main.go | Adds local load generator for throughput/latency measurements. |
| clients/go/sandbox-router/README.md | Documents contract, flags, TLS/mTLS, metrics, tracing, scaling, and deployment. |
| clients/go/sandbox-router/main.go | Implements router binary wiring: config, middleware, metrics/tracing init, TLS reload, server lifecycle. |
| clients/go/sandbox-router/Dockerfile | Builds a static distroless multi-arch image for the Go router. |
| clients/go/sandbox-router/config/config.go | Defines router config, defaults, and validation rules. |
| clients/go/sandbox-router/config/config_test.go | Tests defaults/env/flags and config validation behavior. |
| clients/go/sandbox-router/config/flags.go | Registers CLI flags and applies env-derived defaults for Python parity. |
| clients/go/sandbox-router/config/file.go | Implements YAML config loading and pre-parse config file detection. |
| clients/go/sandbox-router/config/file_test.go | Tests config file path selection and YAML application/validation. |
| clients/go/sandbox-router/config/testmain_test.go | Adds goleak verification for config package tests. |
| clients/go/sandbox-router/observability/access_log.go | Adds structured per-request access logging middleware. |
| clients/go/sandbox-router/observability/context.go | Adds request-scoped labels/logger plumbing via context. |
| clients/go/sandbox-router/observability/metrics.go | Defines Prometheus collectors + middleware and build info metric. |
| clients/go/sandbox-router/observability/otel_metrics.go | Adds optional OTLP metrics push via OTel↔Prometheus bridge. |
| clients/go/sandbox-router/observability/tracing.go | Adds per-request tracing middleware with trace↔log correlation. |
| clients/go/sandbox-router/observability/testmain_test.go | Adds goleak verification for observability package tests. |
| clients/go/sandbox-router/proxy/errors.go | Implements Python-compatible JSON error shape ({"detail": ...}). |
| clients/go/sandbox-router/proxy/errors_test.go | Tests JSON error response formatting and error interface behavior. |
| clients/go/sandbox-router/proxy/headers.go | Parses/validates X-Sandbox-* routing headers and defaults. |
| clients/go/sandbox-router/proxy/headers_test.go | Unit tests for routing header parsing/namespace validation parity. |
| clients/go/sandbox-router/proxy/target.go | Constructs upstream URL (DNS form vs pod-IP form) preserving path/query. |
| clients/go/sandbox-router/proxy/target_test.go | Tests upstream URL construction behavior. |
| clients/go/sandbox-router/proxy/retry.go | Adds dial-class retry transport with exponential backoff. |
| clients/go/sandbox-router/proxy/retry_test.go | Unit tests for retry eligibility, backoff, and cancellation behavior. |
| clients/go/sandbox-router/proxy/proxy.go | Core reverse proxy handler: header parse, upstream routing, tracing propagation, timeout bound. |
| clients/go/sandbox-router/proxy/proxy_integration_test.go | Integration tests for round-trip proxying, streaming body, 400/502 behavior. |
| clients/go/sandbox-router/proxy/retry_integration_test.go | Integration tests for retry success/give-up behavior and retry metrics. |
| clients/go/sandbox-router/proxy/healthz_integration_test.go | Integration test for Python-compatible /healthz response shape. |
| clients/go/sandbox-router/proxy/testmain_test.go | Adds goleak verification for proxy package tests. |
| clients/go/sandbox-router/server/probes.go | Implements /healthz and /readyz with readiness flip for draining. |
| clients/go/sandbox-router/server/probes_test.go | Unit tests for probe behavior and readiness transitions. |
| clients/go/sandbox-router/server/server.go | Coordinates multiple listeners (proxy/http(s), metrics, probes) and shutdown behavior. |
| clients/go/sandbox-router/server/testmain_test.go | Adds goleak verification for server package tests. |
| clients/go/sandbox-router/tlsutil/loader.go | Adds fsnotify-based hot-reloading cert loader with debounce. |
| clients/go/sandbox-router/tlsutil/loader_test.go | Unit tests for cert reloader initial load and hot reload behavior. |
| clients/go/sandbox-router/tlsutil/config.go | Builds server TLS config and loads client CA pool for mTLS. |
| clients/go/sandbox-router/tlsutil/config_test.go | Unit tests for TLS config mapping and CA pool loading errors. |
| clients/go/sandbox-router/tlsutil/testcerts_test.go | Test helper for generating/writing self-signed certs. |
| clients/go/sandbox-router/tlsutil/tls_integration_test.go | Integration tests for TLS + all mTLS modes with real handshakes. |
| clients/go/sandbox-router/tlsutil/testmain_test.go | Adds goleak verification for tlsutil package tests. |
| clients/go/sandbox-router/deploy/README.md | Describes example manifests and production hardening checklist. |
| clients/go/sandbox-router/deploy/deployment.yaml | Example Deployment with probes, security context, and baseline args. |
| clients/go/sandbox-router/deploy/service.yaml | Example Service preserving sandbox-router-svc name and proxy port mapping. |
| clients/go/sandbox-router/deploy/serviceaccount.yaml | Example ServiceAccount (no RBAC by default). |
| clients/go/sandbox-router/deploy/pdb.yaml | Example PDB for router availability during disruptions. |
| clients/go/sandbox-router/deploy/networkpolicy.yaml | Example NetworkPolicy for ingress/egress constraints. |
|
Thanks for the PR. Impressive work. @mastersingh24 |
|
/hold for #758 agreement |
|
Thanks for pushing this forward. One thing I wanted to ask about before this becomes the preferred router implementation: should WebSocket / upgraded-connection forwarding be part of the compatibility contract? A concrete use case is A few things that seem worth clarifying or testing:
|
f083a1c to
6313dfb
Compare
Update: aligned with KEP-NNNN (#758)Pushed 9 follow-up commits on
KEP requirements → status
Extras beyond the KEP
Open questions
Happy to split any of the extras out of the PR if the review prefers a tighter first cut. |
|
@barney-s - couple things
|
|
@ctm8788 thanks for catching this — addressed in e1a5d3d. Status against your four sub-questions:
The regression test pins |
|
IPv6 PodIP support: fixed in 44d8e2d. Same root cause as the Copilot finding on #850 — both |
|
Nice! I just wanted to confirm because the Python version definitely does not support websockets properly. |
|
Port validation hardening: fixed in 1dda38a. Carrying over the Copilot finding from #850 — |
we have working example that uses custom router (go based) and sandboxes running vscode here: |
1dda38a to
6d8813e
Compare
|
Thanks for the pointer — pulled your handler down. Two enhancements landed in d04bfb5: 1. 2. README's "WebSockets and other protocol upgrades" section documents both with rationale. |
Mirror the browser-backend compatibility fix that just landed on the from-scratch Go router (kubernetes-sigs#838 d04bfb5), in the shapes that make sense for the ext_proc design. 1. Origin stripping on upgrade — in the Go handler. When readHeaders sees BOTH Connection: Upgrade AND a non-empty Upgrade header (matching the predicate httputil.ReverseProxy uses internally), the resulting HeaderMutation gains RemoveHeaders: ["origin"] alongside the existing dst-host SetHeaders. Envoy normalizes header keys to lowercase, so the lowercase "origin" is what Envoy actually removes. Backends that validate Origin == Host for CSRF (vscode-server, Jupyter) no longer reject the upgrade with a 1006 close. Non-upgrade requests are unaffected so CORS preflights and any Origin-aware non-WebSocket logic still work. 2. X-Forwarded-Host — in the Envoy config. One-liner request_headers_to_add at the virtual_host level with value "%REQ(:AUTHORITY)%". X-Forwarded-For / -Proto were already free from use_remote_address: true; -Host doesn't come with that setting and needs to be wired separately. Tests: TestHandle_StripsOriginOnUpgrade (asserts the mutation's RemoveHeaders contains "origin" while the dst-host SetHeaders is preserved), TestHandle_NonUpgradePreservesOrigin (guards the non-upgrade path), TestReadHeaders_UpgradeDetection table covering the 8 corner cases of the upgrade predicate. README documents both behaviors and the rationale.
aditya-shantanu
left a comment
There was a problem hiding this comment.
Thanks for the thorough port — the observability/TLS/retry work looks solid. One substantive gap I wanted to flag on the "drop-in replacement" claim: the Go router drops two input-validation checks the Python router performs, and with the default `AllowAll` authorizer there is nothing else gating the request path.
1. X-Sandbox-Pod-IP is dialed unvalidated (SSRF).
In proxy/headers.go, ParseSandboxHeaders stores PodIP: h.Get(HeaderSandboxPodIP) verbatim, and proxy/resolve.go Resolve() uses it directly as the dial host (SourcePodIP). The Python router (sandbox_router.py) parses this header with ipaddress.ip_address() and rejects loopback / link-local / multicast / unspecified addresses:
ip = ipaddress.ip_address(pod_ip)
if ip.is_loopback or ip.is_link_local or ip.is_multicast or ip.is_unspecified:
raise HTTPException(status_code=400, detail="Invalid target IP address.")As written, a caller can set X-Sandbox-Pod-IP: 169.254.169.254 (cloud metadata) or 127.0.0.1 and the router will proxy to it. Since NewHandler defaults to authz.AllowAll{} to preserve the Python no-auth contract, the default deployment has no compensating control. Recommend porting the same IP-class rejection (net.ParseIP + IsLoopback()/IsLinkLocalUnicast()/IsLinkLocalMulticast()/IsMulticast()/IsUnspecified(), and likely also link-local/ULA) before accepting the header.
2. X-Sandbox-ID is not validated.
Python validates sandbox_id against a DNS-label regex ("to prevent DNS injection and directory traversal style attacks") before interpolating it into the FQDN. The Go path validates only namespace (validNamespace) and port; ID flows straight into t.ID + "." + t.Namespace + ".svc." + clusterDomain in Resolve(). Worth applying the same DNS-label check to ID for parity.
The KEP-requirements table marks "Strict input validation" as done with "namespace charset + port numeric" — these two cases appear to be the missing pieces relative to the Python behavior this is meant to replace.
Re-implements the Python sandbox-router
(clients/python/agentic-sandbox-client/sandbox-router/) in Go with the
controls needed for enterprise deployments. Preserves the X-Sandbox-*
header contract, Service name (sandbox-router-svc), JSON error shape,
and PROXY_TIMEOUT_SECONDS / CLUSTER_DOMAIN env vars so existing callers
and Gateway/HTTPRoute resources keep working.
The Python router source stays in the tree until deprecation is
formalized. The Go router builds as a separate image named
"sandbox-router-go" to avoid colliding with the Python "sandbox-router"
image.
Features beyond the Python router:
- TLS termination with hot-reloading server cert (fsnotify on the
parent dir, atomic-rename safe for Kubernetes Secret projection)
- Optional or required mTLS for clients
- Prometheus metrics on /metrics and OTLP push via OTel-Prometheus
bridge (--enable-otel-metrics)
- OpenTelemetry tracing via OTLP gRPC (--enable-tracing); spans carry
sandbox.id / sandbox.namespace attributes; trace context propagated
to the upstream sandbox
- Trace-log correlation: trace_id and span_id baked into every
per-request log line
- Dial-retry with exponential backoff for upstream startup races
(--upstream-max-retries; only dial-class failures retried so request
bodies are never replayed)
- Structured access logging (--access-log, defaults on; skips
/healthz, /readyz, /metrics)
- YAML config file (--config / SANDBOX_ROUTER_CONFIG); precedence is
CLI > file > env > defaults
- Graceful shutdown with readiness flip + parallel listener drain
- Configurable request-body size limit (--max-request-body-bytes)
- Multi-arch distroless static image (gcr.io/distroless/static:nonroot)
Layout:
clients/go/sandbox-router/ library packages + main.go + Dockerfile
config/ Config struct, flags, YAML loader
proxy/ Handler, headers, target, errors, retry
tlsutil/ Hot-reloading cert + tls.Config builder
observability/ Prometheus + OTel + access log + tracing
server/ Four HTTP servers (HTTP, HTTPS, metrics,
probes) with parallel-drain shutdown
deploy/ Example K8s manifests (Deployment,
Service, PDB, NetworkPolicy, RBAC)
README.md Architecture, contract, flags, scaling
dev/load-test/router/ Self-contained Go load harness
Modified files:
Makefile — adds build-sandbox-router target
dev/tools/push-images — uses repo root as build context for the
sandbox-router Dockerfile; image named
"sandbox-router-go"
go.mod / go.sum — promotes fsnotify and x/sync from indirect;
adds OTLP metric exporter, OTel SDK metric,
and the OTel-Prometheus bridge
Tests:
- Unit tests across all packages (headers, target, errors, retry, cert
reloader, TLS config, config validation, YAML loader, probes,
access log fields)
- Integration tests gated by //go:build integration: end-to-end proxy
round-trip + body streaming, healthz, all three mTLS modes with
real handshakes, retry succeeds when backend comes up mid-window,
retry gives up within budget
- goleak.VerifyTestMain in every test-bearing package
- All 14 Python test_sandbox_router.py cases have Go equivalents
Verification:
- go build ./... clean
- go test ./clients/go/sandbox-router/... — green
- go test -tags=integration ./clients/go/sandbox-router/... — green
- make lint-go — 0 issues
- Live load-test numbers captured in clients/go/sandbox-router/README.md
Out of scope (called out in README for future work):
- Per-sandbox authorization (deferred until the CRD-level identity
contract is designed)
- Rate limiting and circuit breaker (Envoy handles these well; the
README's "When to consider Envoy instead" section discusses the
architectural trade-off)
- CA bundle hot-reload (server cert hot-reloads; client CA does not)
- WebSocket / hijacked-connection graceful drain
Two related fixes in server.Run() raised in PR review: 1. /readyz used to flip to 200 immediately after spawning the listener goroutines, before they had necessarily called Listen + bound their ports. On a fresh pod the LB could therefore briefly route traffic to a port that wasn't accepting yet. The new flow calls net.Listen synchronously for every listener up front, closes any successfully bound listeners if a later bind fails, and only MarkReady() after every port is bound. Bind failures now surface as a direct error from Run() instead of from an async goroutine. 2. The docstring promised parallel shutdown but the code looped through the listeners serially, so one slow drain could consume the whole --shutdown-timeout budget. Shutdown calls are now driven from a sync.WaitGroup (Go 1.25 WaitGroup.Go form) so they run concurrently. Added server_test.go covering: bind-failure surfaces synchronously and keeps readiness false; partial-bind failure releases the earlier port for retry; happy path flips readiness only after all binds succeed and clears it on shutdown.
- config: --config flag now wins over SANDBOX_ROUTER_CONFIG env var, matching the documented overall precedence (CLI > file > env > defaults). The previous order returned env first, contradicting the docstring. Flipped the codified test case. - config/flags: drop stale KUBECONFIG mention from RegisterFlags docstring. The --kubeconfig flag was removed earlier in development when it collided with controller-runtime's auto-registered one; the comment was left behind. - observability: tracing and OTel metrics push are now auto-enabled when OTEL_EXPORTER_OTLP_ENDPOINT (or the signal-specific _TRACES_ENDPOINT / _METRICS_ENDPOINT variants) is set. The README and flag help text already implied this behavior; the implementation now matches. Explicit --enable-tracing=false / --enable-otel-metrics=false still wins, detected via flag.Visit so we can distinguish "user did not set" from "user explicitly set to default value." Added TestApplyPostParseEnvDefaults covering 8 scenarios (generic endpoint, signal-specific endpoints, explicit false-override, explicit true-with-no-env, empty env value treated as unset). - README: updated the flag table to call out the auto-enable behavior explicitly.
Introduce an in-process Pod informer cache keyed by Sandbox CR UID so requests carrying the new X-Sandbox-UID header dial the live PodIP directly, bypassing DNS. Resolution priority remains stable: explicit X-Sandbox-Pod-IP > UID cache hit > DNS form. The informer is server-side filtered on the agents.x-k8s.io/sandbox-name-hash label so memory and API traffic scale with sandbox count, not cluster size. Only Pods that pass PodReady=True with a non-empty PodIP are stored; Pods that flip out of Ready are evicted so traffic does not get steered at degraded backends. Add active cache invalidation per the KEP: when the proxy dials a cached IP and the dial fails, the entry is evicted immediately so the next request for the same UID falls through to DNS instead of retrying the stale IP. A new sandbox_router_cache_invalidations_total counter tracks how often this fires. The cache is opt-in via --cache-enabled (default off). When enabled, the router blocks readiness on the initial Pod LIST so a misconfigured RBAC fails fast at startup rather than silently degrading to DNS-only service. Pod get/list/watch RBAC ships in deploy/rbac.yaml; the example deployment.yaml turns the flag on.
Introduce package authz with an Authorizer interface, the no-op AllowAll default, and helpers for the two credential shapes the router will see in practice: TLS client certs (IdentityFromTLS pulls SPIFFE → DNS SAN → CN with O groups) and Bearer tokens (BearerTokenFromRequest extracts and trims the Authorization header). The proxy now calls Authorize(r, ns, sandbox) after header parsing and before resolving the upstream. ErrUnauthenticated maps to 401, ErrForbidden to 403, anything else to 500; any unknown error is treated as an authorizer bug rather than a silent forbid. Per-decision metrics land in sandbox_router_authz_decisions_total. AllowAll is the only implementation wired into main.go today so the default behavior is unchanged. The TokenReview-backed authorizer that satisfies the KEP's authn requirement ships in the next commit.
Implement the KEP-NNNN authentication requirement: the router can now validate every inbound request's Authorization: Bearer token against the cluster's authentication.k8s.io/v1.TokenReview API. Decisions are cached in an LRU keyed by SHA-256 hash of the token (raw tokens never sit in memory) for a configurable TTL — short enough to catch revocations, long enough to keep authn off the hot path. Negative TokenReview results are cached at the full TTL so a flood of bad tokens does not amplify to apiserver load. API failures are cached briefly (TTL/3, minimum 1s) so a flapping apiserver self-heals without getting pummeled. Wired via --authz-mode=tokenreview with related --authz-tokenreview-* flags for TTL, cache size, audience filter, and require-token mode. Default remains allow-all to preserve the Python router contract. deploy/rbac.yaml adds the standard system:auth-delegator binding required by tokenreviews.authentication.k8s.io. Scope note: this is authentication only — the resulting principal is not yet checked against per-sandbox ownership. That tightening needs an agreed identity contract on the Sandbox CR and is tracked as follow-up.
…config flag controller-runtime's pkg/client/config registers a --kubeconfig flag in its package init. Our own RegisterFlags would panic on re-registration when main.go imports ctrl. Detect the existing flag, skip our duplicate StringVar, and pull the value into c.Kubeconfig in ApplyPostParseEnvDefaults so the rest of the code reads from a single field regardless of which package owns the flag. Add clients/go/sandbox-router/dev/smoke-test/run.sh: end-to-end verification on a real kind cluster covering DNS-form routing, UID cache hit, metrics exposure, active cache invalidation on pod deletion, and tokenreview-mode (rejecting tokenless requests with 401 and accepting a fresh projected SA token). Idempotent; tears its cluster down on exit unless KEEP_CLUSTER=1. Documented for use as the manual release-gate check rather than per-PR CI.
Add an auditor-facing note to deploy/rbac.yaml (and pointers from both READMEs) covering the gap between the RBAC grant and the runtime behavior: the grant has to be cluster-wide because K8s RBAC has no negative-namespace primitive, but the informer's server-side label selector (agents.x-k8s.io/sandbox-name-hash) means system-namespace Pods are never returned and never cached. Document the two ways to tighten the grant itself (enumerated per-namespace RoleBindings, Kyverno/OPA policy) for deployments that need RBAC and behavior to match exactly.
The KEP positions the sandbox-router as a top-level component of the
project. Move clients/go/sandbox-router/ → sandbox-router/ (matching
the controller's top-level layout) with the binary entry at
sandbox-router/cmd/main.go.
Mechanical rename only — no behavior change:
- `git mv` every file; main.go relocates from the package root into
cmd/ so the library packages and the binary stay clearly separated
- all `sigs.k8s.io/agent-sandbox/clients/go/sandbox-router/...`
import paths rewritten to `sigs.k8s.io/agent-sandbox/sandbox-router/...`
- Dockerfile narrows its COPY set to the dirs it actually needs
(sandbox-router, api, extensions/api, internal) and builds
./sandbox-router/cmd
- Makefile build-sandbox-router target points at ./sandbox-router/cmd
- dev/tools/push-images go_router_dir special case follows the move;
image name stays sandbox-router-go to avoid colliding with the
Python router still living at clients/python/...
- smoke-test paths and REPO_ROOT computation updated
- README cross-links updated
- smoke test gains a wait_router_serving probe (actively reaches
/healthz via the Service VIP) used after the tokenreview rollout
where iptables/IPVS plumbing can lag the endpoints update
Full unit + integration test suites green; kind smoke test (6/6) green.
The ServeHTTP handler wrapped every request in context.WithTimeout(ctx, ProxyTimeout), which silently tore down WebSocket / Upgrade connections at the timeout boundary. With the 180s default a healthy code-server editing session (single long-lived WebSocket per session) would surface to the client as a WebSocket close 1006 at the 3-minute mark. Detect Upgrade requests with httpguts.HeaderValuesContainsToken (same test httputil.ReverseProxy uses internally) and skip the WithTimeout wrapper for them; once the 101 handshake is done, TCP keepalive is the liveness signal, not our handler context. Normal HTTP requests continue to be bounded by ProxyTimeout. Add three integration tests covering: round-trip through the router to a WebSocket echo backend, an upgraded connection outliving a deliberately-tiny ProxyTimeout (the regression test for the reviewer comment), and a non-upgrade request still getting cut off by the timeout (guards against the carve-out being too broad). Documents the carve-out and the 101 metric semantics in the README.
Resolve() and UpstreamURL() were concatenating "host + \":\" + port" when building the upstream URL, which produces an unparseable string for IPv6 Pod IPs. Pod.Status.PodIP is a bare IPv6 literal on dual-stack or IPv6-only clusters, and "::1:8080" is ambiguous with the address itself — net/http rejects the URL before the request leaves the router. Swap both call sites to net.JoinHostPort, which brackets IPv6 literals per RFC 3986 and is a no-op for IPv4 / DNS names. Add table cases (loopback, full v6) to TestUpstreamURL and a TestResolveBracketsIPv6PodIP that exercises both the cache-hit and the X-Sandbox-Pod-IP override paths. Same bug, same fix as b841c55 on the ext_proc branch.
ParseSandboxHeaders only rejected non-numeric X-Sandbox-Port values (strconv.Atoi failure). Zero, negative, and out-of-range values sailed through and ended up in the upstream URL via net.JoinHostPort as a syntactically valid but semantically junk host:port that surfaces as an opaque 502 once net/http tries to dial it. Tighten the bound to [1, 65535] and add table cases for the four rejected values (0, -1, 65536) and the two accepted boundaries (1, 65535). Same fix as 6437679 on the ext_proc branch.
Two enhancements prompted by the WebSocket / vscode-server reference implementation at gke-labs/gemini-for-kubernetes-development: 1. Origin stripping on Upgrade. WebSocket backends that validate Origin == Host for CSRF protection (vscode-server, Jupyter, and friends) reject the upgrade with WebSocket close 1006 when the router rewrites Host but leaves Origin pointing at the router's external hostname. Drop Origin on upgrade requests so the backend sees "no Origin assertion available" — CSRF-aware backends typically allow that path for non-browser callers, and vscode / Jupyter work as-is. Normal HTTP requests preserve Origin so CORS preflights stay intact. 2. X-Forwarded-Host / -Proto / -For on every outbound request via httputil.ReverseProxy's SetXForwarded helper. Browser-facing backends (Jupyter, vscode-server) need these to construct correct self-links and redirects when sitting behind a proxy that rewrites Host. Reuse the upgrade bool the timeout block already computes so the Rewrite callback and the WithTimeout carve-out stay in sync. Tests: TestIntegration_WebSocketStripsOriginOnUpgrade (sets Origin on the dial, asserts the backend sees it empty), TestIntegration_NonUpgradePreservesOrigin (guard against the strip leaking into regular HTTP and breaking CORS), TestIntegration_XForwardedHeadersSet (Host matches router VIP, Proto=http for plain-HTTP, For non-empty). README "WebSockets" section documents both behaviors.
d04bfb5 to
651c119
Compare
Mirror the browser-backend compatibility fix that just landed on the from-scratch Go router (kubernetes-sigs#838 d04bfb5), in the shapes that make sense for the ext_proc design. 1. Origin stripping on upgrade — in the Go handler. When readHeaders sees BOTH Connection: Upgrade AND a non-empty Upgrade header (matching the predicate httputil.ReverseProxy uses internally), the resulting HeaderMutation gains RemoveHeaders: ["origin"] alongside the existing dst-host SetHeaders. Envoy normalizes header keys to lowercase, so the lowercase "origin" is what Envoy actually removes. Backends that validate Origin == Host for CSRF (vscode-server, Jupyter) no longer reject the upgrade with a 1006 close. Non-upgrade requests are unaffected so CORS preflights and any Origin-aware non-WebSocket logic still work. 2. X-Forwarded-Host — in the Envoy config. One-liner request_headers_to_add at the virtual_host level with value "%REQ(:AUTHORITY)%". X-Forwarded-For / -Proto were already free from use_remote_address: true; -Host doesn't come with that setting and needs to be wired separately. Tests: TestHandle_StripsOriginOnUpgrade (asserts the mutation's RemoveHeaders contains "origin" while the dst-host SetHeaders is preserved), TestHandle_NonUpgradePreservesOrigin (guards the non-upgrade path), TestReadHeaders_UpgradeDetection table covering the 8 corner cases of the upgrade predicate. README documents both behaviors and the rationale.
Reviewer @aditya-shantanu flagged two missed checks the Python router performs, and an audit against sandbox_router.py found two more. Close all four to deliver on the "drop-in replacement" claim. 1. X-Sandbox-Pod-IP class check (SSRF). Python rejects loopback / link-local / multicast / unspecified addresses via ipaddress.ip_address(). We were storing the header verbatim and dialing it. A caller could set X-Sandbox-Pod-IP: 169.254.169.254 to reach cloud metadata, 127.0.0.1 to hit the router pod's own loopback, etc. — and with AllowAll as the default authorizer there was no compensating control. validPodIP uses net.ParseIP + the equivalent class checks. 2. X-Sandbox-ID DNS-label validation. Python validates the ID against a DNS-1123 label regex specifically to block DNS injection ("foo.evil.com") and traversal-style ("foo/bar") inputs that would otherwise be interpolated into "<id>.<ns>.svc.<cluster-domain>". We validated namespace and port but not ID. Now applied to both ID and namespace through a shared validDNSLabel helper that matches Python's ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ with the 63-char cap. 3. Tightened namespace validation. The old validNamespace was more permissive than Python's check (allowed uppercase, leading/trailing hyphens, unbounded length). K8s itself won't accept namespaces matching the loose rule, so the practical risk is just routing to FQDNs that can't exist — but tightening to the same DNS-1123 label rule used for ID keeps the validation surface uniform and matches Python exactly. 4. Authorization header strip. Python drops Authorization right next to Host before forwarding. We were forwarding it verbatim. With --authz-mode=tokenreview the router consumes the caller's K8s bearer token; leaking it to the sandbox would let any sandbox impersonate the caller against the K8s API or any other Bearer-protected service. Done in the Rewrite callback alongside the existing Host strip. Loopback escape hatch: --allow-loopback-pod-ip (default false). The sidecar deployment shape (sandbox shares a Pod with the router, so 127.0.0.1 is the correct dial target) is a legitimate use case and integration tests need it too. Link-local, multicast, and unspecified classes stay rejected regardless of this flag. ParseSandboxHeaders gains a ParseOptions struct so the loopback toggle threads cleanly through without a positional bool. The Handler reads h.cfg.AllowLoopbackPodIP and passes it through. Tests: existing integration tests flip AllowLoopbackPodIP=true after config.Defaults() since httptest binds to 127.0.0.1. New cases cover DNS-label rejection on ID (dot, slash, underscore, uppercase, leading / trailing hyphen), Pod-IP class rejection (loopback v4/v6, 169.254.169.254, fe80::1, 224.0.0.1, 0.0.0.0, ::), routable accept (v4 + v6), and the loopback flag's effect. New TestValidDNSLabel + TestValidPodIP cover the helpers directly. TestIntegration_AuthorizationStrippedFromUpstream asserts the Authorization header does not reach the backend. README documents the new validation surface and error responses, adds the --allow-loopback-pod-ip flag row, and notes the Authorization strip in the routing-contract paragraph.
…st trust Mirror the parity fixes that just landed on PR kubernetes-sigs#838, in the shapes that fit the ext_proc design. Together with the per-PR ingress strip, this closes the four-finding security review from @aditya-shantanu. 1. X-Sandbox-ID DNS-label validation. The Python router runs ID through ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ (max 63 chars) so a caller can't interpolate extra DNS components into "<id>.<ns>.svc. <cluster-domain>" via the DNS-form fallback. We validated namespace and port but not ID. Now applied to both ID and namespace through a shared validDNSLabel helper. 2. Tightened namespace validation. The previous validNamespace allowed uppercase, leading/trailing hyphens, and unbounded length. K8s itself rejects all three, so the practical risk was just routing to FQDNs that can't exist — but tightening to the same DNS-1123 rule used for ID keeps the validation surface uniform and matches Python exactly. validNamespace removed in favor of validDNSLabel. 3. Authorization header strip. Always add "authorization" to the HeaderMutation.RemoveHeaders so the sandbox never sees the caller's bearer credential. Without this, a sandbox could impersonate the caller against the K8s API or any other Bearer-protected service. Matches the Python router and the same fix on kubernetes-sigs#838. 4. x-envoy-original-dst-host ingress strip (defense in depth). A new envoy.filters.http.header_mutation filter runs BEFORE ext_proc in the HCM chain and removes any client-supplied x-envoy-original-dst-host. ext_proc still always sets it via HeaderMutation, so after the filter chain the value reaching the ORIGINAL_DST cluster is provably the one ext_proc wrote — or absent if a future route disables ext_proc, in which case the cluster fails closed with 503 rather than dispatching to whatever the client asked for. Without this, the security of the data path would rest on "ext_proc is enabled on every route", which the existing /healthz route already demonstrates is not a load- bearing assumption. Tests: TestHandle_InvalidIDRejected covers the six classes of DNS-injection / traversal inputs (dot, slash, underscore, uppercase, leading hyphen, trailing hyphen). TestHandle_AlwaysStripsAuthorization asserts the RemoveHeaders mutation contains "authorization" on both upgrade and non-upgrade paths. TestValidDNSLabel replaces the old TestValidNamespace with the stricter table (accepts 1abc per RFC 1123, rejects MY-NS, -x, x-, length > 63). README documents the new validation surface, the headers we strip before forwarding (Authorization, Origin-on-upgrade, and the listener-level x-envoy-original-dst-host strip), and the rationale for each.
mastersingh24
left a comment
There was a problem hiding this comment.
@aditya-shantanu — thanks, both findings were real. Fixed in 4bc00c6, along with two related gaps the audit surfaced:
X-Sandbox-Pod-IPSSRF (your finding) — addedvalidPodIPusingnet.ParseIP+ the same class checks (IsUnspecified/IsLoopback/IsLinkLocalUnicast/IsLinkLocalMulticast/IsMulticast). Rejects169.254.169.254,127.0.0.1,fe80::1, etc. with a 400 + the Python-router error shape.X-Sandbox-IDDNS-label check (your finding) — addedvalidDNSLabelmatching the Python^[a-z0-9]([-a-z0-9]*[a-z0-9])?$+ 63-char cap, applied to bothIDandNamespace. The oldvalidNamespacewas also more permissive than Python's regex (accepted uppercase, leading/trailing hyphens, unbounded length); both fields now share the strict check.Authorizationheader strip — audit surfaced this: the Python router dropsAuthorizationright next toHostbefore forwarding, and we were leaving it. With--authz-mode=tokenreviewthe router consumes the caller's K8s bearer token; leaking it to the sandbox would let the sandbox impersonate the caller. Done in theRewritecallback. Integration testTestIntegration_AuthorizationStrippedFromUpstreamasserts the backend sees an emptyAuthorization.
One pragmatic addition: --allow-loopback-pod-ip (default false, the safe behavior). The sidecar deployment shape (sandbox shares the router's Pod, so 127.0.0.1 is the correct dial address) is a legitimate use case, and our integration tests using httptest also need it. Link-local, multicast, and unspecified classes stay rejected regardless of the flag.
Same X-Sandbox-ID DNS-label gap + Authorization strip + ingress-trust point landed on #850 in b23a65b.
This PR adds a Go reimplementation of the sandbox-router at
clients/go/sandbox-router/, as a drop-in alternative to the existingPython router at
clients/python/agentic-sandbox-client/sandbox-router/.Why
The Python router is a small reverse proxy that fans HTTP traffic out to
sandbox pods by reading
X-Sandbox-*headers and constructing theinternal Kubernetes DNS name. It works, but for enterprise deployments it
has gaps: no TLS, no mTLS, no metrics, no structured access logging, no
tracing, no per-request retry on dial failures.
This Go version closes those gaps while keeping the contract identical
so existing callers (the Go and Python SDKs, plus any direct HTTP clients)
keep working unchanged.
Scope
Drop-in protocol contract preserved:
sandbox-router-svcon port 8080X-Sandbox-ID/X-Sandbox-Namespace/X-Sandbox-Port/X-Sandbox-Pod-IPheaders with the same defaults and validation rules{"detail": "..."}JSON error shapePROXY_TIMEOUT_SECONDSandCLUSTER_DOMAINenv-var supportGET /healthzreturning{"status":"ok"}(used by GatewayHealthCheckPolicy)
Production controls added:
parent directory, safe under K8s Secret atomic-rename rotation)
trace_idandspan_idbaked into everyper-request log line
(
--upstream-max-retries; only dial-class failures are retried sorequest bodies are never replayed)
/metricsand optional OTLP push via theOpenTelemetry → Prometheus bridge (
--enable-otel-metrics)--enable-tracing); trace contextpropagated to the upstream sandbox
--config/SANDBOX_ROUTER_CONFIG); precedence isCLI > file > env > defaults
--max-request-body-bytes)(
gcr.io/distroless/static:nonroot)Tested: unit tests across every package, integration tests
(
-tags=integration) covering end-to-end proxy round-trip with bodystreaming, healthz, all three mTLS modes with real handshakes, retry
success when backend comes up mid-window, retry give-up within budget.
All 14 cases from the Python router's
test_sandbox_router.pyhave Goequivalents.
Compatibility
The Python router source stays in the tree; the Go router builds as a
separate image named
sandbox-router-goto avoid colliding with thePython
sandbox-routerimage (this required a small patch todev/tools/push-images). Operators opt in by switching their workload'simage. Existing Gateway / HTTPRoute / Service manifests don't change.
Example Kubernetes manifests are in
clients/go/sandbox-router/deploy/(Deployment, Service, PDB,NetworkPolicy, ServiceAccount) with a README walking through what to
tighten for production.
A local load-test harness is in
dev/load-test/router/; referencethroughput / latency numbers are captured in the package README's
"Scaling guidance" section.
Files
New code under
clients/go/sandbox-router/(config, proxy, tlsutil,observability, server packages + main + Dockerfile + deploy/), plus a
load-test harness under
dev/load-test/router/. TheMakefilegets abuild-sandbox-routertarget.go.modpromotesfsnotifyandgolang.org/x/syncfrom indirect, and adds the OTLP metric exporter,the OTel SDK metric package, and the OTel-to-Prometheus bridge.
Release Note