Summary
The agent_sandbox_claim_creation_total counter is labeled with pod_condition ("ready" / "not_ready"). For claims served via warm-pool adoption, this label correctly reflects the pod's readiness at the moment of adoption. For cold-start claims, the label is hardcoded to "not_ready" and is never updated when the underlying pod subsequently becomes Ready. As a result, every successful cold-start claim is permanently counted as not_ready in the metric.
Steps to reproduce
- Deploy the agent-sandbox controller.
- Create a
SandboxTemplate with no associated warm pool (or create more SandboxClaims than the warm pool can satisfy).
- Create a
SandboxClaim.
- Wait for the resulting pod to reach
Ready.
- Scrape the metrics endpoint and query
agent_sandbox_claim_creation_total.
Observed:
agent_sandbox_claim_creation_total{launch_type="cold", pod_condition="not_ready", ...} 1
There is no corresponding pod_condition="ready" series for that same claim. The cold-start counter for pod_condition="ready" stays at zero regardless of how many claims succeed.
By contrast, claims adopted from a warm pool produce:
agent_sandbox_claim_creation_total{launch_type="warm", pod_condition="ready", ...} 1
(when the warm pod was already Ready at adoption time).
Root cause
In extensions/controllers/sandboxclaim_controller.go:
Warm path (adoptSandboxFromCandidates, ~line 706) correctly derives the condition from sandbox state:
podCondition := "not_ready"
if isSandboxReady(adopted) {
podCondition = "ready"
}
asmetrics.RecordSandboxClaimCreation(
claim.Namespace, claim.Spec.TemplateRef.Name,
asmetrics.LaunchTypeWarm, poolName, podCondition,
)
Cold path (createSandbox, ~line 1070) hardcodes the literal "not_ready" immediately after r.Create(ctx, sandbox):
asmetrics.RecordSandboxClaimCreation(
claim.Namespace, claim.Spec.TemplateRef.Name,
asmetrics.LaunchTypeCold, "none", "not_ready",
)
The Sandbox CR has only just been created at this point, so the pod isn't running yet — "not_ready" is mechanically correct at that instant. The problem is that RecordSandboxClaimCreation is never called again for this claim, so the increment with pod_condition="not_ready" becomes the permanent record of the claim's outcome. agent_sandbox_claim_creation_total is a Counter, so any given label set, once incremented, is a fixed historical fact — there is no way to retroactively re-classify the claim as Ready after the pod comes up.
A repo-wide grep for RecordSandboxClaimCreation confirms these are the only two call sites:
extensions/controllers/sandboxclaim_controller.go:710 (warm)
extensions/controllers/sandboxclaim_controller.go:1070 (cold)
internal/metrics/metrics.go (definition)
Summary
The
agent_sandbox_claim_creation_totalcounter is labeled withpod_condition("ready"/"not_ready"). For claims served via warm-pool adoption, this label correctly reflects the pod's readiness at the moment of adoption. For cold-start claims, the label is hardcoded to"not_ready"and is never updated when the underlying pod subsequently becomes Ready. As a result, every successful cold-start claim is permanently counted asnot_readyin the metric.Steps to reproduce
SandboxTemplatewith no associated warm pool (or create moreSandboxClaims than the warm pool can satisfy).SandboxClaim.Ready.agent_sandbox_claim_creation_total.Observed:
There is no corresponding
pod_condition="ready"series for that same claim. The cold-start counter forpod_condition="ready"stays at zero regardless of how many claims succeed.By contrast, claims adopted from a warm pool produce:
(when the warm pod was already Ready at adoption time).
Root cause
In
extensions/controllers/sandboxclaim_controller.go:Warm path (
adoptSandboxFromCandidates, ~line 706) correctly derives the condition from sandbox state:Cold path (
createSandbox, ~line 1070) hardcodes the literal"not_ready"immediately afterr.Create(ctx, sandbox):The Sandbox CR has only just been created at this point, so the pod isn't running yet —
"not_ready"is mechanically correct at that instant. The problem is thatRecordSandboxClaimCreationis never called again for this claim, so the increment withpod_condition="not_ready"becomes the permanent record of the claim's outcome.agent_sandbox_claim_creation_totalis a Counter, so any given label set, once incremented, is a fixed historical fact — there is no way to retroactively re-classify the claim as Ready after the pod comes up.A repo-wide grep for
RecordSandboxClaimCreationconfirms these are the only two call sites: