Skip to content

[Bug] agent_sandbox_claim_creation_total cold-start path always records pod_condition="not_ready" #872

@wll1203

Description

@wll1203

Summary

The agent_sandbox_claim_creation_total counter is labeled with pod_condition ("ready" / "not_ready"). For claims served via warm-pool adoption, this label correctly reflects the pod's readiness at the moment of adoption. For cold-start claims, the label is hardcoded to "not_ready" and is never updated when the underlying pod subsequently becomes Ready. As a result, every successful cold-start claim is permanently counted as not_ready in the metric.

Steps to reproduce

  1. Deploy the agent-sandbox controller.
  2. Create a SandboxTemplate with no associated warm pool (or create more SandboxClaims than the warm pool can satisfy).
  3. Create a SandboxClaim.
  4. Wait for the resulting pod to reach Ready.
  5. Scrape the metrics endpoint and query agent_sandbox_claim_creation_total.

Observed:

agent_sandbox_claim_creation_total{launch_type="cold", pod_condition="not_ready", ...} 1

There is no corresponding pod_condition="ready" series for that same claim. The cold-start counter for pod_condition="ready" stays at zero regardless of how many claims succeed.

By contrast, claims adopted from a warm pool produce:

agent_sandbox_claim_creation_total{launch_type="warm", pod_condition="ready", ...} 1

(when the warm pod was already Ready at adoption time).

Root cause

In extensions/controllers/sandboxclaim_controller.go:

Warm path (adoptSandboxFromCandidates, ~line 706) correctly derives the condition from sandbox state:

podCondition := "not_ready"
if isSandboxReady(adopted) {
    podCondition = "ready"
}
asmetrics.RecordSandboxClaimCreation(
    claim.Namespace, claim.Spec.TemplateRef.Name,
    asmetrics.LaunchTypeWarm, poolName, podCondition,
)

Cold path (createSandbox, ~line 1070) hardcodes the literal "not_ready" immediately after r.Create(ctx, sandbox):

asmetrics.RecordSandboxClaimCreation(
    claim.Namespace, claim.Spec.TemplateRef.Name,
    asmetrics.LaunchTypeCold, "none", "not_ready",
)

The Sandbox CR has only just been created at this point, so the pod isn't running yet — "not_ready" is mechanically correct at that instant. The problem is that RecordSandboxClaimCreation is never called again for this claim, so the increment with pod_condition="not_ready" becomes the permanent record of the claim's outcome. agent_sandbox_claim_creation_total is a Counter, so any given label set, once incremented, is a fixed historical fact — there is no way to retroactively re-classify the claim as Ready after the pod comes up.

A repo-wide grep for RecordSandboxClaimCreation confirms these are the only two call sites:

extensions/controllers/sandboxclaim_controller.go:710  (warm)
extensions/controllers/sandboxclaim_controller.go:1070 (cold)
internal/metrics/metrics.go                            (definition)

Metadata

Metadata

Assignees

Labels

priority/backlogHigher priority than priority/awaiting-more-evidence.

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions