Skip to content

[Bug] SandboxClaim controller startup latency histogram can overcount Ready transitions under high concurrency #940

@zf930530

Description

@zf930530

What happened?

While running repeated high-concurrency SandboxClaim load tests, I observed that the agent_sandbox_claim_controller_startup_latency_ms histogram sometimes reports more observations than the number of SandboxClaims created in the test batch.

In my environment, I repeatedly created 500 SandboxClaims concurrently/frequently. I expected the histogram delta for this batch to be 500 observations, but it occasionally increased by a small amount more than that:

  • expected: agent_sandbox_claim_controller_startup_latency_ms_count delta = 500
  • observed intermittently: _count delta = 502 or 503
  • the corresponding _bucket samples, including le="+Inf", appear to reflect the same overcount

During some runs I also observed SandboxClaims in Terminating state while the test was still settling.

This makes the histogram unsuitable as an exact per-claim throughput/sample-count source in high-concurrency tests, because the count/bucket deltas can be slightly higher than the number of created claims.

My current suspicion is that under high concurrency the controller can record the Ready transition more than once for a small number of claims. One possible path is repeated reconcile events combined with status/cache lag: a reconcile records the transition to Ready and patches status, then a subsequent reconcile still sees an old status snapshot where the claim was not Ready yet and records the same claim again.

How can we reproduce it (as minimally and precisely as possible)?

I do not yet have a minimal standalone reproducer, but the issue shows up in repeated batch/load testing:

  1. Run a Kubernetes cluster with multiple control-plane and worker nodes.
  2. Deploy agent-sandbox with extensions enabled.
  3. Create a SandboxTemplate and, if applicable, a warm pool for the test path.
  4. Scrape the controller metrics endpoint and record the current values for:
    • agent_sandbox_claim_controller_startup_latency_ms_count
    • agent_sandbox_claim_controller_startup_latency_ms_bucket
  5. Concurrently/frequently create 500 SandboxClaims.
  6. Wait for the claims to become Ready and for the test batch to settle.
  7. Scrape the same metrics again and compute the delta.
  8. Repeat the test multiple times.

Intermittently, the histogram delta is greater than the 500 claims created in the batch, for example 502 or 503.

Version

  • Kubernetes: v1.29.14
  • Cluster shape: 3 control-plane nodes + 4 worker nodes
  • agent-sandbox: v0.4.3
  • Note: local code-path review for the suspected cause was later done against main at commit 32c4f23, but the observed test run used v0.4.3.

Anything else we need to know?

Related context that may be relevant:

This report is about the histogram overcounting observations, not about the latency value itself being slightly high or low.

Metadata

Metadata

Assignees

Labels

priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions