Skip to content

feat: warmpool selection strategy#878

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:warmpoolSelectionStrategy
Jun 5, 2026
Merged

feat: warmpool selection strategy#878
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:warmpoolSelectionStrategy

Conversation

@vicentefb
Copy link
Copy Markdown
Member

@vicentefb vicentefb commented May 28, 2026

What this PR does / why we need it:

This PR introduces a unified smart sandbox selection strategy for warm pool adoption. Based on review feedback, we eliminated the configurable adoptionStrategy field from SandboxTemplate to simplify the user-facing API. The controller now automatically applies this smart strategy by default.

Changes:

  • Sandbox API Update (SandboxStatus.NodeName):

    • Added nodeName string to SandboxStatus to track the name of the node where the underlying sandbox pod is scheduled. This allows controllers to inspect workload distribution dynamically.
    • Regenerated deepcopy files and CRDs.
  • Smart Sandbox Selection Strategy (smartSelector):

    • The controller automatically adopts sandboxes from the warm pool using a decoupled, package-level selection strategy:
      • Readiness Priority: Prioritizes ready sandboxes over unready ones to ensure instant workload availability.
      • Node Work Spread: Prioritizes spreading workloads across different GKE/Kubernetes nodes by preferring candidate sandboxes on the node with the fewest active workloads (using the new nodeName status field).
      • Oldest Ready Fallback: Resolves ties (equal node workloads or unready candidates) by choosing the oldest created sandbox (FIFO fallback).
  • Queue Self-Cleaning & Fallback:

    • If a sandbox key in the queue is deleted/ghost (doesn't exist in the cluster) or fails candidate validation (e.g. missing warmpool labels or incorrect template hash), the selector immediately returns key, true to pop and discard it from the memory queue.
    • Propagates transient r.List API server errors immediately rather than silently falling back to a cold-start, ensuring correct controller requeuing.
  • Thread-Safe Queue Pops with Retry Loop:

    • Extended SimpleSandboxQueue and synchronizedQueue to support GetWithStrategy and PopWithStrategy operations.
    • Implemented a retry loop inside PopWithStrategy to safely handle concurrency race conditions: if another goroutine pops the selected key during selection, the pop retries with a fresh snapshot rather than failing.
  • Performance & API Alignment:

    • Aligned queue keying to use warmPoolName instead of templateHash to follow the v1beta1 API model. Partitioning the queues per warm pool prevents evaluating candidates from other warm pools.
    • Added upfront namespace filtering in the smart selector before calling r.Get to save expensive Kubernetes API round-trips.
    • Updated API definitions in extensions/api/v1beta1/sandboxtemplate_types.go, regenerated docs/api.md, and updated all generated deepcopy files and CRDs.
  • Unit Testing:

    • Added a comprehensive unit test suite TestSmartSelector in extensions/controllers/sandboxclaim_controller_test.go covering readiness prioritization, node spreading, namespace filtering, and queue cleaning/deletion logic.

Which issue(s) this PR is related to:

Fixes #491

Release Note

- **api/v1beta1**: Added `NodeName` to `SandboxStatus` to expose the scheduled node name of the underlying sandbox pod.
- **extensions/controllers**: Implemented a unified smart sandbox selection strategy for warm pool adoption. The strategy automatically prioritizes ready sandboxes, spreads workloads across nodes, and falls back to creation-time tie-breaking.
- **extensions/queue**: Extended `SimpleSandboxQueue` to support thread-safe strategic queue pops (`GetWithStrategy`) with built-in retry protection against concurrent pop race conditions.

Generated by Overseer (powered by the gemini model).

Summary by CodeRabbit

  • New Features

    • Sandbox status now exposes the node name where the underlying pod is scheduled.
    • Warm-pool queue supports strategy-aware retrieval for targeted sandbox selection.
  • Behavioral Improvements

    • Adoption now prefers ready sandboxes and balances allocations across nodes.
  • Documentation

    • API docs and CRDs updated to include the new Sandbox status nodeName field.
  • Tests

    • Added and updated unit and e2e tests covering adoption selection and status behavior.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 28, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit d027475
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a21fdd2760f830008ab4e1f

@k8s-ci-robot k8s-ci-robot requested review from barney-s and justinsb May 28, 2026 20:23
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 28, 2026
@janetkuo janetkuo requested a review from Copilot May 28, 2026 21:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds configurable warm-pool sandbox adoption strategies, allowing SandboxTemplates to choose between the existing ready/oldest preference and a node-spread strategy for distributing adopted sandboxes.

Changes:

  • Adds adoptionStrategy to SandboxTemplateSpec and CRDs.
  • Adds Sandbox.status.nodeName and populates it from the underlying Pod.
  • Extends the warm sandbox queue and adoption logic to support strategy-based candidate selection, with unit tests.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
extensions/api/v1beta1/sandboxtemplate_types.go Adds the adoption strategy API enum and field.
api/v1beta1/sandbox_types.go Adds nodeName to Sandbox status.
controllers/sandbox_controller.go Updates Sandbox status with the scheduled node name.
extensions/controllers/sandboxclaim_controller.go Implements strategy selection during warm-pool adoption.
extensions/controllers/sandboxclaim_controller_test.go Adds tests for adoption strategy behavior.
extensions/controllers/queue/simple_sandbox_queue.go Adds queue support for strategy-based popping.
extensions/controllers/queue/simple_sandbox_queue_test.go Tests strategy-based queue removal.
k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml Regenerates SandboxTemplate CRD schema.
k8s/crds/agents.x-k8s.io_sandboxes.yaml Regenerates Sandbox CRD schema.
helm/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml Updates Helm CRD copy for SandboxTemplate.
helm/crds/agents.x-k8s.io_sandboxes.yaml Updates Helm CRD copy for Sandbox.

Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
Comment thread extensions/controllers/queue/simple_sandbox_queue.go Outdated
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2026
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch 2 times, most recently from 2a75b12 to 1b6ff48 Compare June 1, 2026 20:52
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2026
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from 1b6ff48 to 955f8d0 Compare June 1, 2026 20:59
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from 955f8d0 to 976e5eb Compare June 1, 2026 22:24
@vicentefb vicentefb changed the title (WIP) feat: warmpool selection strategy feat: warmpool selection strategy Jun 1, 2026
@vicentefb vicentefb marked this pull request as ready for review June 1, 2026 22:29
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 1, 2026
@k8s-ci-robot k8s-ci-robot requested a review from igooch June 1, 2026 22:29
@vicentefb
Copy link
Copy Markdown
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 1, 2026
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from 976e5eb to b2eb881 Compare June 1, 2026 22:32
@vicentefb
Copy link
Copy Markdown
Member Author

cc

@bittermandel
@krzysied

@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from b2eb881 to 6c2ae22 Compare June 1, 2026 23:28
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch 2 times, most recently from ed0efb5 to a3e4be4 Compare June 2, 2026 23:28
@vicentefb vicentefb requested review from barney-s and janetkuo June 2, 2026 23:31
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch 2 times, most recently from dc8a08e to 29611a0 Compare June 2, 2026 23:51
@vicentefb
Copy link
Copy Markdown
Member Author

/retest

@janetkuo janetkuo moved this from Backlog to Linked in Agent Sandbox Jun 3, 2026
Comment thread extensions/controllers/sandboxclaim_controller.go
Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
Comment thread extensions/controllers/sandboxclaim_controller.go
Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2026
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from 29611a0 to f21caba Compare June 4, 2026 22:03
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5a37b3c0-8884-4e3d-a42d-0932f98bda7b

📥 Commits

Reviewing files that changed from the base of the PR and between ed93960 and d027475.

📒 Files selected for processing (11)
  • api/v1beta1/sandbox_types.go
  • controllers/sandbox_controller.go
  • controllers/sandbox_controller_test.go
  • docs/api.md
  • extensions/controllers/queue/simple_sandbox_queue.go
  • extensions/controllers/queue/simple_sandbox_queue_test.go
  • extensions/controllers/sandboxclaim_controller.go
  • extensions/controllers/sandboxclaim_controller_test.go
  • helm/crds/agents.x-k8s.io_sandboxes.yaml
  • k8s/crds/agents.x-k8s.io_sandboxes.yaml
  • test/e2e/framework/predicates/sandbox.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • api/v1beta1/sandbox_types.go
  • k8s/crds/agents.x-k8s.io_sandboxes.yaml
  • test/e2e/framework/predicates/sandbox.go
  • extensions/controllers/queue/simple_sandbox_queue_test.go
  • docs/api.md
  • helm/crds/agents.x-k8s.io_sandboxes.yaml
  • controllers/sandbox_controller.go
  • extensions/controllers/queue/simple_sandbox_queue.go
  • extensions/controllers/sandboxclaim_controller_test.go

📝 Walkthrough

Walkthrough

Adds a nodeName status field to Sandbox, propagates it from Pod.Spec.NodeName during reconciliation, adds a strategy-based queue pop, and integrates a smart selector to pick warm-pool sandboxes preferring ready pods and balancing by node.

Changes

Warm Pool Adoption with Node Awareness

Layer / File(s) Summary
NodeName Status Field and Propagation
api/v1beta1/sandbox_types.go, controllers/sandbox_controller.go, controllers/sandbox_controller_test.go, docs/api.md, helm/crds/agents.x-k8s.io_sandboxes.yaml, k8s/crds/agents.x-k8s.io_sandboxes.yaml, test/e2e/framework/predicates/sandbox.go
SandboxStatus gains optional nodeName; controller sets/clears it from pod.Spec.NodeName; CRD/docs updated and tests/predicates adjusted to account for the new field.
Queue Strategy-Based Dequeue
extensions/controllers/queue/simple_sandbox_queue.go, extensions/controllers/queue/simple_sandbox_queue_test.go
SandboxQueue adds GetWithStrategy; SimpleSandboxQueue delegates to PopWithStrategy, which snapshots items, applies a caller pick function, and removes the selected key with lock-and-retry semantics; new unit test validates selection behavior.
Smart Warm Pool Adoption Selection
extensions/controllers/sandboxclaim_controller.go, extensions/controllers/sandboxclaim_controller_test.go
getCandidate builds a name→sandbox cache and per-node counts, then uses WarmSandboxQueue.GetWithStrategy with smartSelector.pick that filters by namespace, returns cleanup keys for missing/invalid sandboxes, prefers Ready sandboxes, balances by node counts, and breaks ties by creation timestamp; tests added/updated for selector behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Warm pools now choose with wisdom true,
Node names hum where pods will rest,
Ready hops first, then balance spread,
Keys picked kindly, ties addressed,
A clever queue makes adoption best!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: warmpool selection strategy' clearly and concisely describes the main change: introducing a smart selection strategy for warm pool sandbox adoption.
Description check ✅ Passed The PR description comprehensively covers all required sections: clear explanation of changes, linked issue reference (Fixes #491), and detailed release notes covering API updates, smart selection strategy, queue extensions, and performance improvements.
Linked Issues check ✅ Passed The PR fully implements all coding objectives from issue #491: adds a thread-safe queue interface with GetWithStrategy method, implements smart sandbox selection with readiness priority, node-aware load spreading via nodeName field, creation-time tie-breaking, and namespace filtering.
Out of Scope Changes check ✅ Passed All code changes directly support the warmpool selection strategy objective: SandboxStatus.NodeName enables node-aware scheduling, GetWithStrategy enables pluggable selection logic, and adoption controller implements the smart selection strategy without unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
extensions/controllers/sandboxclaim_controller.go (1)

614-618: 💤 Low value

Consider wrapping the error with context for better error messages.

When r.List fails, the returned error lacks reconciler-specific context. While the error is logged immediately (line 616), wrapping it would improve error messages if it propagates up the call stack.

📝 Suggested improvement
 if err := r.List(ctx, &sandboxList, client.InNamespace(claim.Namespace)); err != nil {
     logger.Error(err, "Failed to list sandboxes for smart selection node counting")
-    return nil, queue.SandboxKey{}, err
+    return nil, queue.SandboxKey{}, fmt.Errorf("failed to list sandboxes in namespace %s for smart selection: %w", claim.Namespace, err)
 }

As per coding guidelines, wrap errors with context to help downstream error handlers identify the failure point.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/controllers/sandboxclaim_controller.go` around lines 614 - 618,
Wrap the error returned by r.List with reconciler-specific context before
returning so callers get clearer failure info; e.g., when listing into
v1beta1.SandboxList via r.List(ctx, &sandboxList,
client.InNamespace(claim.Namespace)), create a wrapped error (using fmt.Errorf
or errors.Wrapf) that includes the operation and identifying info such as the
claim name/namespace, call logger.Error with that wrapped error, and return that
wrapped error instead of the raw err (references: r.List, v1beta1.SandboxList,
logger.Error, claim.Namespace, queue.SandboxKey{}).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@extensions/controllers/sandboxclaim_controller.go`:
- Around line 614-618: Wrap the error returned by r.List with
reconciler-specific context before returning so callers get clearer failure
info; e.g., when listing into v1beta1.SandboxList via r.List(ctx, &sandboxList,
client.InNamespace(claim.Namespace)), create a wrapped error (using fmt.Errorf
or errors.Wrapf) that includes the operation and identifying info such as the
claim name/namespace, call logger.Error with that wrapped error, and return that
wrapped error instead of the raw err (references: r.List, v1beta1.SandboxList,
logger.Error, claim.Namespace, queue.SandboxKey{}).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c975b7c1-2ad1-4602-ae1d-5442d93c690b

📥 Commits

Reviewing files that changed from the base of the PR and between 98952fd and f21caba.

📒 Files selected for processing (11)
  • api/v1beta1/sandbox_types.go
  • controllers/sandbox_controller.go
  • controllers/sandbox_controller_test.go
  • docs/api.md
  • extensions/controllers/queue/simple_sandbox_queue.go
  • extensions/controllers/queue/simple_sandbox_queue_test.go
  • extensions/controllers/sandboxclaim_controller.go
  • extensions/controllers/sandboxclaim_controller_test.go
  • helm/crds/agents.x-k8s.io_sandboxes.yaml
  • k8s/crds/agents.x-k8s.io_sandboxes.yaml
  • test/e2e/framework/predicates/sandbox.go

@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from f21caba to ed93960 Compare June 4, 2026 22:17
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2026
updated based on copilot

added test to sandbox controller

lint fixed

fixed e2e tests

addressed Janet's comments

autogent

fixed docs

addressed Barni's question

updated docs

revert capitalization:

update

addressed comments on fallback logic

linter
@vicentefb vicentefb force-pushed the warmpoolSelectionStrategy branch from ed93960 to d027475 Compare June 4, 2026 22:35
Copy link
Copy Markdown
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2026
@k8s-ci-robot k8s-ci-robot merged commit ea6f51a into kubernetes-sigs:main Jun 5, 2026
14 checks passed
@github-project-automation github-project-automation Bot moved this from Linked to Done in Agent Sandbox Jun 5, 2026
@janetkuo janetkuo moved this from Done to Linked in Agent Sandbox Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. ready-for-review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Linked

Development

Successfully merging this pull request may close these issues.

[Feature request] Strategy for picking the warm pool pods

6 participants