Skip to content

[Feature Request] Support Idle Lifecycle Policy for Sandboxes #849

@ctm8788

Description

@ctm8788

Summary

I would like agent-sandbox to support an idle lifecycle policy for long-lived interactive workspaces.

The goal is to let active workspaces remain available while in use, automatically stop idle workspaces while retaining their recoverable state, and eventually delete abandoned retained workspaces.

This is useful for browser IDEs, notebooks, and agent workspaces where rebuilding the environment is expensive, but keeping pods running indefinitely is wasteful.

For this proposal, Suspend and Retain are effectively the same user-facing outcome for this use case: stop active compute, keep enough state to resume the workspace later, and avoid deleting the workspace immediately.

Desired Lifecycle

A workspace should be able to follow this lifecycle:

  1. A Sandbox is created in operationMode: Running with an active TTL, for example 24 hours.
  2. If the active TTL expires, the Sandbox transitions to a retained non-running state. In v1beta1, this may be represented as operationMode: Suspended, but the important behavior is retention.
  3. When retained, a new retention TTL is stamped, for example 30 days, with expiration action Delete.
  4. If the user resumes the workspace before the retention TTL expires, the Sandbox returns to operationMode: Running.
  5. On resume, the active TTL is reset to 24 hours and the active expiration action becomes retain/suspend again.
  6. If the workspace is never resumed and the retention TTL expires, the Sandbox is deleted.
  7. Ideally, active TTL renewal should happen when a client connection is created or renewed, such as a browser IDE WebSocket connection.
Created
  -> operationMode: Running
  -> active TTL: 24h
  -> active expiration action: Retain/Suspend

Active TTL expires
  -> retained non-running state
  -> operationMode: Suspended, if that is the canonical v1beta1 representation
  -> retention TTL: 30d
  -> retained expiration action: Delete

User resumes before 30d
  -> operationMode: Running
  -> active TTL reset to 24h
  -> active expiration action: Retain/Suspend

No resume before 30d
  -> Deleted

API Scope and Ownership

This lifecycle policy should be supported directly on the core Sandbox API and enforced by the core Sandbox controller.

SandboxClaim should be able to accept the same lifecycle policy for template-driven workflows, but it should mirror/pass that policy through to the generated Sandbox rather than independently enforcing a separate lifecycle state machine.

Desired ownership model:

  • Sandbox.spec.lifecycle defines the actual runtime lifecycle policy.
  • The Sandbox controller enforces active TTL, retained non-running transitions, resume behavior, retention TTL, and deletion.
  • SandboxClaim.spec.lifecycle may expose the same fields for convenience.
  • The SandboxClaim controller passes those lifecycle fields through when creating or reconciling the backing Sandbox.
  • Direct Sandbox users get the same lifecycle behavior without needing SandboxClaim.

Why Existing Lifecycle Support Is Not Enough

Current lifecycle support appears centered on absolute shutdownTime and shutdownPolicy. Retain is close to the desired first-stage idle behavior, and operationMode: Suspended may be the right v1beta1 representation for the non-running retained state. However, the lifecycle policy cannot yet express:

  • Active TTL expiration should retain the workspace state and stop active compute rather than delete immediately.
  • Retained non-running resources should have a separate retention TTL.
  • Resume should reset the active TTL.
  • Connection/activity renewal should extend the active TTL.
  • The controller should own the state transition between running, suspended, resumed, and deleted.

Possible API Shape

This is one possible shape, not a fixed proposal:

apiVersion: agents.x-k8s.io/v1beta1
kind: Sandbox
spec:
  operationMode: Running
  lifecycle:
    activeTTLSeconds: 86400
    activeExpirationPolicy: Retain
    retainedTTLSeconds: 2592000
    retainedExpirationPolicy: Delete

For SandboxClaim, the same policy could be accepted and mirrored to the generated Sandbox:

apiVersion: extensions.agents.x-k8s.io/v1beta1
kind: SandboxClaim
spec:
  lifecycle:
    activeTTLSeconds: 86400
    activeExpirationPolicy: Retain
    retainedTTLSeconds: 2592000
    retainedExpirationPolicy: Delete

The generated Sandbox would receive the lifecycle policy and the Sandbox controller would enforce it.

Activity Renewal

For browser workspaces, activity could come from the gateway/router/client layer.

A Kubernetes-native option might be a Lease associated with the Sandbox. The router or client could renew the Lease periodically while a WebSocket/session is active. The controller could use the Lease renewal time as the source of activity without requiring frequent writes to the Sandbox object.

This would allow active workspaces to remain running while a user is connected, without causing high write volume on the Sandbox object.

Open Questions

  • Should this extend the existing lifecycle fields or introduce a new lifecycle policy struct?
  • Should this build on existing Retain semantics, or introduce separate active/retained expiration policies?
  • If operationMode: Suspended is the representation for the retained non-running state, should the retention TTL start when operationMode is set to Suspended, or when the Suspended=True condition is observed?
  • Should activity be represented by a Lease, annotation, status field, or subresource?
  • Should activity renewal be optional, gated by a field such as renewOnActivity: true?
  • How should this interact with future suspend implementations such as freeze or hibernate?

Desired Outcome

Users can define a lifecycle policy where active workspaces remain running while in use, idle workspaces automatically stop active compute while retaining recoverable state, and abandoned retained workspaces are eventually deleted without manual cleanup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions