Summary
I would like agent-sandbox to support an idle lifecycle policy for long-lived interactive workspaces.
The goal is to let active workspaces remain available while in use, automatically stop idle workspaces while retaining their recoverable state, and eventually delete abandoned retained workspaces.
This is useful for browser IDEs, notebooks, and agent workspaces where rebuilding the environment is expensive, but keeping pods running indefinitely is wasteful.
For this proposal, Suspend and Retain are effectively the same user-facing outcome for this use case: stop active compute, keep enough state to resume the workspace later, and avoid deleting the workspace immediately.
Desired Lifecycle
A workspace should be able to follow this lifecycle:
- A
Sandbox is created in operationMode: Running with an active TTL, for example 24 hours.
- If the active TTL expires, the
Sandbox transitions to a retained non-running state. In v1beta1, this may be represented as operationMode: Suspended, but the important behavior is retention.
- When retained, a new retention TTL is stamped, for example 30 days, with expiration action
Delete.
- If the user resumes the workspace before the retention TTL expires, the
Sandbox returns to operationMode: Running.
- On resume, the active TTL is reset to 24 hours and the active expiration action becomes retain/suspend again.
- If the workspace is never resumed and the retention TTL expires, the
Sandbox is deleted.
- Ideally, active TTL renewal should happen when a client connection is created or renewed, such as a browser IDE WebSocket connection.
Created
-> operationMode: Running
-> active TTL: 24h
-> active expiration action: Retain/Suspend
Active TTL expires
-> retained non-running state
-> operationMode: Suspended, if that is the canonical v1beta1 representation
-> retention TTL: 30d
-> retained expiration action: Delete
User resumes before 30d
-> operationMode: Running
-> active TTL reset to 24h
-> active expiration action: Retain/Suspend
No resume before 30d
-> Deleted
API Scope and Ownership
This lifecycle policy should be supported directly on the core Sandbox API and enforced by the core Sandbox controller.
SandboxClaim should be able to accept the same lifecycle policy for template-driven workflows, but it should mirror/pass that policy through to the generated Sandbox rather than independently enforcing a separate lifecycle state machine.
Desired ownership model:
Sandbox.spec.lifecycle defines the actual runtime lifecycle policy.
- The
Sandbox controller enforces active TTL, retained non-running transitions, resume behavior, retention TTL, and deletion.
SandboxClaim.spec.lifecycle may expose the same fields for convenience.
- The
SandboxClaim controller passes those lifecycle fields through when creating or reconciling the backing Sandbox.
- Direct
Sandbox users get the same lifecycle behavior without needing SandboxClaim.
Why Existing Lifecycle Support Is Not Enough
Current lifecycle support appears centered on absolute shutdownTime and shutdownPolicy. Retain is close to the desired first-stage idle behavior, and operationMode: Suspended may be the right v1beta1 representation for the non-running retained state. However, the lifecycle policy cannot yet express:
- Active TTL expiration should retain the workspace state and stop active compute rather than delete immediately.
- Retained non-running resources should have a separate retention TTL.
- Resume should reset the active TTL.
- Connection/activity renewal should extend the active TTL.
- The controller should own the state transition between running, suspended, resumed, and deleted.
Possible API Shape
This is one possible shape, not a fixed proposal:
apiVersion: agents.x-k8s.io/v1beta1
kind: Sandbox
spec:
operationMode: Running
lifecycle:
activeTTLSeconds: 86400
activeExpirationPolicy: Retain
retainedTTLSeconds: 2592000
retainedExpirationPolicy: Delete
For SandboxClaim, the same policy could be accepted and mirrored to the generated Sandbox:
apiVersion: extensions.agents.x-k8s.io/v1beta1
kind: SandboxClaim
spec:
lifecycle:
activeTTLSeconds: 86400
activeExpirationPolicy: Retain
retainedTTLSeconds: 2592000
retainedExpirationPolicy: Delete
The generated Sandbox would receive the lifecycle policy and the Sandbox controller would enforce it.
Activity Renewal
For browser workspaces, activity could come from the gateway/router/client layer.
A Kubernetes-native option might be a Lease associated with the Sandbox. The router or client could renew the Lease periodically while a WebSocket/session is active. The controller could use the Lease renewal time as the source of activity without requiring frequent writes to the Sandbox object.
This would allow active workspaces to remain running while a user is connected, without causing high write volume on the Sandbox object.
Open Questions
- Should this extend the existing lifecycle fields or introduce a new lifecycle policy struct?
- Should this build on existing
Retain semantics, or introduce separate active/retained expiration policies?
- If
operationMode: Suspended is the representation for the retained non-running state, should the retention TTL start when operationMode is set to Suspended, or when the Suspended=True condition is observed?
- Should activity be represented by a Lease, annotation, status field, or subresource?
- Should activity renewal be optional, gated by a field such as
renewOnActivity: true?
- How should this interact with future suspend implementations such as freeze or hibernate?
Desired Outcome
Users can define a lifecycle policy where active workspaces remain running while in use, idle workspaces automatically stop active compute while retaining recoverable state, and abandoned retained workspaces are eventually deleted without manual cleanup.
Summary
I would like
agent-sandboxto support an idle lifecycle policy for long-lived interactive workspaces.The goal is to let active workspaces remain available while in use, automatically stop idle workspaces while retaining their recoverable state, and eventually delete abandoned retained workspaces.
This is useful for browser IDEs, notebooks, and agent workspaces where rebuilding the environment is expensive, but keeping pods running indefinitely is wasteful.
For this proposal,
SuspendandRetainare effectively the same user-facing outcome for this use case: stop active compute, keep enough state to resume the workspace later, and avoid deleting the workspace immediately.Desired Lifecycle
A workspace should be able to follow this lifecycle:
Sandboxis created inoperationMode: Runningwith an active TTL, for example 24 hours.Sandboxtransitions to a retained non-running state. Inv1beta1, this may be represented asoperationMode: Suspended, but the important behavior is retention.Delete.Sandboxreturns tooperationMode: Running.Sandboxis deleted.API Scope and Ownership
This lifecycle policy should be supported directly on the core
SandboxAPI and enforced by the coreSandboxcontroller.SandboxClaimshould be able to accept the same lifecycle policy for template-driven workflows, but it should mirror/pass that policy through to the generatedSandboxrather than independently enforcing a separate lifecycle state machine.Desired ownership model:
Sandbox.spec.lifecycledefines the actual runtime lifecycle policy.Sandboxcontroller enforces active TTL, retained non-running transitions, resume behavior, retention TTL, and deletion.SandboxClaim.spec.lifecyclemay expose the same fields for convenience.SandboxClaimcontroller passes those lifecycle fields through when creating or reconciling the backingSandbox.Sandboxusers get the same lifecycle behavior without needingSandboxClaim.Why Existing Lifecycle Support Is Not Enough
Current lifecycle support appears centered on absolute
shutdownTimeandshutdownPolicy.Retainis close to the desired first-stage idle behavior, andoperationMode: Suspendedmay be the right v1beta1 representation for the non-running retained state. However, the lifecycle policy cannot yet express:Possible API Shape
This is one possible shape, not a fixed proposal:
For
SandboxClaim, the same policy could be accepted and mirrored to the generatedSandbox:The generated
Sandboxwould receive the lifecycle policy and theSandboxcontroller would enforce it.Activity Renewal
For browser workspaces, activity could come from the gateway/router/client layer.
A Kubernetes-native option might be a
Leaseassociated with theSandbox. The router or client could renew the Lease periodically while a WebSocket/session is active. The controller could use the Lease renewal time as the source of activity without requiring frequent writes to theSandboxobject.This would allow active workspaces to remain running while a user is connected, without causing high write volume on the
Sandboxobject.Open Questions
Retainsemantics, or introduce separate active/retained expiration policies?operationMode: Suspendedis the representation for the retained non-running state, should the retention TTL start whenoperationModeis set toSuspended, or when theSuspended=Truecondition is observed?renewOnActivity: true?Desired Outcome
Users can define a lifecycle policy where active workspaces remain running while in use, idle workspaces automatically stop active compute while retaining recoverable state, and abandoned retained workspaces are eventually deleted without manual cleanup.