Skip to content

Latest commit

 

History

History
206 lines (155 loc) · 17.8 KB

File metadata and controls

206 lines (155 loc) · 17.8 KB

Roadmap

What's shipped and what's coming next.

What's ready

Core platform

  • Autonomous agent execution - Isolated MicroVM (AgentCore Runtime) per task with shell, filesystem, and git access
  • CLI and REST API - Submit, list, get, cancel tasks; view audit events; Cognito auth with token caching
  • Durable orchestrator - Lambda Durable Functions with checkpoint/resume; survives transient failures up to 9 hours
  • Task state machine - SUBMITTED → HYDRATING → RUNNING → COMPLETED / FAILED / CANCELLED / TIMED_OUT
  • Concurrency control - Per-user limits (default 3) with atomic admission and automated drift reconciliation
  • Idempotency - Idempotency-Key header on POST requests (24-hour TTL)

Task types

  • new_task - Branch, implement, build/test, open PR
  • pr_iteration - Check out PR branch, read review feedback, address it, push
  • pr_review - Read-only structured code review via GitHub Reviews API (no Write/Edit tools)

Onboarding and customization

  • Blueprint construct - Per-repo CDK configuration (model, turns, budget, prompt overrides, egress, GitHub token)
  • Repo-level project config - Agent loads CLAUDE.md, .claude/rules/, .claude/settings.json, .mcp.json
  • Per-repo overrides - Model ID, max turns, max budget, system prompt overrides, poll interval, dedicated token

Security

  • Network isolation - VPC with private subnets, HTTPS-only egress, VPC endpoints for AWS services
  • DNS Firewall - Domain allowlist with observation mode and path to enforcement
  • Input guardrails - Bedrock Guardrails screen task descriptions and PR/issue content (fail-closed)
  • Output screening - Regex-based secret/PII scanner with PostToolUse hook redaction
  • Content sanitization - HTML stripping, injection pattern neutralization, control character removal
  • Cedar policy engine - Tool-call governance with fail-closed default and per-repo custom policies
  • WAF - Managed rule groups + rate-based rule (1,000 req/5 min/IP)
  • Pre-flight checks - GitHub API reachability, repo access, token permissions (fail-closed)
  • Model invocation logging - Full prompt/response audit trail (90-day retention)

Memory and learning

  • AgentCore Memory - Semantic (repo knowledge) and episodic (task episodes) strategies with namespace templates
  • Content integrity - SHA-256 hashing, source provenance tracking, schema v3
  • Fail-open design - Memory never blocks task execution; 2,000-token budget

Context hydration

  • Rich prompt assembly - Task description + GitHub issue/PR content + memory context (~100K token budget)
  • Token budget management - Oldest comments trimmed first; title/body always preserved

Webhooks

  • HMAC-SHA256 webhooks - External systems create tasks without Cognito credentials
  • Webhook management - Create, list, revoke with soft delete (30-day TTL)

Cost and limits

  • Turn caps - Per-task max turns (1-500, default 100) with Blueprint defaults
  • Cost budget - Per-task max budget in USD ($0.01-$100)
  • Data retention - Automatic TTL-based cleanup (default 90 days)

Observability

  • OpenTelemetry - Custom spans for pipeline phases with CloudWatch querying
  • Operator dashboard - Task success rate, cost, duration, build/lint pass rates, AgentCore metrics
  • Alarms - Stuck tasks, orchestration failures, counter drift, crash rate, guardrail failures
  • Audit trail - TaskEvents table with chronological event log per task
  • Runtime error classifier - Pattern-matching classifier that categorizes task errors (auth/network/concurrency/compute/agent/guardrail/config/timeout/unknown) with human-readable titles, descriptions, remedies, and retryability flags. Computed at API response time; powers structured CLI error display and CloudWatch alarm routing

Agent harness

  • Default branch detection - Dynamic detection via gh repo view
  • Uncommitted work safety net - Auto-commit before PR creation
  • Build/lint verification - Pre- and post-agent baselines in PR body
  • Prompt versioning - SHA-256 hash for A/B comparison
  • Per-commit attribution - Task-Id and Prompt-Version git trailers
  • Persistent session storage - /mnt/workspace for npm and config caches

Docs and DX

  • Quick start guide - Zero to first PR in ~30 minutes
  • Prompt guide - Best practices, anti-patterns, examples
  • Claude Code plugin - Interactive skills for setup, deploy, submit, troubleshoot

What's next

Planned capabilities, grouped by theme. Items are independent and may ship in any order.

Credentials and authorization

Capability Description
Per-session IAM scoping Generate short-lived, scoped credentials per task via sts:AssumeRole with session tags (user_id, repo, task_id). DynamoDB leading-key conditions restrict each session to its own partition. Bedrock model access scoped to an explicit ARN allowlist instead of *. Eliminates cross-tenant blast radius from a compromised agent session.
Per-repo GitHub credentials GitHub App per org/repo via AgentCore Token Vault. Auto-refresh for long sessions. Sets the pattern for GitLab, Jira, Slack integrations.
Principal-to-repo authorization Map Cognito identities to allowed repository sets. Users can only trigger work on authorized repos.

Agent quality

Capability Description
Autonomous feedback loop Extend the orchestrator state machine beyond PR_OPENED with a PR watcher phase. Auto-resume the agent when CI fails (inject failure logs), merge conflicts arise (rebase instructions), or reviewers request changes (inline comments). Continue the loop until the PR is merged or a human cancels. Optionally auto-merge when CI passes and review is approved. Transforms ABCA from "open PR" to "merge PR".
Tiered validation pipeline Three post-agent tiers: tool validation (build/test/lint), code quality (DRY/SOLID/complexity), risk and blast radius analysis.
In-pipeline build/lint fix-up loop Today the agent path is linear (clone → code → build → lint → PR); a post-change verify_build / verify_lint failure fails the task. Instead, loop back into the agent with the failure output as extra context, up to a configurable retry count, then fail only if fixes are exhausted—while still respecting the existing max_turns budget. Likely implementable in pipeline.py (after run_agent(), on verification failure re-invoke the agent) without orchestrator changes; distinct from the Autonomous feedback loop (PR/CI after the PR exists).
In-pipeline pre-PR self-review Post-hooks already run build / lint, but the LLM is not prompted to self-review its own diff before the PR. Add an optional in-pipeline step: surface the change set (diff), have the model critique it (bugs, style, edge cases, test gaps), then iterate on fixes—within the same max_turns / budget constraints. Aims to improve first-pass PR quality before human or CI review; implementable alongside other pipeline.py phases.
PR risk classification Rule-based risk classifier at submission. Drives model selection, budget defaults, approval requirements.
PR scope creep check (pr_review) Add an advisory-first scope analysis in pr_review that compares declared intent (task description / issue / PR narrative) to the actual diff and touched areas. Return structured output with scope_rating (within_scope/mild_expansion/significant_expansion/likely_scope_creep), confidence, and rationale (files, API/schema/config changes, unrelated dependency churn). Start as non-blocking reviewer guidance; optional policy gates can be enabled later for high-risk repos.
Review feedback memory loop Capture PR review comments via webhook, extract rules via LLM, persist as searchable memory.
PR outcome tracking Track merge/reject via GitHub webhooks. Positive/negative signals feed evaluation and memory.
Evaluation pipeline Failure categorization, memory effectiveness metrics (merge rate, revision cycles, CI pass rate).
A/B prompt experiments Assign prompt variants per task or cohort; compare merge rate, failure rate, and token usage with statistical guardrails.
LLM-assisted trace analysis Automated deep dive on failed trajectories (logs + spans) to surface recurring reasoning and tool-use failure modes.
Validation and risk analytics Dashboards for PR risk labels, validation outcomes, and trends by repo, user, and prompt_version; eventually feed learned memory rules into Tier 2 when the tiered pipeline ships.

Memory security

Capability Description
Trust-aware retrieval Weight memories by freshness, source type, pattern consistency.
Temporal decay Configurable per-entry TTL with faster decay for unverified content.
Anomaly detection CloudWatch metrics on write patterns; alarms for burst writes or suspicious content.
Quarantine and rollback Operator API for isolating suspicious entries and restoring pre-task snapshots.
Write-ahead validation Route proposed memory writes through a guardian model.
Review feedback quorum Promote review-derived rules to persistent memory only after corroboration (e.g. pattern seen across trusted reviewers and PRs), reducing single-comment poisoning. Complements Review feedback memory loop.
Memory backup to S3 Scheduled export of AgentCore Memory namespaces to versioned S3 for disaster recovery and pre-poisoning restore (see design: SECURITY.md).
Memory extraction replay Operator API (e.g. start_memory_extraction_job) to re-run failed PR-review extraction after webhook or Lambda errors.
Structured knowledge graph (tier 4) Optional long-term direction if semantic + episodic memory proves insufficient for repo-specific query patterns.

Security (execution guardrails)

Capability Description
Behavioral circuit breaker Per-session limits on tool-call rate, cumulative cost, consecutive failures, and file churn; pause or terminate when thresholds are exceeded. Configurable per repo via Blueprint (design: SECURITY.md, REPO_ONBOARDING.md).
Tool capability tiers Opt-in extended tool profile per repo: MCP servers, plugins, and additional Gateway-mediated tools beyond the default minimal surface (COMPUTE.md). Enforced at Gateway and policy layers.

Channels and integrations

Capability Description
Multi-modal input Accept images in task payload (screenshots, UI mockups, design specs).
Additional git providers GitLab (and optionally Bitbucket). Same workflow, provider-specific API adapters.
Slack integration Submit tasks, check status, receive notifications from Slack. Block Kit rendering.
Control panel Web UI: task list, task detail with logs/traces, cancel, metrics dashboards, cost attribution.
Real-time event streaming WebSocket API for live task updates. Replaces polling for CLI, control panel, Slack.
Outbound notification pipeline Canonical internal notification schema emitted on task lifecycle events; channel adapters (Slack, email, CLI) render and deliver. Complements polling and WebSocket.
Per-user notification preferences DynamoDB (or equivalent) store for preferred channels, per-channel config, and event filters (INPUT_GATEWAY.md).
Browser extension channel Lightweight extension to open tasks from GitHub issue/PR pages using existing webhook or OAuth-issued JWT; same internal message contract as other channels.

Compute and performance

Capability Description
Adaptive model router Per-turn model selection by complexity. Cheaper models for reads, Opus for complex reasoning. ~30-40% cost reduction.
Alternative compute ECS/Fargate or EKS via ComputeStrategy interface. For workloads exceeding AgentCore's 2 GB image limit or requiring GPU.
Environment pre-warming Pre-build container layers per repo. Snapshot-on-schedule (rebuild on push). Cold start from minutes to seconds.

Onboarding and repo lifecycle

Capability Description
Automated re-onboarding Event-driven refresh of blueprint-related artifacts when the default branch changes materially (GitHub webhook); optional EventBridge schedule for periodic drift checks. Distinct from Scheduled triggers (task creation).
Dynamic onboarding artifacts When repo hygiene is weak, generate attachments for the agent context: codebase summaries, dependency graphs, suggested rules from layout (REPO_ONBOARDING.md).

Cost governance

Capability Description
Org and team budgets Per-user and per-team monthly token or USD budgets with alerting (e.g. 80%) and optional hard stop at 100%.
Automated model downgrade When approaching budget limits, shift to a cheaper default model for new work (e.g. Sonnet → Haiku) per policy.

Observability and safe deploy

Capability Description
Admission backlog observability Metric and alarm when SUBMITTED task depth exceeds an operator threshold (capacity and admission health).
Safe orchestrator deploys Pre-deploy checks for active tasks (drain or warn); blue-green or canary Lambda deploy for the durable orchestrator with rollback on error regressions (OBSERVABILITY.md).

Scale and collaboration

Capability Description
Multi-user and teams Team visibility, shared approval queues, team concurrency/cost budgets, memory isolation.
Agent swarm Planner-worker architecture for complex multi-file tasks. DAG of subtasks, merge orchestrator, one consolidated PR.
Configurable human-in-the-loop (operator gate) Blueprint- or task-level workflow with multiple gates per run (for example: draft plan → operator review → implement → operator review). Each gate is configured as human-review (pause until the operator responds) or auto (continue through that checkpoint without blocking). Mix gates in one run—for example plan review enforced, implementation review auto. Resume, timeout, and cancel stay well-defined at every human-review wait. Complements Iterative feedback (soft inject without a hard pause).
Iterative feedback Follow-up instructions to running tasks. Multiple users inject context. Per-prompt commit attribution.
Scheduled triggers Cron-based task creation via EventBridge (dependency updates, nightly flaky test checks).

Platform maturity

Capability Description
Unified liveness decision model (follow-up design ticket) Normalize task health evaluation across compute backends so heartbeat, compute session status, and DynamoDB state are handled through a single typed decision path. Define explicit backend capabilities (for example, heartbeat support), deterministic precedence rules for terminal outcomes, and regression tests that prevent cross-runtime false failures like ECS heartbeat mismatch.
Pure decision function orchestrator refactor Extract orchestrator decision logic into pure functions that take a frozen snapshot and return a typed action. Side-effectful execution applies actions with CAS (compare-and-swap) guards on DynamoDB updated_at to prevent stale writes. Makes the orchestrator exhaustively unit-testable without mocking I/O, eliminates competing-worker race conditions, and is a prerequisite for the autonomous feedback loop.
Blueprint custom steps and step sequences Lambda-backed pre-agent / post-agent steps and optional step_sequence overrides with CDK synth + runtime validation and INVALID_STEP_SEQUENCE on misconfiguration (REPO_ONBOARDING.md, ORCHESTRATOR.md).
Blueprint RepoConfig parity Extend the Blueprint construct to persist per-repo default max_budget_usd and memory_token_budget in DynamoDB (orchestrator already merges max_budget_usd when present; hydration uses a fixed memory token cap today).
Orchestrator DLQ Dead-letter path for task orchestration after retry exhaustion so operators can inspect and replay failed durable executions (ORCHESTRATOR.md).
Automated stuck-task reconciliation Scheduled job beyond passive alarms: detect tasks stuck in non-terminal states longer than policy allows and drive explicit resume, fail, or notify (ORCHESTRATOR.md).
Task lifecycle push notifications On terminal transitions, publish events (e.g. EventBridge/SNS) for email, chat, or Per-user notification preferences without requiring clients to poll.
CDK constructs library Publish reusable constructs to Construct Hub with semver versioning.
Centralized policy framework Unified Cedar-based framework with PolicyDecisionEvent audit schema. Three enforcement modes with observe-before-enforce rollout.
Formal verification TLA+ specification of task state machine, concurrency, cancellation races, reconciler interleavings.

Design docs to keep in sync: ARCHITECTURE.md, ORCHESTRATOR.md, API_CONTRACT.md, INPUT_GATEWAY.md, REPO_ONBOARDING.md, MEMORY.md, OBSERVABILITY.md, COMPUTE.md, SECURITY.md, EVALUATION.md.