12 operations testing and roadmap

Jump to bottom

github-actions[bot] edited this page Feb 20, 2026 · 1 revision

12. Operations, Testing, and Roadmap

Startup and Runtime Operations

Primary startup flow:

bun run start

This runs preflight checks in scripts/start.ts before launching the full stack:

required config files,
LLM endpoint readiness and optional LM Studio bootstrap,
integration branch/worktree safety,
worker Docker image readiness,
startup warmup job path.

Useful alternatives:

bun run dev:full for direct multi-service launch.
individual *:only scripts for targeted debugging.

Fast Runbook Commands

Full stack with preflights:
- bun run start
Full stack without preflight wrapper:
- bun run dev:full
Integration harness:
- bun run test:integration
Eval harness:
- bun run test:integration:eval
VS Code extension package/lint:
- bun run vscode:client:lint
- bun run vscode:client:package

Local Environment Expectations

Baseline tooling:

Bun
Python 3.12+
Docker (for default worker flow)
Git (and optionally GitHub CLI for PR workflows)

Logging and Debugging

Where to look first:

service terminal logs from dev:full or start.
server queue snapshots (/requests, /jobs, /completions).
WorkerPals logs and job logs in Server job log endpoints.
integration logs from SourceControlManager.

For session behavior:

inspect event stream (/sessions/:id/events with cursor replay semantics).

Incident Triage Order

When the system is "stuck", diagnose in this order:

Server health and session event progression.
Request queue movement.
Job queue movement and worker heartbeat.
Completion queue movement and SCM processing.
Client transport/reconnect state.

Testing Layers

Unit/integration tests (TypeScript + Python harnesses).
End-to-end integration harness:
- tests/integration/integration_controller.py
- tests/integration/test_workerpals_e2e.py
Eval scenarios:
- tests/integration/eval_scenarios.swebench_like.json

The integration controller supports two modes:

integration: regular flow checks.
eval: backend quality benchmark runs with scenario suites and budgets.

Tradeoffs

Pros:

strong operational discipline and reproducibility,
realistic benchmark path for backend quality.

Cons:

setup complexity is higher than simple single-agent tools,
Docker and multi-service orchestration increase local troubleshooting load.

Future Improvements

Observability

distributed trace IDs across request/job/completion lifecycle,
richer metrics and dashboards for latency, failure categories, and retries.

Reliability

dead-letter queues and replay tools,
stronger backpressure and overload controls.

DX

one-command diagnostics report,
clearer startup failure classification with remediation hints.

Autonomy quality

objective outcome attribution loops,
model/prompt benchmark gating before production rollout.

Platform hardening

stricter schema evolution checks,
stronger integration of policy checks into CI.