Skip to content

Commit 331a4c6

Browse files
zmanianclaudeilblackdragon
authored
Trajectory benchmarks and e2e trace test rig (nearai#553)
* refactor: extract shared assertion helpers to support/assertions.rs Move 5 assertion helpers from e2e_spot_checks.rs to a shared module. Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating false positives in E2E tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add tool output capture via tool_results() accessor Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct tool parameters in 3 broken trace fixtures - tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add tool success and output assertions to eliminate false positives Every E2E test that exercises tools now calls assert_all_tools_succeeded. Added tool output content assertions where tool results are predictable (time year, read_file content, memory_read content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: capture per-tool timing from ToolStarted/ToolCompleted events Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: add RAII CleanupGuard for temp file/dir cleanup in tests Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add Drop impl and graceful shutdown for TestRig Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace agent startup sleep with oneshot ready signal Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace fragile string-matching iteration limit with count-based detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use assert_all_tools_succeeded for memory_full_cycle test Remove incorrect comment about memory_tree failing with empty path (it actually succeeds). Omit empty path from fixture and use the standard assert_all_tools_succeeded instead of per-tool assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: promote benchmark metrics types to library code Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs. Existing tests use re-export for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Scenario and Criterion types for agent benchmarking Scenario defines a task with input, success criteria, and resource limits. Criterion is an enum of programmatic checks (tool_used, response_contains, etc.) evaluated without LLM judgment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add initial benchmark scenario suite (12 scenarios across 5 categories) Scenarios cover tool_selection, tool_chaining, error_recovery, efficiency, and memory_operations. All loaded from JSON with deserialization validation test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add benchmark runner with BenchChannel and InstrumentedLlm BenchChannel is a minimal Channel implementation for benchmarks. InstrumentedLlm wraps any LlmProvider to capture per-call metrics. Runner creates a fresh agent per scenario, evaluates success criteria, and produces RunResult with timing, token, and cost metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add baseline management, reports, and benchmark entry point - baseline.rs: load/save/promote benchmark results - report.rs: format comparison reports with regression detection - benchmark_runner.rs: integration test with real LLM (feature-gated) - Add benchmark feature flag to Cargo.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply cargo fmt to benchmark module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add multi-turn scenario types with setup, judge, ResponseNotContains Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup, WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios. Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria() converter for backward compat with existing evaluation engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add JSON scenario loader with recursive discovery and tag filter Add load_bench_scenarios() for the new BenchScenario format with recursive directory traversal and tag-based filtering. Create 4 initial trajectory scenarios across tool-selection, multi-turn, and efficiency categories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): multi-turn runner with workspace seeding and per-turn metrics Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace documents, collects per-turn metrics (tokens, tool calls, wall time), and evaluates per-turn assertions. Add TurnMetrics to metrics.rs and clear_for_next_turn() to BenchChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add LLM-as-judge scoring with prompt formatting and score parsing Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn. Wire into run_bench_scenario for turns with judge config -- scores below min_score fail the turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add CLI subcommand (ironclaw benchmark) Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout, --update-baseline flags. Wire into Command enum and main.rs dispatch. Feature-gated behind benchmark flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): per-scenario JSON output with full trajectory Add save_scenario_results() that writes per-scenario JSON files alongside the run summary. Each scenario gets its own file with turn_metrics trajectory. Update CLI to use new output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add ToolRegistry::retain_only and wire tool filtering in scenarios Add a retain_only() method to ToolRegistry that filters tools down to a given allowlist. Wire this into run_bench_scenario() so that when a scenario specifies a tools list in its setup, only those tools are available during the benchmark run. Includes two tests for the new method: one verifying filtering works and one verifying empty input is a no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): wire identity overrides into workspace before agent start Add seed_identity() helper that writes identity files (IDENTITY.md, USER.md, etc.) into the workspace before the agent starts, so that workspace.system_prompt() picks them up. Wire it into run_bench_scenario() after workspace seeding. Include a test that verifies identity files are written and readable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --parallel and --max-cost CLI flags Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(benchmark): use feature-conditional snapshot names for CLI help tests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): parallel execution with JoinSet and budget cap enforcement Replace sequential loop in run_all_bench() with parallel execution using JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement that skips remaining scenarios when max_total_cost_usd is exceeded. Track skipped count in RunResult.skipped_scenarios and display it in format_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add tool restriction and identity override test scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: fix formatting for Phase 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add SkillRegistry::retain_only and wire skill filtering in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --json flag for machine-readable output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add GitHub Actions benchmark workflow (manual trigger) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(benchmark): remove in-tree benchmark harness, keep retain_only utilities Move benchmark-specific code out of ironclaw in preparation for the nearai/benchmarks trajectory adapter. This removes: - src/benchmark/ (runner, scenarios, metrics, judge, report, etc.) - src/cli/benchmark.rs and the Benchmark CLI subcommand - benchmarks/ data directory (scenarios + trajectories) - .github/workflows/benchmark.yml - The "benchmark" Cargo feature flag What remains: - ToolRegistry::retain_only() and SkillRegistry::retain_only() - Test support types (TraceMetrics, InstrumentedLlm) inlined into tests/support/ instead of re-exporting from the deleted module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add README for LLM trace fixture format Documents the trajectory JSON format, response types, request hints, directory structure, and how to write new traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(test): unify trace format around turns, add multi-turn support Introduce TraceTurn type that groups user_input with LLM response steps, making traces self-contained conversation trajectories. Add run_trace() to TestRig for automatic multi-turn replay. Backward-compatible: flat "steps" JSON is deserialized as a single turn transparently. Includes all trace fixtures (spot, coverage, advanced), plan docs, and new e2e tests for steering, error recovery, long chains, memory, and prompt injection resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Fix tool_json fixture: use "data" parameter (not "input") to match JsonTool schema - Fix status_events test: remove assertion for "time" tool that isn't in the fixture (only "echo" calls are used) - Allow dead_code in test support metrics/instrumented_llm modules (utilities for future benchmark tests) [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Working on recording traces and testing them * feat(test): add declarative expects to trace fixtures, split infra tests Add TraceExpects struct with 9 optional assertion fields (response_contains, tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON instead of hand-written Rust. Add verify_expects() and run_recorded_trace() so recorded trace tests become one-liners. Split trace infra tests (deserialization, backward compat) into tests/trace_format.rs which doesn't require the libsql feature gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): add expects to all trace fixtures, simplify e2e tests Add declarative expects blocks to all 19 trace fixture JSONs across spot/, coverage/, advanced/, and root directories. Update all 8 e2e test files to use verify_trace_expects() / run_and_verify_trace(), replacing ~270 lines of hand-written assertions with fixture-driven verification. Tests that check things beyond expects (file content on disk, metrics, event ordering) keep those extra assertions alongside the declarative ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): adapt tests to AppBuilder refactor, fix formatting Update test files to work with refactored TestRigBuilder that uses AppBuilder::build_all() (removing with_tools/with_workspace methods). Update telegram_check fixture to use tool_list instead of echo. Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): deduplicate support unit tests into single binary Support modules (assertions, cleanup, test_channel, test_rig, trace_llm) had #[cfg(test)] mod tests blocks that were compiled and run 12 times — once per e2e test binary that declares `mod support;`. Extracted all 29 support unit tests into a dedicated `tests/support_unit_tests.rs` so they run exactly once. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix trailing newlines in support files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): unify trace types and fix recorded multi-turn replay Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint, ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from ironclaw::llm::recording instead of redefining them in trace_llm.rs. Fix the flat-steps deserializer to split at UserInput boundaries into multiple turns, instead of filtering them out and wrapping everything into a single turn. This enables recorded multi-turn traces to be replayed as proper multi-turn conversations via run_trace(). [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures - unused imports and missing struct fields - Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs (types are re-exported for downstream test files, not used locally) - Add `..` to ToolCompleted pattern in test_channel.rs to match new `error` and `parameters` fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Add missing `error` and `parameters` fields to ToolCompleted constructors in support_unit_tests.rs - Add `..` to ToolCompleted pattern match in support_unit_tests.rs - Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and TraceLlm impl (only used behind #[cfg(feature = "libsql")]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Adding coverage running script * fix(test): address review feedback on E2E test infrastructure - Increase wait_for_responses polling to exponential backoff (50ms-500ms) and raise default timeout from 15s to 30s to reduce CI flakiness (nearai#1) - Strengthen prompt_injection_resilience test with positive safety layer assertion via has_safety_warnings(), enable injection_check (nearai#2) - Add assert_tool_order() helper and tools_order field in TraceExpects for verifying tool execution ordering in multi-step traces (nearai#3) - Document TraceLlm sequential-call assumption for concurrency (nearai#6) - Clean up CleanupGuard with PathKind enum instead of shotgun remove_file + remove_dir_all on every path (nearai#8) - Fix coverage.sh: default to --lib only, fix multi-filter syntax, add COV_ALL_TARGETS option - Add coverage/ to .gitignore - Remove planning docs from PR [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review - use HashSet in retain_only, improve skill test - Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and ToolRegistry::retain_only instead of linear scan - Strengthen test_retain_only_empty_is_noop in SkillRegistry to pre-populate with a skill before asserting the no-op behavior [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): revert incorrect safety layer assertion in injection test The safety layer sanitizes tool output, not user input. The injection test sends a malicious user message with no tools called, so the safety layer never fires. Reverted to the original test which correctly validates the LLM refuses via trace expects. Also fixed case-sensitive request hint ("ignore" -> "Ignore") to suppress noisy warning. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clean stale profdata before coverage run Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix formatting in retain_only test [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>
1 parent f355dba commit 331a4c6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+7469
-11
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ target/
1616
# Benchmark results (local runs, not committed)
1717
bench-results/
1818

19+
# Coverage reports (local runs, not committed)
20+
coverage/
21+
1922
# WASM build artifacts (loaded from disk, not bundled)
2023
*.wasm
2124

scripts/coverage.sh

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
#!/usr/bin/env bash
2+
# Generate an HTML coverage report for a given set of tests.
3+
#
4+
# Usage:
5+
# ./scripts/coverage.sh # all tests (lib only)
6+
# ./scripts/coverage.sh safety # tests matching "safety"
7+
# ./scripts/coverage.sh safety::sanitizer # specific module tests
8+
# ./scripts/coverage.sh test_a test_b test_c # multiple test filters
9+
#
10+
# Options (env vars):
11+
# COV_OPEN=1 Auto-open the report in a browser (default: 1)
12+
# COV_FORMAT=html Output format: html, text, json, lcov (default: html)
13+
# COV_OUT=coverage Output directory (default: coverage/)
14+
# COV_FEATURES="" Extra --features to pass (default: none)
15+
# COV_ALL_TARGETS=0 Set to 1 to include integration tests (default: lib only)
16+
#
17+
# Requires: cargo-llvm-cov (install: cargo install cargo-llvm-cov)
18+
19+
set -euo pipefail
20+
21+
COV_OPEN="${COV_OPEN:-1}"
22+
COV_FORMAT="${COV_FORMAT:-html}"
23+
COV_OUT="${COV_OUT:-coverage}"
24+
COV_FEATURES="${COV_FEATURES:-}"
25+
COV_ALL_TARGETS="${COV_ALL_TARGETS:-0}"
26+
27+
cd "$(git rev-parse --show-toplevel)"
28+
29+
if ! command -v cargo-llvm-cov &>/dev/null; then
30+
echo "ERROR: cargo-llvm-cov not found. Install with: cargo install cargo-llvm-cov"
31+
exit 1
32+
fi
33+
34+
# Clean stale profiling data to avoid "mismatched data" warnings.
35+
cargo llvm-cov clean --workspace 2>/dev/null || true
36+
37+
# Build the cargo llvm-cov command
38+
cmd=(cargo llvm-cov)
39+
40+
# Features
41+
if [[ -n "$COV_FEATURES" ]]; then
42+
cmd+=(--features "$COV_FEATURES")
43+
else
44+
cmd+=(--all-features)
45+
fi
46+
47+
# By default, only run the lib unit tests (fast, no integration test compilation).
48+
# Set COV_ALL_TARGETS=1 to include integration tests.
49+
if [[ "$COV_ALL_TARGETS" != "1" ]]; then
50+
cmd+=(--lib)
51+
fi
52+
53+
# Output format
54+
case "$COV_FORMAT" in
55+
html)
56+
cmd+=(--html --output-dir "$COV_OUT")
57+
;;
58+
text)
59+
cmd+=(--text)
60+
;;
61+
json)
62+
cmd+=(--json --output-path "$COV_OUT/coverage.json")
63+
;;
64+
lcov)
65+
cmd+=(--lcov --output-path "$COV_OUT/lcov.info")
66+
;;
67+
*)
68+
echo "ERROR: Unknown format '$COV_FORMAT'. Use: html, text, json, lcov"
69+
exit 1
70+
;;
71+
esac
72+
73+
# Test name filters (passed after -- to cargo test)
74+
if [[ $# -gt 0 ]]; then
75+
if [[ $# -eq 1 ]]; then
76+
cmd+=(-- "$1")
77+
else
78+
# Join filters with | for regex matching
79+
filter=$(IFS='|'; echo "$*")
80+
cmd+=(-- "$filter")
81+
fi
82+
fi
83+
84+
echo "Running: ${cmd[*]}"
85+
echo ""
86+
87+
"${cmd[@]}"
88+
89+
# Open report
90+
if [[ "$COV_FORMAT" == "html" && "$COV_OPEN" == "1" ]]; then
91+
index="$COV_OUT/html/index.html"
92+
if [[ -f "$index" ]]; then
93+
echo ""
94+
echo "Report: $index"
95+
if command -v open &>/dev/null; then
96+
open "$index"
97+
elif command -v xdg-open &>/dev/null; then
98+
xdg-open "$index"
99+
fi
100+
fi
101+
fi

src/agent/agent_loop.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,8 @@ pub struct AgentDeps {
7575
pub cost_guard: Arc<crate::agent::cost_guard::CostGuard>,
7676
/// SSE broadcast sender for live job event streaming to the web gateway.
7777
pub sse_tx: Option<tokio::sync::broadcast::Sender<crate::channels::web::types::SseEvent>>,
78+
/// HTTP interceptor for trace recording/replay.
79+
pub http_interceptor: Option<Arc<dyn crate::llm::recording::HttpInterceptor>>,
7880
}
7981

8082
/// The main agent that coordinates all components.

src/agent/dispatcher.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,9 @@ impl Agent {
127127
let mut context_messages = initial_messages;
128128

129129
// Create a JobContext for tool execution (chat doesn't have a real job)
130-
let job_ctx = JobContext::with_user(&message.user_id, "chat", "Interactive chat session");
130+
let mut job_ctx =
131+
JobContext::with_user(&message.user_id, "chat", "Interactive chat session");
132+
job_ctx.http_interceptor = self.deps.http_interceptor.clone();
131133

132134
let max_tool_iterations = self.config.max_tool_iterations;
133135
// Force a text-only response on the last iteration to guarantee termination
@@ -1066,6 +1068,7 @@ mod tests {
10661068
hooks: Arc::new(HookRegistry::new()),
10671069
cost_guard: Arc::new(CostGuard::new(CostGuardConfig::default())),
10681070
sse_tx: None,
1071+
http_interceptor: None,
10691072
};
10701073

10711074
Agent::new(
@@ -1805,6 +1808,7 @@ mod tests {
18051808
hooks: Arc::new(HookRegistry::new()),
18061809
cost_guard: Arc::new(CostGuard::new(CostGuardConfig::default())),
18071810
sse_tx: None,
1811+
http_interceptor: None,
18081812
};
18091813

18101814
Agent::new(
@@ -1917,6 +1921,7 @@ mod tests {
19171921
hooks: Arc::new(HookRegistry::new()),
19181922
cost_guard: Arc::new(CostGuard::new(CostGuardConfig::default())),
19191923
sse_tx: None,
1924+
http_interceptor: None,
19201925
};
19211926

19221927
Agent::new(

src/agent/thread_ops.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -734,8 +734,9 @@ impl Agent {
734734
}
735735

736736
// Execute the approved tool and continue the loop
737-
let job_ctx =
737+
let mut job_ctx =
738738
JobContext::with_user(&message.user_id, "chat", "Interactive chat session");
739+
job_ctx.http_interceptor = self.deps.http_interceptor.clone();
739740

740741
let _ = self
741742
.channels

src/app.rs

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ use crate::context::ContextManager;
1515
use crate::db::Database;
1616
use crate::extensions::ExtensionManager;
1717
use crate::hooks::HookRegistry;
18-
use crate::llm::{LlmProvider, SessionManager};
18+
use crate::llm::{LlmProvider, RecordingLlm, SessionManager};
1919
use crate::safety::SafetyLayer;
2020
use crate::secrets::SecretsStore;
2121
use crate::skills::SkillRegistry;
@@ -48,6 +48,7 @@ pub struct AppComponents {
4848
pub skill_registry: Option<Arc<std::sync::RwLock<SkillRegistry>>>,
4949
pub skill_catalog: Option<Arc<SkillCatalog>>,
5050
pub cost_guard: Arc<crate::agent::cost_guard::CostGuard>,
51+
pub recording_handle: Option<Arc<RecordingLlm>>,
5152
pub session: Arc<SessionManager>,
5253
pub catalog_entries: Vec<crate::extensions::RegistryEntry>,
5354
pub dev_loaded_tool_names: Vec<String>,
@@ -71,6 +72,9 @@ pub struct AppBuilder {
7172
db: Option<Arc<dyn Database>>,
7273
secrets_store: Option<Arc<dyn SecretsStore + Send + Sync>>,
7374

75+
// Test overrides
76+
llm_override: Option<Arc<dyn LlmProvider>>,
77+
7478
// Backend-specific handles needed by secrets store
7579
#[cfg(feature = "postgres")]
7680
pg_pool: Option<deadpool_postgres::Pool>,
@@ -99,18 +103,34 @@ impl AppBuilder {
99103
log_broadcaster,
100104
db: None,
101105
secrets_store: None,
106+
llm_override: None,
102107
#[cfg(feature = "postgres")]
103108
pg_pool: None,
104109
#[cfg(feature = "libsql")]
105110
libsql_db: None,
106111
}
107112
}
108113

114+
/// Inject a pre-created database, skipping `init_database()`.
115+
pub fn with_database(&mut self, db: Arc<dyn Database>) {
116+
self.db = Some(db);
117+
}
118+
119+
/// Inject a pre-created LLM provider, skipping `init_llm()`.
120+
pub fn with_llm(&mut self, llm: Arc<dyn LlmProvider>) {
121+
self.llm_override = Some(llm);
122+
}
123+
109124
/// Phase 1: Initialize database backend.
110125
///
111126
/// Creates the database connection, runs migrations, reloads config
112127
/// from DB, attaches DB to session manager, and cleans up stale jobs.
113128
pub async fn init_database(&mut self) -> Result<(), anyhow::Error> {
129+
if self.db.is_some() {
130+
tracing::debug!("Database already provided, skipping init_database()");
131+
return Ok(());
132+
}
133+
114134
if self.flags.no_db {
115135
tracing::warn!("Running without database connection");
116136
return Ok(());
@@ -297,10 +317,17 @@ impl AppBuilder {
297317
#[allow(clippy::type_complexity)]
298318
pub fn init_llm(
299319
&self,
300-
) -> Result<(Arc<dyn LlmProvider>, Option<Arc<dyn LlmProvider>>), anyhow::Error> {
301-
let (llm, cheap_llm) =
320+
) -> Result<
321+
(
322+
Arc<dyn LlmProvider>,
323+
Option<Arc<dyn LlmProvider>>,
324+
Option<Arc<RecordingLlm>>,
325+
),
326+
anyhow::Error,
327+
> {
328+
let (llm, cheap_llm, recording_handle) =
302329
crate::llm::build_provider_chain(&self.config.llm, self.session.clone())?;
303-
Ok((llm, cheap_llm))
330+
Ok((llm, cheap_llm, recording_handle))
304331
}
305332

306333
/// Phase 4: Initialize safety, tools, embeddings, and workspace.
@@ -653,7 +680,11 @@ impl AppBuilder {
653680
self.init_database().await?;
654681
self.init_secrets().await?;
655682

656-
let (llm, cheap_llm) = self.init_llm()?;
683+
let (llm, cheap_llm, recording_handle) = if let Some(llm) = self.llm_override.take() {
684+
(llm, None, None)
685+
} else {
686+
self.init_llm()?
687+
};
657688
let (safety, tools, embeddings, workspace) = self.init_tools(&llm).await?;
658689

659690
// Create hook registry early so runtime extension activation can register hooks.
@@ -765,6 +796,7 @@ impl AppBuilder {
765796
skill_registry,
766797
skill_catalog,
767798
cost_guard,
799+
recording_handle,
768800
session: self.session,
769801
catalog_entries,
770802
dev_loaded_tool_names,

src/config/agent.rs

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,26 @@ pub struct AgentConfig {
3030
}
3131

3232
impl AgentConfig {
33+
/// Create a test-friendly config without reading env vars.
34+
#[cfg(feature = "libsql")]
35+
pub fn for_testing() -> Self {
36+
Self {
37+
name: "test-rig".to_string(),
38+
max_parallel_jobs: 1,
39+
job_timeout: Duration::from_secs(30),
40+
stuck_threshold: Duration::from_secs(300),
41+
repair_check_interval: Duration::from_secs(3600),
42+
max_repair_attempts: 0,
43+
use_planning: false,
44+
session_idle_timeout: Duration::from_secs(3600),
45+
allow_local_tools: true,
46+
max_cost_per_day_cents: None,
47+
max_actions_per_hour: None,
48+
max_tool_iterations: 10,
49+
auto_approve_tools: true,
50+
}
51+
}
52+
3353
pub(crate) fn resolve(settings: &Settings) -> Result<Self, ConfigError> {
3454
Ok(Self {
3555
name: parse_optional_env("AGENT_NAME", settings.agent.name.clone())?,

src/config/llm.rs

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,40 @@ pub struct NearAiConfig {
195195
}
196196

197197
impl LlmConfig {
198+
/// Create a test-friendly config without reading env vars.
199+
///
200+
/// Uses NearAi backend with dummy values. The LLM provider is replaced
201+
/// by `TraceLlm` via `AppBuilder::with_llm()`, so these values are unused.
202+
#[cfg(feature = "libsql")]
203+
pub fn for_testing() -> Self {
204+
Self {
205+
backend: LlmBackend::NearAi,
206+
nearai: NearAiConfig {
207+
model: "test-model".to_string(),
208+
cheap_model: None,
209+
base_url: "http://localhost:0".to_string(),
210+
auth_base_url: "http://localhost:0".to_string(),
211+
session_path: PathBuf::from("/tmp/ironclaw-test-session.json"),
212+
api_key: None,
213+
fallback_model: None,
214+
max_retries: 0,
215+
circuit_breaker_threshold: None,
216+
circuit_breaker_recovery_secs: 30,
217+
response_cache_enabled: false,
218+
response_cache_ttl_secs: 3600,
219+
response_cache_max_entries: 100,
220+
failover_cooldown_secs: 300,
221+
failover_cooldown_threshold: 3,
222+
smart_routing_cascade: false,
223+
},
224+
openai: None,
225+
anthropic: None,
226+
ollama: None,
227+
openai_compatible: None,
228+
tinfoil: None,
229+
}
230+
}
231+
198232
/// Resolve a model name from env var → settings.selected_model → hardcoded default.
199233
fn resolve_model(
200234
env_var: &str,

0 commit comments

Comments
 (0)