Skip to content

Commit 068c10e

Browse files
AlexMikhalevAlexforge-admin
authored
fix(rlm): harden executor surface, agent search envelope, and OpenCode hook (#870)
* chore(workspace): unblock pre-commit hook for hardening branch Three pre-existing failures on main blocking the pre-commit hook; this branch needs them resolved before adding RLM hardening commits. 1. Five orchestrator integration tests construct OrchestratorConfig literally and were not updated when GiteaSkillRepoConfig was added in 2d41798. Add gitea_skill_repo: None to each fixture. 2. main.rs:4240 was added in 2d41798 (#1486) calling api.get_thesaurus(&role_name) where role_name is &RoleName but the method expects &str. Also moved role_name twice into SearchQuery. Smallest fix: clone() at the SearchQuery moves and pass role_name.as_str() to get_thesaurus. The whole block is replaced wholesale in the upcoming concepts_matched helper extraction, so this is a temporary unblocker. 3. terraphim_service llm::router_config tests share process env: test_env_overrides sets LLM_PROXY_URL without serialising against test_merged_config_from_role, so parallel execution caused the latter to read the override and fail. Mark both #[serial] and add a defensive remove_var at the start of test_merged_config_from_role. cargo check --workspace --all-targets and cargo test -p terraphim_service --lib llm::router_config are clean after these changes. * fix(rlm-local): honour ctx.timeout_ms, kill_on_drop, NotSupported snapshots LocalExecutor now respects the caller's ExecutionContext.timeout_ms instead of a hardcoded 30s, and spawns children with kill_on_drop(true) so timed-out processes are reaped instead of leaking. Snapshot operations now return RlmError::NotSupported instead of fabricating a SnapshotId for a no-op, giving callers an honest signal that VM-style state versioning is unavailable on this backend. Adds ExecutionEnvironment::end_session(&SessionId) trait method with a default Ok(()) impl, so backends with per-session resources (Docker) can override without disturbing existing impls. Removes the unused output_dir field and the unnecessary SessionId clone. Tests: 7 unit tests in local.rs (timeout-honoured, child-killed, snapshots-not-supported, end-session-default), 1 in error.rs (NotSupported error code, display, retryability). Refs review of e4d896d..4442671 * fix(rlm-docker): per-session lock, lifecycle hook, resource limits Closes the TOCTOU race in ensure_container by replacing the parking_lot::RwLock<HashMap<...>> with a DashMap<SessionId, Arc<tokio::sync::Mutex<Option<String>>>>. Concurrent execute_* calls for the same session_id now serialise on the per-session Mutex; only one container is created. Adds DockerExecutor::release_session_container(&SessionId) inherent method, mirroring FirecrackerExecutor::release_session_vm. Also wires ExecutionEnvironment::end_session to call it, so the supervisor can release containers via the trait when needed. Adds default_host_config() applying the permissive profile per 2026-05-15 design decision: 512 MiB memory cap, 256 PIDs cap, all caps dropped, network=bridge (LLM bridge / pip use), readonly_rootfs=false (Python /tmp writes). Replaces sleep 3600 keepalive with sleep infinity so the container outlives any reasonable session and is torn down only by release_session_container, cleanup, or Drop. Replaces snapshot method bodies with RlmError::NotSupported. Replaces `as i32` cast on bollard's i64 exit code with i32::try_from. Removes the #[allow(dead_code)] attribute on the struct (project rule: no dead code) and the now-redundant config field. Tests: 8 in docker.rs (6 daemon-free, 2 gated on --ignored): host config profile, snapshots-not-supported, release-unknown-returns-none, release-removes-and-recreates, concurrent-ensure-no-leak. Refs review of e4d896d..4442671 * fix(rlm-selector): degrade gracefully, drop E2B fallthrough lie Three fixes to select_executor: 1. The E2B arm previously logged "Selected E2B backend" then fell through without continuing, misleading operators reading logs. Now logs at debug! and explicitly continues to the next backend. 2. The Docker arm previously short-circuited the entire fallback chain if DockerExecutor::new returned Err (e.g. CLI present, daemon unreachable). Now logs at warn!, pushes to `tried`, and continues to the next backend so Local remains selectable. 3. The Local arm previously logged "Selected Local backend (no isolation)" at info! - falling back to no-isolation is a security-posture downgrade and should be visible in production logs. Now logs at warn! including the `tried` context so operators can see why the secure backends were unavailable. Also caches `is_docker_available()` once per call (avoiding repeated ~50-100ms shell-outs to `docker --version`) and tightens the helper log to "CLI not present" since it does not actually probe the daemon. Tests: 2 new unit tests in mod.rs (Local-preference returns Local, E2B-unimplemented falls through to Local). Refs review of e4d896d..4442671 * fix(agent): replace duplicated concepts_matched logic with helper Adds two public helpers to terraphim_automata::matcher: - compute_concepts_matched(query, &Thesaurus) -> Vec<String>: the small ergonomic wrapper around find_matches that the agent's robot-search envelope needs in two places. - thesaurus_from_terms(&RoleName, impl IntoIterator<Item=&str>) -> Thesaurus: builds a minimal Thesaurus from a slice of plain term strings, assigning each a unique id via NormalizedTerm::with_auto_id. Replaces inline blocks at main.rs:2177-2182 (offline) and 4247-4268 (server) with the new helpers. The server-mode call site previously assigned 1u64 to every reconstructed NormalizedTerm; the new with_auto_id reconstruction is semantically honest. Both sites log thesaurus failures at debug! instead of silently swallowing them. The /thesaurus/{role} HTTP API stays untouched. Tests: 5 in automata::compute_concepts_matched_tests including a parity test guarding offline/server thesaurus shapes producing the same concepts_matched output. Refs review of e4d896d..4442671 * fix(rlm-hook): portable timeout, safe JSON, smoke tests terraphim-rlm-hook.sh: - Replace string-interpolated JSON with jq -n --arg/--argjson so prompts containing double quotes, backslashes, or other shell-meta chars no longer corrupt the JSON-RPC request body. - Replace `timeout 30` (GNU coreutils, absent on macOS by default) with a portable POSIX subshell timeout wrapper. The hook now works on macOS without `brew install coreutils`. - Drop the `2>/dev/null` redirect on the MCP invocation so operators retain stderr for debugging failed calls. install.sh now warns (does not block) if `jq` is missing. README.md documents the `jq` dependency, macOS portability, the known-limitation of the JS plugin's per-call MCP-server spawn, and the new tests/test_hook.sh smoke-test entry point. tests/test_hook.sh exercises the hook with a quoted prompt (asserts the captured payload is valid JSON), a passthrough non-RLM command, and the portable timeout wrapper on a stripped PATH that excludes GNU timeout / gtimeout. Tests stub $TERRAPHIM_MCP so no live MCP server is needed. Refs review of e4d896d..4442671 * fix(orchestrator-config): redact GiteaSkillRepoConfig.token, set cache_dir default Implements Debug manually so the optional `token` field appears as "***REDACTED***" instead of leaking via panic backtraces or trace dumps. The struct no longer derives Debug. Provides a sensible default for `cache_dir` instead of an empty PathBuf: prefer $XDG_CACHE_HOME, then $HOME/.cache, then $TMPDIR, always under terraphim/skills/. Tests: 2 unit tests in terraphim_orchestrator::config::tests covering the redacted Debug output and the cache_dir default. Refs review of e4d896d..4442671 * fix(rlm-hardening): self-review follow-ups (P1+P2) Addresses every finding from the structural self-review of the hardening branch. P1 fixes: 1. Wire end_session trait method through TerraphimRlm::destroy_session (rlm.rs:194). FirecrackerExecutor overrides end_session to call the inherent release_session_vm, so destroying a session now releases per-backend resources via a single trait call. Adds a regression test (test_destroy_session_calls_executor_end_session) that uses an instrumented MockExecutor with an AtomicUsize counter. 2. Fix run_with_timeout PID-reuse race in terraphim-rlm-hook.sh by running the child in its own process group (setsid where available) and targeting kill at the negative pgid instead of a recyclable pid. Falls back to plain pid kill on systems without setsid. P2 fixes: 3. Drop #[ignore] from the two real-Docker integration tests (test_docker_concurrent_ensure_no_leak and test_docker_release_session_container_removes). Add a skip_unless_image_ready helper that gates on both daemon presence AND the default image being cached locally; tests skip cleanly with a hint to `docker pull` if the image is missing. 4. Hook smoke test now also asserts prompt content fidelity (prompt_content_roundtrip) by extracting params.arguments.prompt from the captured payload and comparing it byte-for-byte to the original input. Replaces the awk-based RLM args extractor with bash parameter expansion that preserves whitespace and literal quotes. 5. Force run_with_timeout test 3 to actually exercise the portable wrapper on Linux by staging a private $PATH (only sleep, kill, bash, sh, dash, setsid, grep, cut, head; no timeout / gtimeout) and sourcing the function definitions directly from the hook script. Test no longer self-skips on systems shipping GNU coreutils. 5a. Add DockerExecutor::with_host_config(HostConfig) builder so callers can override the permissive default profile (network=bridge, readonly_rootfs=false). Test confirms override propagates. 5c. Replace the 200ms fixed sleep in test_local_kills_on_timeout with a bounded pgrep poll (50ms backoff up to 2s). Marker now includes a Ulid so concurrent test runs don't collide. 5d clippy: convert five `let mut x = T::default(); x.field = ...` patterns to struct-update syntax, silencing the field_reassign_with_default warnings. Test count: 128 in terraphim_rlm lib (was 126), 4 in hook smoke tests (was 3). Zero clippy warnings on touched crates. cargo fmt clean. * docs(rlm-hardening): add Phase 1 research and Phase 2 design Audit trail for the disciplined-development workflow that produced this branch. Captures: - Vital-few constraints (no new deps, no trait churn, no API change). - Eliminated alternatives (capability-matcher rewrite, /thesaurus/ API enrichment, container pooling) with rationale per item. - Five risks with assigned mitigations and three open questions resolved during implementation. - File-level change list, function signatures, test strategy, and step sequencing matching the seven commits on this branch. * fix(orchestrator): prevent cron re-triggering within same schedule window Add last_cron_fire HashMap to track per-agent last fire time, preventing agents from firing multiple times within the same cron schedule window. Fixes: agents firing 130+ times instead of ~11 times per schedule cycle Test: 11 synthetic time tests covering AC1-AC5, compound schedules, and boundary conditions Refs #IDX * docs(rlm-hardening): add session handover and quality eval docs Refs #1488 * docs(handover): Echo verification session for #1488 Refs #1488 * docs(handover): Echo re-verification session for #1488 Refs #1488 --------- Co-authored-by: Alex <alex@example.com> Co-authored-by: forge-admin <admin@terraphim-forge.local>
1 parent 9832fc9 commit 068c10e

32 files changed

Lines changed: 3098 additions & 141 deletions
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Design & Implementation Plan: Synthetic Time Testing for ADF Cron Scheduler
2+
3+
## 1. Summary of Target Behavior
4+
5+
Enable synthetic time testing for the ADF orchestrator's cron scheduling logic to verify that agents fire exactly once per schedule occurrence and do not re-trigger rapidly within the same schedule window.
6+
7+
After implementation:
8+
- Tests can manipulate `last_tick_time` directly to simulate tick boundaries
9+
- Tests can verify `last_cron_fire` correctly prevents re-triggering
10+
- The cron scheduling fix can be validated without waiting hours for real-time verification
11+
12+
## 2. Key Invariants and Acceptance Criteria
13+
14+
### Invariants
15+
1. `last_cron_fire` is set AFTER agent spawn, never before
16+
2. `last_tick_time` represents the tick boundary used for cron comparison
17+
3. `schedule.after(&last_tick_time).next()` returns the next fire time at or after last_tick
18+
4. An agent is skipped if `next_fire <= last_fire_time` for that agent
19+
20+
### Acceptance Criteria
21+
22+
| ID | Criterion | Test Type | Test Location |
23+
|----|-----------|-----------|---------------|
24+
| AC1 | Agent fires when `last_tick_time` is just before schedule fire time and agent is inactive | Unit | scheduler_tests.rs |
25+
| AC2 | Agent does NOT fire again within same schedule occurrence (re-trigger prevention) | Unit | scheduler_tests.rs |
26+
| AC3 | Agent fires again at next schedule occurrence after `last_tick_time` advances past previous fire | Unit | scheduler_tests.rs |
27+
| AC4 | Agent is skipped if already in `active_agents` | Unit | scheduler_tests.rs |
28+
| AC5 | `set_last_tick_time` helper correctly updates internal state | Unit | scheduler_tests.rs |
29+
30+
## 3. High-Level Design and Boundaries
31+
32+
### Approach: Direct Field Manipulation (Test-Only Helpers)
33+
34+
```
35+
┌─────────────────────────────────────────────────────────────┐
36+
│ Test Context Only │
37+
│ ┌─────────────────────────────────────────────────────┐ │
38+
│ │ TestOrchestrator (wraps AgentOrchestrator) │ │
39+
│ │ - set_last_tick_time(time) │ │
40+
│ │ - set_last_cron_fire(agent, time) │ │
41+
│ │ - check_cron_schedules() -> Vec<AgentDefinition> │ │
42+
│ └─────────────────────────────────────────────────────┘ │
43+
└─────────────────────────────────────────────────────────────┘
44+
45+
│ Uses existing
46+
47+
┌─────────────────────────────────────────────────────────────┐
48+
│ Production: AgentOrchestrator │
49+
│ - last_tick_time: DateTime<Utc> │
50+
│ - last_cron_fire: HashMap<String, DateTime<Utc>> │
51+
│ - check_cron_schedules() │
52+
└─────────────────────────────────────────────────────────────┘
53+
```
54+
55+
### Boundaries
56+
- **Changes inside existing components**: Add `#[cfg(test)]` helper method only
57+
- **No new components**: Use existing test infrastructure
58+
- **No production code changes**: All synthetic time handling is test-only
59+
60+
## 4. File/Module-Level Change Plan
61+
62+
| File/Module | Action | Before | After | Dependencies |
63+
|-------------|--------|--------|-------|--------------|
64+
| `crates/terraphim_orchestrator/src/lib.rs` | Modify | No `set_last_tick_time` helper | Add `#[cfg(test)] set_last_tick_time(&mut self, time: DateTime<Utc>)` | chrono |
65+
| `crates/terraphim_orchestrator/tests/scheduler_tests.rs` | Modify | 3 scheduler tests | Add 4-5 cron scheduling tests using synthetic time | TestModTimeScheduler |
66+
67+
### State Changes
68+
69+
**`src/lib.rs` - AgentOrchestrator:**
70+
- Add line ~7625: `#[cfg(test)] pub fn set_last_tick_time(&mut self, time: chrono::DateTime<chrono::Utc>)`
71+
- Sets `self.last_tick_time` for test control
72+
73+
**`tests/scheduler_tests.rs`:**
74+
- Create `TestModTimeScheduler` struct wrapping `TimeScheduler`
75+
- Add helper methods for time manipulation
76+
- Add test cases for AC1-AC5
77+
78+
## 5. Step-by-Step Implementation Sequence
79+
80+
### Step 1: Add test helper `set_last_tick_time`
81+
**Purpose**: Allow tests to control `last_tick_time` field
82+
**Deployable State**: Yes - purely additive, #[cfg(test)]
83+
84+
```rust
85+
#[cfg(test)]
86+
/// Test helper: set last_tick_time for synthetic time testing.
87+
pub fn set_last_tick_time(&mut self, time: chrono::DateTime<chrono::Utc>) {
88+
self.last_tick_time = time;
89+
}
90+
```
91+
92+
### Step 2: Create TestModTimeScheduler wrapper
93+
**Purpose**: Provide controlled time environment for scheduler tests
94+
**Deployable State**: Yes - new test infrastructure
95+
96+
```rust
97+
struct TestModTimeScheduler {
98+
scheduler: TimeScheduler,
99+
last_tick_time: chrono::DateTime<chrono::Utc>,
100+
}
101+
102+
impl TestModTimeScheduler {
103+
fn new(agents: &[AgentDefinition], last_tick_time: chrono::DateTime<chrono::Utc>) -> Self { ... }
104+
fn set_tick_time(&mut self, time: chrono::DateTime<chrono::Utc>) { ... }
105+
fn get_scheduled_fire_time(&self, agent_name: &str) -> Option<DateTime<Utc>> { ... }
106+
}
107+
```
108+
109+
### Step 3: Add AC1 - Agent fires when time is right
110+
**Purpose**: Verify basic cron firing works
111+
**Deployable State**: Yes - new test
112+
113+
```rust
114+
#[test]
115+
fn test_cron_fires_when_time_matches_schedule() {
116+
// Set last_tick_time to T-31s where fire is at T
117+
// Call check logic
118+
// Assert agent is in to_spawn
119+
}
120+
```
121+
122+
### Step 4: Add AC2 - Re-trigger prevention
123+
**Purpose**: Verify `last_cron_fire` prevents re-trigger
124+
**Deployable State**: Yes - core fix verification
125+
126+
```rust
127+
#[test]
128+
fn test_cron_no_retrigger_same_occurrence() {
129+
// First call: agent fires, last_cron_fire is set
130+
// Second call with same time window: agent should NOT be in to_spawn
131+
}
132+
```
133+
134+
### Step 5: Add AC3 - Fires at next occurrence
135+
**Purpose**: Verify agent can fire again at next schedule
136+
**Deployable State**: Yes - regression prevention
137+
138+
```rust
139+
#[test]
140+
fn test_cron_fires_next_occurrence() {
141+
// First occurrence fires, last_cron_fire = T1
142+
// Advance last_tick_time past T1 (to T2 where next fire occurs)
143+
// Call check: agent should fire again
144+
}
145+
```
146+
147+
### Step 6: Add AC4 - Skip active agents
148+
**Purpose**: Verify existing active agent check still works
149+
**Deployable State**: Yes - existing behavior regression test
150+
151+
### Step 7: Add AC5 - Helper method validation
152+
**Purpose**: Ensure test helper correctly updates state
153+
**Deployable State**: Yes - infrastructure validation
154+
155+
## 6. Testing & Verification Strategy
156+
157+
| Acceptance Criteria | Test Type | Test Location | Test Name |
158+
|---------------------|-----------|--------------|-----------|
159+
| AC1 | Unit | scheduler_tests.rs | `test_cron_fires_when_time_matches_schedule` |
160+
| AC2 | Unit | scheduler_tests.rs | `test_cron_no_retrigger_same_occurrence` |
161+
| AC3 | Unit | scheduler_tests.rs | `test_cron_fires_next_occurrence` |
162+
| AC4 | Unit | scheduler_tests.rs | `test_cron_skips_active_agents` |
163+
| AC5 | Unit | scheduler_tests.rs | `test_set_last_tick_time_helper` |
164+
165+
### Test Execution
166+
```bash
167+
cargo test -p terraphim_orchestrator scheduler_tests
168+
```
169+
170+
## 7. Risk & Complexity Review
171+
172+
| Risk | Mitigation | Residual Risk |
173+
|------|------------|---------------|
174+
| Test doesn't reflect production | Use actual cron expressions from terraphim.toml | Low - expressions are real |
175+
| last_tick_time timing semantics unclear | Add detailed comments | Low - code review |
176+
| Async spawn complications | Test only checks to_spawn, not actual spawn | Low - isolated test |
177+
178+
### Complexity Assessment
179+
- **Cyclomatic complexity**: Low - each test is straightforward sequence
180+
- **Coupling**: Low - uses existing scheduler infrastructure
181+
- **Lines of change**: ~100-150 total (2-3 helpers + 5 tests)
182+
183+
## 8. Open Questions / Decisions for Human Review
184+
185+
1. **Test cron expressions**: Use production schedules (`30 0-10 * * *`) or simplified test schedules (`0 * * * *`)?
186+
187+
2. **Integration with existing scheduler_tests.rs**: Add to existing file or create new `cron_schedule_tests.rs`?
188+
189+
3. **Test spawn vs. to_spawn**: Should tests verify the actual `to_spawn` list (current approach) or mock full spawn?
190+
191+
4. **Compound schedule testing**: Should the compound review schedule (`0 2 * * *`) also have synthetic time tests?
192+
193+
5. **Edge case coverage**: Should we test what happens if `last_tick_time` is set to exactly the fire time?
194+
195+
## Implementation Notes
196+
197+
### Location of set_last_tick_time in lib.rs
198+
After existing test helpers (~line 7620):
199+
```rust
200+
/// Test helper: set last_tick_time for synthetic time testing.
201+
#[cfg(test)]
202+
pub fn set_last_tick_time(&mut self, time: chrono::DateTime<chrono::Utc>) {
203+
self.last_tick_time = time;
204+
}
205+
```
206+
207+
### Test Data
208+
Use actual schedules from `/opt/ai-dark-factory/conf.d/terraphim.toml`:
209+
- spec-validator: `30 0-10 * * *` (fires at XX:30)
210+
- test-guardian: `35 0-10 * * *` (fires at XX:35)
211+
- implementation-swarm: `45 0-10 * * *` (fires at XX:45)

0 commit comments

Comments
 (0)