feat(hooks): production gate-ref factory (successor to #3573)#3633
feat(hooks): production gate-ref factory (successor to #3573)#3633zmanian wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a design document for the RouterBackedHookGateRefFactory, which aims to integrate hook approval and authentication with the host's existing gateways. The feedback highlights a logical contradiction regarding identifier synthesis in the threat model, suggests passing the LoopRunContext as a method argument to allow for a long-lived factory instance, and recommends adding an optional TTL override to the minting methods for greater flexibility.
| `LoopGateRef` is the gateway's chosen identifier — not synthesized | ||
| in the factory. |
There was a problem hiding this comment.
There is a logical contradiction between the requirement that the identifier is not synthesized in the factory and the threat model consideration on line 61, which suggests the factory might wrap a v4 UUID inside if the gateway's format is guessable. If the factory adds entropy to the ID, it is participating in the synthesis. The design should clarify whether the gateway is solely responsible for generating secure identifiers or if the factory provides a client-side token to be incorporated by the gateway.
References
- Documentation for complex logic, such as security policies, must precisely match the code implementation.
| pub struct RouterBackedHookGateRefFactory { | ||
| run_context: LoopRunContext, | ||
| approval_gateway: Arc<dyn ApprovalGateway>, | ||
| auth_gateway: Arc<dyn AuthGateway>, | ||
| reservation_ttl: Duration, | ||
| } |
There was a problem hiding this comment.
Binding LoopRunContext directly to the factory struct makes the factory instance single-use and requires recreation for every loop run. A more flexible and efficient approach, consistent with other host services like LoopModelGateway, is to pass the LoopRunContext as an argument to the mint methods. This allows the factory to be a long-lived, shared component.
| pub struct RouterBackedHookGateRefFactory { | |
| run_context: LoopRunContext, | |
| approval_gateway: Arc<dyn ApprovalGateway>, | |
| auth_gateway: Arc<dyn AuthGateway>, | |
| reservation_ttl: Duration, | |
| } | |
| pub struct RouterBackedHookGateRefFactory { | |
| approval_gateway: Arc<dyn ApprovalGateway>, | |
| auth_gateway: Arc<dyn AuthGateway>, | |
| default_reservation_ttl: Duration, | |
| } |
|
|
||
| #[async_trait] | ||
| impl HookGateRefFactory for RouterBackedHookGateRefFactory { | ||
| async fn mint_approval_ref(&self, reason: &str) -> Result<LoopGateRef, AgentLoopHostError> { |
There was a problem hiding this comment.
The mint_approval_ref signature currently only accepts a reason. Since different hooks may require different expiration policies (e.g., a quick confirmation vs. a long-running manual review), the design would be improved by allowing an optional TTL override or a policy hint in the method signature.
| async fn mint_approval_ref(&self, reason: &str) -> Result<LoopGateRef, AgentLoopHostError> { | |
| async fn mint_approval_ref(&self, run_context: &LoopRunContext, reason: &str, ttl: Option<Duration>) -> Result<LoopGateRef, AgentLoopHostError> { |
Successor PR scope doc. Until this lands, hook PauseApproval/PauseAuth decisions surface as Denied in production because the middleware default is FailClosedHookGateRefFactory (PR #3573 / henrypark133 Critical #3). This PR carries the scope doc only; implementation follows after review of the design (cross-crate seam to approval gateway is the load-bearing decision).
6c04f1c to
f75f779
Compare
Adds two addenda from codex's design-review pass: - Critical: actor/session binding requirement (prevents same-tenant wrong-user approval-bypass). Gateway reservation must carry the actor/session id and reject cross-actor consumption. - Recommendation: include capability id + arguments digest in the reservation, not just free-form reason. Lets the approval UI show the exact gated call AND defeats a future-call digest-mismatch replay vector.
Codex review addressed (commit fd50759)Verdict was REQUEST CHANGES; scope doc updated with two addenda:
|
Cites Codex review addenda: bind hook gate reservations to actor/session identity and carry capability plus arguments digest before handing refs to the approval/auth router.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c3c68cd76e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| run_context: self.run_context.clone(), | ||
| actor: self.actor.clone(), | ||
| capability_id: metadata.capability_id, |
There was a problem hiding this comment.
Mint reservations from the active run/actor context
RouterBackedHookGateRefFactory always copies self.run_context and self.actor into each reservation, so a factory instance is permanently pinned to whichever run/user was passed to try_new. Because host factories commonly reuse a single HookGateRefFactory across multiple builds, later turns can mint gate refs bound to stale context, which then fail resolution (RunMismatch/ActorMismatch) or are attributed to the wrong principal. The reservation context should be sourced per host build/invocation rather than stored once on the factory object.
Useful? React with 👍 / 👎.
| consumed_at, | ||
| }); | ||
| } | ||
| if request.resolved_at >= state.reservation.expires_at { |
There was a problem hiding this comment.
Validate gate expiry against router time, not caller time
Expiry enforcement trusts request.resolved_at, which is caller-provided and public on HookGateResolutionRequest. A client can backdate that timestamp and successfully consume a gate that has already expired in real time, bypassing TTL semantics. The router should compare expires_at to its own clock (Utc::now()) and treat resolved_at as audit metadata only.
Useful? React with 👍 / 👎.
SummaryReviewed PR #3633 only. Base PR adds production gate-ref factory for hooks and improves run/thread/actor binding. Merge stance: blocking authority bug remains in TTL/consume timestamp handling. Findings
Security/data-flow notes
Correctness/invariant notes
Missing tests
|
#3633) serrrfirat HIGH on PR #3633: `HookGateResolutionRequest.resolved_at` was a caller-controllable timestamp. Any adapter wiring the request from external input — or a buggy router that trusted it — could backdate it and consume an expired approval/auth gate ref. The same caller-supplied timestamp was persisted as the reservation's `consumed_at`, so forged values also corrupted the one-shot audit trail. The `InMemoryHookGateRouter` reference implementation already reads its own wall clock (`Utc::now()`) inside `resolve_gate` for both expiry checks and `consumed_at` — but the public request struct still exposed `resolved_at` as a `pub` field, encouraging future router impls to trust it and leaving the field as ambient trust- boundary surface. Fix: remove `resolved_at` from `HookGateResolutionRequest` entirely. Time authority for resolution and consumption lives exclusively on the router's wall clock; the only place a resolution timestamp surfaces is `HookGateResolution::resolved_at` (the *result*), which is router-supplied. The `for_kind` / `for_invocation` constructors no longer take or set a timestamp. Tests: - `router_backed_pause_approval_gate_ref_rejects_backdated_resolution_after_ttl` reframed: it no longer mutates `request.resolved_at` (the field is gone). Instead it relies on the router's own clock — TTL = 1ms, sleep 5ms, resolve must surface `Expired`. The property is now statically enforced by the absence of the field rather than dynamically asserted, but the regression test still exercises the router-owned-time code path.
Code Review — PR #3633 (production gate-ref factory)Verdict: REQUEST CHANGES — structurally sound with good test coverage on the critical paths, but several medium/high issues need addressing before promotion. Note: the PR is filed as "Draft — scope doc only" but the diff ships the complete implementation (+1217 lines). Flagging in case the draft status is intentional. OverviewRouter-backed IssuesMust fix before merge:
Non-blocking:
Missing test coverage
Security notes
|
Four items from the 5-15 review: **#1 (must-fix) MAX_RESERVATION_TTL cap** `RouterBackedHookGateRefFactory::try_new` now caps `reservation_ttl` at 24h. Without the cap, an operator misconfiguring a year-long TTL accumulates unresolved reservations in `InMemoryHookGateRouter` state for the full window — a long-tail memory leak. 24h is plenty for human-in-the-loop approval flows. **#2 (must-fix) MAX_REASON_BYTES cap** `mint(...)` rejects `reason` strings longer than 4 KiB. Without the cap, a buggy or malicious caller could push arbitrarily large strings through the approval store; the reason is operator-facing and may be persisted. **#3 (must-fix) Split `InvalidDigest` from `InvalidToken`** `validate_token` previously returned `HookGateError::InvalidDigest` for failures on actor / session ids — confusing because those values aren't digests. Add a new `InvalidToken { field, reason }` variant and route `validate_token` to it. `InvalidDigest` stays for actual sha256-digest shape failures. **#5 (must-fix) Collapse consumption-failure oracle** `From<HookGateError> for AgentLoopHostError` previously preserved Display text for every variant, so a probing caller could distinguish "this gate ref doesn't exist" from "this gate ref belongs to another run/actor/capability" — an oracle for liveness detection on foreign gate refs. Now collapses the entire consumption-failure family (`UnknownGate` / `AlreadyConsumed` / `Expired` / `KindMismatch` / `RunMismatch` / `ActorMismatch` / `CapabilityMismatch` / `ArgumentsDigestMismatch`) to a single opaque "hook gate consumption denied" surface. Misuse / availability variants still surface details — they signal config bugs and need operator visibility. Internal variants stay distinct for test assertions and operator-visible tracing. **Bonus** (henrypark133 non-blocking #9): Add a `tracing::warn!` at the conversion site so operators can still distinguish security rejections from availability failures in logs even though the public `AgentLoopHostError` no longer carries that information. All 23 hooks_integration tests + 43 reborn lib tests still pass.
serrrfirat
left a comment
There was a problem hiding this comment.
Findings
Medium: RouterBackedHookGateRefFactory still depends on an external Fn() -> HookGateReservationContext for active run/actor binding, while RebornLoopDriverHostFactory::with_hook_gate_ref_factory(...) stores and reuses one shared Arc<dyn HookGateRefFactory> across host builds. That means production call sites can still accidentally capture stale LoopRunContext / actor state and mint later gate refs for the wrong run or principal. The new reuse test proves the router can reject stale context, but it does not remove the integration footgun in host composition because the host itself never supplies the current build context to the factory.
I would make this host-owned instead of caller-owned: add a per-build hook-gate factory callback that receives the current LoopRunContext plus actor/session, or otherwise change the minting path so a gate ref cannot be minted without active request context.
I did not find other blocking issues in the current head. The prior TTL backdating bug appears fixed: HookGateResolutionRequest no longer carries caller-controlled resolution time, and resolve_gate now uses router time.
Verification run against current head 7d4469ac from a detached worktree:
cargo test -p ironclaw_reborn --test hooks_integration router_backed -- --nocapturecargo clippy -p ironclaw_reborn --test hooks_integration -- -D warnings
…t MEDIUM on PR #3633) `RouterBackedHookGateRefFactory::try_new` takes a caller-supplied `Fn() -> HookGateReservationContext` closure, and the host factory's `with_hook_gate_ref_factory(Arc<dyn ...>)` stored ONE factory instance that was reused for every host build. That instance carried whatever run/actor context its closure captured at construction time — so a second host build could mint a gate ref against the FIRST build's `LoopRunContext`. The verification test that backdated `resolved_at` proved the router rejects stale timestamps, but didn't address the host-side wiring footgun. Added per-build callback path: - New `HookGateRefFactoryBuilder` type alias for `Arc<dyn Fn(&LoopRunContext) -> Arc<dyn HookGateRefFactory>>`. - `with_hook_gate_ref_factory_builder(F)` on `RebornLoopDriverHostFactory` installs the callback. It runs once per `build_text_only_host*` call with the active `LoopRunContext`, so production callers wire `move |run_ctx| Arc::new(RouterBackedHookGateRefFactory::try_new(..., ttl, || HookGateReservationContext::new(run_ctx.clone(), actor.clone()))?)` and the factory is constructed fresh per host with no stale capture. - Build path consults the builder first, falls back to the shared `hook_gate_ref_factory` if only the older API is wired. - `with_hook_gate_ref_factory(Arc<dyn ...>)` is marked `#[deprecated]` pointing to the builder. The method body is unchanged for back-compat. Tests: - Existing integration tests migrated to the builder API (`with_hook_gate_ref_factory_builder({ let f = Arc::new(factory); move |_| Arc::clone(&f) })`) so they exercise the same logical wiring against the new seam. All 23 pass; clippy clean with `-D warnings`. The trait/router types and `validate_token` route are unchanged from the previous fix in this PR.
|
Addressed in New .with_hook_gate_ref_factory_builder(move |run_ctx| {
Arc::new(RouterBackedHookGateRefFactory::try_new(
Arc::clone(&router),
ttl,
{
let run_ctx = run_ctx.clone();
let actor = actor.clone();
move || HookGateReservationContext::new(run_ctx.clone(), actor.clone())
},
).expect("ttl ok"))
})The old shared Build path consults the builder first, falls back to the shared factory if only the older API is wired. All 23 hooks integration tests still pass; clippy clean with |
Union resolution: keep both hook_gate_refs additions (HEAD) and driver_registry imports plus loop_support re-exports (base). loop_driver_host kept public to preserve HEAD's gate-ref factory API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Successor PR from #3573. Draft — scope doc only, no implementation yet.
Problem
Hook `PauseApproval` / `PauseAuth` decisions currently fail-closed in production because the middleware default is `FailClosedHookGateRefFactory`. The dev-only `UuidHookGateRefFactory` mints syntactically-valid but router-unregistered refs — the loop would park on a ref the approval gateway has never heard of.
Scope
Implement a `HookGateRefFactory` that mints refs the host's existing approval/auth router can resolve, with TTL + one-shot consumption + cross-run isolation.
Design doc
`crates/ironclaw_hooks/docs/successors/01-prod-gate-ref-factory.md`
Status
Draft for design review. Promote to ready when the cross-crate seam to the approval gateway is agreed.