Skip to content

feat(hooks): production gate-ref factory (successor to #3573)#3633

Open
zmanian wants to merge 8 commits into
hooks-foundation-01from
hooks-fu-prod-gate-ref-factory
Open

feat(hooks): production gate-ref factory (successor to #3573)#3633
zmanian wants to merge 8 commits into
hooks-foundation-01from
hooks-fu-prod-gate-ref-factory

Conversation

@zmanian
Copy link
Copy Markdown
Collaborator

@zmanian zmanian commented May 14, 2026

Successor PR from #3573. Draft — scope doc only, no implementation yet.

Problem

Hook `PauseApproval` / `PauseAuth` decisions currently fail-closed in production because the middleware default is `FailClosedHookGateRefFactory`. The dev-only `UuidHookGateRefFactory` mints syntactically-valid but router-unregistered refs — the loop would park on a ref the approval gateway has never heard of.

Scope

Implement a `HookGateRefFactory` that mints refs the host's existing approval/auth router can resolve, with TTL + one-shot consumption + cross-run isolation.

Design doc

`crates/ironclaw_hooks/docs/successors/01-prod-gate-ref-factory.md`

Status

Draft for design review. Promote to ready when the cross-crate seam to the approval gateway is agreed.

@github-actions github-actions Bot added size: XS < 10 changed lines (excluding docs) scope: docs Documentation risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels May 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design document for the RouterBackedHookGateRefFactory, which aims to integrate hook approval and authentication with the host's existing gateways. The feedback highlights a logical contradiction regarding identifier synthesis in the threat model, suggests passing the LoopRunContext as a method argument to allow for a long-lived factory instance, and recommends adding an optional TTL override to the minting methods for greater flexibility.

Comment on lines +19 to +20
`LoopGateRef` is the gateway's chosen identifier — not synthesized
in the factory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

There is a logical contradiction between the requirement that the identifier is not synthesized in the factory and the threat model consideration on line 61, which suggests the factory might wrap a v4 UUID inside if the gateway's format is guessable. If the factory adds entropy to the ID, it is participating in the synthesis. The design should clarify whether the gateway is solely responsible for generating secure identifiers or if the factory provides a client-side token to be incorporated by the gateway.

References
  1. Documentation for complex logic, such as security policies, must precisely match the code implementation.

Comment on lines +35 to +40
pub struct RouterBackedHookGateRefFactory {
run_context: LoopRunContext,
approval_gateway: Arc<dyn ApprovalGateway>,
auth_gateway: Arc<dyn AuthGateway>,
reservation_ttl: Duration,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Binding LoopRunContext directly to the factory struct makes the factory instance single-use and requires recreation for every loop run. A more flexible and efficient approach, consistent with other host services like LoopModelGateway, is to pass the LoopRunContext as an argument to the mint methods. This allows the factory to be a long-lived, shared component.

Suggested change
pub struct RouterBackedHookGateRefFactory {
run_context: LoopRunContext,
approval_gateway: Arc<dyn ApprovalGateway>,
auth_gateway: Arc<dyn AuthGateway>,
reservation_ttl: Duration,
}
pub struct RouterBackedHookGateRefFactory {
approval_gateway: Arc<dyn ApprovalGateway>,
auth_gateway: Arc<dyn AuthGateway>,
default_reservation_ttl: Duration,
}


#[async_trait]
impl HookGateRefFactory for RouterBackedHookGateRefFactory {
async fn mint_approval_ref(&self, reason: &str) -> Result<LoopGateRef, AgentLoopHostError> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mint_approval_ref signature currently only accepts a reason. Since different hooks may require different expiration policies (e.g., a quick confirmation vs. a long-running manual review), the design would be improved by allowing an optional TTL override or a policy hint in the method signature.

Suggested change
async fn mint_approval_ref(&self, reason: &str) -> Result<LoopGateRef, AgentLoopHostError> {
async fn mint_approval_ref(&self, run_context: &LoopRunContext, reason: &str, ttl: Option<Duration>) -> Result<LoopGateRef, AgentLoopHostError> {

Successor PR scope doc. Until this lands, hook PauseApproval/PauseAuth
decisions surface as Denied in production because the middleware
default is FailClosedHookGateRefFactory (PR #3573 / henrypark133
Critical #3).

This PR carries the scope doc only; implementation follows after
review of the design (cross-crate seam to approval gateway is the
load-bearing decision).
@zmanian zmanian force-pushed the hooks-fu-prod-gate-ref-factory branch from 6c04f1c to f75f779 Compare May 14, 2026 13:18
@zmanian zmanian changed the base branch from reborn-integration to hooks-foundation-01 May 14, 2026 13:18
Adds two addenda from codex's design-review pass:

- Critical: actor/session binding requirement (prevents same-tenant
  wrong-user approval-bypass). Gateway reservation must carry the
  actor/session id and reject cross-actor consumption.
- Recommendation: include capability id + arguments digest in the
  reservation, not just free-form reason. Lets the approval UI show
  the exact gated call AND defeats a future-call digest-mismatch
  replay vector.
@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented May 14, 2026

Codex review addressed (commit fd50759)

Verdict was REQUEST CHANGES; scope doc updated with two addenda:

  • Critical: actor/session binding — reservations must include the actor/session id and the gateway must reject cross-actor consumption. Added explicit test requirement: ext-A's gate-ref issued in user-A's session is rejected when user-B resolves it.
  • Recommendation: capability + arguments digest in reservation — lets the approval UI show the exact gated call AND defeats a future-call digest-mismatch replay vector. The reservation payload now includes both fields alongside reason.

Cites Codex review addenda: bind hook gate reservations to actor/session identity and carry capability plus arguments digest before handing refs to the approval/auth router.
@zmanian zmanian marked this pull request as ready for review May 14, 2026 17:06
@github-actions github-actions Bot added size: XL 500+ changed lines and removed size: XS < 10 changed lines (excluding docs) labels May 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3c68cd76e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +270 to +272
run_context: self.run_context.clone(),
actor: self.actor.clone(),
capability_id: metadata.capability_id,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mint reservations from the active run/actor context

RouterBackedHookGateRefFactory always copies self.run_context and self.actor into each reservation, so a factory instance is permanently pinned to whichever run/user was passed to try_new. Because host factories commonly reuse a single HookGateRefFactory across multiple builds, later turns can mint gate refs bound to stale context, which then fail resolution (RunMismatch/ActorMismatch) or are attributed to the wrong principal. The reservation context should be sourced per host build/invocation rather than stored once on the factory object.

Useful? React with 👍 / 👎.

consumed_at,
});
}
if request.resolved_at >= state.reservation.expires_at {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate gate expiry against router time, not caller time

Expiry enforcement trusts request.resolved_at, which is caller-provided and public on HookGateResolutionRequest. A client can backdate that timestamp and successfully consume a gate that has already expired in real time, bypassing TTL semantics. The router should compare expires_at to its own clock (Utc::now()) and treat resolved_at as audit metadata only.

Useful? React with 👍 / 👎.

@serrrfirat
Copy link
Copy Markdown
Collaborator

Summary

Reviewed PR #3633 only. Base 5793e4d90e1316adb93ec9c7edf6511d85f8873e → head c3c68cd76e3bad66a580fbd1bce8b86a8401500e.

PR adds production gate-ref factory for hooks and improves run/thread/actor binding. Merge stance: blocking authority bug remains in TTL/consume timestamp handling.

Findings

# Sev Category File:Line Issue Fix suggestion
1 High Security / Correctness crates/ironclaw_reborn/src/hook_gate_refs.rs:140, crates/ironclaw_reborn/src/hook_gate_refs.rs:147, crates/ironclaw_reborn/src/hook_gate_refs.rs:447, crates/ironclaw_reborn/src/hook_gate_refs.rs:484 TTL enforcement trusts HookGateResolutionRequest.resolved_at, and that request type is public/caller-supplied. Any resolver path that builds this request from external input or a buggy adapter can backdate resolved_at and consume an expired approval/auth gate ref. The same caller-supplied timestamp is persisted as consumed_at, so forged past/future values also corrupt one-shot audit semantics. Make resolve time router-owned. Remove resolved_at from public request or ignore it in resolve_gate(); compute current time inside router before expiry and consume checks. Add regression: create expired reservation, resolve with forged past timestamp, assert rejection and no consumption.

Security/data-flow notes

  • Good: gate refs bind run/thread/tenant via run context checks, actor, capability, and arguments digest.
  • Good: one-shot consumption checked before mutation.
  • Remaining trust boundary bug: time authority belongs to router/host clock, not request caller.

Correctness/invariant notes

  • hook_gate_refs.rs:166-182: helper initializes resolved_at = Utc::now(), but field remains mutable public struct data.
  • hook_gate_refs.rs:447-451: expiry check uses request timestamp.
  • hook_gate_refs.rs:484: consumed timestamp also uses request timestamp.

Missing tests

  • Backdated resolved_at cannot bypass TTL.
  • Cross-run/actor mismatch plus PauseAuth parity tests remain useful boundary coverage.

#3633)

serrrfirat HIGH on PR #3633: `HookGateResolutionRequest.resolved_at`
was a caller-controllable timestamp. Any adapter wiring the request
from external input — or a buggy router that trusted it — could
backdate it and consume an expired approval/auth gate ref. The same
caller-supplied timestamp was persisted as the reservation's
`consumed_at`, so forged values also corrupted the one-shot audit
trail.

The `InMemoryHookGateRouter` reference implementation already reads
its own wall clock (`Utc::now()`) inside `resolve_gate` for both
expiry checks and `consumed_at` — but the public request struct
still exposed `resolved_at` as a `pub` field, encouraging future
router impls to trust it and leaving the field as ambient trust-
boundary surface.

Fix: remove `resolved_at` from `HookGateResolutionRequest` entirely.
Time authority for resolution and consumption lives exclusively on
the router's wall clock; the only place a resolution timestamp
surfaces is `HookGateResolution::resolved_at` (the *result*), which
is router-supplied. The `for_kind` / `for_invocation` constructors
no longer take or set a timestamp.

Tests:
- `router_backed_pause_approval_gate_ref_rejects_backdated_resolution_after_ttl`
  reframed: it no longer mutates `request.resolved_at` (the field is
  gone). Instead it relies on the router's own clock — TTL = 1ms,
  sleep 5ms, resolve must surface `Expired`. The property is now
  statically enforced by the absence of the field rather than
  dynamically asserted, but the regression test still exercises the
  router-owned-time code path.
@henrypark133
Copy link
Copy Markdown
Collaborator

Code Review — PR #3633 (production gate-ref factory)

Verdict: REQUEST CHANGES — structurally sound with good test coverage on the critical paths, but several medium/high issues need addressing before promotion.

Note: the PR is filed as "Draft — scope doc only" but the diff ships the complete implementation (+1217 lines). Flagging in case the draft status is intentional.

Overview

Router-backed HookGateRefFactory with TTL, one-shot consumption, actor binding, and capability+args digest. tokio::task_local! for per-invocation context, HookGateError variants for targeted assertions, fail-closed mapping to AgentLoopHostError::Unavailable.

Issues

Must fix before merge:

  1. No MAX_RESERVATION_TTL cap — an operator setting Duration::days(365) creates reservations that persist in InMemoryHookGateRouter forever (memory leak). Add a cap (e.g. 24h) enforced in try_new.

  2. reason field has no length capmint_approval_ref/mint_auth_ref accept an unbounded reason string stored as-is. Add MAX_REASON_BYTES (e.g. 4096) and validate in mint.

  3. validate_token returns HookGateError::InvalidDigest for session-id validation failures — confusing. Add InvalidToken { field, reason } or rename the variant.

  4. RouterBackedHookGateRefFactory + HookGateInvocationScopePort pairing is an invisible integration constraint — calling mint without the scope port silently fails with MissingInvocationMetadata. Return both as a tuple from new to enforce the pairing at construction time.

  5. UnknownGate vs RunMismatch oracle — leaks whether a gate_ref exists vs. belongs to another run. Collapse into an opaque error at the public HookGateRouter trait boundary; preserve the distinction only in internal logs.

Non-blocking:

  1. InMemoryHookGateRouter never purges expired entries — add purge_expired() for tests, or document it's dev/test only.
  2. std::sync::Mutex used inside async fn — held briefly with no I/O, so not a deadlock risk, but prefer tokio::sync::Mutex or add a "held briefly, no I/O" comment.
  3. 1ms TTL / 5ms sleep in the expiry test is flaky on loaded CI — widen to 50ms / 200ms.
  4. From<HookGateError> for AgentLoopHostError loses observability of security rejections vs. availability failures — add a tracing::warn! at the conversion site.
  5. hook_gate_arguments_digest includes surface_version — document whether a surface version bump mid-run invalidates outstanding gates.

Missing test coverage

  • PauseAuth / HookGateKind::Auth path (no tests at all for auth refs)
  • MissingInvocationMetadata failure path (factory used without scope port)
  • Gateway-unavailable → Denied integration test
  • invoke_capability_batch suspension semantics
  • validate_token boundary inputs (empty, at-limit, over-limit, control chars)
  • Concurrent mint to verify task-local isolation per-task

Security notes

  • Check ordering in resolve_gate (consumed → expired → kind → run → actor → digest) is correct; cross-run attempt rejected before actor identity evaluated ✅
  • input_ref immutability assumption in hook_gate_arguments_digest should be documented — if input_ref is mutable, TOCTOU exists between approval display and resume
  • No cancel_gate / explicit cleanup on run abort — orphaned reservations accumulate for the full TTL window

Four items from the 5-15 review:

**#1 (must-fix) MAX_RESERVATION_TTL cap**
`RouterBackedHookGateRefFactory::try_new` now caps `reservation_ttl`
at 24h. Without the cap, an operator misconfiguring a year-long TTL
accumulates unresolved reservations in `InMemoryHookGateRouter` state
for the full window — a long-tail memory leak. 24h is plenty for
human-in-the-loop approval flows.

**#2 (must-fix) MAX_REASON_BYTES cap**
`mint(...)` rejects `reason` strings longer than 4 KiB. Without the
cap, a buggy or malicious caller could push arbitrarily large strings
through the approval store; the reason is operator-facing and may be
persisted.

**#3 (must-fix) Split `InvalidDigest` from `InvalidToken`**
`validate_token` previously returned `HookGateError::InvalidDigest`
for failures on actor / session ids — confusing because those values
aren't digests. Add a new `InvalidToken { field, reason }` variant
and route `validate_token` to it. `InvalidDigest` stays for actual
sha256-digest shape failures.

**#5 (must-fix) Collapse consumption-failure oracle**
`From<HookGateError> for AgentLoopHostError` previously preserved
Display text for every variant, so a probing caller could distinguish
"this gate ref doesn't exist" from "this gate ref belongs to another
run/actor/capability" — an oracle for liveness detection on foreign
gate refs. Now collapses the entire consumption-failure family
(`UnknownGate` / `AlreadyConsumed` / `Expired` / `KindMismatch` /
`RunMismatch` / `ActorMismatch` / `CapabilityMismatch` /
`ArgumentsDigestMismatch`) to a single opaque
"hook gate consumption denied" surface. Misuse / availability
variants still surface details — they signal config bugs and need
operator visibility. Internal variants stay distinct for test
assertions and operator-visible tracing.

**Bonus** (henrypark133 non-blocking #9):
Add a `tracing::warn!` at the conversion site so operators can still
distinguish security rejections from availability failures in logs
even though the public `AgentLoopHostError` no longer carries that
information.

All 23 hooks_integration tests + 43 reborn lib tests still pass.
Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

Medium: RouterBackedHookGateRefFactory still depends on an external Fn() -> HookGateReservationContext for active run/actor binding, while RebornLoopDriverHostFactory::with_hook_gate_ref_factory(...) stores and reuses one shared Arc<dyn HookGateRefFactory> across host builds. That means production call sites can still accidentally capture stale LoopRunContext / actor state and mint later gate refs for the wrong run or principal. The new reuse test proves the router can reject stale context, but it does not remove the integration footgun in host composition because the host itself never supplies the current build context to the factory.

I would make this host-owned instead of caller-owned: add a per-build hook-gate factory callback that receives the current LoopRunContext plus actor/session, or otherwise change the minting path so a gate ref cannot be minted without active request context.

I did not find other blocking issues in the current head. The prior TTL backdating bug appears fixed: HookGateResolutionRequest no longer carries caller-controlled resolution time, and resolve_gate now uses router time.

Verification run against current head 7d4469ac from a detached worktree:

  • cargo test -p ironclaw_reborn --test hooks_integration router_backed -- --nocapture
  • cargo clippy -p ironclaw_reborn --test hooks_integration -- -D warnings

…t MEDIUM on PR #3633)

`RouterBackedHookGateRefFactory::try_new` takes a caller-supplied
`Fn() -> HookGateReservationContext` closure, and the host factory's
`with_hook_gate_ref_factory(Arc<dyn ...>)` stored ONE factory instance
that was reused for every host build. That instance carried whatever
run/actor context its closure captured at construction time — so a
second host build could mint a gate ref against the FIRST build's
`LoopRunContext`. The verification test that backdated `resolved_at`
proved the router rejects stale timestamps, but didn't address the
host-side wiring footgun.

Added per-build callback path:
- New `HookGateRefFactoryBuilder` type alias for
  `Arc<dyn Fn(&LoopRunContext) -> Arc<dyn HookGateRefFactory>>`.
- `with_hook_gate_ref_factory_builder(F)` on `RebornLoopDriverHostFactory`
  installs the callback. It runs once per `build_text_only_host*` call
  with the active `LoopRunContext`, so production callers wire
  `move |run_ctx| Arc::new(RouterBackedHookGateRefFactory::try_new(...,
  ttl, || HookGateReservationContext::new(run_ctx.clone(), actor.clone()))?)`
  and the factory is constructed fresh per host with no stale capture.
- Build path consults the builder first, falls back to the shared
  `hook_gate_ref_factory` if only the older API is wired.
- `with_hook_gate_ref_factory(Arc<dyn ...>)` is marked `#[deprecated]`
  pointing to the builder. The method body is unchanged for back-compat.

Tests:
- Existing integration tests migrated to the builder API
  (`with_hook_gate_ref_factory_builder({ let f = Arc::new(factory); move |_| Arc::clone(&f) })`)
  so they exercise the same logical wiring against the new seam. All
  23 pass; clippy clean with `-D warnings`.

The trait/router types and `validate_token` route are unchanged from
the previous fix in this PR.
@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented May 15, 2026

Addressed in 0bb58eaa7.

New HookGateRefFactoryBuilder type alias + with_hook_gate_ref_factory_builder(F) on RebornLoopDriverHostFactory. The closure receives the active LoopRunContext per host build, so production callers construct the factory fresh per host:

.with_hook_gate_ref_factory_builder(move |run_ctx| {
    Arc::new(RouterBackedHookGateRefFactory::try_new(
        Arc::clone(&router),
        ttl,
        {
            let run_ctx = run_ctx.clone();
            let actor = actor.clone();
            move || HookGateReservationContext::new(run_ctx.clone(), actor.clone())
        },
    ).expect("ttl ok"))
})

The old shared with_hook_gate_ref_factory(Arc<dyn ...>) is #[deprecated] with a note pointing to the builder; body unchanged so existing callers don't break, they just see the warning.

Build path consults the builder first, falls back to the shared factory if only the older API is wired.

All 23 hooks integration tests still pass; clippy clean with -D warnings.

Union resolution: keep both hook_gate_refs additions (HEAD) and
driver_registry imports plus loop_support re-exports (base).
loop_driver_host kept public to preserve HEAD's gate-ref factory API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: docs Documentation size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants