feat(hooks): event-triggered hooks Phase 5 (successor to #3573)#3640
feat(hooks): event-triggered hooks Phase 5 (successor to #3573)#3640zmanian wants to merge 11 commits into
Conversation
Successor PR from #3573. Adds a new EventTriggered hook point that subscribes to RuntimeEvents asynchronously, outside the loop's inline tick. Observer-only by construction (no Allow/Deny/Patch); typed against a narrowed HookObservableEvent projection to keep the cross-crate boundary clean. Scope doc only; design questions about cursor/replay semantics and per-extension event-rate caps need design review before implementation.
There was a problem hiding this comment.
Code Review
This pull request introduces a design document for event-triggered hooks, which are asynchronous, observer-only hooks designed to react to runtime events without blocking the main loop. The review feedback highlights several internal inconsistencies and contradictions within the document that need clarification. Specifically, the reviewer noted discrepancies regarding the use of narrowed event projections versus the full RuntimeEvent type, the available methods on the ObserverSink trait, the policy for replaying missed events during downtime, and the current state of crate dependencies.
| ```rust | ||
| // in ironclaw_hooks::points::event_triggered | ||
| pub struct EventHookContext<'a> { | ||
| pub event: &'a RuntimeEvent, |
There was a problem hiding this comment.
The code snippet uses the full RuntimeEvent type, which contradicts the recommendation in the Cross-cutting constraints section (line 105) to pick a narrowed projection (e.g., HookObservableEvent). Using a narrowed projection is preferred to maintain clean crate boundaries. Documentation for complex logic must precisely match the intended implementation.
References
- Documentation for complex logic, such as security policies, must precisely match the code implementation.
| pub tenant_id: TenantId, | ||
| } | ||
|
|
||
| // Sink mirrors ObserverSink — `note_fact`, `emit_audit`. No `allow`, |
There was a problem hiding this comment.
The documentation mentions note_fact and emit_audit as methods on the sink, but the existing ObserverSink trait only defines a note method. Align this documentation with the actual trait definition or clearly distinguish these as future or planned states. Documentation must precisely match the code implementation and distinguish between current and future states.
| // Sink mirrors ObserverSink — `note_fact`, `emit_audit`. No `allow`, | |
| // Sink mirrors ObserverSink — note. No allow, |
References
- Documentation for complex logic, such as security policies, must precisely match the code implementation.
- In documentation, clearly distinguish between the current implementation state and future or planned states.
| (tick-driven or stream-driven, TBD). | ||
| - **Cursor / replay**: subscriptions are cursor-keyed so a restarted | ||
| host can resume from the last-seen `EventCursor`. Lost events | ||
| during downtime is acceptable for observer-only semantics; |
There was a problem hiding this comment.
There is a contradiction between the statement that Lost events during downtime is acceptable and the later requirement for Cursor resume (line 133) which specifies replaying missed events. The design should clarify which semantic is intended to ensure the documentation precisely matches the implementation logic.
References
- Documentation for complex logic, such as security policies, must precisely match the code implementation.
|
|
||
| ## Cross-cutting constraints | ||
|
|
||
| - **Cross-crate boundary**: `ironclaw_hooks` already forbids `events` |
There was a problem hiding this comment.
The statement that ironclaw_hooks already forbids a dependency on events is contradicted by the Risk section (line 159), which notes that ironclaw_events is already a dependency. Clarify the constraint to ensure the documentation precisely matches the system architecture.
References
- Documentation for complex logic, such as security policies, must precisely match the code implementation.
Cites crates/ironclaw_hooks/docs/successors/04-event-triggered-hooks.md as the scope contract. Adds the EventTriggered observer hook point, durable RuntimeEvent dispatch path, and Reborn pull-driven subscription wiring with caller-level coverage for matching, replay, scope filtering, observer-only authority, and backpressure.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce610736dc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !binding | ||
| .scope | ||
| .permits(binding.owning_extension.as_ref(), event.provider.as_ref()) | ||
| { |
There was a problem hiding this comment.
Preserve OwnCapabilities matching for providerless hook events
Scope filtering for event-triggered hooks is based on event.provider, but RuntimeEvent::hook_failed and RuntimeEvent::hook_decision_emitted populate provider as None (see crates/ironclaw_events/src/runtime_event.rs constructors). As a result, Installed hooks using the default OwnCapabilities scope are always filtered out for those event kinds, so a hook-failure subscription silently never fires unless it is widened to Global/SameTenant. This breaks the expected default behavior for extension-scoped failure observers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in two commits on this branch:
d7f43a3ff— adds ahook_id-based fallback inscope_provider_for_runtime_eventso that whenevent.providerisNoneonHookDispatched/HookDecisionEmitted/HookFailedevents, the dispatcher resolves the owning extension by looking upevent.hook_idagainst the registry's hex index.OwnCapabilitiesmatching is preserved for the default Installed-hook case.4bb06c93a— the durable fix: plumbsowning_extension: Option<ExtensionId>end-to-end throughLoopHostMilestoneKind::{HookDispatched, HookDecisionEmitted, HookFailed}and theRuntimeEvent::hook_*constructors, soevent.provideris populated at emit time and the primary path resolves without any fallback. Checkpoint payloads round-trip unchanged (#[serde(default, skip_serializing_if = "Option::is_none")]).
Regression tests in crates/ironclaw_reborn/tests/hooks_integration.rs:
event_triggered_own_capabilities_matches_hook_failed_with_carried_provider(primary path)event_triggered_own_capabilities_scope_resolves_hook_failed_owner_from_hook_id(legacy/None-provider fallback)
SummaryReviewed PR #3640 only. Base PR adds event-triggered hooks over durable event replay. Merge stance: blocking security/correctness findings at subscription boundary and replay/reentrancy behavior. Findings
Security/data-flow notes
Correctness/invariant notes
Missing tests
|
henrypark133
left a comment
There was a problem hiding this comment.
What looks good:
- Event-triggered hooks are observer-only at the sink/type boundary.
- Subscription runs from the durable log on a background task, so event emitters are not blocked by hook execution.
- Cursor replay, kind filtering, scope filtering, and task cleanup all have useful coverage.
Findings:
- High -
crates/ironclaw_hooks/src/dispatch.rs:887: event-triggered scope filtering usesevent.provider, butRuntimeEvent::hook_decision_emittedandRuntimeEvent::hook_failedsetprovider: Noneincrates/ironclaw_events/src/runtime_event.rs:550and:576. Why it matters: installed hooks default toOwnCapabilities, so the common/default event-triggered hook silently never fires for the hook-failure/decision events this PR is explicitly meant to observe. Expected fix direction: carry the originating provider into hook milestone runtime events, or resolve provider during replay/dispatch before applyingOwnCapabilities, with a regression test forOwnCapabilities + HookFailed.
Summary:
- Recommended verdict: Request changes
- Prior feedback status: current Codex P1 is confirmed.
- Residual risk: failure is silent non-observation, not loop corruption, but it undermines the Phase 5 default use case.
henrypark133 HIGH + codex P1 on PR #3640: `OwnCapabilities`-scoped event-triggered subscriptions silently never fired for `HookFailed`/`HookDecisionEmitted`/`HookDispatched` events because those `RuntimeEvent` constructors hardcoded `provider: None`. Since Installed hooks default to `OwnCapabilities`, the very events that Phase 5 was designed to observe (hook-failure / decision alerting) never reached their default-configured subscriber. A prior fix added a hook_id-based fallback in `scope_provider_for_runtime_event` that resolves the owning extension through the registry's hex index when `event.provider` is `None`. That covers the case where the failing hook is still registered at replay time, but the durable fix is to stamp the originating provider into the event at emit time so the primary `event.provider` path resolves without any fallback. Plumbed `owning_extension: Option<ExtensionId>` end-to-end: - `LoopHostMilestoneKind::{HookDispatched, HookDecisionEmitted, HookFailed}` gain the field (with `#[serde(default, skip_serializing_if = "Option::is_none")]` so pre-existing checkpoint payloads and the L3 schema-snapshot tests round-trip unchanged when no owner is set). - `RuntimeEvent::hook_{dispatched, decision_emitted, failed}` constructors accept the owner and stamp it into `provider`. - `milestone_events.rs` threads the field through the projection. - `HookDispatcher::emit_dispatched/emit_decision` pass `binding.owning_extension.clone()` directly. - `HookDispatcher::emit_failure` (no binding handy on the failure path) looks the owner up via the registry's existing `owning_extension_for_hook_hex` index. Tests: - `event_triggered_own_capabilities_matches_hook_failed_with_carried_provider`: primary-path regression — two `HookFailed` events with `provider: Some(ext_a|ext_b)` against an `OwnCapabilities` subscription scoped to ext_a; only the own-provider event fires and `event.provider == Some(ext_a)`. - Existing `event_triggered_own_capabilities_scope_resolves_hook_failed_owner_from_hook_id` remains green: passes `None` for the new arg so the fallback path is still exercised for legacy payloads. All other call sites updated to pass `None` (no owner available) or the resolved owner where applicable.
…rfirat HIGH #1 on PR #3640) `EventTriggeredHookSubscription` accepted a caller-supplied `EventStreamKey` + `ReadScope` and used `run_context.scope.tenant_id` as the hook context's tenant — with no validation that the two agreed. A caller wiring tenant A's host with tenant B's stream would cause hooks to observe B's events while the hook context claimed tenant A. Cross-tenant trust-boundary break. Add `EventTriggeredHookSubscription::validate_against_run_scope` and call it from `build_text_only_host_with_capabilities` before spawning. Validation: - Stream `(tenant_id, user_id, agent_id)` must equal `(run_context.scope.tenant_id, thread_scope.owner_user_id, run_context.scope.agent_id)`. - Thread without `owner_user_id` cannot bind any subscription — the user dimension is required to verify stream identity. - Every `Some(want)` in `ReadScope` must equal the corresponding run/thread scope value (project/mission/thread). `None` is permissive (run scope owns the dimension authoritatively). Failures surface as `RebornLoopDriverHostError::ScopeMismatch` with a specific reason naming the offending dimension. Tests: - `event_triggered_subscription_with_foreign_tenant_stream_fails_host_build` - `event_triggered_subscription_with_foreign_user_stream_fails_host_build` The integration fixture's `ThreadScope` now sets `owner_user_id: Some(...)` so it passes validation; previously it was `None`, which the new check (correctly) refuses. Existing tests continue to pass.
…rrfirat MED on PR #3640) When the durable event log returned `EventError::ReplayGap`, the event-triggered subscription's background task previously logged a `tracing::warn!` and broke out of the poll loop — silently killing all future hook event delivery for the run with no operator-visible signal. A scoped audit hook that mattered to compliance would just stop, and nobody downstream would know. Surface the termination through the host's milestone sink: - New `LoopDriverNoteKind::EventSubscriptionTerminated` variant. - The subscription's `spawn`/`run` now takes the host's `Arc<dyn LoopHostMilestoneSink>` and the active `LoopRunContext`. On `ReplayGap`, it constructs a `DriverNote` milestone with that kind plus a `LoopSafeSummary` describing the gap, publishes it through the same sink that carries every other host milestone, and *then* breaks (fail-closed: the at-most-once contract is already broken; resuming from `earliest` would silently lose the gap). - Log level bumped from `warn` to `error` to match the severity. - A best-effort send: failures to publish the milestone are logged but do not stall the subscription teardown. Tests: - `event_triggered_replay_gap_emits_subscription_terminated_milestone`: appends 3 events, `truncate_before_or_at` to cursor 2 to force a replay gap, starts the subscription from cursor origin (now stale), and asserts a `DriverNote { kind: EventSubscriptionTerminated, .. }` shows up on the host's milestone sink within a 2s deadline. Self-emit reentrancy (serrrfirat MED #3 on the same PR) is intentionally not addressed here — that fix needs a design call (task-local re-entry flag vs. removing RuntimeEvent emit capability from event-hook execution contexts) and is a follow-up.
on PR #3640) A hook that subscribes to one of the hook-lifecycle event kinds (`HookDispatched`/`HookDecisionEmitted`/`HookFailed`) with a scope that matches its own provider would otherwise be dispatched for events describing its OWN executions. The dispatcher emits those events itself when running the hook, so a hook subscribing to `HookFailed` with `OwnCapabilities` against its own extension would fail → emit HookFailed → re-dispatch → fail → emit → … storm. `dispatch_event_triggered_at` now skips events whose `event.hook_id` equals the binding's own hook id when the event kind is a hook- lifecycle kind (`is_hook_lifecycle_kind`). The check is intentionally narrow: - It only fires for hook-lifecycle events. Subscriptions to other event kinds are unaffected. - It only suppresses literal self-observation; events about other hooks (even hooks from the same extension) still dispatch. This does NOT cover the broader case of a hook that captures an `Arc<DurableEventLog>` and mints arbitrary `RuntimeEvent`s from inside its `observe()`. That requires architectural restriction on what hook impls can capture — tracked separately as a follow-up. Tests: - `event_triggered_self_lifecycle_event_does_not_redispatch`: appends two `HookFailed` events with the same provider — one targeting the subscriber's own hook id, one targeting a different hook. Asserts only the OTHER hook's failure fires (proves the filter is narrow, not blanket).
Code Review — PR #3640 (event-triggered hooks, Phase 5)Verdict: COMMENT (draft) — design and implementation are sound and well-structured. Several items should be resolved before promotion from draft. OverviewNew IssuesShould fix before promotion from draft:
Non-blocking nits:
Missing test coverage
Security
|
Four items from the 5-15 review (#4 DoS budget and #5 narrowed projection deferred — see below): **#1 (should-fix) Invariant: EventTriggered ↔ event_kind_filter** `HookRegistry::insert` now enforces the biconditional at install time: an `EventTriggered` binding must declare an `event_kind_filter` (otherwise the dispatcher's kind match would silently never fire — a no-op binding), and conversely only `EventTriggered` bindings may declare a filter (other points are kind-agnostic and would ignore the field). Misconfigured bindings fail loud at install. **#2 (should-fix) Remove `Clone` derive on EventTriggeredHookSubscription** `Clone` on a spawn-semantics type was a footgun: external callers cloning + spawning twice would create two consumers reading from the same `start_cursor`, each dispatching every hook. Replace with an explicit `clone_for_independent_spawn(&self)` method named verbosely so the property is visible at the seam. Internal use updated in the factory's host-build path; external callers can no longer accidentally construct a dual-consumer pattern. **#3 (should-fix) catch_unwind around the background `run()` task** The subscription's tokio task body now runs inside `AssertUnwindSafe(...).catch_unwind()`; a panic in `run()` emits the same `EventSubscriptionTerminated` `DriverNote` milestone the `ReplayGap` path already emits, instead of silently terminating with no operator-visible signal. **#6 (should-fix) Replay semantics in rustdoc on public API** Added a "Replay semantics" section to `EventTriggeredHookSubscription` rustdoc: at-least-once, caller-owned cursor persistence, the restart-from-start_cursor replay pattern. Previously only in the design doc; now load-bearing API contract is visible at the type. **#4 (deferred) Per-hook DoS budget for Installed tier** Henry's recommendation was to gate `Installed`-tier event-triggered hooks entirely until the budget design lands, allowing only Builtin/Trusted. That breaks 11 existing tests + the primary use case. Instead: documented the existing first-line throttle (`batch_limit` × `poll_interval`) as the current bound on indirect- recursion fanout, and tracked the full per-hook rate cap with poisoning + milestone-on-overrun as a follow-up. The self-trigger guard (committed earlier in this PR) catches the most common direct pattern; the throttle here bounds the indirect pattern until the proper budget lands. **#5 (deferred) Narrowed `HookObservableEvent` projection** Would prevent full `RuntimeEvent` surface from reaching Installed- tier hooks. Project-wide impact (events crate types, projection glue). Tracked as a follow-up; the existing sanitized-event projection bounds the surface to closed-vocab labels. All 156 hooks lib + 30 reborn integration tests pass.
|
@henrypark133 thanks for the review. Addressing the High finding ( Fixed in two commits, in the order you suggested:
Regression tests in
Full Codex P1 on |
Bundle three nit-tier review items into a single commit: **#9 Replace author-internal tags with NOTE(#3640)** The Phase-5 PR (#3640) had several `serrrfirat HIGH/MED #N on PR #3640` comment tags in this PR's diff. These are review-internal scaffolding, not load-bearing for future readers. Replaced with `NOTE(#3640)` in: - crates/ironclaw_hooks/src/dispatch.rs (self-observation guard) - crates/ironclaw_reborn/src/loop_driver_host.rs (scope validation, replay-gap milestone, subscription binding) - crates/ironclaw_reborn/tests/hooks_integration.rs (three regression tests covering scope validation, self-observation suppression, and replay-gap surfacing) - crates/ironclaw_turns/src/run_profile/host.rs (`EventSubscriptionTerminated` doc) **#10 Replace 10ms spin-poll with tokio::sync::Notify** `wait_for_seen_events` polled the shared `Mutex<Vec<SeenRuntimeEvent>>` every 10 ms until the expected count was reached. Replaced with a `SeenLog` newtype that pairs the events vec with a `Notify`; the hook's `observe()` calls `seen.push(...)` which signals `Notify::notify_one`, and `wait_for_seen_events` parks on `notified().await` under a `tokio::time::timeout`. `notify_one` is a permit-store, so an event landing between snapshot and wait still wakes the waiter immediately. Test latency drops from ~10 ms median to sub-ms and is no longer rate-limited by the polling cadence. All 30 hooks_integration tests still pass. **#11 Remove unused Clone derive on EventTriggeredHookContext** No call site clones the context — it's passed by reference. Dropped the derive to make the borrow contract clearer.
…reality Address gemini-code-assist review on `04-event-triggered-hooks.md`: - L50 (Likely surface): annotated the sketch's full `RuntimeEvent` use with a pointer to the narrowed-projection follow-up so the snippet no longer reads as a recommendation contradicting L119–121. - L55 (sink methods): replaced `note_fact` / `emit_audit` (which never shipped on `ObserverSink`) with the actual `note(category, summary)` primitive and cross-referenced Reborn's `EventTriggeredObserverSink`. - L95 (cursor / replay): "lost events during downtime acceptable" contradicted the at-least-once replay semantics described in the Phase 5 implementation notes. Rewrote the bullet to say replay is at-least-once from the persisted cursor and to spell out the operator obligation around cursor persistence before shutdown. - L100/115 (forbids events dep): the original doc claimed `ironclaw_hooks` forbids an `ironclaw_events` dep, but the Risk section noted the dep is already established via PR #3573. Updated both passages to reflect that the dep direction is set; Phase 5 adds the *consumer* side. The narrowed `HookObservableEvent` projection is now framed as a follow-up tracked in #3690.
|
@henrypark133 @serrrfirat addressing the 5/15 review batch. Status of each finding below. serrrfirat (MED)MED #2 ( MED #3 (no source tag / self-author suppression beyond hook-id exact match): resolved earlier in this PR — henrypark133should-fix #1 ( should-fix #2 ( should-fix #3 (panic in should-fix #4 (per-hook DoS dispatch budget for Installed-tier event hooks): deferred to follow-up issue #3689. Henry's stricter recommendation was to gate Installed-tier event hooks entirely until the budget lands, which breaks the primary use case + 11 tests. Interim mitigations remain: the subscription throttle ( should-fix #5 (narrowed should-fix #6 (replay semantics on nit #7 (author-internal tags): resolved in nit #8 ( nit #9 (unused nit #10 ( gemini-code-assist (doc fixes)All four resolved in
|
|
Thanks for the update, @zmanian. The resolutions for the identified issues and the follow-up tracking for the DoS budget and narrowed event projection look correct and address the concerns raised in the review. The use of |
Resolves conflicts from the latest base advance: - crates/ironclaw_hooks/src/dispatch.rs: union HEAD's owning_extension plumbing with base's audit_reason field in emit_decision_with_audit - crates/ironclaw_hooks/src/registry.rs: keep both new install-time validations — HEAD's event-kind-filter biconditional (henrypark133 should-fix #1) and base's scope-vs-point coherence check. Extend point_has_capability_context to treat EventTriggered as capability-context-bearing (provider resolved per-event at dispatch time) - crates/ironclaw_turns/src/run_profile/milestones.rs: union HookDecisionEmitted to carry both owning_extension AND audit_reason - crates/ironclaw_reborn/src/milestone_events.rs: project owning_extension into RuntimeEvent::hook_decision_emitted, ignore audit_reason (it stays on the in-memory milestone sink, never crosses the durable boundary) - crates/ironclaw_reborn/src/lib.rs: re-export EventTriggeredHookSubscription* alongside the base's new LoopCapabilityPortFactory - crates/ironclaw_reborn/src/loop_driver_host.rs: keep both the new ProfiledCapabilityHostRuntime/LoopCapabilityPortFactory and the event-subscription handle field; drop unused tokio::sync::Notify - crates/ironclaw_hooks/src/dispatch.rs: reject EventTriggered in the observer install path explicitly (it is installed via the event subscription path, not the generic observer entry point) - propagate event_kind_filter: None to remaining HookBinding test literals exposed by the merge henrypark133 #1 (HIGH): the originally-cited `event.provider` always being `None` for hook milestone events is already addressed on this branch — commit 4bb06c9 / d7f43a3 carry owning_extension end-to-end through both `RuntimeEvent::hook_failed` and `RuntimeEvent::hook_decision_emitted`. The merge preserves and extends that plumbing rather than reverting it. cargo fmt + clippy (-D warnings) green; unit and hooks_integration tests green.
|
@henrypark133 Rebased onto latest Re: HIGH finding #1 ( This was already addressed on the branch in commits The merge preserved that plumbing — I had to union it with the new Could you take another look when you have a moment? |
Successor PR from #3573. Draft — scope doc only, no implementation yet.
Scope
New `EventTriggered` hook point that subscribes to the runtime event bus and reacts to durable `RuntimeEvent`s asynchronously, outside the inline dispatch tick. Observer-only by construction (no Allow/Deny/Patch decisions; sink mirrors `ObserverSink`).
Motivation
Inline hook points are synchronous against the loop. Some legitimate hook use cases don't fit:
Design doc
`crates/ironclaw_hooks/docs/successors/04-event-triggered-hooks.md`
Open design questions
Coordination
Status
Draft for design review. Promote when the cursor/replay + DoS-budget design lands.