-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat(hooks): persistent predicate counter backend (successor to #3573) #3635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: hooks-foundation-01
Are you sure you want to change the base?
Changes from 8 commits
d4dd47a
d5f68de
9f9513c
df505ae
e6df47d
4a6c7cf
0ea35d8
7c83c3d
b4d8a35
c9c5fb0
f632d22
2b03832
33521f7
8660d65
7ff90ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| # Successor PR: persistent predicate counter | ||
|
|
||
| > Successor work from PR #3573. Current sliding-window counter state is | ||
| > in-memory only and resets on every restart. This PR adds a durable | ||
| > backend so rate-limit / value-cap predicates survive process | ||
| > lifecycles. | ||
|
|
||
| ## Scope | ||
|
|
||
| Add a `PredicateStateBackend` trait + Postgres / libSQL impls so | ||
| `PredicateEvaluator` can persist its counter state. Threat-model D5 | ||
| (LRU eviction at 8192 keys per map) still applies in-memory; the | ||
| durable backend is the source of truth. | ||
|
|
||
| ## Required behavior | ||
|
|
||
| 1. **Cross-process consistency**: two host processes against the same | ||
| tenant share counter state. Run-1 increments → run-2 sees the | ||
| updated count. | ||
| 2. **Restart survival**: counter state survives process restart. | ||
| 3. **Replay refusal**: re-emitting a recorded invocation must NOT | ||
| double-count. The backend dedupes on `event_id` (a sanitized | ||
| `RuntimeEventId` hex) so duplicate detection works across restart | ||
| / replay. | ||
| 4. **Backend-agnostic**: the predicate evaluator depends on the trait, | ||
| not a specific store. | ||
|
|
||
| ## Likely surface | ||
|
|
||
| ```rust | ||
| #[async_trait] | ||
| pub trait PredicateStateBackend: Send + Sync { | ||
| /// Atomic record-and-read: the implementation MUST perform the | ||
| /// insert AND the in-window count read under a single | ||
| /// lock/transaction. Splitting them is a race that lets the cap | ||
| /// drift past `max` (codex Critical from #3635). | ||
| async fn record_invocation( | ||
| &self, | ||
| key: &InvocationKey, // (tenant, hook_id, capability) | ||
| timestamp: DateTime<Utc>, // project-wide convention; see src/db/mod.rs | ||
| event_id: &PredicateEventId, // opaque hex; durable backends UNIQUE-constraint on it | ||
| window: Duration, | ||
| ) -> Result<u32, PredicateBackendError>; | ||
|
|
||
| async fn record_value( | ||
| &self, | ||
| key: &ValueKey, // (tenant, hook_id, capability, field) | ||
| timestamp: DateTime<Utc>, | ||
| event_id: &PredicateEventId, | ||
| value: Decimal, | ||
| window: Duration, | ||
| ) -> Result<Decimal, PredicateBackendError>; | ||
|
|
||
| /// Garbage-collect rows older than `cutoff`. Operator runs this as | ||
| /// a reaper task, typically at the slowest configured window. | ||
| async fn evict_older_than( | ||
| &self, | ||
| cutoff: DateTime<Utc>, | ||
| ) -> Result<u64, PredicateBackendError>; | ||
| } | ||
| ``` | ||
|
|
||
| **Clock note (responding to gemini's review on the prior draft):** the | ||
| in-memory backend that already shipped in PR #3635 uses `Instant` | ||
| because it's process-local and monotonic. Durable backends serialize | ||
| across processes, so they must use `chrono::DateTime<Utc>` to match | ||
| the rest of the project (`src/db/mod.rs`, `ironclaw_events`). The | ||
| trait will accept `DateTime<Utc>` and the existing in-memory backend | ||
| will gain a thin shim mapping `Instant`-driven callers to a fixed | ||
| reference point. | ||
|
|
||
| **Run scope:** the trait does NOT carry `run_id` directly. The | ||
| sliding-window state is per-tenant + per-hook-id; replay refusal is | ||
| driven by `event_id` (which is `RuntimeEventId` from `ironclaw_events`, | ||
| itself already keyed to the current run's emission). An earlier draft | ||
| mentioned storing `run_id` alongside; that's redundant given the event | ||
| id's uniqueness contract and was removed in this revision. | ||
|
|
||
| ## Backends | ||
|
|
||
| - **`InMemoryPredicateStateBackend`**: keeps current behavior; used | ||
| for tests + the standalone `ironclaw_hooks` integration tests that | ||
| don't want to spin up a database. | ||
| - **`PostgresPredicateStateBackend`**: production. Schema: | ||
| ```sql | ||
| CREATE TABLE hook_invocation_events ( | ||
| tenant_id text NOT NULL, | ||
| hook_id bytea NOT NULL, | ||
| capability text NOT NULL, | ||
| occurred_at timestamptz NOT NULL, | ||
| event_id uuid PRIMARY KEY -- for replay dedup | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The schema dedupes
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in Schema PKs changed to composite uniqueness scoped to the counter key:
The trait's replay-refusal docs in |
||
| ); | ||
| CREATE INDEX ix_hook_invocation_events_window ON hook_invocation_events | ||
| (tenant_id, hook_id, capability, occurred_at DESC); | ||
|
|
||
| CREATE TABLE hook_value_events ( | ||
| tenant_id text NOT NULL, | ||
| hook_id bytea NOT NULL, | ||
| capability text NOT NULL, | ||
| field text NOT NULL, | ||
| occurred_at timestamptz NOT NULL, | ||
| value numeric NOT NULL, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the libSQL implementation, the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in doc revision (lines 120-128). The libSQL impl is now explicitly called out as having two material differences from the Postgres impl mandated by
|
||
| event_id uuid PRIMARY KEY | ||
| ); | ||
| CREATE INDEX ix_hook_value_events_window ON hook_value_events | ||
| (tenant_id, hook_id, capability, field, occurred_at DESC); | ||
| ``` | ||
| - **`LibSqlPredicateStateBackend`**: parity with the rest of the | ||
| dual-backend story (per `src/db/CLAUDE.md`). Same shape as the | ||
| Postgres impl, with two material differences mandated by | ||
| `src/db/CLAUDE.md`: | ||
| - `value` column is `TEXT NOT NULL` (not numeric), because libSQL's | ||
| integer/real types can't preserve `rust_decimal` precision; the | ||
| backend serializes via `Decimal::to_string()` / `from_str()` at | ||
| the row boundary. | ||
| - `occurred_at` is stored as ISO-8601 `TEXT` (libSQL convention). | ||
|
|
||
| ## Migration / coexistence | ||
|
|
||
| - `PredicateEvaluator::new()` keeps the in-memory default. | ||
| - `PredicateEvaluator::with_backend(backend)` opt-in for production. | ||
| - The registrar / Reborn factory wires the backend per host. | ||
|
|
||
| ## Required tests | ||
|
|
||
| 1. **Replay refusal**: record the same `event_id` twice → second call | ||
| no-ops, count unchanged. | ||
| 2. **Cross-process**: two backends pointing at the same database, | ||
| one increments, the other reads the updated count. | ||
| 3. **Restart survival**: backend roundtrip after a connection cycle. | ||
| 4. **Eviction policy**: `evict_older_than` clears expired rows; called | ||
| periodically by a reaper task. | ||
| 5. **Tenant isolation**: tenant A's writes don't show up in tenant B's | ||
| reads (already pinned for in-memory by `HistoryKey` — confirm | ||
| backend-side). | ||
|
|
||
| ## Risk | ||
|
|
||
| - Touches `src/db/` migrations machinery. See `.claude/rules/database.md` | ||
| for the dual-backend conventions. | ||
| - Performance: every predicate evaluation becomes a DB round-trip. | ||
| An earlier draft suggested batching writes at the dispatcher's tick | ||
| boundary. **That conflicts with requirement #1 (cross-process | ||
| consistency)**: a deferred write from host A wouldn't be visible to | ||
| host B's read until the next tick, so two concurrent hosts could | ||
| each see "under cap" simultaneously and both proceed past `max` | ||
| (gemini's review on the prior draft). Resolution: the v1 production | ||
| backend keeps writes synchronous (read-your-own-writes within the | ||
| call); the in-process cache stays per-dispatch-only. A future | ||
| optimization could batch *reads* — collapse N predicate evaluations | ||
| in one dispatch into a single batch SELECT — but never writes. | ||
|
|
||
| ## Effort | ||
|
|
||
| Medium-Large. Schema migration + backend impls are mechanical; the | ||
| performance-conscious read/write batching is the design-discussion | ||
| piece. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trait uses
std::time::SystemTime, but the existing database layer in this repository (e.g.,src/db/mod.rs) consistently useschrono::DateTime<Utc>. UsingDateTime<Utc>would be more idiomatic for this project and simplify integration with the existing persistence and migration logic.pub trait PredicateStateBackend: Send + Sync { async fn record_invocation( &self, key: &InvocationKey, // (tenant, hook_id, capability) timestamp: DateTime<Utc>, event_id: RuntimeEventId, ) -> Result<(), PredicateBackendError>; async fn count_in_window( &self, key: &InvocationKey, window: Duration, ) -> Result<u32, PredicateBackendError>; async fn record_value( &self, key: &ValueKey, // (tenant, hook_id, capability, field) timestamp: DateTime<Utc>, value: Decimal, event_id: RuntimeEventId, ) -> Result<(), PredicateBackendError>; async fn sum_in_window( &self, key: &ValueKey, window: Duration, ) -> Result<Decimal, PredicateBackendError>; async fn evict_older_than(&self, cutoff: DateTime<Utc>) -> Result<u64, PredicateBackendError>; }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in doc revision. The current trait sketch (line 41-49) uses
DateTime<Utc>for the durable backend, and the rationale is called out explicitly on lines 63-70: