feat(state): track per-skill cron invocations + EMA query for labyrinth#19508
feat(state): track per-skill cron invocations + EMA query for labyrinth#19508Brecht-H wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds per-skill cron invocation tracking into the SQLite state store and exposes an EMA query API for downstream observability (e.g., Labyrinth dashboards) to compare quality/cost/duration over time.
Changes:
- Add
skill_invocationstable + supporting indexes andskill_stats_dailyview; bumpSCHEMA_VERSIONto 12. - Add
SessionDB.record_skill_invocation()writer andSessionDB.query_skill_ema()reader API. - Hook
cron/scheduler.py::run_job()to write one invocation row per skill at job completion (success/failure).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| hermes_state.py | Introduces the new skill invocation schema + daily aggregation view, and adds writer + EMA query methods. |
| cron/scheduler.py | Records per-skill invocation rows at the end of each cron run using data sourced from the session row. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DEFAULT_DB_PATH = get_hermes_home() / "state.db" | ||
|
|
||
| SCHEMA_VERSION = 11 | ||
| SCHEMA_VERSION = 12 |
| groups: Dict[Tuple[str, Optional[str]], List[Dict[str, Any]]] = {} | ||
| for r in rows: | ||
| key = (r["skill_name"], r["model"]) | ||
| groups.setdefault(key, []).append(r) | ||
|
|
|
|
||
| ema_success_rate = sum( | ||
| w * ( | ||
| (int(r["success_count"] or 0) / max(1, int(r["invocation_count"] or 0))) |
| "SELECT skill_name, model, provider, day, " | ||
| "invocation_count, success_count, failure_count, " | ||
| "avg_duration_s, avg_cost_usd, avg_quality_score, last_invoked_at " | ||
| "FROM skill_stats_daily " | ||
| "WHERE last_invoked_at >= ? " | ||
| "ORDER BY skill_name, model, day" | ||
| ) | ||
| with self._lock: | ||
| cur = self._conn.execute(sql, (cutoff_ts,)) | ||
| rows = [dict(r) for r in cur.fetchall()] | ||
|
|
| raw_weights = [alpha * (1 - alpha) ** (n - 1 - i) for i in range(n)] | ||
| wsum = sum(raw_weights) | ||
| weights = ( | ||
| [w / wsum for w in raw_weights] | ||
| if wsum > 0 | ||
| else [1.0 / n] * n | ||
| ) | ||
|
|
||
| sample_count = sum(int(r["invocation_count"] or 0) for r in rs) | ||
| success_count = sum(int(r["success_count"] or 0) for r in rs) | ||
| failure_count = sum(int(r["failure_count"] or 0) for r in rs) | ||
|
|
||
| ema_success_rate = sum( | ||
| w * ( | ||
| (int(r["success_count"] or 0) / max(1, int(r["invocation_count"] or 0))) | ||
| ) | ||
| for w, r in zip(weights, rs) | ||
| ) | ||
| ema_duration = sum( | ||
| w * float(r["avg_duration_s"] or 0.0) | ||
| for w, r in zip(weights, rs) | ||
| ) | ||
| ema_cost = sum( | ||
| w * float(r["avg_cost_usd"] or 0.0) | ||
| for w, r in zip(weights, rs) | ||
| ) |
| _completed_at = _hermes_now().timestamp() | ||
| _sess = _session_db.get_session(_cron_session_id) or {} | ||
| _success_flag, _end_reason_val = _skill_outcome | ||
| _duration = _completed_at - _invoked_at | ||
| for _skill_name in _job_skills: | ||
| _sn = str(_skill_name).strip() | ||
| if not _sn: | ||
| continue | ||
| _session_db.record_skill_invocation( | ||
| skill_name=_sn, | ||
| invoked_at=_invoked_at, | ||
| session_id=_cron_session_id, | ||
| cron_id=job_id, | ||
| completed_at=_completed_at, | ||
| duration_seconds=_duration, | ||
| model=_sess.get("model"), | ||
| provider=_sess.get("billing_provider"), | ||
| input_tokens=int(_sess.get("input_tokens") or 0), | ||
| output_tokens=int(_sess.get("output_tokens") or 0), | ||
| cache_read_tokens=int(_sess.get("cache_read_tokens") or 0), | ||
| cache_write_tokens=int(_sess.get("cache_write_tokens") or 0), | ||
| estimated_cost_usd=_sess.get("estimated_cost_usd"), | ||
| success=_success_flag, | ||
| end_reason=_end_reason_val, | ||
| ) |
Mac 3-lens review — REQUEST-CHANGESRan 3 parallel reviewers (architect / code-review / security). Architect + code-review converged on 3× P1 correctness bugs that defeat the EMA telemetry's stated purpose. Security clear. Full feedback at Must-fix before merge (P1 × 3)
Should-fix (P2 × 3)
Nits (P3)
What's goodSchema additivity, parameterized SQL, keyword-only API, security clear, v1 scope is right. Estimated fix effort: ~30-45 min for P1s, ~15 min more for P2s. Smoke tests should still pass. Tag me when ready for re-review. |
Mac 3-lens review (architect / code-review / security) flagged 3 P1
correctness bugs. Fixes:
P1.1 — Calendar-aware EMA weighting (hermes_state.py)
Was: raw_weights = [alpha * (1-alpha)**(n-1-i) for i in range(n)]
Now: weights by (today_utc - day).days. Days with no data correctly
get zero weight (absent from result); gaps don't compress older data.
Defeats the index-based bias toward sparsely-running skills.
P1.2 — Volume-weighted cost / duration EMA (hermes_state.py + view)
Was: ema_X = sum(w * avg_X) — biased equal-weight across days
regardless of invocation count.
Now: skill_stats_daily exposes total_duration_s / total_cost_usd
alongside avg_*. EMA computed as
sum(w_d * total_d) / sum(w_d * count_d)
so a day with 100 fast calls correctly outweighs a day with 1 slow
call within the same EMA term.
P1.3 — Surface telemetry write failures (cron/scheduler.py)
Was: except: logger.debug(...) — silent swallow at DEBUG; broken
writer invisible until dashboard goes empty.
Now: logger.warning(..., exc_info=True). Operator sees regressions;
cron still cannot fail (warning never raises).
Plus P2.1 (multi-skill cron cost split — divides session cost evenly
when len(job["skills"]) > 1, preventing per-skill EMA cost-doubling),
P2.2 (DROP VIEW IF EXISTS before CREATE for migration idempotency),
and P3 nits (dead conditional on error_msg slice; docstring half-life
math corrected — α=0.3 ≈ 1.94d, not 5d; α≈0.129 for true 5-day).
Smoke-tested off-tree:
skill-a-gappy (days 0,1,7) success_rate=0.954 ← gap correctly
down-weights day-7
skill-b-tight (days 0,1,2) success_rate=1.000
skill-c-vol (1 slow + 100 fast) duration=1.69s cost=$0.0024/call
← volume-weighted
Constraint: pure additive view change — DROP+CREATE on every connect
is safe because skill_stats_daily has no rowids/triggers depending on
its identity. Existing rows in skill_invocations untouched.
Confidence: high (3 scenarios validated)
Scope-risk: narrow (same file boundaries as original PR)
Not-tested: multi-skill cron live-fire (smoke uses synthetic data)
Machine: orion-terminal
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@Brecht-H ready for re-review. P1 + P2 fixes pushed as 52ff992 (additive commit, no rebase — kept original 50bdf20 intact for diff trail). P1.1 Calendar-aware EMA weighting → smoke verified: 3-day-gap skill correctly down-weights day-7 failure (success_rate=0.954 vs naive 0.667). P1.2 Volume-weighted cost/duration EMA via new P1.3 P2.1 Multi-skill cron cost split (even division when P2.2 P3 nits: dead conditional removed on Diff: +98 / −27 across the same 2 files. No new tests added (existing test plan still applicable). Once you approve + merge, I'll cherry-pick locally → validate Step 3 cron-fire emits |
Mac re-review v2 — APPROVED ✅All 3 P1 fixes verified clean against original review feedback:
Both P2 fixes also landed:
Smoke evidence in re-review post is comprehensive (3 scenarios validating calendar-awareness baseline, gap handling, and volume weighting). Ship it. Merging now. |
|
@Brecht-H Bugs A + B + C from Orion CC's RED validation report fixed and validated. Pushed as commit Validation reports:
Bug C correctly downgraded both polymarket and miniflux to Pass 1 erratum added at top of Tagging for re-review of
|
Adds a `skill_invocations` table to state.db and writes one row per skill listed on a cron job at completion (success or failure paths). Tokens, cost and duration are sourced from the existing session row. Includes a `skill_stats_daily` view that buckets invocations by day and (skill_name, model, provider), and a new `SessionDB.query_skill_ema()` method that applies exponential weighting (default alpha=0.3, ~5d half-life) so the dashboard can A/B local Qwen against external models once analyzer crons start firing. SCHEMA_VERSION 11 → 12. Pure additive: existing rows untouched, new table created on next connection-open via the existing executescript() path. No Alembic. Slash-command and ad-hoc skill_view invocations are NOT tracked in v1. Multi-skill crons over-account: each skill in `job["skills"]` gets the full session cost. Both are acceptable for the analyzer-cron use case (1 skill per cron) and can iterate later. Constraint: no new external dependencies — uses sqlite3 + stdlib only. Rejected: per-skill cost split (would require model attribution inside a single agent run, which Hermes does not currently track) | Reason: defer to v2 once NousResearch#87 surfaces real-world skew. Confidence: high (smoke-tested end-to-end on tmp DB) Scope-risk: narrow (additive table, no existing-row touchpoints) Not-tested: live cron fire (validated by Step 3 — Pass 2 v2 handover) Machine: orion-terminal Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mac 3-lens review (architect / code-review / security) flagged 3 P1
correctness bugs. Fixes:
P1.1 — Calendar-aware EMA weighting (hermes_state.py)
Was: raw_weights = [alpha * (1-alpha)**(n-1-i) for i in range(n)]
Now: weights by (today_utc - day).days. Days with no data correctly
get zero weight (absent from result); gaps don't compress older data.
Defeats the index-based bias toward sparsely-running skills.
P1.2 — Volume-weighted cost / duration EMA (hermes_state.py + view)
Was: ema_X = sum(w * avg_X) — biased equal-weight across days
regardless of invocation count.
Now: skill_stats_daily exposes total_duration_s / total_cost_usd
alongside avg_*. EMA computed as
sum(w_d * total_d) / sum(w_d * count_d)
so a day with 100 fast calls correctly outweighs a day with 1 slow
call within the same EMA term.
P1.3 — Surface telemetry write failures (cron/scheduler.py)
Was: except: logger.debug(...) — silent swallow at DEBUG; broken
writer invisible until dashboard goes empty.
Now: logger.warning(..., exc_info=True). Operator sees regressions;
cron still cannot fail (warning never raises).
Plus P2.1 (multi-skill cron cost split — divides session cost evenly
when len(job["skills"]) > 1, preventing per-skill EMA cost-doubling),
P2.2 (DROP VIEW IF EXISTS before CREATE for migration idempotency),
and P3 nits (dead conditional on error_msg slice; docstring half-life
math corrected — α=0.3 ≈ 1.94d, not 5d; α≈0.129 for true 5-day).
Smoke-tested off-tree:
skill-a-gappy (days 0,1,7) success_rate=0.954 ← gap correctly
down-weights day-7
skill-b-tight (days 0,1,2) success_rate=1.000
skill-c-vol (1 slow + 100 fast) duration=1.69s cost=$0.0024/call
← volume-weighted
Constraint: pure additive view change — DROP+CREATE on every connect
is safe because skill_stats_daily has no rowids/triggers depending on
its identity. Existing rows in skill_invocations untouched.
Confidence: high (3 scenarios validated)
Scope-risk: narrow (same file boundaries as original PR)
Not-tested: multi-skill cron live-fire (smoke uses synthetic data)
Machine: orion-terminal
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mac's Pass 2 v2 deploy validation (Orion CC, RED report
2026-05-04) caught two scaffolding bugs that prevented the
analyzer crons from firing end-to-end. Both fixed here.
Bug A — skill_view() didn't match by frontmatter `name:`
Was: skill_view fell through directory-name match only.
Skills with dir/frontmatter mismatch (e.g. dir
`analyze/derivatives-anomaly`, frontmatter
`analyze-derivatives-anomaly`) returned "not found"
even though `hermes skills list` showed them. The
cron's _build_job_prompt then logged
"skill not found, skipping" and the activation
banner never reached the agent.
Now: tools/skills_tool.py adds a frontmatter-name fallback
after the existing dir-name match, restoring parity
with the LIST command's keying. ~17 lines.
Bug C — runner success classifier accepted scaffolding
failures as success
Was: cron/scheduler.py set
_skill_outcome = (True, "complete") whenever the
agent returned a non-FATAL response. A cron whose
skill resolution failed and whose script never ran
was reported success=True, polluting the EMA's
A/B comparison foundation.
Now: scan output + final_response for known scaffolding
failure markers ("skill not found, skipping",
"Blocked: script path resolves outside",
"permission denied", "security check failed",
"Skill(s) not found and skipped"). When matched,
downgrade to (False, marker_excerpt) so the
skill_invocations row records the real outcome.
(Bug B — symlink security check rejecting
~/.hermes/profiles/<p>/scripts/* symlinks whose targets
live under ~/.hermes/skills/<x>/scripts/ — is a
filesystem-only fix: replace the symlinks with `cp` real
files. Not in this commit; documented in
feedback_hermes-profile-mode-script-paths.md memory.)
Validation:
After applying A+B+C and restarting the dashboard, all 3
analyzer crons fired via run_job() and wrote
skill_invocations rows with correctly-classified
outcomes:
- derivatives-anomaly: success=1, model=qwen3.6-27b
- polymarket-regime: success=0, model=deepseek-v4-flash
end_reason="HTTP 401: API key invalid" (genuine
downstream credential failure, NOT scaffolding)
- miniflux-digest: success=0,
model=moonshotai/kimi-k2.6 end_reason="HTTP 402:
credit budget exceeded" (genuine downstream OpenRouter
limit, NOT scaffolding)
Confidence: high (3 live cron-fires, table inserts and
skill_invocations writes confirmed).
Scope-risk: narrow (additive only — Bug A is a new
fallback after existing matches; Bug C wraps the existing
success path in a marker scan).
Not-tested: the 65536-max_tokens setting in agent runtime
that triggered miniflux's 402 — likely a separate
agent-side config issue, not in this PR's scope.
Machine: orion-terminal
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
50a1d51 to
050f4d0
Compare
|
@Brecht-H rebased onto upstream main (was 201 commits behind base Conflict resolved:
Smoke test on rebased branch (off-tree fresh DB): New SHAs:
Force-pushed via Tagging for re-review of the resolved conflict + final approve. Local Hermes-agent install on Orion is unaffected (cherry-picked commits live there independent of upstream merge state). |
|
Mac re-review post-rebase — APPROVE OK Rebase verified clean:
The PR is ready for upstream maintainer review. Down-stream consumer (Mac/Allaert's Hermes deployment) has been running cherry-picked commits ccdb7a951 + c3d8a1016 since 2026-05-04 — Bug A/B/C fixes verified working in production for 24h+. skill_invocations table populating cleanly, derivatives-anomaly cron green, polymarket-regime + miniflux-digest now also green after the post-#19508 routing fix. No blockers from our side. Waiting on NousResearch maintainer. — Mac |
Summary
Adds a
skill_invocationstable tostate.db, wires the cron runner to write one row per skill listed on a cron job at completion (success and failure paths), and exposes aSessionDB.query_skill_ema()method so observability dashboards can A/B local Qwen against external models.This is the data layer for per-skill EMA tracking — needed before installing analyzer crons that route to paid providers (DeepSeek, Kimi-via-OpenRouter, etc.) without flying blind on quality vs cost.
Coexists cleanly with #18253's
bump_use()cron-side curator integration (auto-merged on cherry-pick).Changes
Schema (
hermes_state.py)skill_invocations(session_id, cron_id, skill_name, model, provider, duration, tokens, cost, success, end_reason, …)skill_stats_dailyaggregating per (skill_name, model, provider, day)SCHEMA_VERSION11 → 12. Pure additive — no Alembic, no destructive ops on existing rows.Writer (
hermes_state.py)SessionDB.record_skill_invocation(...)keyword-only API, uses the existing_execute_writeWAL-safe helper.Cron hook (
cron/scheduler.py)run_job()records timestamps around the agent run and writes oneskill_invocationsrow per name injob["skills"]after the agent exits (both success and failure paths). Tokens / cost / duration are sourced from the existingsessionsrow, so no new instrumentation in the agent loop.Read API (
hermes_state.py)SessionDB.query_skill_ema(window_days=14, alpha=0.3)returns per-(skill_name, model) exponentially-weighted moving averages for success rate, cost-per-call, and duration. Default α≈5d half-life.Smoke test (local, off-tree)
Inserts validated for success=True / success=False / success=None paths. View aggregates cleanly. EMA query returns expected exponential weighting (verified 8-row dataset across 7 days, recent failures correctly down-weighted as they age out).
v1 scope notes
skill_view()calls are not tracked yet — out of scope for v1, additive in v2 if needed.job["skills"]gets the full session cost. The current analyzer pattern is one skill per cron (Pass 2 v2 Steps 3–5), so this corner case is rare.query_skill_ema) lands here. The plumbing inplugins/hermes-labyrinth/dashboard/plugin_api.py(a small@router.get('/skills/ema')wrapper) is a follow-up because labyrinth is upstreamed atstainlu/hermes-labyrinth— separate PR target.Test plan
record_skill_invocation()insert +skill_stats_dailyaggregationquery_skill_ema()returns expected EMA values across multi-day databump_use()integration (auto-merge clean)analyze:derivatives-anomaly)History
Supersedes #19507, which carried an extra unrelated commit from local main. This PR is rebased clean onto current upstream main (363cc93).
🤖 Generated with Claude Code