Skip to content

feat(skills): smart ranking, usage tracking, and lifecycle management#4406

Closed
fathah wants to merge 3 commits into
NousResearch:mainfrom
fathah:skills-overflow-fix
Closed

feat(skills): smart ranking, usage tracking, and lifecycle management#4406
fathah wants to merge 3 commits into
NousResearch:mainfrom
fathah:skills-overflow-fix

Conversation

@fathah
Copy link
Copy Markdown

@fathah fathah commented Apr 1, 2026

What does this PR do?

Skills in the system prompt are now ranked by usage frequency + keyword relevance to the user's message, replacing the alphabetical dump that buried the right skills.

Also adds usage tracking, opt-in token budgets, auto-archival of stale skills, and CLI commands to manage skill health.

Problem

Every skill is injected into the system prompt alphabetically with no limits. With 98 skills, ml-paper-writing sits at position 86 and systematic-debugging at 95. The LLM scans through dozens of irrelevant skills before finding the one that matches — or gives up and improvises.

The system prompt is immune to context compression, so this gets worse over time as skills accumulate.

How it works

  1. Usage trackingskill_usage table (schema v7) records every view, invoke, and slash command. Scored with recency-weighted frequency in a single SQL query.
  2. Keyword relevance — Jaccard similarity between user message and skill metadata (name, description, tags), expanded with suffix stemming and a domain synonym map (tweet -> twitter, bug -> debug).
  3. Normalized merge — both signals normalized to 0-1 before combining. Relevance weighted 3x so query-relevant skills beat daily-driver habits.
  4. Flat output — when scores are active, skills listed in score order instead of grouped by category.

Related Issue

#4356 #4379 #4319 #4391 #4404

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)
  • ✅ Tests (adding or improving test coverage)

Changes Made

  • agent/prompt_builder.py — keyword relevance scoring, suffix stemmer, synonym map, token budget, normalized merge, flat ranked output
  • hermes_state.py — schema v7 migration with skill_usage table, ranking/stats/last-used queries, self-cleaning purge
  • tools/skill_manager_tool.py — archive/restore, bundled skill detection, dedup check on create, find_archivable_skills()
  • tools/skills_tool.py — usage tracking on skill_view, .archive exclusion, include_archived param, archive fallback with restore hint
  • agent/skill_commands.py — usage tracking on slash command invocations
  • agent/skill_utils.py.archive added to EXCLUDED_SKILL_DIRS
  • hermes_cli/config.pyskills config block (token_budget, max_prompt_skills, pinned_skills, auto_archive_days)
  • hermes_cli/main.py — argparse for stats/archive/restore/prune subcommands
  • hermes_cli/skills_config.py — CLI implementations for stats, archive, restore, prune
  • run_agent.py — loads skills config, computes usage scores, passes user_message to prompt builder, background auto-archive
  • tests/test_skills_overflow.py — 47 tests covering all new features

All config defaults preserve existing behavior (0 = unlimited/disabled). No breaking changes.

How to Test

  1. pytest tests/test_skills_overflow.py -v — 47 tests, all pass
  2. pytest tests/ -k skill -q — full skill test suite, 0 new regressions
  3. Start hermes with default config — all skills appear as before
  4. Set skills.token_budget: 4000 — skills section capped, footer shows omitted count
  5. hermes skills stats — shows usage data after interacting with skills
  6. hermes skills archive <name> then hermes skills restore <name>
  7. hermes skills prune --days 90 — lists unused skills, prompts for confirmation

Benchmark (98 real skills)

Query Before After
"write a research paper for NeurIPS" ml-paper-writing 86, arxiv 82 2, 3
"set up a vector database for RAG" qdrant 73, pinecone 72, chroma 70 5, 7, 8
"post a tweet about my project" xitter 90 2
"debug my python code that crashes" systematic-debugging 95 9
"find a restaurant nearby" find-nearby 27 1

Right skill in top 20: 29% -> 93%

End-to-end with gemma-3-4b: LLM picked the correct skill 6/6 vs 4/6 on alphabetical ordering.

fathah added 3 commits April 1, 2026 10:05
… Skills in the system prompt are now ranked by a combination of usage frequency and keyword relevance to the user's message, replacing the previous alphabetical dump. Adds a skill_usage table (schema v7) that tracks views, invocations, and management actions — feeding a normalized scoring system that surfaces the right skill for the task. New capabilities: - Token budget and max_prompt_skills caps (opt-in, defaults unchanged) - Pinned skills that survive budget cuts - Suffix stemming and domain synonym expansion for keyword matching - Auto-archival of stale skills (background thread, opt-in) - CLI: hermes skills stats/archive/restore/prune - Deduplication warnings on skill creation - Archived skills discoverable via skills_list(include_archived=True) Benchmark on 98 real skills: correct skill in top 20 improved from 29% to 93%. Verified end-to-end with LLM picking the right skill 6/6 vs 4/6 on alphabetical ordering.
… Skills in the system prompt are now ranked by a combination of usage frequency and keyword relevance to the user's message, replacing the previous alphabetical dump. Adds a skill_usage table (schema v7) that tracks views, invocations, and management actions — feeding a normalized scoring system that surfaces the right skill for the task. New capabilities: - Token budget and max_prompt_skills caps (opt-in, defaults unchanged) - Pinned skills that survive budget cuts - Suffix stemming and domain synonym expansion for keyword matching - Auto-archival of stale skills (background thread, opt-in) - CLI: hermes skills stats/archive/restore/prune - Deduplication warnings on skill creation - Archived skills discoverable via skills_list(include_archived=True) Benchmark on 98 real skills: correct skill in top 20 improved from 29% to 93%. Verified end-to-end with LLM picking the right skill 6/6 vs 4/6 on alphabetical ordering.
@fathah fathah changed the title feat(skills): smart ranking, usage tracking, and lifecycle management… feat(skills): smart ranking, usage tracking, and lifecycle management Apr 1, 2026
@alexferrari88
Copy link
Copy Markdown
Contributor

nice!

@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/agent Core agent loop, run_agent.py, prompt builder tool/skills Skills system (list, view, manage) labels May 2, 2026
@alt-glitch
Copy link
Copy Markdown
Collaborator

Related to #11425 (skills lifecycle management feature request) and RFC #16077 (Curator background skill maintenance).

1 similar comment
@alt-glitch
Copy link
Copy Markdown
Collaborator

Related to #11425 (skills lifecycle management feature request) and RFC #16077 (Curator background skill maintenance).

@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented May 3, 2026

Thanks for the thorough write-up and benchmark, @fathah — closing this one, but the problem framing was useful.

Most of what this PR builds has since shipped via the curator (commit bc79e22, `feat(curator): background skill maintenance`):

Two specific reasons not to salvage the rest:

  1. Keyword-relevance ranking in the system prompt would break prompt caching. Hermes treats the system prompt as immutable across a session (see AGENTS.md: "Prompt Caching Integrity"). Adding `user_message` to `build_skills_system_prompt()` and re-ordering skills per turn would invalidate the cache on every message, materially raising cost for every user. Skill ranking would need a different delivery channel (e.g. a runtime skill selector) — not the system prompt.

  2. Schema/CLI overlap with the shipped curator is now too large to rebase cleanly. A cherry-pick onto current main would conflict across all 11 files touched, and the PR's data model (separate `.archive` directory, its own schema v7) doesn't match the curator's approach.

The `hermes skills stats / archive / restore / prune` CLI surface is the one piece that's genuinely net-new and worth keeping — tracked in #19384, crediting this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have tool/skills Skills system (list, view, manage) type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants