feat(skills): smart ranking, usage tracking, and lifecycle management#4406
feat(skills): smart ranking, usage tracking, and lifecycle management#4406fathah wants to merge 3 commits into
Conversation
… Skills in the system prompt are now ranked by a combination of usage frequency and keyword relevance to the user's message, replacing the previous alphabetical dump. Adds a skill_usage table (schema v7) that tracks views, invocations, and management actions — feeding a normalized scoring system that surfaces the right skill for the task. New capabilities: - Token budget and max_prompt_skills caps (opt-in, defaults unchanged) - Pinned skills that survive budget cuts - Suffix stemming and domain synonym expansion for keyword matching - Auto-archival of stale skills (background thread, opt-in) - CLI: hermes skills stats/archive/restore/prune - Deduplication warnings on skill creation - Archived skills discoverable via skills_list(include_archived=True) Benchmark on 98 real skills: correct skill in top 20 improved from 29% to 93%. Verified end-to-end with LLM picking the right skill 6/6 vs 4/6 on alphabetical ordering.
… Skills in the system prompt are now ranked by a combination of usage frequency and keyword relevance to the user's message, replacing the previous alphabetical dump. Adds a skill_usage table (schema v7) that tracks views, invocations, and management actions — feeding a normalized scoring system that surfaces the right skill for the task. New capabilities: - Token budget and max_prompt_skills caps (opt-in, defaults unchanged) - Pinned skills that survive budget cuts - Suffix stemming and domain synonym expansion for keyword matching - Auto-archival of stale skills (background thread, opt-in) - CLI: hermes skills stats/archive/restore/prune - Deduplication warnings on skill creation - Archived skills discoverable via skills_list(include_archived=True) Benchmark on 98 real skills: correct skill in top 20 improved from 29% to 93%. Verified end-to-end with LLM picking the right skill 6/6 vs 4/6 on alphabetical ordering.
…s-agent into skills-overflow-fix
|
nice! |
1 similar comment
|
Thanks for the thorough write-up and benchmark, @fathah — closing this one, but the problem framing was useful. Most of what this PR builds has since shipped via the curator (commit bc79e22, `feat(curator): background skill maintenance`):
Two specific reasons not to salvage the rest:
The `hermes skills stats / archive / restore / prune` CLI surface is the one piece that's genuinely net-new and worth keeping — tracked in #19384, crediting this PR. |
What does this PR do?
Skills in the system prompt are now ranked by usage frequency + keyword relevance to the user's message, replacing the alphabetical dump that buried the right skills.
Also adds usage tracking, opt-in token budgets, auto-archival of stale skills, and CLI commands to manage skill health.
Problem
Every skill is injected into the system prompt alphabetically with no limits. With 98 skills,
ml-paper-writingsits at position 86 andsystematic-debuggingat 95. The LLM scans through dozens of irrelevant skills before finding the one that matches — or gives up and improvises.The system prompt is immune to context compression, so this gets worse over time as skills accumulate.
How it works
skill_usagetable (schema v7) records every view, invoke, and slash command. Scored with recency-weighted frequency in a single SQL query.tweet->twitter,bug->debug).Related Issue
#4356 #4379 #4319 #4391 #4404
Type of Change
Changes Made
agent/prompt_builder.py— keyword relevance scoring, suffix stemmer, synonym map, token budget, normalized merge, flat ranked outputhermes_state.py— schema v7 migration withskill_usagetable, ranking/stats/last-used queries, self-cleaning purgetools/skill_manager_tool.py— archive/restore, bundled skill detection, dedup check on create,find_archivable_skills()tools/skills_tool.py— usage tracking on skill_view,.archiveexclusion,include_archivedparam, archive fallback with restore hintagent/skill_commands.py— usage tracking on slash command invocationsagent/skill_utils.py—.archiveadded toEXCLUDED_SKILL_DIRShermes_cli/config.py—skillsconfig block (token_budget, max_prompt_skills, pinned_skills, auto_archive_days)hermes_cli/main.py— argparse for stats/archive/restore/prune subcommandshermes_cli/skills_config.py— CLI implementations for stats, archive, restore, prunerun_agent.py— loads skills config, computes usage scores, passes user_message to prompt builder, background auto-archivetests/test_skills_overflow.py— 47 tests covering all new featuresAll config defaults preserve existing behavior (0 = unlimited/disabled). No breaking changes.
How to Test
pytest tests/test_skills_overflow.py -v— 47 tests, all passpytest tests/ -k skill -q— full skill test suite, 0 new regressionsskills.token_budget: 4000— skills section capped, footer shows omitted counthermes skills stats— shows usage data after interacting with skillshermes skills archive <name>thenhermes skills restore <name>hermes skills prune --days 90— lists unused skills, prompts for confirmationBenchmark (98 real skills)
Right skill in top 20: 29% -> 93%
End-to-end with gemma-3-4b: LLM picked the correct skill 6/6 vs 4/6 on alphabetical ordering.