Problem
Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed.
Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)
Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip.
Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match process on keywords alone. Hybrid catches both.
Flow
- User sends message
- Hermes runs hybrid search against precomputed tool index:
- Keyword (BM25): exact tool name mentions, parameter names, direct intent words — pure Python, <1ms
- Semantic (embeddings): fuzzy intent, synonyms, natural language — ~50ms embed query
- Score fusion via RRF:
final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)
- Top-K tools selected (e.g. K=8), plus a small fixed set of always-included core tools
- Only those schemas injected into the system prompt
- Single LLM call as normal — no extra round trip
Index Storage
- Two lightweight indexes built from the same tool name + description corpus
- Keyword: inverted index (BM25 via rank-bm25, ~500 lines, no model needed)
- Semantic: precomputed vectors in
~/.hermes/tool_embeddings.npz (~77KB for 50 tools × 384 dims)
- Both built once on startup, loaded into memory in milliseconds
- Re-indexed automatically on checksum mismatch (tool names+descriptions change)
- Re-index triggers:
hermes tools enable/disable, MCP server added/removed, Hermes update
Implementation Sketch
# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes() # rebuilds on checksum mismatch
# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message) # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]
# Reciprocal Rank Fusion
k = 60 # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
scores[name] += 1 / (k + rank)
top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal
Config
tools:
selection: eager # current default — all tools always injected
# selection: hybrid # semantic + keyword fusion (recommended)
# selection: semantic # embedding only
# selection: keyword # BM25 only
# rag_top_k: 8
# rag_embed_model: nomic-embed-text # or "auxiliary" to reuse existing provider
Comparison with Existing Proposals
|
Current |
#6839 Lazy Loading |
This proposal (Hybrid) |
| Schema tokens/call |
~14,000 |
~500 + full schema on request |
~1,400 (top-8 schemas) |
| Extra LLM round trip |
0 |
+1 per tool use |
0 |
| Latency penalty |
0 |
~1-2s per tool use |
~50ms embed + <1ms BM25 |
| Token savings |
0% |
~70% schema, but extra call cost |
~90% schema, no extra call |
| Handles exact tool name |
yes |
yes |
yes (keyword leg) |
| Handles fuzzy intent |
n/a |
n/a |
yes (semantic leg) |
The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K.
Trade-offs
- Pro: ~90% schema token reduction with zero latency penalty
- Pro: No change to agent loop structure
- Pro: No external DB — BM25 inverted index + numpy vectors, both tiny
- Pro: Automatic re-indexing on tool set changes
- Pro: Hybrid beats pure semantic or pure keyword on both exact and fuzzy queries
- Con: Requires an embedding model (can reuse auxiliary provider already in Hermes)
- Con: Risk of missing a needed tool if K is too small (mitigated by always including a pinned core set: terminal, read_file, search_files)
Related
Problem
Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed.
Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)
Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip.
Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match
processon keywords alone. Hybrid catches both.Flow
final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)Index Storage
~/.hermes/tool_embeddings.npz(~77KB for 50 tools × 384 dims)hermes tools enable/disable, MCP server added/removed, Hermes updateImplementation Sketch
Config
Comparison with Existing Proposals
The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K.
Trade-offs
Related