Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips

## Problem

Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed.

## Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip.

Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match `process` on keywords alone. Hybrid catches both.

### Flow
1. User sends message
2. Hermes runs hybrid search against precomputed tool index:
   - **Keyword (BM25):** exact tool name mentions, parameter names, direct intent words — pure Python, <1ms
   - **Semantic (embeddings):** fuzzy intent, synonyms, natural language — ~50ms embed query
   - **Score fusion via RRF:** `final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)`
3. Top-K tools selected (e.g. K=8), plus a small fixed set of always-included core tools
4. Only those schemas injected into the system prompt
5. Single LLM call as normal — no extra round trip

### Index Storage
- Two lightweight indexes built from the same tool name + description corpus
- **Keyword:** inverted index (BM25 via rank-bm25, ~500 lines, no model needed)
- **Semantic:** precomputed vectors in `~/.hermes/tool_embeddings.npz` (~77KB for 50 tools × 384 dims)
- Both built once on startup, loaded into memory in milliseconds
- Re-indexed automatically on checksum mismatch (tool names+descriptions change)
- Re-index triggers: `hermes tools enable/disable`, MCP server added/removed, Hermes update

### Implementation Sketch
```python
# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes()  # rebuilds on checksum mismatch

# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message)                              # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]

# Reciprocal Rank Fusion
k = 60  # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
    scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
    scores[name] += 1 / (k + rank)

top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal
```

### Config
```yaml
tools:
  selection: eager      # current default — all tools always injected
  # selection: hybrid   # semantic + keyword fusion (recommended)
  # selection: semantic # embedding only
  # selection: keyword  # BM25 only
  # rag_top_k: 8
  # rag_embed_model: nomic-embed-text  # or "auxiliary" to reuse existing provider
```

## Comparison with Existing Proposals

|  | Current | #6839 Lazy Loading | This proposal (Hybrid) |
|--|---------|-------------------|------------------------|
| Schema tokens/call | ~14,000 | ~500 + full schema on request | ~1,400 (top-8 schemas) |
| Extra LLM round trip | 0 | +1 per tool use | 0 |
| Latency penalty | 0 | ~1-2s per tool use | ~50ms embed + <1ms BM25 |
| Token savings | 0% | ~70% schema, but extra call cost | ~90% schema, no extra call |
| Handles exact tool name | yes | yes | yes (keyword leg) |
| Handles fuzzy intent | n/a | n/a | yes (semantic leg) |

The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K.

## Trade-offs
- **Pro:** ~90% schema token reduction with zero latency penalty
- **Pro:** No change to agent loop structure
- **Pro:** No external DB — BM25 inverted index + numpy vectors, both tiny
- **Pro:** Automatic re-indexing on tool set changes
- **Pro:** Hybrid beats pure semantic or pure keyword on both exact and fuzzy queries
- **Con:** Requires an embedding model (can reuse auxiliary provider already in Hermes)
- **Con:** Risk of missing a needed tool if K is too small (mitigated by always including a pinned core set: terminal, read_file, search_files)

## Related
- #6839 — Lazy Tool Schema Loading (two-pass, extra round trip)
- #11115 — Lean default tool exposure profile


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips #13332

Problem

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Flow

Index Storage

Implementation Sketch

Config

Comparison with Existing Proposals

Trade-offs

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Current	#6839 Lazy Loading	This proposal (Hybrid)
Schema tokens/call	~14,000	~500 + full schema on request	~1,400 (top-8 schemas)
Extra LLM round trip	0	+1 per tool use	0
Latency penalty	0	~1-2s per tool use	~50ms embed + <1ms BM25
Token savings	0%	~70% schema, but extra call cost	~90% schema, no extra call
Handles exact tool name	yes	yes	yes (keyword leg)
Handles fuzzy intent	n/a	n/a	yes (semantic leg)

Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips #13332

Description

Problem

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Flow

Index Storage

Implementation Sketch

Config

Comparison with Existing Proposals

Trade-offs

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions