Skip to content

Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips #13332

@jack2684

Description

@jack2684

Problem

Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed.

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip.

Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match process on keywords alone. Hybrid catches both.

Flow

  1. User sends message
  2. Hermes runs hybrid search against precomputed tool index:
    • Keyword (BM25): exact tool name mentions, parameter names, direct intent words — pure Python, <1ms
    • Semantic (embeddings): fuzzy intent, synonyms, natural language — ~50ms embed query
    • Score fusion via RRF: final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)
  3. Top-K tools selected (e.g. K=8), plus a small fixed set of always-included core tools
  4. Only those schemas injected into the system prompt
  5. Single LLM call as normal — no extra round trip

Index Storage

  • Two lightweight indexes built from the same tool name + description corpus
  • Keyword: inverted index (BM25 via rank-bm25, ~500 lines, no model needed)
  • Semantic: precomputed vectors in ~/.hermes/tool_embeddings.npz (~77KB for 50 tools × 384 dims)
  • Both built once on startup, loaded into memory in milliseconds
  • Re-indexed automatically on checksum mismatch (tool names+descriptions change)
  • Re-index triggers: hermes tools enable/disable, MCP server added/removed, Hermes update

Implementation Sketch

# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes()  # rebuilds on checksum mismatch

# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message)                              # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]

# Reciprocal Rank Fusion
k = 60  # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
    scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
    scores[name] += 1 / (k + rank)

top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal

Config

tools:
  selection: eager      # current default — all tools always injected
  # selection: hybrid   # semantic + keyword fusion (recommended)
  # selection: semantic # embedding only
  # selection: keyword  # BM25 only
  # rag_top_k: 8
  # rag_embed_model: nomic-embed-text  # or "auxiliary" to reuse existing provider

Comparison with Existing Proposals

Current #6839 Lazy Loading This proposal (Hybrid)
Schema tokens/call ~14,000 ~500 + full schema on request ~1,400 (top-8 schemas)
Extra LLM round trip 0 +1 per tool use 0
Latency penalty 0 ~1-2s per tool use ~50ms embed + <1ms BM25
Token savings 0% ~70% schema, but extra call cost ~90% schema, no extra call
Handles exact tool name yes yes yes (keyword leg)
Handles fuzzy intent n/a n/a yes (semantic leg)

The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K.

Trade-offs

  • Pro: ~90% schema token reduction with zero latency penalty
  • Pro: No change to agent loop structure
  • Pro: No external DB — BM25 inverted index + numpy vectors, both tiny
  • Pro: Automatic re-indexing on tool set changes
  • Pro: Hybrid beats pure semantic or pure keyword on both exact and fuzzy queries
  • Con: Requires an embedding model (can reuse auxiliary provider already in Hermes)
  • Con: Risk of missing a needed tool if K is too small (mitigated by always including a pinned core set: terminal, read_file, search_files)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildercomp/toolsTool registry, model_tools, toolsetstype/featureNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions