Summary
Refactor web tools from a monolithic single-backend architecture to a capability-based provider system where search, extract, and crawl backends are independently selectable.
This enables combinations like SearXNG for search + Firecrawl for extract — currently impossible because _get_backend() returns a single string shared by all three capabilities.
Problem
tools/web_tools.py (2153 lines) uses a single _get_backend() function that returns one of "firecrawl", "parallel", "tavily", "exa". This same string drives the dispatch in web_search_tool(), web_extract_tool(), AND web_crawl_tool(). Every provider must support all three capabilities or break whichever it can't do.
This makes it impossible to integrate search-only providers (SearXNG, DuckDuckGo, Brave) or extract-only providers (native HTTP, Bright Data) without either:
The community has noticed. Comments on #11562 show users defaulting to the mcp-searxng MCP workaround because the native integration can't mix providers. Multiple PRs proposing new web backends (#6065, #3416, #6826, #2668) all hit the same architectural wall.
Current coupling points
| Location |
What it does |
web_tools.py:121 _get_backend() |
Single selector for ALL capabilities |
web_tools.py:1131 web_search_tool() |
60-line if/elif dispatch to 4 vendor implementations |
web_tools.py:1287 web_extract_tool() |
80-line if/elif dispatch to 4 vendor implementations |
web_tools.py:1589 web_crawl_tool() |
Dispatch to Firecrawl/Tavily only |
web_tools.py:1987 check_web_api_key() |
Single gate for both web_search and web_extract registration |
hermes_cli/tools_config.py:240 |
Provider picker writes single web.backend config key |
Research
We studied how three major frameworks handle the search/extract/browse separation:
| Framework |
Architecture |
What they lack |
| browser-use |
Monolithic Tools class; "search" = navigate browser to DuckDuckGo URL. Flat action registry with decorator-based registration + parameter injection DI. |
No pluggable search/extract backends at all. Every capability is a hardcoded action in one __init__() method. |
| Stagehand |
Clean observe/act/extract as separate handler classes (ActHandler, ExtractHandler, ObserveHandler) with shared snapshot abstraction. Provider/client pattern for LLM backends. |
No provider interfaces for web capabilities — one implementation per primitive. Search not even a concept. |
| Hermes (current) |
CloudBrowserProvider ABC with registry dict (_PROVIDER_REGISTRY) for browser backends. Clean pattern. |
No equivalent for search/extract — inline if/elif chains with a single _get_backend(). |
Key finding: None of these frameworks solve the problem. But Hermes already has the right pattern in tools/browser_providers/base.py — the CloudBrowserProvider ABC with a provider registry dict. We extend this pattern to web capabilities.
Proposed Architecture
┌──────────────────────────────────────────────────────┐
│ Agent Toolset │
│ web_search() web_extract() browser_navigate() │
└───────┬────────────────┬───────────────────┬─────────┘
│ │ │
┌────▼─────┐ ┌─────▼──────┐ ┌──────▼───────┐
│ Search │ │ Extract │ │ Browser │
│ Router │ │ Router │ │ Router │
└────┬──────┘ └─────┬──────┘ └──────┬───────┘
│ │ │ (already exists)
┌────▼──────────┐ ┌───▼──────────┐ ┌─────▼─────────┐
│ Providers: │ │ Providers: │ │ Providers: │
│ • firecrawl │ │ • firecrawl │ │ • local │
│ • tavily │ │ • tavily │ │ • browserbase │
│ • exa │ │ • native │ │ • browser-use │
│ • parallel │ │ • parallel │ │ • camofox │
│ • searxng │ │ • exa │ │ │
│ • duckduckgo │ │ │ │ │
└───────────────┘ └──────────────┘ └────────────────┘
Config (final state)
# Mix free providers — no API keys at all
web:
search_backend: "searxng"
extract_backend: "native"
# SearXNG search + Firecrawl scraping
web:
search_backend: "searxng"
extract_backend: "firecrawl"
# Single provider for everything (backward compatible, current behavior)
web:
backend: "tavily"
Selection priority per capability:
web.search_backend / web.extract_backend (explicit per-capability)
web.backend (legacy shared fallback)
- Auto-detect from env vars (existing behavior)
Implementation Plan
Phase 1: Provider ABCs and Modules (foundation — no behavior change)
Task 1: Create provider ABCs
Create tools/web_providers/base.py with WebSearchProvider and WebExtractProvider ABCs:
# tools/web_providers/base.py
from abc import ABC, abstractmethod
from typing import Any, Dict, List
class WebSearchProvider(ABC):
"""Interface for web search backends.
Returns normalized: {"success": True, "data": {"web": [{"title", "url", "description", "position"}]}}
"""
@abstractmethod
def provider_name(self) -> str: ...
@abstractmethod
def is_configured(self) -> bool: ...
@abstractmethod
def search(self, query: str, limit: int = 5) -> Dict[str, Any]: ...
class WebExtractProvider(ABC):
"""Interface for web content extraction backends.
Returns normalized: [{"url", "title", "content", "raw_content", "metadata"}]
"""
@abstractmethod
def provider_name(self) -> str: ...
@abstractmethod
def is_configured(self) -> bool: ...
@abstractmethod
async def extract(self, urls: List[str], **kwargs) -> List[Dict[str, Any]]: ...
This mirrors the existing CloudBrowserProvider pattern in tools/browser_providers/base.py.
Task 2-3: Extract existing vendors into provider modules
Move inline vendor code from web_tools.py into dedicated modules. Each vendor's search/extract logic becomes a class implementing the appropriate ABC:
| Module |
Classes |
Wraps |
tools/web_providers/firecrawl.py |
FirecrawlSearchProvider, FirecrawlExtractProvider |
_get_firecrawl_client().search(), .scrape() |
tools/web_providers/tavily.py |
TavilySearchProvider, TavilyExtractProvider |
_tavily_request(), _normalize_tavily_*() |
tools/web_providers/exa.py |
ExaSearchProvider, ExaExtractProvider |
_exa_search(), _exa_extract() |
tools/web_providers/parallel.py |
ParallelSearchProvider, ParallelExtractProvider |
_parallel_search(), _parallel_extract() |
These are moves, not rewrites — the existing normalization functions become methods on the provider classes.
Task 4: SearXNG search provider (new — search only)
# tools/web_providers/searxng.py
class SearXNGSearchProvider(WebSearchProvider):
def provider_name(self): return "searxng"
def is_configured(self):
return bool(os.getenv("SEARXNG_URL", "").strip())
def search(self, query, limit=5):
base_url = os.getenv("SEARXNG_URL", "").strip().rstrip("/")
resp = httpx.get(f"{base_url}/search",
params={"q": query, "format": "json"}, timeout=15)
resp.raise_for_status()
data = resp.json()
results = sorted(data.get("results", []),
key=lambda r: r.get("score", 0), reverse=True)[:limit]
return {
"success": True,
"data": {"web": [
{"title": r.get("title", ""), "url": r.get("url", ""),
"description": r.get("content", ""), "position": i + 1}
for i, r in enumerate(results)
]},
}
SearXNG is search-only. It does NOT implement WebExtractProvider. Users pair it with a separate extract provider.
Task 5: Native HTTP extract provider (new — extract only, zero API key)
# tools/web_providers/native.py
class NativeExtractProvider(WebExtractProvider):
def provider_name(self): return "native"
def is_configured(self): return True # always available
async def extract(self, urls, **kwargs):
# httpx fetch + html-to-markdown conversion
# No API key, no external service
...
Users who just want web_extract without paying for Firecrawl/Tavily can use this alongside SearXNG for search.
Phase 2: Registries and Per-Capability Config
Task 6: Provider registries and per-capability backend selection
Replace _get_backend() with _get_search_provider() and _get_extract_provider():
# tools/web_tools.py
_SEARCH_PROVIDERS: Dict[str, type] = {
"firecrawl": FirecrawlSearchProvider,
"tavily": TavilySearchProvider,
"exa": ExaSearchProvider,
"parallel": ParallelSearchProvider,
"searxng": SearXNGSearchProvider,
}
_EXTRACT_PROVIDERS: Dict[str, type] = {
"firecrawl": FirecrawlExtractProvider,
"tavily": TavilyExtractProvider,
"exa": ExaExtractProvider,
"parallel": ParallelExtractProvider,
"native": NativeExtractProvider,
}
def _get_search_provider() -> WebSearchProvider:
cfg = _load_web_config()
name = (cfg.get("search_backend") or cfg.get("backend") or "").lower().strip()
if name in _SEARCH_PROVIDERS:
p = _SEARCH_PROVIDERS[name]()
if p.is_configured():
return p
# Auto-detect fallback (existing behavior)
for name, cls in _SEARCH_PROVIDERS.items():
p = cls()
if p.is_configured():
return p
return FirecrawlSearchProvider() # last resort
Task 7: Rewire tool dispatch
Replace the inline if/elif chains with provider calls:
# Before (60+ lines of dispatch):
def web_search_tool(query, limit=5):
backend = _get_backend()
if backend == "parallel": response_data = _parallel_search(query, limit)
elif backend == "exa": response_data = _exa_search(query, limit)
elif backend == "tavily": ...
else: ...
# After:
def web_search_tool(query, limit=5):
provider = _get_search_provider()
response_data = provider.search(query, limit)
Task 8: Split check_fn per capability
Today both tools share check_fn=check_web_api_key. Split so a SearXNG-only user sees web_search available without needing a Firecrawl key:
# web_search registration
registry.register(name="web_search", ..., check_fn=check_web_search_available)
# web_extract registration
registry.register(name="web_extract", ..., check_fn=check_web_extract_available)
Task 9-10: Config + hermes tools UI
- Add
web.search_backend and web.extract_backend to DEFAULT_CONFIG (no config version bump needed — _deep_merge handles new keys)
- Add
SEARXNG_URL to OPTIONAL_ENV_VARS
- Add SearXNG and Native HTTP as provider options in the
hermes tools picker
- Update post-setup handler to write per-capability config keys
Phase 3: Cleanup
Task 11: Unify LLM summarization
Extract the duplicated LLM post-processing (process_content_with_llm() in web_tools.py and _extract_relevant_content() in browser_tool.py) into a shared tools/web_summarize.py.
Task 12: Delete dead inline vendor code
Once vendors are routed through providers, remove ~500-700 lines of inline _parallel_search(), _exa_search(), _tavily_request(), etc. from web_tools.py.
Task 13: Documentation
Update configuration.md, environment-variables.md, and the skills catalog with SearXNG setup and the new per-capability config keys.
What This Unblocks
| PR |
Author |
What it adds |
Currently blocked because |
| #11562 |
@kshitijk4poor |
SearXNG search |
Can't use search-only provider without breaking extract |
| #6065 |
@meirk-brd |
Bright Data extract |
Would need to implement search too (it's scrape-only) |
| #3416 |
@Local-First |
webclaw local extract |
Same — scrape-only provider |
| #6826 |
@? |
fastCRW search+extract |
Would be a clean drop-in with provider ABCs |
| #2668 |
@edmundman |
Brave Search |
Search-only provider |
| Future |
— |
DuckDuckGo native provider |
Currently a skill; could become zero-key WebSearchProvider |
| Future |
— |
MCP-backed search |
Wrap any MCP search server as a WebSearchProvider |
Risk Assessment
| Risk |
Mitigation |
Breaking existing web.backend config |
web.backend stays as fallback — per-capability keys only override when explicitly set |
| Provider module import overhead |
Lazy imports in registry — modules only loaded when selected |
html-to-markdown new dependency for native extract |
Optional — guard with try/except ImportError, degrade gracefully |
web_crawl_tool still needs inline logic |
Keep crawl dispatch inline for now; extract WebCrawlProvider ABC in a follow-up |
| Test suite changes |
Provider ABCs make testing cleaner — mock the provider, not HTTP calls |
Related
UX Design: hermes tools Provider Picker
Design Principle: Progressive Disclosure
The 90% path (single provider for both search + extract) stays identical to today — one pick, done. The split is revealed only to users who explicitly opt into the advanced flow.
Flow
hermes tools → Web Search & Extract → Select provider
[1] Nous Subscription (search + extract)
[2] Firecrawl Cloud ★ (search + extract)
[3] Exa (search + extract)
[4] Parallel (search + extract)
[5] Tavily (search + extract)
[6] Firecrawl Self-Hosted (search + extract)
[7] ⚙️ Advanced: configure search & extract separately
→ User picks [2]: sets web.backend = "firecrawl". Done.
(Identical to current behavior. No split exposed.)
→ User picks [7]:
"Search backend:" [SearXNG / Firecrawl / Tavily / Exa / Parallel]
"Extract backend:" [Firecrawl / Tavily / Exa / Parallel / Native HTTP]
→ Sets web.search_backend + web.extract_backend. Done.
Principles
- Default path unchanged — Single-provider users never see the split. Zero friction increase.
- Advanced is last — The split option is at the bottom, clearly badged "advanced". Discoverable but not confusing.
- No redundant entry — Picking "Firecrawl" from the main list sets
web.backend once. Users never type/pick the same provider twice.
- Undo is natural — Picking any main-list provider later clears
search_backend/extract_backend and resets to shared mode.
- Config.yaml is source of truth — Power users can always edit
web.search_backend / web.extract_backend directly without the picker.
Implementation (tools_config.py)
# Last entry in TOOL_CATEGORIES["web"]["providers"]:
{
"name": "Advanced: configure search & extract separately",
"badge": "advanced",
"tag": "Pick different providers for search vs extract (e.g. SearXNG + Firecrawl)",
"advanced_split": True,
"env_vars": [],
},
The post-setup handler detects advanced_split: True and runs a two-step sub-picker:
- Show search-capable providers → writes
config["web"]["search_backend"]
- Show extract-capable providers → writes
config["web"]["extract_backend"]
When any NON-advanced provider is picked, the handler clears per-capability overrides:
if not provider.get("advanced_split"):
web_cfg = config.setdefault("web", {})
web_cfg["backend"] = provider["web_backend"]
web_cfg.pop("search_backend", None)
web_cfg.pop("extract_backend", None)
This ensures switching back from split mode to unified mode is seamless.
Summary
Refactor web tools from a monolithic single-backend architecture to a capability-based provider system where search, extract, and crawl backends are independently selectable.
This enables combinations like SearXNG for search + Firecrawl for extract — currently impossible because
_get_backend()returns a single string shared by all three capabilities.Problem
tools/web_tools.py(2153 lines) uses a single_get_backend()function that returns one of"firecrawl","parallel","tavily","exa". This same string drives the dispatch inweb_search_tool(),web_extract_tool(), ANDweb_crawl_tool(). Every provider must support all three capabilities or break whichever it can't do.This makes it impossible to integrate search-only providers (SearXNG, DuckDuckGo, Brave) or extract-only providers (native HTTP, Bright Data) without either:
The community has noticed. Comments on #11562 show users defaulting to the
mcp-searxngMCP workaround because the native integration can't mix providers. Multiple PRs proposing new web backends (#6065, #3416, #6826, #2668) all hit the same architectural wall.Current coupling points
web_tools.py:121_get_backend()web_tools.py:1131web_search_tool()web_tools.py:1287web_extract_tool()web_tools.py:1589web_crawl_tool()web_tools.py:1987check_web_api_key()web_searchandweb_extractregistrationhermes_cli/tools_config.py:240web.backendconfig keyResearch
We studied how three major frameworks handle the search/extract/browse separation:
Toolsclass; "search" = navigate browser to DuckDuckGo URL. Flat action registry with decorator-based registration + parameter injection DI.__init__()method.observe/act/extractas separate handler classes (ActHandler,ExtractHandler,ObserveHandler) with shared snapshot abstraction. Provider/client pattern for LLM backends.CloudBrowserProviderABC with registry dict (_PROVIDER_REGISTRY) for browser backends. Clean pattern._get_backend().Key finding: None of these frameworks solve the problem. But Hermes already has the right pattern in
tools/browser_providers/base.py— theCloudBrowserProviderABC with a provider registry dict. We extend this pattern to web capabilities.Proposed Architecture
Config (final state)
Selection priority per capability:
web.search_backend/web.extract_backend(explicit per-capability)web.backend(legacy shared fallback)Implementation Plan
Phase 1: Provider ABCs and Modules (foundation — no behavior change)
Task 1: Create provider ABCs
Create
tools/web_providers/base.pywithWebSearchProviderandWebExtractProviderABCs:This mirrors the existing
CloudBrowserProviderpattern intools/browser_providers/base.py.Task 2-3: Extract existing vendors into provider modules
Move inline vendor code from
web_tools.pyinto dedicated modules. Each vendor's search/extract logic becomes a class implementing the appropriate ABC:tools/web_providers/firecrawl.pyFirecrawlSearchProvider,FirecrawlExtractProvider_get_firecrawl_client().search(),.scrape()tools/web_providers/tavily.pyTavilySearchProvider,TavilyExtractProvider_tavily_request(),_normalize_tavily_*()tools/web_providers/exa.pyExaSearchProvider,ExaExtractProvider_exa_search(),_exa_extract()tools/web_providers/parallel.pyParallelSearchProvider,ParallelExtractProvider_parallel_search(),_parallel_extract()These are moves, not rewrites — the existing normalization functions become methods on the provider classes.
Task 4: SearXNG search provider (new — search only)
SearXNG is search-only. It does NOT implement
WebExtractProvider. Users pair it with a separate extract provider.Task 5: Native HTTP extract provider (new — extract only, zero API key)
Users who just want
web_extractwithout paying for Firecrawl/Tavily can use this alongside SearXNG for search.Phase 2: Registries and Per-Capability Config
Task 6: Provider registries and per-capability backend selection
Replace
_get_backend()with_get_search_provider()and_get_extract_provider():Task 7: Rewire tool dispatch
Replace the inline if/elif chains with provider calls:
Task 8: Split
check_fnper capabilityToday both tools share
check_fn=check_web_api_key. Split so a SearXNG-only user seesweb_searchavailable without needing a Firecrawl key:Task 9-10: Config +
hermes toolsUIweb.search_backendandweb.extract_backendtoDEFAULT_CONFIG(no config version bump needed —_deep_mergehandles new keys)SEARXNG_URLtoOPTIONAL_ENV_VARShermes toolspickerPhase 3: Cleanup
Task 11: Unify LLM summarization
Extract the duplicated LLM post-processing (
process_content_with_llm()inweb_tools.pyand_extract_relevant_content()inbrowser_tool.py) into a sharedtools/web_summarize.py.Task 12: Delete dead inline vendor code
Once vendors are routed through providers, remove ~500-700 lines of inline
_parallel_search(),_exa_search(),_tavily_request(), etc. fromweb_tools.py.Task 13: Documentation
Update
configuration.md,environment-variables.md, and the skills catalog with SearXNG setup and the new per-capability config keys.What This Unblocks
WebSearchProviderWebSearchProviderRisk Assessment
web.backendconfigweb.backendstays as fallback — per-capability keys only override when explicitly sethtml-to-markdownnew dependency for native extracttry/except ImportError, degrade gracefullyweb_crawl_toolstill needs inline logicWebCrawlProviderABC in a follow-upRelated
UX Design:
hermes toolsProvider PickerDesign Principle: Progressive Disclosure
The 90% path (single provider for both search + extract) stays identical to today — one pick, done. The split is revealed only to users who explicitly opt into the advanced flow.
Flow
Principles
web.backendonce. Users never type/pick the same provider twice.search_backend/extract_backendand resets to shared mode.web.search_backend/web.extract_backenddirectly without the picker.Implementation (tools_config.py)
The post-setup handler detects
advanced_split: Trueand runs a two-step sub-picker:config["web"]["search_backend"]config["web"]["extract_backend"]When any NON-advanced provider is picked, the handler clears per-capability overrides:
This ensures switching back from split mode to unified mode is seamless.