Skip to content

refactor: capability-based provider architecture for web tools (search/extract/browse split) #19198

@kshitijk4poor

Description

@kshitijk4poor

Summary

Refactor web tools from a monolithic single-backend architecture to a capability-based provider system where search, extract, and crawl backends are independently selectable.

This enables combinations like SearXNG for search + Firecrawl for extract — currently impossible because _get_backend() returns a single string shared by all three capabilities.

Problem

tools/web_tools.py (2153 lines) uses a single _get_backend() function that returns one of "firecrawl", "parallel", "tavily", "exa". This same string drives the dispatch in web_search_tool(), web_extract_tool(), AND web_crawl_tool(). Every provider must support all three capabilities or break whichever it can't do.

This makes it impossible to integrate search-only providers (SearXNG, DuckDuckGo, Brave) or extract-only providers (native HTTP, Bright Data) without either:

The community has noticed. Comments on #11562 show users defaulting to the mcp-searxng MCP workaround because the native integration can't mix providers. Multiple PRs proposing new web backends (#6065, #3416, #6826, #2668) all hit the same architectural wall.

Current coupling points

Location What it does
web_tools.py:121 _get_backend() Single selector for ALL capabilities
web_tools.py:1131 web_search_tool() 60-line if/elif dispatch to 4 vendor implementations
web_tools.py:1287 web_extract_tool() 80-line if/elif dispatch to 4 vendor implementations
web_tools.py:1589 web_crawl_tool() Dispatch to Firecrawl/Tavily only
web_tools.py:1987 check_web_api_key() Single gate for both web_search and web_extract registration
hermes_cli/tools_config.py:240 Provider picker writes single web.backend config key

Research

We studied how three major frameworks handle the search/extract/browse separation:

Framework Architecture What they lack
browser-use Monolithic Tools class; "search" = navigate browser to DuckDuckGo URL. Flat action registry with decorator-based registration + parameter injection DI. No pluggable search/extract backends at all. Every capability is a hardcoded action in one __init__() method.
Stagehand Clean observe/act/extract as separate handler classes (ActHandler, ExtractHandler, ObserveHandler) with shared snapshot abstraction. Provider/client pattern for LLM backends. No provider interfaces for web capabilities — one implementation per primitive. Search not even a concept.
Hermes (current) CloudBrowserProvider ABC with registry dict (_PROVIDER_REGISTRY) for browser backends. Clean pattern. No equivalent for search/extract — inline if/elif chains with a single _get_backend().

Key finding: None of these frameworks solve the problem. But Hermes already has the right pattern in tools/browser_providers/base.py — the CloudBrowserProvider ABC with a provider registry dict. We extend this pattern to web capabilities.

Proposed Architecture

┌──────────────────────────────────────────────────────┐
│                    Agent Toolset                      │
│   web_search()    web_extract()    browser_navigate() │
└───────┬────────────────┬───────────────────┬─────────┘
        │                │                   │
   ┌────▼─────┐    ┌─────▼──────┐    ┌──────▼───────┐
   │  Search   │    │  Extract   │    │   Browser    │
   │  Router   │    │  Router    │    │   Router     │
   └────┬──────┘    └─────┬──────┘    └──────┬───────┘
        │                 │                   │  (already exists)
   ┌────▼──────────┐ ┌───▼──────────┐ ┌─────▼─────────┐
   │ Providers:    │ │ Providers:   │ │ Providers:     │
   │ • firecrawl   │ │ • firecrawl  │ │ • local        │
   │ • tavily      │ │ • tavily     │ │ • browserbase  │
   │ • exa         │ │ • native     │ │ • browser-use  │
   │ • parallel    │ │ • parallel   │ │ • camofox      │
   │ • searxng     │ │ • exa        │ │                │
   │ • duckduckgo  │ │              │ │                │
   └───────────────┘ └──────────────┘ └────────────────┘

Config (final state)

# Mix free providers — no API keys at all
web:
  search_backend: "searxng"
  extract_backend: "native"

# SearXNG search + Firecrawl scraping
web:
  search_backend: "searxng"
  extract_backend: "firecrawl"

# Single provider for everything (backward compatible, current behavior)
web:
  backend: "tavily"

Selection priority per capability:

  1. web.search_backend / web.extract_backend (explicit per-capability)
  2. web.backend (legacy shared fallback)
  3. Auto-detect from env vars (existing behavior)

Implementation Plan

Phase 1: Provider ABCs and Modules (foundation — no behavior change)

Task 1: Create provider ABCs

Create tools/web_providers/base.py with WebSearchProvider and WebExtractProvider ABCs:

# tools/web_providers/base.py
from abc import ABC, abstractmethod
from typing import Any, Dict, List

class WebSearchProvider(ABC):
    """Interface for web search backends.
    
    Returns normalized: {"success": True, "data": {"web": [{"title", "url", "description", "position"}]}}
    """
    @abstractmethod
    def provider_name(self) -> str: ...

    @abstractmethod
    def is_configured(self) -> bool: ...

    @abstractmethod
    def search(self, query: str, limit: int = 5) -> Dict[str, Any]: ...


class WebExtractProvider(ABC):
    """Interface for web content extraction backends.
    
    Returns normalized: [{"url", "title", "content", "raw_content", "metadata"}]
    """
    @abstractmethod
    def provider_name(self) -> str: ...

    @abstractmethod
    def is_configured(self) -> bool: ...

    @abstractmethod
    async def extract(self, urls: List[str], **kwargs) -> List[Dict[str, Any]]: ...

This mirrors the existing CloudBrowserProvider pattern in tools/browser_providers/base.py.

Task 2-3: Extract existing vendors into provider modules

Move inline vendor code from web_tools.py into dedicated modules. Each vendor's search/extract logic becomes a class implementing the appropriate ABC:

Module Classes Wraps
tools/web_providers/firecrawl.py FirecrawlSearchProvider, FirecrawlExtractProvider _get_firecrawl_client().search(), .scrape()
tools/web_providers/tavily.py TavilySearchProvider, TavilyExtractProvider _tavily_request(), _normalize_tavily_*()
tools/web_providers/exa.py ExaSearchProvider, ExaExtractProvider _exa_search(), _exa_extract()
tools/web_providers/parallel.py ParallelSearchProvider, ParallelExtractProvider _parallel_search(), _parallel_extract()

These are moves, not rewrites — the existing normalization functions become methods on the provider classes.

Task 4: SearXNG search provider (new — search only)

# tools/web_providers/searxng.py
class SearXNGSearchProvider(WebSearchProvider):
    def provider_name(self): return "searxng"
    
    def is_configured(self):
        return bool(os.getenv("SEARXNG_URL", "").strip())
    
    def search(self, query, limit=5):
        base_url = os.getenv("SEARXNG_URL", "").strip().rstrip("/")
        resp = httpx.get(f"{base_url}/search", 
                         params={"q": query, "format": "json"}, timeout=15)
        resp.raise_for_status()
        data = resp.json()
        results = sorted(data.get("results", []),
                         key=lambda r: r.get("score", 0), reverse=True)[:limit]
        return {
            "success": True,
            "data": {"web": [
                {"title": r.get("title", ""), "url": r.get("url", ""),
                 "description": r.get("content", ""), "position": i + 1}
                for i, r in enumerate(results)
            ]},
        }

SearXNG is search-only. It does NOT implement WebExtractProvider. Users pair it with a separate extract provider.

Task 5: Native HTTP extract provider (new — extract only, zero API key)

# tools/web_providers/native.py
class NativeExtractProvider(WebExtractProvider):
    def provider_name(self): return "native"
    def is_configured(self): return True  # always available
    
    async def extract(self, urls, **kwargs):
        # httpx fetch + html-to-markdown conversion
        # No API key, no external service
        ...

Users who just want web_extract without paying for Firecrawl/Tavily can use this alongside SearXNG for search.

Phase 2: Registries and Per-Capability Config

Task 6: Provider registries and per-capability backend selection

Replace _get_backend() with _get_search_provider() and _get_extract_provider():

# tools/web_tools.py

_SEARCH_PROVIDERS: Dict[str, type] = {
    "firecrawl": FirecrawlSearchProvider,
    "tavily":    TavilySearchProvider,
    "exa":       ExaSearchProvider,
    "parallel":  ParallelSearchProvider,
    "searxng":   SearXNGSearchProvider,
}

_EXTRACT_PROVIDERS: Dict[str, type] = {
    "firecrawl": FirecrawlExtractProvider,
    "tavily":    TavilyExtractProvider,
    "exa":       ExaExtractProvider,
    "parallel":  ParallelExtractProvider,
    "native":    NativeExtractProvider,
}

def _get_search_provider() -> WebSearchProvider:
    cfg = _load_web_config()
    name = (cfg.get("search_backend") or cfg.get("backend") or "").lower().strip()
    if name in _SEARCH_PROVIDERS:
        p = _SEARCH_PROVIDERS[name]()
        if p.is_configured():
            return p
    # Auto-detect fallback (existing behavior)
    for name, cls in _SEARCH_PROVIDERS.items():
        p = cls()
        if p.is_configured():
            return p
    return FirecrawlSearchProvider()  # last resort

Task 7: Rewire tool dispatch

Replace the inline if/elif chains with provider calls:

# Before (60+ lines of dispatch):
def web_search_tool(query, limit=5):
    backend = _get_backend()
    if backend == "parallel": response_data = _parallel_search(query, limit)
    elif backend == "exa":    response_data = _exa_search(query, limit)
    elif backend == "tavily": ...
    else: ...

# After:
def web_search_tool(query, limit=5):
    provider = _get_search_provider()
    response_data = provider.search(query, limit)

Task 8: Split check_fn per capability

Today both tools share check_fn=check_web_api_key. Split so a SearXNG-only user sees web_search available without needing a Firecrawl key:

# web_search registration
registry.register(name="web_search", ..., check_fn=check_web_search_available)

# web_extract registration  
registry.register(name="web_extract", ..., check_fn=check_web_extract_available)

Task 9-10: Config + hermes tools UI

  • Add web.search_backend and web.extract_backend to DEFAULT_CONFIG (no config version bump needed — _deep_merge handles new keys)
  • Add SEARXNG_URL to OPTIONAL_ENV_VARS
  • Add SearXNG and Native HTTP as provider options in the hermes tools picker
  • Update post-setup handler to write per-capability config keys

Phase 3: Cleanup

Task 11: Unify LLM summarization

Extract the duplicated LLM post-processing (process_content_with_llm() in web_tools.py and _extract_relevant_content() in browser_tool.py) into a shared tools/web_summarize.py.

Task 12: Delete dead inline vendor code

Once vendors are routed through providers, remove ~500-700 lines of inline _parallel_search(), _exa_search(), _tavily_request(), etc. from web_tools.py.

Task 13: Documentation

Update configuration.md, environment-variables.md, and the skills catalog with SearXNG setup and the new per-capability config keys.


What This Unblocks

PR Author What it adds Currently blocked because
#11562 @kshitijk4poor SearXNG search Can't use search-only provider without breaking extract
#6065 @meirk-brd Bright Data extract Would need to implement search too (it's scrape-only)
#3416 @Local-First webclaw local extract Same — scrape-only provider
#6826 @? fastCRW search+extract Would be a clean drop-in with provider ABCs
#2668 @edmundman Brave Search Search-only provider
Future DuckDuckGo native provider Currently a skill; could become zero-key WebSearchProvider
Future MCP-backed search Wrap any MCP search server as a WebSearchProvider

Risk Assessment

Risk Mitigation
Breaking existing web.backend config web.backend stays as fallback — per-capability keys only override when explicitly set
Provider module import overhead Lazy imports in registry — modules only loaded when selected
html-to-markdown new dependency for native extract Optional — guard with try/except ImportError, degrade gracefully
web_crawl_tool still needs inline logic Keep crawl dispatch inline for now; extract WebCrawlProvider ABC in a follow-up
Test suite changes Provider ABCs make testing cleaner — mock the provider, not HTTP calls

Related

UX Design: hermes tools Provider Picker

Design Principle: Progressive Disclosure

The 90% path (single provider for both search + extract) stays identical to today — one pick, done. The split is revealed only to users who explicitly opt into the advanced flow.

Flow

hermes tools → Web Search & Extract → Select provider

  [1] Nous Subscription          (search + extract)
  [2] Firecrawl Cloud ★          (search + extract)
  [3] Exa                        (search + extract)
  [4] Parallel                   (search + extract)
  [5] Tavily                     (search + extract)
  [6] Firecrawl Self-Hosted      (search + extract)
  [7] ⚙️  Advanced: configure search & extract separately

  → User picks [2]: sets web.backend = "firecrawl". Done.
    (Identical to current behavior. No split exposed.)

  → User picks [7]:
    "Search backend:"  [SearXNG / Firecrawl / Tavily / Exa / Parallel]
    "Extract backend:" [Firecrawl / Tavily / Exa / Parallel / Native HTTP]
    → Sets web.search_backend + web.extract_backend. Done.

Principles

  1. Default path unchanged — Single-provider users never see the split. Zero friction increase.
  2. Advanced is last — The split option is at the bottom, clearly badged "advanced". Discoverable but not confusing.
  3. No redundant entry — Picking "Firecrawl" from the main list sets web.backend once. Users never type/pick the same provider twice.
  4. Undo is natural — Picking any main-list provider later clears search_backend/extract_backend and resets to shared mode.
  5. Config.yaml is source of truth — Power users can always edit web.search_backend / web.extract_backend directly without the picker.

Implementation (tools_config.py)

# Last entry in TOOL_CATEGORIES["web"]["providers"]:
{
    "name": "Advanced: configure search & extract separately",
    "badge": "advanced",
    "tag": "Pick different providers for search vs extract (e.g. SearXNG + Firecrawl)",
    "advanced_split": True,
    "env_vars": [],
},

The post-setup handler detects advanced_split: True and runs a two-step sub-picker:

  1. Show search-capable providers → writes config["web"]["search_backend"]
  2. Show extract-capable providers → writes config["web"]["extract_backend"]

When any NON-advanced provider is picked, the handler clears per-capability overrides:

if not provider.get("advanced_split"):
    web_cfg = config.setdefault("web", {})
    web_cfg["backend"] = provider["web_backend"]
    web_cfg.pop("search_backend", None)
    web_cfg.pop("extract_backend", None)

This ensures switching back from split mode to unified mode is seamless.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/toolsTool registry, model_tools, toolsetstool/webWeb search and extractiontype/featureNew feature or requesttype/refactorCode restructuring, no behavior change

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions