refactor: capability-based provider architecture for web tools (search/extract/browse split)

## Summary

Refactor web tools from a monolithic single-backend architecture to a **capability-based provider system** where search, extract, and crawl backends are independently selectable.

This enables combinations like **SearXNG for search + Firecrawl for extract** — currently impossible because `_get_backend()` returns a single string shared by all three capabilities.

## Problem

`tools/web_tools.py` (2153 lines) uses a single `_get_backend()` function that returns one of `"firecrawl"`, `"parallel"`, `"tavily"`, `"exa"`. This same string drives the dispatch in `web_search_tool()`, `web_extract_tool()`, AND `web_crawl_tool()`. Every provider must support all three capabilities or break whichever it can't do.

This makes it impossible to integrate search-only providers (SearXNG, DuckDuckGo, Brave) or extract-only providers (native HTTP, Bright Data) without either:
- Breaking the other capability, or
- Splitting the config manually (which PR #2710 attempted but was never merged)

The community has noticed. Comments on #11562 show users defaulting to the `mcp-searxng` MCP workaround because the native integration can't mix providers. Multiple PRs proposing new web backends (#6065, #3416, #6826, #2668) all hit the same architectural wall.

### Current coupling points

| Location | What it does |
|----------|-------------|
| `web_tools.py:121` `_get_backend()` | Single selector for ALL capabilities |
| `web_tools.py:1131` `web_search_tool()` | 60-line if/elif dispatch to 4 vendor implementations |
| `web_tools.py:1287` `web_extract_tool()` | 80-line if/elif dispatch to 4 vendor implementations |
| `web_tools.py:1589` `web_crawl_tool()` | Dispatch to Firecrawl/Tavily only |
| `web_tools.py:1987` `check_web_api_key()` | Single gate for both `web_search` and `web_extract` registration |
| `hermes_cli/tools_config.py:240` | Provider picker writes single `web.backend` config key |

## Research

We studied how three major frameworks handle the search/extract/browse separation:

| Framework | Architecture | What they lack |
|---|---|---|
| **browser-use** | Monolithic `Tools` class; "search" = navigate browser to DuckDuckGo URL. Flat action registry with decorator-based registration + parameter injection DI. | No pluggable search/extract backends at all. Every capability is a hardcoded action in one `__init__()` method. |
| **Stagehand** | Clean `observe`/`act`/`extract` as separate handler classes (`ActHandler`, `ExtractHandler`, `ObserveHandler`) with shared snapshot abstraction. Provider/client pattern for LLM backends. | No provider interfaces for web capabilities — one implementation per primitive. Search not even a concept. |
| **Hermes (current)** | `CloudBrowserProvider` ABC with registry dict (`_PROVIDER_REGISTRY`) for browser backends. Clean pattern. | No equivalent for search/extract — inline if/elif chains with a single `_get_backend()`. |

**Key finding:** None of these frameworks solve the problem. But Hermes already has the right pattern in `tools/browser_providers/base.py` — the `CloudBrowserProvider` ABC with a provider registry dict. We extend this pattern to web capabilities.

## Proposed Architecture

```
┌──────────────────────────────────────────────────────┐
│                    Agent Toolset                      │
│   web_search()    web_extract()    browser_navigate() │
└───────┬────────────────┬───────────────────┬─────────┘
        │                │                   │
   ┌────▼─────┐    ┌─────▼──────┐    ┌──────▼───────┐
   │  Search   │    │  Extract   │    │   Browser    │
   │  Router   │    │  Router    │    │   Router     │
   └────┬──────┘    └─────┬──────┘    └──────┬───────┘
        │                 │                   │  (already exists)
   ┌────▼──────────┐ ┌───▼──────────┐ ┌─────▼─────────┐
   │ Providers:    │ │ Providers:   │ │ Providers:     │
   │ • firecrawl   │ │ • firecrawl  │ │ • local        │
   │ • tavily      │ │ • tavily     │ │ • browserbase  │
   │ • exa         │ │ • native     │ │ • browser-use  │
   │ • parallel    │ │ • parallel   │ │ • camofox      │
   │ • searxng     │ │ • exa        │ │                │
   │ • duckduckgo  │ │              │ │                │
   └───────────────┘ └──────────────┘ └────────────────┘
```

### Config (final state)

```yaml
# Mix free providers — no API keys at all
web:
  search_backend: "searxng"
  extract_backend: "native"

# SearXNG search + Firecrawl scraping
web:
  search_backend: "searxng"
  extract_backend: "firecrawl"

# Single provider for everything (backward compatible, current behavior)
web:
  backend: "tavily"
```

**Selection priority per capability:**
1. `web.search_backend` / `web.extract_backend` (explicit per-capability)
2. `web.backend` (legacy shared fallback)
3. Auto-detect from env vars (existing behavior)

---

## Implementation Plan

### Phase 1: Provider ABCs and Modules (foundation — no behavior change)

#### Task 1: Create provider ABCs

Create `tools/web_providers/base.py` with `WebSearchProvider` and `WebExtractProvider` ABCs:

```python
# tools/web_providers/base.py
from abc import ABC, abstractmethod
from typing import Any, Dict, List

class WebSearchProvider(ABC):
    """Interface for web search backends.
    
    Returns normalized: {"success": True, "data": {"web": [{"title", "url", "description", "position"}]}}
    """
    @abstractmethod
    def provider_name(self) -> str: ...

    @abstractmethod
    def is_configured(self) -> bool: ...

    @abstractmethod
    def search(self, query: str, limit: int = 5) -> Dict[str, Any]: ...


class WebExtractProvider(ABC):
    """Interface for web content extraction backends.
    
    Returns normalized: [{"url", "title", "content", "raw_content", "metadata"}]
    """
    @abstractmethod
    def provider_name(self) -> str: ...

    @abstractmethod
    def is_configured(self) -> bool: ...

    @abstractmethod
    async def extract(self, urls: List[str], **kwargs) -> List[Dict[str, Any]]: ...
```

This mirrors the existing `CloudBrowserProvider` pattern in `tools/browser_providers/base.py`.

#### Task 2-3: Extract existing vendors into provider modules

Move inline vendor code from `web_tools.py` into dedicated modules. Each vendor's search/extract logic becomes a class implementing the appropriate ABC:

| Module | Classes | Wraps |
|--------|---------|-------|
| `tools/web_providers/firecrawl.py` | `FirecrawlSearchProvider`, `FirecrawlExtractProvider` | `_get_firecrawl_client().search()`, `.scrape()` |
| `tools/web_providers/tavily.py` | `TavilySearchProvider`, `TavilyExtractProvider` | `_tavily_request()`, `_normalize_tavily_*()` |
| `tools/web_providers/exa.py` | `ExaSearchProvider`, `ExaExtractProvider` | `_exa_search()`, `_exa_extract()` |
| `tools/web_providers/parallel.py` | `ParallelSearchProvider`, `ParallelExtractProvider` | `_parallel_search()`, `_parallel_extract()` |

These are **moves, not rewrites** — the existing normalization functions become methods on the provider classes.

#### Task 4: SearXNG search provider (new — search only)

```python
# tools/web_providers/searxng.py
class SearXNGSearchProvider(WebSearchProvider):
    def provider_name(self): return "searxng"
    
    def is_configured(self):
        return bool(os.getenv("SEARXNG_URL", "").strip())
    
    def search(self, query, limit=5):
        base_url = os.getenv("SEARXNG_URL", "").strip().rstrip("/")
        resp = httpx.get(f"{base_url}/search", 
                         params={"q": query, "format": "json"}, timeout=15)
        resp.raise_for_status()
        data = resp.json()
        results = sorted(data.get("results", []),
                         key=lambda r: r.get("score", 0), reverse=True)[:limit]
        return {
            "success": True,
            "data": {"web": [
                {"title": r.get("title", ""), "url": r.get("url", ""),
                 "description": r.get("content", ""), "position": i + 1}
                for i, r in enumerate(results)
            ]},
        }
```

SearXNG is search-only. It does NOT implement `WebExtractProvider`. Users pair it with a separate extract provider.

#### Task 5: Native HTTP extract provider (new — extract only, zero API key)

```python
# tools/web_providers/native.py
class NativeExtractProvider(WebExtractProvider):
    def provider_name(self): return "native"
    def is_configured(self): return True  # always available
    
    async def extract(self, urls, **kwargs):
        # httpx fetch + html-to-markdown conversion
        # No API key, no external service
        ...
```

Users who just want `web_extract` without paying for Firecrawl/Tavily can use this alongside SearXNG for search.

### Phase 2: Registries and Per-Capability Config

#### Task 6: Provider registries and per-capability backend selection

Replace `_get_backend()` with `_get_search_provider()` and `_get_extract_provider()`:

```python
# tools/web_tools.py

_SEARCH_PROVIDERS: Dict[str, type] = {
    "firecrawl": FirecrawlSearchProvider,
    "tavily":    TavilySearchProvider,
    "exa":       ExaSearchProvider,
    "parallel":  ParallelSearchProvider,
    "searxng":   SearXNGSearchProvider,
}

_EXTRACT_PROVIDERS: Dict[str, type] = {
    "firecrawl": FirecrawlExtractProvider,
    "tavily":    TavilyExtractProvider,
    "exa":       ExaExtractProvider,
    "parallel":  ParallelExtractProvider,
    "native":    NativeExtractProvider,
}

def _get_search_provider() -> WebSearchProvider:
    cfg = _load_web_config()
    name = (cfg.get("search_backend") or cfg.get("backend") or "").lower().strip()
    if name in _SEARCH_PROVIDERS:
        p = _SEARCH_PROVIDERS[name]()
        if p.is_configured():
            return p
    # Auto-detect fallback (existing behavior)
    for name, cls in _SEARCH_PROVIDERS.items():
        p = cls()
        if p.is_configured():
            return p
    return FirecrawlSearchProvider()  # last resort
```

#### Task 7: Rewire tool dispatch

Replace the inline if/elif chains with provider calls:

```python
# Before (60+ lines of dispatch):
def web_search_tool(query, limit=5):
    backend = _get_backend()
    if backend == "parallel": response_data = _parallel_search(query, limit)
    elif backend == "exa":    response_data = _exa_search(query, limit)
    elif backend == "tavily": ...
    else: ...

# After:
def web_search_tool(query, limit=5):
    provider = _get_search_provider()
    response_data = provider.search(query, limit)
```

#### Task 8: Split `check_fn` per capability

Today both tools share `check_fn=check_web_api_key`. Split so a SearXNG-only user sees `web_search` available without needing a Firecrawl key:

```python
# web_search registration
registry.register(name="web_search", ..., check_fn=check_web_search_available)

# web_extract registration  
registry.register(name="web_extract", ..., check_fn=check_web_extract_available)
```

#### Task 9-10: Config + `hermes tools` UI

- Add `web.search_backend` and `web.extract_backend` to `DEFAULT_CONFIG` (no config version bump needed — `_deep_merge` handles new keys)
- Add `SEARXNG_URL` to `OPTIONAL_ENV_VARS`
- Add SearXNG and Native HTTP as provider options in the `hermes tools` picker
- Update post-setup handler to write per-capability config keys

### Phase 3: Cleanup

#### Task 11: Unify LLM summarization

Extract the duplicated LLM post-processing (`process_content_with_llm()` in `web_tools.py` and `_extract_relevant_content()` in `browser_tool.py`) into a shared `tools/web_summarize.py`.

#### Task 12: Delete dead inline vendor code

Once vendors are routed through providers, remove ~500-700 lines of inline `_parallel_search()`, `_exa_search()`, `_tavily_request()`, etc. from `web_tools.py`.

#### Task 13: Documentation

Update `configuration.md`, `environment-variables.md`, and the skills catalog with SearXNG setup and the new per-capability config keys.

---

## What This Unblocks

| PR | Author | What it adds | Currently blocked because |
|----|--------|-------------|--------------------------|
| #11562 | @kshitijk4poor | SearXNG search | Can't use search-only provider without breaking extract |
| #6065 | @meirk-brd | Bright Data extract | Would need to implement search too (it's scrape-only) |
| #3416 | @local-first | webclaw local extract | Same — scrape-only provider |
| #6826 | @? | fastCRW search+extract | Would be a clean drop-in with provider ABCs |
| #2668 | @edmundman | Brave Search | Search-only provider |
| Future | — | DuckDuckGo native provider | Currently a skill; could become zero-key `WebSearchProvider` |
| Future | — | MCP-backed search | Wrap any MCP search server as a `WebSearchProvider` |

## Risk Assessment

| Risk | Mitigation |
|------|-----------|
| Breaking existing `web.backend` config | `web.backend` stays as fallback — per-capability keys only override when explicitly set |
| Provider module import overhead | Lazy imports in registry — modules only loaded when selected |
| `html-to-markdown` new dependency for native extract | Optional — guard with `try/except ImportError`, degrade gracefully |
| `web_crawl_tool` still needs inline logic | Keep crawl dispatch inline for now; extract `WebCrawlProvider` ABC in a follow-up |
| Test suite changes | Provider ABCs make testing cleaner — mock the provider, not HTTP calls |

## Related

- #11562 — Current open SearXNG PR (search-only, doesn't solve the split) — will be superseded
- #2710 — StreamOfRon's original split proposal (closed)
- #19186 — Self-extending agent helpers (separate browser-harness research)
- #19189 — Compositor-level coordinate click (separate browser-harness research)

## UX Design: `hermes tools` Provider Picker

### Design Principle: Progressive Disclosure

The 90% path (single provider for both search + extract) stays **identical to today** — one pick, done. The split is revealed only to users who explicitly opt into the advanced flow.

### Flow

```
hermes tools → Web Search & Extract → Select provider

  [1] Nous Subscription          (search + extract)
  [2] Firecrawl Cloud ★          (search + extract)
  [3] Exa                        (search + extract)
  [4] Parallel                   (search + extract)
  [5] Tavily                     (search + extract)
  [6] Firecrawl Self-Hosted      (search + extract)
  [7] ⚙️  Advanced: configure search & extract separately

  → User picks [2]: sets web.backend = "firecrawl". Done.
    (Identical to current behavior. No split exposed.)

  → User picks [7]:
    "Search backend:"  [SearXNG / Firecrawl / Tavily / Exa / Parallel]
    "Extract backend:" [Firecrawl / Tavily / Exa / Parallel / Native HTTP]
    → Sets web.search_backend + web.extract_backend. Done.
```

### Principles

1. **Default path unchanged** — Single-provider users never see the split. Zero friction increase.
2. **Advanced is last** — The split option is at the bottom, clearly badged "advanced". Discoverable but not confusing.
3. **No redundant entry** — Picking "Firecrawl" from the main list sets `web.backend` once. Users never type/pick the same provider twice.
4. **Undo is natural** — Picking any main-list provider later clears `search_backend`/`extract_backend` and resets to shared mode.
5. **Config.yaml is source of truth** — Power users can always edit `web.search_backend` / `web.extract_backend` directly without the picker.

### Implementation (tools_config.py)

```python
# Last entry in TOOL_CATEGORIES["web"]["providers"]:
{
    "name": "Advanced: configure search & extract separately",
    "badge": "advanced",
    "tag": "Pick different providers for search vs extract (e.g. SearXNG + Firecrawl)",
    "advanced_split": True,
    "env_vars": [],
},
```

The post-setup handler detects `advanced_split: True` and runs a two-step sub-picker:
1. Show search-capable providers → writes `config["web"]["search_backend"]`
2. Show extract-capable providers → writes `config["web"]["extract_backend"]`

When any NON-advanced provider is picked, the handler clears per-capability overrides:
```python
if not provider.get("advanced_split"):
    web_cfg = config.setdefault("web", {})
    web_cfg["backend"] = provider["web_backend"]
    web_cfg.pop("search_backend", None)
    web_cfg.pop("extract_backend", None)
```

This ensures switching back from split mode to unified mode is seamless.



Location	What it does
`web_tools.py:121` `_get_backend()`	Single selector for ALL capabilities
`web_tools.py:1131` `web_search_tool()`	60-line if/elif dispatch to 4 vendor implementations
`web_tools.py:1287` `web_extract_tool()`	80-line if/elif dispatch to 4 vendor implementations
`web_tools.py:1589` `web_crawl_tool()`	Dispatch to Firecrawl/Tavily only
`web_tools.py:1987` `check_web_api_key()`	Single gate for both `web_search` and `web_extract` registration
`hermes_cli/tools_config.py:240`	Provider picker writes single `web.backend` config key

Framework	Architecture	What they lack
browser-use	Monolithic `Tools` class; "search" = navigate browser to DuckDuckGo URL. Flat action registry with decorator-based registration + parameter injection DI.	No pluggable search/extract backends at all. Every capability is a hardcoded action in one `__init__()` method.
Stagehand	Clean `observe`/`act`/`extract` as separate handler classes (`ActHandler`, `ExtractHandler`, `ObserveHandler`) with shared snapshot abstraction. Provider/client pattern for LLM backends.	No provider interfaces for web capabilities — one implementation per primitive. Search not even a concept.
Hermes (current)	`CloudBrowserProvider` ABC with registry dict (`_PROVIDER_REGISTRY`) for browser backends. Clean pattern.	No equivalent for search/extract — inline if/elif chains with a single `_get_backend()`.

Module	Classes	Wraps
`tools/web_providers/firecrawl.py`	`FirecrawlSearchProvider`, `FirecrawlExtractProvider`	`_get_firecrawl_client().search()`, `.scrape()`
`tools/web_providers/tavily.py`	`TavilySearchProvider`, `TavilyExtractProvider`	`_tavily_request()`, `_normalize_tavily_*()`
`tools/web_providers/exa.py`	`ExaSearchProvider`, `ExaExtractProvider`	`_exa_search()`, `_exa_extract()`
`tools/web_providers/parallel.py`	`ParallelSearchProvider`, `ParallelExtractProvider`	`_parallel_search()`, `_parallel_extract()`

PR	Author	What it adds	Currently blocked because
#11562	@kshitijk4poor	SearXNG search	Can't use search-only provider without breaking extract
#6065	@meirk-brd	Bright Data extract	Would need to implement search too (it's scrape-only)
#3416	@Local-First	webclaw local extract	Same — scrape-only provider
#6826	@?	fastCRW search+extract	Would be a clean drop-in with provider ABCs
#2668	@edmundman	Brave Search	Search-only provider
Future	—	DuckDuckGo native provider	Currently a skill; could become zero-key `WebSearchProvider`
Future	—	MCP-backed search	Wrap any MCP search server as a `WebSearchProvider`

Risk	Mitigation
Breaking existing `web.backend` config	`web.backend` stays as fallback — per-capability keys only override when explicitly set
Provider module import overhead	Lazy imports in registry — modules only loaded when selected
`html-to-markdown` new dependency for native extract	Optional — guard with `try/except ImportError`, degrade gracefully
`web_crawl_tool` still needs inline logic	Keep crawl dispatch inline for now; extract `WebCrawlProvider` ABC in a follow-up
Test suite changes	Provider ABCs make testing cleaner — mock the provider, not HTTP calls

refactor: capability-based provider architecture for web tools (search/extract/browse split) #19198

Description

Summary

Problem

Current coupling points

Research

Proposed Architecture

Config (final state)

Implementation Plan

Phase 1: Provider ABCs and Modules (foundation — no behavior change)

Task 1: Create provider ABCs

Task 2-3: Extract existing vendors into provider modules

Task 4: SearXNG search provider (new — search only)

Task 5: Native HTTP extract provider (new — extract only, zero API key)

Phase 2: Registries and Per-Capability Config

Task 6: Provider registries and per-capability backend selection

Task 7: Rewire tool dispatch

Task 8: Split check_fn per capability

Task 9-10: Config + hermes tools UI

Phase 3: Cleanup

Task 11: Unify LLM summarization

Task 12: Delete dead inline vendor code

Task 13: Documentation

What This Unblocks

Risk Assessment

Related

UX Design: hermes tools Provider Picker

Design Principle: Progressive Disclosure

Flow

Principles

Implementation (tools_config.py)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Task 8: Split `check_fn` per capability

Task 9-10: Config + `hermes tools` UI

UX Design: `hermes tools` Provider Picker