Skip to content

feat: remote endpoint support for QMD (embed, expand, rerank, generate)#1

Closed
unithejerk wants to merge 18 commits into
mainfrom
feat/remote-endpoints
Closed

feat: remote endpoint support for QMD (embed, expand, rerank, generate)#1
unithejerk wants to merge 18 commits into
mainfrom
feat/remote-endpoints

Conversation

@unithejerk

@unithejerk unithejerk commented Jun 10, 2026

Copy link
Copy Markdown
Owner

feat: remote endpoint support for QMD (embed, expand, rerank, generate)

Summary

This PR adds remote endpoint support across QMD's embed, expand, rerank, and generate paths.

It introduces role-specific endpoint configuration plus protocol adapters for OpenAI-compatible, Cohere, Anthropic, Ollama, and vLLM-style APIs. Remote mode activates only when endpoints are configured, so existing local-only behavior remains the default.

Current branch shape:

  • 18 commits
  • 41 files changed
  • dedicated remote endpoint docs
  • remote/runtime, retrieval-gate, local-config, and CLI regression coverage

What it adds

Remote runtime

  • src/remote/ module with per-role endpoint configuration for embed, expand, rerank, and generate
  • protocol-specific handlers for:
    • OpenAI chat completions
    • OpenAI completions
    • OpenAI responses
    • Anthropic messages
    • Cohere embed / rerank
    • Ollama embed / chat / generate
    • vLLM pooling / score
  • circuit breakers, health probing, retry/timeout handling, and HTTP transport helpers
  • remote tokenizer support for chunking when local tokenization is unavailable

QMD integration

  • remote wiring in src/store.ts, src/llm.ts, src/index.ts, and src/cli/qmd.ts
  • qmd status / qmd doctor model and connection reporting updates
  • programmatic createStore({ llm }) support preserved for remote/custom LLM injection
  • remote-only setups avoid instantiating local LlamaCpp during tokenizer detection

Docs and tests

  • docs/remote-endpoints.md with provider roles, formats, examples, tokenizer notes, and troubleshooting
  • one-line README.md pointer to the dedicated remote docs
  • test/remote.test.ts for adapter/runtime behavior
  • test/retrieval-gate.test.ts for retrieval-quality regression gating
  • local config and CLI regression coverage for status output and collection-add behavior

Small CLI fix included

  • qmd collection add with no path argument now exits with usage guidance instead of implicitly indexing CWD
  • fixes #684

Related work

This overlaps with earlier remote-endpoint efforts, but differs in scope and current integration point:

  • #705: focuses on OpenAI-compatible endpoints; this PR adds multiple protocol families
  • #629: provides earlier scaffolding; this PR carries that forward into adapter/runtime integration
  • #603: earlier closed remote-endpoint PR; this PR is rebased onto current main
  • #517: adds OpenAI-compatible remote embedding/reranking with a hybrid local/remote split; this PR extends remote support across all four roles

Architecture

flowchart TD
    Q["qmd embed / query"] --> R["RemoteLLM"]
    R --> E["embed.ts → /v1/embeddings, /v2/embed, /pooling, /api/embed"]
    R --> EX["expand.ts → chat / completions / responses / messages / Ollama"]
    R --> RK["rerank.ts → /rerank or /score"]
    R --> G["generate.ts → chat / completions / responses / messages / Ollama"]
    R --> P["probe.ts → metadata / startup checks"]
    R --> T["transport.ts → Node http/https"]
    R --> CB["circuit-breaker.ts → per-role breaker"]
    R --> A["adapters/ → protocol handlers"]
Loading

Validation

Ran on this branch:

  • npm run test:types
  • node ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/remote.test.ts test/retrieval-gate.test.ts test/local-config.test.ts
  • node ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/cli.test.ts test/local-config.test.ts

Most recent focused results:

  • 198/198 passed for remote + retrieval-gate + local-config
  • 148/148 passed for cli + local-config

Manual / live validation performed during development:

  • vLLM embedding endpoint with Qwen3-Embedding-0.6B
  • OpenRouter-based expand path
  • Cohere rerank via OpenRouter
  • local-only QMD behavior checked as unchanged when no remote endpoints are configured

Configuration

Remote mode activates only when endpoint config is present.

YAML (~/.config/qmd/index.yml)

models:
  embed_api_url: http://embed-host:8000/v1
  embed_api_model: Qwen3-Embedding-0.6B
  embed_api_key: sk-...
  embed_api_format: vllm_pooling

  expand_api_url: https://openrouter.ai/api/v1
  expand_api_model: google/gemini-2.0-flash-lite-001
  expand_api_key: sk-or-...
  expand_api_format: openai_chat_completions

  rerank_api_url: https://openrouter.ai/api/v1
  rerank_api_model: cohere/rerank-v3.5
  rerank_api_key: sk-or-...
  rerank_api_format: cohere_v2_rerank

Environment variables

QMD_EMBED_BASE_URL=http://embed-host:8000/v1
QMD_EMBED_MODEL=Qwen3-Embedding-0.6B
QMD_EXPAND_BASE_URL=https://openrouter.ai/api/v1
QMD_EXPAND_MODEL=google/gemini-2.0-flash-lite-001
# ... etc

Precedence:

  • env vars
  • YAML config
  • backward-compat fallbacks such as OPENAI_BASE_URL for embed

See docs/remote-endpoints.md for provider roles, protocol formats, tokenizer behavior, and example setups.

Backward compatibility

  • no *_api_url configured: QMD stays local-only
  • existing qmd collection add <path> behavior is unchanged
  • programmatic consumers can inject a custom llm without losing direct expandQuery() behavior

Out of scope

  • broader CLI UX changes such as interactive confirmation / --dry-run for collection add
  • additional batching-policy work beyond the current remote embedding support

unithejerk added 16 commits June 9, 2026 19:13
…dling

Why:
- Rebase the feature branch onto v2.5.1 while preserving RemoteLLM work.
- Avoid Node 24 undici ByteString failures on Unicode response payloads.

What:
- Ported RemoteLLM integration across CLI, store, index, and llm wiring.
- Replaced fragile fetch path with JSON helpers and nodePost/nodeGet-compatible flow.
- Kept upstream v2.5.1 interface widening and remote detection semantics intact.

Risk/Notes:
- Rebase + transport compatibility update; behavior should match pre-rebase intent.
…nd timeout config

Why:
- remote embedding needed basic failure containment and retry control before broader provider support could be added
- embedding batch behavior and vector dimension mismatches needed explicit handling to avoid silent corruption or brittle runtime failures

What:
- add a circuit breaker for remote endpoint health and half-open retry behavior
- add batch embedding controls, timeout configuration, and embedding dimension validation
- update the remote embedding path and tests to cover retry, failure, and dimension handling

Risk/Notes:
- this intentionally excludes local workspace metadata that was accidentally captured during earlier branch history
- runtime behavior changes primarily affect remote embedding flows
Why:
- remote endpoint configuration was split across several follow-up commits even though it is one coherent feature surface
- the PR branch should present remote activation and config semantics as a single reviewable change

What:
- add YAML config support for expand, rerank, generate, and embed remote endpoints and models
- make remote endpoint defaults local-first so no remote URL is assumed unless explicitly configured
- update auto-detection and docs so env vars and YAML config consistently activate remote mode

Risk/Notes:
- local behavior remains the default when no endpoint URLs are configured
- env vars still take precedence over YAML values
- Show all 4 endpoints: Embed, Expand, Rerank, Generate (Expand was missing)
- Read config + env vars directly, no auto-writes to index.yml
- Provider labels: OpenRouter, OpenAI, Ollama, xAI, host:port, local (GGUF)
- Source tags: (index.yml), (env QMD_*_MODEL), (default)
Why:
- the remote implementation introduced a new subsystem boundary that is easier to review as one unit than as a refactor followed by an immediate config follow-up
- endpoint-specific protocol selection needs an explicit validated contract before provider adapters can be layered on top

What:
- extract remote transport, logging, config, embed, expand, rerank, generate, probe, and RemoteLLM core modules
- add endpoint format validation and config typing for per-role remote protocol selection
- add core remote test coverage around transport, config, and RemoteLLM behavior

Risk/Notes:
- this is a structural extraction with behavior intended to match the prior remote path, plus the new validated format contract
- later commits layer registry and provider-specific adapters onto this core
Why:
- Adding OpenAI/Anthropic/Cohere variants directly in `RemoteLLM` would increase complexity.
- Adapter selection and wiring needed to move out of orchestration internals.

What:
- Added adapter contracts and endpoint contexts in `src/remote/adapters/types.ts`.
- Added behavior-preserving legacy adapters around existing embed/expand/rerank/generate modules.
- Added format-driven registry resolution for endpoint adapter bundles.
- Rewired `RemoteLLM` to orchestrate through adapters while preserving breaker/timeout/fallback behavior.

Risk/Notes:
- Internal architecture refactor only; no intended functional behavior change.
Add concrete adapters for three OpenAI-style generation protocols:
- /v1/chat/completions (openai_chat_completions)
- /v1/completions (openai_completions)
- /v1/responses (openai_responses)

Each protocol gets its own ExpandAdapter + GenerateAdapter with
protocol-specific request/response shapes. Shared normalization helpers
extract text consistently from the different response formats.

Why: enables expand/generate to work across OpenAI-compatible variants
(vLLM, OpenRouter, Ollama, OpenAI) with explicit format selection. Users
can now set expand_api_format=openai_completions for legacy endpoints.

Risk: low. auto and anthropic_messages stay on legacy code paths.
Existing behavior is preserved via the registry fallback pattern.

Validation:
- 120 tests pass (50 new, zero regressions)
- npm run -s test:types passes
Why:
- anthropic_messages format previously routed to legacy OpenAI chat
  adapters, sending incorrect request shapes (Authorization bearer,
  system-as-message) to Anthropic endpoints
- Phase 3 implements proper Anthropic Messages protocol with x-api-key
  auth, top-level system field, and content-block response extraction

What:
- src/remote/adapters/anthropic-messages.ts (new): ExpandAdapter +
  GenerateAdapter for /v1/messages, with buildHeaders/buildMessagesPayload
  helpers
- src/remote/adapters/normalization.ts: added normalizeAnthropicMessagesText
  for content-block text extraction (handles multi-block, tool_use skip)
- src/remote/adapters/registry.ts: anthropic_messages now routes to
  anthropic/messages-expand and anthropic/messages-generate (was legacy)
- test/remote.test.ts: 26 new tests covering normalization edge cases,
  request shape verification (x-api-key, anthropic-version, system field),
  circuit breaker, fallback on empty/malformed/tool_use-only responses,
  and RemoteLLM integration path

Risk-Notes:
- Backward compatible: auto format and all existing OpenAI protocol
  adapters are unchanged
- The generate adapter sends NO system prompt (unlike expand) — matches
  existing OpenAI generate adapter conventions
- Response text concatenates across multiple content blocks in order;
  non-text blocks (tool_use, image) are silently skipped
- Tests are mock-server-only; live Anthropic contract testing deferred
Why:
- Cohere-compatible deployments behind vLLM showed endpoint and request-shape drift.

What:
- Added dedicated Cohere embed/rerank adapters and registry wiring.
- Added resilient `/v2/embed` and `/rerank` path fallback plus request-shape fallback.
- Added `input_type` fallback mappings for vLLM-hosted Cohere endpoints.
- Tightened rerank malformed-response handling and expanded integration coverage.
- Updated README with vLLM Cohere configuration guidance.

Risk/Notes:
- Preserves OpenAI v1 compatibility while improving Cohere/vLLM interoperability.
Why:
- Reindex and search paths performed unnecessary work across hydration and embedding flows.
- Remote transport created repeated request setup overhead.

What:
- Added source-metadata fast paths for incremental reindexing and routed CLI indexing via shared reindexer.
- Deferred search body hydration until after fusion/dedupe and reduced repeated context resolution.
- Batched vector-only query embeddings and trimmed unnecessary reordering work.
- Refactored remote transport to shared keep-alive JSON requests and reusable bearer auth helpers.

Risk/Notes:
- Performance-focused change; remote adapter behavior remains covered by tests.
Why:
- Ollama support was split across separate embed and text commits even though they form one provider integration
- reviewers should be able to evaluate the complete Ollama surface in one place

What:
- add native Ollama embed adapter for /api/embed with compatibility fallback behavior
- add native Ollama chat and generate adapters for /api/chat and /api/generate
- wire Ollama formats into adapter resolution, config aliases, and remote tests

Risk/Notes:
- this is provider-specific surface area on top of the existing remote adapter framework
- local workflows remain unaffected unless an Ollama format is explicitly selected
Why:
- Filepath hydration for virtual `qmd://` docs was doing expensive post-filtering.
- Half-open breaker recovery allowed concurrent probe stampedes.

What:
- Rewrote `loadSearchDocumentsByFilepaths` to hydrate virtual candidates via indexed `(collection,path)` predicates.
- Kept compatibility fallback for non-virtual identifiers.
- Restricted half-open breaker recovery to a single in-flight probe.
- Added circuit-breaker coverage for single-probe half-open gating.

Risk/Notes:
- Performance and resilience update with compatibility fallback retained.
Show per-endpoint connection state for embed/expand/rerank/generate in qmd status, including latency and model-availability checks when remote endpoints expose model listings.

Probe common metadata endpoints for OpenAI/Ollama-style providers and mark local-only configurations explicitly.

Start remote probes in parallel to keep status responsive when one endpoint is slow or unavailable.
… support

Why:
- the branch ended with two commits carrying the same subject/body even though they represented one logical tightening pass
- remote embedding needed more robust endpoint normalization for Cohere-compatible and vLLM deployments
- retrieval quality and remote token-bound chunking needed explicit guardrails before a PR against current upstream

What:
- normalize Cohere-compatible embed routing, add vLLM pooling support, and keep host-aware input_type behavior
- improve retrieval candidate selection, expansion filtering, rerank chunk aggregation, and add the retrieval regression gate
- add remote tokenizer support for exact token-aware chunking in remote mode, with fallback behavior and targeted tests
- align the localhost Cohere-adapter test with the generic-host contract while keeping Cohere-host expectations covered separately

Risk/Notes:
- remote tokenizer behavior adds another network-dependent path, but it degrades to character-based chunking unless explicitly forced
- this rewrites only the local PR branch tail; backup/pr-remote-llm-clean-pre-tail-squash preserves the pre-squash history
Why:
- qmd collection add with no path silently indexed the current working directory, which is a footgun on large folders
- the cleaner fix is to reject the command unless the target path is explicit, rather than layering confirmation UX into this branch

What:
- make qmd collection add fail with usage guidance when the path argument is omitted
- add a CLI regression test covering the no-argument failure path

Risk/Notes:
- existing documented and tested usage with qmd collection add . or an explicit path is unchanged
- this aligns with issue tobi#684 without broadening the PR into prompt or dry-run behavior
Why:
- the remote adapter work needs focused user-facing documentation without bloating the main README
- protocol roles, format selection, provider examples, and tokenizer behavior are easier to review in a dedicated doc than as scattered README additions

What:
- add docs/remote-endpoints.md covering remote roles, protocol formats, config precedence, provider examples, runtime behavior, tokenizer support, and troubleshooting
- keep README upstream-friendly by replacing the large remote README expansion with a single pointer line to the dedicated doc
- remove fork-specific wording and identifying IP addresses from the documentation examples

Risk/Notes:
- documentation only; no runtime behavior changed in this commit
- the README delta versus origin/main is intentionally reduced to a single line
@unithejerk unithejerk force-pushed the feat/remote-endpoints branch from ebcb471 to 5625e42 Compare June 10, 2026 02:57
Squash the follow-up test fixes for status and local-config output after the status model changes.
@unithejerk unithejerk force-pushed the feat/remote-endpoints branch from 06eecc7 to ce64795 Compare June 10, 2026 03:45
Pass the constructed LLM through createStore().expandQuery, avoid instantiating local LlamaCpp during remote tokenizer detection when a remote-style default LLM is already active, and remove unsolicited startup probe logging from createStore().
@unithejerk unithejerk closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qmd collection add with no arguments silently indexes CWD as a new collection

1 participant