feat: remote endpoint support for QMD (embed, expand, rerank, generate) by unithejerk · Pull Request #1 · unithejerk/qmd

unithejerk · 2026-06-10T02:42:00Z

feat: remote endpoint support for QMD (embed, expand, rerank, generate)

Summary

This PR adds remote endpoint support across QMD's embed, expand, rerank, and generate paths.

It introduces role-specific endpoint configuration plus protocol adapters for OpenAI-compatible, Cohere, Anthropic, Ollama, and vLLM-style APIs. Remote mode activates only when endpoints are configured, so existing local-only behavior remains the default.

Current branch shape:

18 commits
41 files changed
dedicated remote endpoint docs
remote/runtime, retrieval-gate, local-config, and CLI regression coverage

What it adds

Remote runtime

src/remote/ module with per-role endpoint configuration for embed, expand, rerank, and generate
protocol-specific handlers for:
- OpenAI chat completions
- OpenAI completions
- OpenAI responses
- Anthropic messages
- Cohere embed / rerank
- Ollama embed / chat / generate
- vLLM pooling / score
circuit breakers, health probing, retry/timeout handling, and HTTP transport helpers
remote tokenizer support for chunking when local tokenization is unavailable

QMD integration

remote wiring in src/store.ts, src/llm.ts, src/index.ts, and src/cli/qmd.ts
qmd status / qmd doctor model and connection reporting updates
programmatic createStore({ llm }) support preserved for remote/custom LLM injection
remote-only setups avoid instantiating local LlamaCpp during tokenizer detection

Docs and tests

docs/remote-endpoints.md with provider roles, formats, examples, tokenizer notes, and troubleshooting
one-line README.md pointer to the dedicated remote docs
test/remote.test.ts for adapter/runtime behavior
test/retrieval-gate.test.ts for retrieval-quality regression gating
local config and CLI regression coverage for status output and collection-add behavior

Small CLI fix included

qmd collection add with no path argument now exits with usage guidance instead of implicitly indexing CWD
fixes #684

Related work

This overlaps with earlier remote-endpoint efforts, but differs in scope and current integration point:

#705: focuses on OpenAI-compatible endpoints; this PR adds multiple protocol families
#629: provides earlier scaffolding; this PR carries that forward into adapter/runtime integration
#603: earlier closed remote-endpoint PR; this PR is rebased onto current main
#517: adds OpenAI-compatible remote embedding/reranking with a hybrid local/remote split; this PR extends remote support across all four roles

Architecture

flowchart TD
    Q["qmd embed / query"] --> R["RemoteLLM"]
    R --> E["embed.ts → /v1/embeddings, /v2/embed, /pooling, /api/embed"]
    R --> EX["expand.ts → chat / completions / responses / messages / Ollama"]
    R --> RK["rerank.ts → /rerank or /score"]
    R --> G["generate.ts → chat / completions / responses / messages / Ollama"]
    R --> P["probe.ts → metadata / startup checks"]
    R --> T["transport.ts → Node http/https"]
    R --> CB["circuit-breaker.ts → per-role breaker"]
    R --> A["adapters/ → protocol handlers"]

Validation

Ran on this branch:

npm run test:types
node ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/remote.test.ts test/retrieval-gate.test.ts test/local-config.test.ts
node ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/cli.test.ts test/local-config.test.ts

Most recent focused results:

198/198 passed for remote + retrieval-gate + local-config
148/148 passed for cli + local-config

Manual / live validation performed during development:

vLLM embedding endpoint with Qwen3-Embedding-0.6B
OpenRouter-based expand path
Cohere rerank via OpenRouter
local-only QMD behavior checked as unchanged when no remote endpoints are configured

Configuration

Remote mode activates only when endpoint config is present.

YAML (`~/.config/qmd/index.yml`)

models:
  embed_api_url: http://embed-host:8000/v1
  embed_api_model: Qwen3-Embedding-0.6B
  embed_api_key: sk-...
  embed_api_format: vllm_pooling

  expand_api_url: https://openrouter.ai/api/v1
  expand_api_model: google/gemini-2.0-flash-lite-001
  expand_api_key: sk-or-...
  expand_api_format: openai_chat_completions

  rerank_api_url: https://openrouter.ai/api/v1
  rerank_api_model: cohere/rerank-v3.5
  rerank_api_key: sk-or-...
  rerank_api_format: cohere_v2_rerank

Environment variables

QMD_EMBED_BASE_URL=http://embed-host:8000/v1
QMD_EMBED_MODEL=Qwen3-Embedding-0.6B
QMD_EXPAND_BASE_URL=https://openrouter.ai/api/v1
QMD_EXPAND_MODEL=google/gemini-2.0-flash-lite-001
# ... etc

Precedence:

env vars
YAML config
backward-compat fallbacks such as OPENAI_BASE_URL for embed

See docs/remote-endpoints.md for provider roles, protocol formats, tokenizer behavior, and example setups.

Backward compatibility

no *_api_url configured: QMD stays local-only
existing qmd collection add <path> behavior is unchanged
programmatic consumers can inject a custom llm without losing direct expandQuery() behavior

Out of scope

broader CLI UX changes such as interactive confirmation / --dry-run for collection add
additional batching-policy work beyond the current remote embedding support

…dling Why: - Rebase the feature branch onto v2.5.1 while preserving RemoteLLM work. - Avoid Node 24 undici ByteString failures on Unicode response payloads. What: - Ported RemoteLLM integration across CLI, store, index, and llm wiring. - Replaced fragile fetch path with JSON helpers and nodePost/nodeGet-compatible flow. - Kept upstream v2.5.1 interface widening and remote detection semantics intact. Risk/Notes: - Rebase + transport compatibility update; behavior should match pre-rebase intent.

…nd timeout config Why: - remote embedding needed basic failure containment and retry control before broader provider support could be added - embedding batch behavior and vector dimension mismatches needed explicit handling to avoid silent corruption or brittle runtime failures What: - add a circuit breaker for remote endpoint health and half-open retry behavior - add batch embedding controls, timeout configuration, and embedding dimension validation - update the remote embedding path and tests to cover retry, failure, and dimension handling Risk/Notes: - this intentionally excludes local workspace metadata that was accidentally captured during earlier branch history - runtime behavior changes primarily affect remote embedding flows

Why: - remote endpoint configuration was split across several follow-up commits even though it is one coherent feature surface - the PR branch should present remote activation and config semantics as a single reviewable change What: - add YAML config support for expand, rerank, generate, and embed remote endpoints and models - make remote endpoint defaults local-first so no remote URL is assumed unless explicitly configured - update auto-detection and docs so env vars and YAML config consistently activate remote mode Risk/Notes: - local behavior remains the default when no endpoint URLs are configured - env vars still take precedence over YAML values

- Show all 4 endpoints: Embed, Expand, Rerank, Generate (Expand was missing) - Read config + env vars directly, no auto-writes to index.yml - Provider labels: OpenRouter, OpenAI, Ollama, xAI, host:port, local (GGUF) - Source tags: (index.yml), (env QMD_*_MODEL), (default)

Why: - the remote implementation introduced a new subsystem boundary that is easier to review as one unit than as a refactor followed by an immediate config follow-up - endpoint-specific protocol selection needs an explicit validated contract before provider adapters can be layered on top What: - extract remote transport, logging, config, embed, expand, rerank, generate, probe, and RemoteLLM core modules - add endpoint format validation and config typing for per-role remote protocol selection - add core remote test coverage around transport, config, and RemoteLLM behavior Risk/Notes: - this is a structural extraction with behavior intended to match the prior remote path, plus the new validated format contract - later commits layer registry and provider-specific adapters onto this core

Why: - Adding OpenAI/Anthropic/Cohere variants directly in `RemoteLLM` would increase complexity. - Adapter selection and wiring needed to move out of orchestration internals. What: - Added adapter contracts and endpoint contexts in `src/remote/adapters/types.ts`. - Added behavior-preserving legacy adapters around existing embed/expand/rerank/generate modules. - Added format-driven registry resolution for endpoint adapter bundles. - Rewired `RemoteLLM` to orchestrate through adapters while preserving breaker/timeout/fallback behavior. Risk/Notes: - Internal architecture refactor only; no intended functional behavior change.

Add concrete adapters for three OpenAI-style generation protocols: - /v1/chat/completions (openai_chat_completions) - /v1/completions (openai_completions) - /v1/responses (openai_responses) Each protocol gets its own ExpandAdapter + GenerateAdapter with protocol-specific request/response shapes. Shared normalization helpers extract text consistently from the different response formats. Why: enables expand/generate to work across OpenAI-compatible variants (vLLM, OpenRouter, Ollama, OpenAI) with explicit format selection. Users can now set expand_api_format=openai_completions for legacy endpoints. Risk: low. auto and anthropic_messages stay on legacy code paths. Existing behavior is preserved via the registry fallback pattern. Validation: - 120 tests pass (50 new, zero regressions) - npm run -s test:types passes

Why: - anthropic_messages format previously routed to legacy OpenAI chat adapters, sending incorrect request shapes (Authorization bearer, system-as-message) to Anthropic endpoints - Phase 3 implements proper Anthropic Messages protocol with x-api-key auth, top-level system field, and content-block response extraction What: - src/remote/adapters/anthropic-messages.ts (new): ExpandAdapter + GenerateAdapter for /v1/messages, with buildHeaders/buildMessagesPayload helpers - src/remote/adapters/normalization.ts: added normalizeAnthropicMessagesText for content-block text extraction (handles multi-block, tool_use skip) - src/remote/adapters/registry.ts: anthropic_messages now routes to anthropic/messages-expand and anthropic/messages-generate (was legacy) - test/remote.test.ts: 26 new tests covering normalization edge cases, request shape verification (x-api-key, anthropic-version, system field), circuit breaker, fallback on empty/malformed/tool_use-only responses, and RemoteLLM integration path Risk-Notes: - Backward compatible: auto format and all existing OpenAI protocol adapters are unchanged - The generate adapter sends NO system prompt (unlike expand) — matches existing OpenAI generate adapter conventions - Response text concatenates across multiple content blocks in order; non-text blocks (tool_use, image) are silently skipped - Tests are mock-server-only; live Anthropic contract testing deferred

Why: - Cohere-compatible deployments behind vLLM showed endpoint and request-shape drift. What: - Added dedicated Cohere embed/rerank adapters and registry wiring. - Added resilient `/v2/embed` and `/rerank` path fallback plus request-shape fallback. - Added `input_type` fallback mappings for vLLM-hosted Cohere endpoints. - Tightened rerank malformed-response handling and expanded integration coverage. - Updated README with vLLM Cohere configuration guidance. Risk/Notes: - Preserves OpenAI v1 compatibility while improving Cohere/vLLM interoperability.

Why: - Reindex and search paths performed unnecessary work across hydration and embedding flows. - Remote transport created repeated request setup overhead. What: - Added source-metadata fast paths for incremental reindexing and routed CLI indexing via shared reindexer. - Deferred search body hydration until after fusion/dedupe and reduced repeated context resolution. - Batched vector-only query embeddings and trimmed unnecessary reordering work. - Refactored remote transport to shared keep-alive JSON requests and reusable bearer auth helpers. Risk/Notes: - Performance-focused change; remote adapter behavior remains covered by tests.

Why: - Ollama support was split across separate embed and text commits even though they form one provider integration - reviewers should be able to evaluate the complete Ollama surface in one place What: - add native Ollama embed adapter for /api/embed with compatibility fallback behavior - add native Ollama chat and generate adapters for /api/chat and /api/generate - wire Ollama formats into adapter resolution, config aliases, and remote tests Risk/Notes: - this is provider-specific surface area on top of the existing remote adapter framework - local workflows remain unaffected unless an Ollama format is explicitly selected

Why: - Filepath hydration for virtual `qmd://` docs was doing expensive post-filtering. - Half-open breaker recovery allowed concurrent probe stampedes. What: - Rewrote `loadSearchDocumentsByFilepaths` to hydrate virtual candidates via indexed `(collection,path)` predicates. - Kept compatibility fallback for non-virtual identifiers. - Restricted half-open breaker recovery to a single in-flight probe. - Added circuit-breaker coverage for single-probe half-open gating. Risk/Notes: - Performance and resilience update with compatibility fallback retained.

Show per-endpoint connection state for embed/expand/rerank/generate in qmd status, including latency and model-availability checks when remote endpoints expose model listings. Probe common metadata endpoints for OpenAI/Ollama-style providers and mark local-only configurations explicitly. Start remote probes in parallel to keep status responsive when one endpoint is slow or unavailable.

… support Why: - the branch ended with two commits carrying the same subject/body even though they represented one logical tightening pass - remote embedding needed more robust endpoint normalization for Cohere-compatible and vLLM deployments - retrieval quality and remote token-bound chunking needed explicit guardrails before a PR against current upstream What: - normalize Cohere-compatible embed routing, add vLLM pooling support, and keep host-aware input_type behavior - improve retrieval candidate selection, expansion filtering, rerank chunk aggregation, and add the retrieval regression gate - add remote tokenizer support for exact token-aware chunking in remote mode, with fallback behavior and targeted tests - align the localhost Cohere-adapter test with the generic-host contract while keeping Cohere-host expectations covered separately Risk/Notes: - remote tokenizer behavior adds another network-dependent path, but it degrades to character-based chunking unless explicitly forced - this rewrites only the local PR branch tail; backup/pr-remote-llm-clean-pre-tail-squash preserves the pre-squash history

Why: - qmd collection add with no path silently indexed the current working directory, which is a footgun on large folders - the cleaner fix is to reject the command unless the target path is explicit, rather than layering confirmation UX into this branch What: - make qmd collection add fail with usage guidance when the path argument is omitted - add a CLI regression test covering the no-argument failure path Risk/Notes: - existing documented and tested usage with qmd collection add . or an explicit path is unchanged - this aligns with issue tobi#684 without broadening the PR into prompt or dry-run behavior

Why: - the remote adapter work needs focused user-facing documentation without bloating the main README - protocol roles, format selection, provider examples, and tokenizer behavior are easier to review in a dedicated doc than as scattered README additions What: - add docs/remote-endpoints.md covering remote roles, protocol formats, config precedence, provider examples, runtime behavior, tokenizer support, and troubleshooting - keep README upstream-friendly by replacing the large remote README expansion with a single pointer line to the dedicated doc - remove fork-specific wording and identifying IP addresses from the documentation examples Risk/Notes: - documentation only; no runtime behavior changed in this commit - the README delta versus origin/main is intentionally reduced to a single line

Squash the follow-up test fixes for status and local-config output after the status model changes.

Pass the constructed LLM through createStore().expandQuery, avoid instantiating local LlamaCpp during remote tokenizer detection when a remote-style default LLM is already active, and remove unsolicited startup probe logging from createStore().

unithejerk added 16 commits June 9, 2026 19:13

unithejerk force-pushed the feat/remote-endpoints branch from ebcb471 to 5625e42 Compare June 10, 2026 02:57

test: align status and local-config expectations

ce64795

Squash the follow-up test fixes for status and local-config output after the status model changes.

unithejerk force-pushed the feat/remote-endpoints branch from 06eecc7 to ce64795 Compare June 10, 2026 03:45

unithejerk closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remote endpoint support for QMD (embed, expand, rerank, generate)#1

feat: remote endpoint support for QMD (embed, expand, rerank, generate)#1
unithejerk wants to merge 18 commits into
mainfrom
feat/remote-endpoints

unithejerk commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unithejerk commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: remote endpoint support for QMD (embed, expand, rerank, generate)

Summary

What it adds

Remote runtime

QMD integration

Docs and tests

Small CLI fix included

Related work

Architecture

Validation

Configuration

YAML (~/.config/qmd/index.yml)

Environment variables

Backward compatibility

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

unithejerk commented Jun 10, 2026 •

edited

Loading

YAML (`~/.config/qmd/index.yml`)