feat: remote endpoint support for QMD (embed, expand, rerank, generate)#1
Closed
unithejerk wants to merge 18 commits into
Closed
feat: remote endpoint support for QMD (embed, expand, rerank, generate)#1unithejerk wants to merge 18 commits into
unithejerk wants to merge 18 commits into
Conversation
…dling Why: - Rebase the feature branch onto v2.5.1 while preserving RemoteLLM work. - Avoid Node 24 undici ByteString failures on Unicode response payloads. What: - Ported RemoteLLM integration across CLI, store, index, and llm wiring. - Replaced fragile fetch path with JSON helpers and nodePost/nodeGet-compatible flow. - Kept upstream v2.5.1 interface widening and remote detection semantics intact. Risk/Notes: - Rebase + transport compatibility update; behavior should match pre-rebase intent.
…nd timeout config Why: - remote embedding needed basic failure containment and retry control before broader provider support could be added - embedding batch behavior and vector dimension mismatches needed explicit handling to avoid silent corruption or brittle runtime failures What: - add a circuit breaker for remote endpoint health and half-open retry behavior - add batch embedding controls, timeout configuration, and embedding dimension validation - update the remote embedding path and tests to cover retry, failure, and dimension handling Risk/Notes: - this intentionally excludes local workspace metadata that was accidentally captured during earlier branch history - runtime behavior changes primarily affect remote embedding flows
Why: - remote endpoint configuration was split across several follow-up commits even though it is one coherent feature surface - the PR branch should present remote activation and config semantics as a single reviewable change What: - add YAML config support for expand, rerank, generate, and embed remote endpoints and models - make remote endpoint defaults local-first so no remote URL is assumed unless explicitly configured - update auto-detection and docs so env vars and YAML config consistently activate remote mode Risk/Notes: - local behavior remains the default when no endpoint URLs are configured - env vars still take precedence over YAML values
- Show all 4 endpoints: Embed, Expand, Rerank, Generate (Expand was missing) - Read config + env vars directly, no auto-writes to index.yml - Provider labels: OpenRouter, OpenAI, Ollama, xAI, host:port, local (GGUF) - Source tags: (index.yml), (env QMD_*_MODEL), (default)
Why: - the remote implementation introduced a new subsystem boundary that is easier to review as one unit than as a refactor followed by an immediate config follow-up - endpoint-specific protocol selection needs an explicit validated contract before provider adapters can be layered on top What: - extract remote transport, logging, config, embed, expand, rerank, generate, probe, and RemoteLLM core modules - add endpoint format validation and config typing for per-role remote protocol selection - add core remote test coverage around transport, config, and RemoteLLM behavior Risk/Notes: - this is a structural extraction with behavior intended to match the prior remote path, plus the new validated format contract - later commits layer registry and provider-specific adapters onto this core
Why: - Adding OpenAI/Anthropic/Cohere variants directly in `RemoteLLM` would increase complexity. - Adapter selection and wiring needed to move out of orchestration internals. What: - Added adapter contracts and endpoint contexts in `src/remote/adapters/types.ts`. - Added behavior-preserving legacy adapters around existing embed/expand/rerank/generate modules. - Added format-driven registry resolution for endpoint adapter bundles. - Rewired `RemoteLLM` to orchestrate through adapters while preserving breaker/timeout/fallback behavior. Risk/Notes: - Internal architecture refactor only; no intended functional behavior change.
Add concrete adapters for three OpenAI-style generation protocols: - /v1/chat/completions (openai_chat_completions) - /v1/completions (openai_completions) - /v1/responses (openai_responses) Each protocol gets its own ExpandAdapter + GenerateAdapter with protocol-specific request/response shapes. Shared normalization helpers extract text consistently from the different response formats. Why: enables expand/generate to work across OpenAI-compatible variants (vLLM, OpenRouter, Ollama, OpenAI) with explicit format selection. Users can now set expand_api_format=openai_completions for legacy endpoints. Risk: low. auto and anthropic_messages stay on legacy code paths. Existing behavior is preserved via the registry fallback pattern. Validation: - 120 tests pass (50 new, zero regressions) - npm run -s test:types passes
Why: - anthropic_messages format previously routed to legacy OpenAI chat adapters, sending incorrect request shapes (Authorization bearer, system-as-message) to Anthropic endpoints - Phase 3 implements proper Anthropic Messages protocol with x-api-key auth, top-level system field, and content-block response extraction What: - src/remote/adapters/anthropic-messages.ts (new): ExpandAdapter + GenerateAdapter for /v1/messages, with buildHeaders/buildMessagesPayload helpers - src/remote/adapters/normalization.ts: added normalizeAnthropicMessagesText for content-block text extraction (handles multi-block, tool_use skip) - src/remote/adapters/registry.ts: anthropic_messages now routes to anthropic/messages-expand and anthropic/messages-generate (was legacy) - test/remote.test.ts: 26 new tests covering normalization edge cases, request shape verification (x-api-key, anthropic-version, system field), circuit breaker, fallback on empty/malformed/tool_use-only responses, and RemoteLLM integration path Risk-Notes: - Backward compatible: auto format and all existing OpenAI protocol adapters are unchanged - The generate adapter sends NO system prompt (unlike expand) — matches existing OpenAI generate adapter conventions - Response text concatenates across multiple content blocks in order; non-text blocks (tool_use, image) are silently skipped - Tests are mock-server-only; live Anthropic contract testing deferred
Why: - Cohere-compatible deployments behind vLLM showed endpoint and request-shape drift. What: - Added dedicated Cohere embed/rerank adapters and registry wiring. - Added resilient `/v2/embed` and `/rerank` path fallback plus request-shape fallback. - Added `input_type` fallback mappings for vLLM-hosted Cohere endpoints. - Tightened rerank malformed-response handling and expanded integration coverage. - Updated README with vLLM Cohere configuration guidance. Risk/Notes: - Preserves OpenAI v1 compatibility while improving Cohere/vLLM interoperability.
Why: - Reindex and search paths performed unnecessary work across hydration and embedding flows. - Remote transport created repeated request setup overhead. What: - Added source-metadata fast paths for incremental reindexing and routed CLI indexing via shared reindexer. - Deferred search body hydration until after fusion/dedupe and reduced repeated context resolution. - Batched vector-only query embeddings and trimmed unnecessary reordering work. - Refactored remote transport to shared keep-alive JSON requests and reusable bearer auth helpers. Risk/Notes: - Performance-focused change; remote adapter behavior remains covered by tests.
Why: - Ollama support was split across separate embed and text commits even though they form one provider integration - reviewers should be able to evaluate the complete Ollama surface in one place What: - add native Ollama embed adapter for /api/embed with compatibility fallback behavior - add native Ollama chat and generate adapters for /api/chat and /api/generate - wire Ollama formats into adapter resolution, config aliases, and remote tests Risk/Notes: - this is provider-specific surface area on top of the existing remote adapter framework - local workflows remain unaffected unless an Ollama format is explicitly selected
Why: - Filepath hydration for virtual `qmd://` docs was doing expensive post-filtering. - Half-open breaker recovery allowed concurrent probe stampedes. What: - Rewrote `loadSearchDocumentsByFilepaths` to hydrate virtual candidates via indexed `(collection,path)` predicates. - Kept compatibility fallback for non-virtual identifiers. - Restricted half-open breaker recovery to a single in-flight probe. - Added circuit-breaker coverage for single-probe half-open gating. Risk/Notes: - Performance and resilience update with compatibility fallback retained.
Show per-endpoint connection state for embed/expand/rerank/generate in qmd status, including latency and model-availability checks when remote endpoints expose model listings. Probe common metadata endpoints for OpenAI/Ollama-style providers and mark local-only configurations explicitly. Start remote probes in parallel to keep status responsive when one endpoint is slow or unavailable.
… support Why: - the branch ended with two commits carrying the same subject/body even though they represented one logical tightening pass - remote embedding needed more robust endpoint normalization for Cohere-compatible and vLLM deployments - retrieval quality and remote token-bound chunking needed explicit guardrails before a PR against current upstream What: - normalize Cohere-compatible embed routing, add vLLM pooling support, and keep host-aware input_type behavior - improve retrieval candidate selection, expansion filtering, rerank chunk aggregation, and add the retrieval regression gate - add remote tokenizer support for exact token-aware chunking in remote mode, with fallback behavior and targeted tests - align the localhost Cohere-adapter test with the generic-host contract while keeping Cohere-host expectations covered separately Risk/Notes: - remote tokenizer behavior adds another network-dependent path, but it degrades to character-based chunking unless explicitly forced - this rewrites only the local PR branch tail; backup/pr-remote-llm-clean-pre-tail-squash preserves the pre-squash history
Why: - qmd collection add with no path silently indexed the current working directory, which is a footgun on large folders - the cleaner fix is to reject the command unless the target path is explicit, rather than layering confirmation UX into this branch What: - make qmd collection add fail with usage guidance when the path argument is omitted - add a CLI regression test covering the no-argument failure path Risk/Notes: - existing documented and tested usage with qmd collection add . or an explicit path is unchanged - this aligns with issue tobi#684 without broadening the PR into prompt or dry-run behavior
Why: - the remote adapter work needs focused user-facing documentation without bloating the main README - protocol roles, format selection, provider examples, and tokenizer behavior are easier to review in a dedicated doc than as scattered README additions What: - add docs/remote-endpoints.md covering remote roles, protocol formats, config precedence, provider examples, runtime behavior, tokenizer support, and troubleshooting - keep README upstream-friendly by replacing the large remote README expansion with a single pointer line to the dedicated doc - remove fork-specific wording and identifying IP addresses from the documentation examples Risk/Notes: - documentation only; no runtime behavior changed in this commit - the README delta versus origin/main is intentionally reduced to a single line
ebcb471 to
5625e42
Compare
Squash the follow-up test fixes for status and local-config output after the status model changes.
06eecc7 to
ce64795
Compare
Pass the constructed LLM through createStore().expandQuery, avoid instantiating local LlamaCpp during remote tokenizer detection when a remote-style default LLM is already active, and remove unsolicited startup probe logging from createStore().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: remote endpoint support for QMD (embed, expand, rerank, generate)
Summary
This PR adds remote endpoint support across QMD's embed, expand, rerank, and generate paths.
It introduces role-specific endpoint configuration plus protocol adapters for OpenAI-compatible, Cohere, Anthropic, Ollama, and vLLM-style APIs. Remote mode activates only when endpoints are configured, so existing local-only behavior remains the default.
Current branch shape:
What it adds
Remote runtime
src/remote/module with per-role endpoint configuration forembed,expand,rerank, andgenerateQMD integration
src/store.ts,src/llm.ts,src/index.ts, andsrc/cli/qmd.tsqmd status/qmd doctormodel and connection reporting updatescreateStore({ llm })support preserved for remote/custom LLM injectionLlamaCppduring tokenizer detectionDocs and tests
docs/remote-endpoints.mdwith provider roles, formats, examples, tokenizer notes, and troubleshootingREADME.mdpointer to the dedicated remote docstest/remote.test.tsfor adapter/runtime behaviortest/retrieval-gate.test.tsfor retrieval-quality regression gatingSmall CLI fix included
qmd collection addwith no path argument now exits with usage guidance instead of implicitly indexing CWDRelated work
This overlaps with earlier remote-endpoint efforts, but differs in scope and current integration point:
mainArchitecture
flowchart TD Q["qmd embed / query"] --> R["RemoteLLM"] R --> E["embed.ts → /v1/embeddings, /v2/embed, /pooling, /api/embed"] R --> EX["expand.ts → chat / completions / responses / messages / Ollama"] R --> RK["rerank.ts → /rerank or /score"] R --> G["generate.ts → chat / completions / responses / messages / Ollama"] R --> P["probe.ts → metadata / startup checks"] R --> T["transport.ts → Node http/https"] R --> CB["circuit-breaker.ts → per-role breaker"] R --> A["adapters/ → protocol handlers"]Validation
Ran on this branch:
npm run test:typesnode ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/remote.test.ts test/retrieval-gate.test.ts test/local-config.test.tsnode ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/cli.test.ts test/local-config.test.tsMost recent focused results:
198/198passed forremote + retrieval-gate + local-config148/148passed forcli + local-configManual / live validation performed during development:
Configuration
Remote mode activates only when endpoint config is present.
YAML (
~/.config/qmd/index.yml)Environment variables
QMD_EMBED_BASE_URL=http://embed-host:8000/v1 QMD_EMBED_MODEL=Qwen3-Embedding-0.6B QMD_EXPAND_BASE_URL=https://openrouter.ai/api/v1 QMD_EXPAND_MODEL=google/gemini-2.0-flash-lite-001 # ... etcPrecedence:
OPENAI_BASE_URLfor embedSee docs/remote-endpoints.md for provider roles, protocol formats, tokenizer behavior, and example setups.
Backward compatibility
*_api_urlconfigured: QMD stays local-onlyqmd collection add <path>behavior is unchangedllmwithout losing directexpandQuery()behaviorOut of scope
--dry-runforcollection add