feat: OpenAI-compatible remote embedding, reranking, and query expansion by rghamilton3 · Pull Request #1 · rghamilton3/qmd

rghamilton3 · 2026-05-27T18:58:33Z

Summary

Cherry-picks the 5 commits from georgelichen's PR #629 (branch georgelichen:merge-pr-517-remote-llm) onto current main, then applies Kaspre's three documented fixes from the tobi#629 comment thread.

What's new

Remote LLM backends (`RemoteLLM`, `HybridLLM`)

RemoteLLM — HTTP client implementing the LLM interface against any OpenAI-compatible API:
- /v1/embeddings for embedding (with batch splitting, dimension validation, index-sorting)
- /v1/rerank for reranking
- /v1/chat/completions for query expansion (lex:/vec:/hyde: line format)
- Circuit breaker pattern (closed → open → half-open) per endpoint
HybridLLM — Routes embed/embedBatch/rerank/expandQuery to remote, generate/tokenize/detokenize to local LlamaCpp
remoteConfigFromEnv() — Reads QMD_EMBED_API_URL, QMD_EMBED_API_KEY, QMD_EMBED_API_MODEL, QMD_RERANK_API_*, QMD_EXPAND_API_* (env vars override YAML models: block)

Configuration (YAML `models:` block or env vars)

models:
  embed_api_url: http://vllm-host:8000/v1
  embed_api_model: text-embedding-3-small
  embed_api_key: sk-...          # optional
  rerank_api_url: http://vllm-host:8001/v1
  rerank_api_model: bge-reranker-v2-m3
  expand_api_url: http://vllm-host:8002/v1
  expand_api_model: Qwen3-4B

LLM interface additions

embedModelName, generateModelName?, rerankModelName? readonly properties
usesRemoteEmbedding? — skips local tokenizer preprocessing when true
embedBatch() method
tokenize()/detokenize() on interface (stubs throw for remote-only backends)
getDefaultLLM() / setDefaultLLM() generalized singleton (getDefaultLlamaCpp() deprecated but kept)

Remote-aware chunking

When usesRemoteEmbedding is true, chunkDocumentByApproxTokens() uses a character-heuristic (~3 chars/token) instead of the local tokenizer.

Kaspre's fixes (PR tobi#629 comments)

Fix A — Sigmoid normalization in RemoteLLM.rerank

llama.cpp's /v1/rerank emits log-odds (~−10 to +10), not probabilities. Without normalization every score goes negative and --min-score 0.3 fires "No results found" for every query.

const sigmoid = (x: number) => 1 / (1 + Math.exp(-x));
score: sigmoid(r.relevance_score)   // normalises log-odds → [0, 1]

No-op for rerankers already emitting 0–1 values.

Fix B — RemoteLLM.expandQuery (already in the cherry-picked commits)

POSTs to <expand_api_url>/chat/completions with a system prompt eliciting lex:/vec:/hyde: line-prefixed output. Parsing and fallback mirror LocalLLM.expandQuery shape.

Fix C — Pre-flight probe in vectorIndex()

When usesRemoteEmbedding === true, embeds a single token before starting the batch loop. Config mistakes (bad URL, wrong model name, auth failure) surface immediately:

✗ Remote embedding probe failed: connect ECONNREFUSED 127.0.0.1:8000
  Verify QMD_EMBED_API_URL and QMD_EMBED_API_MODEL, then retry.

Other fixes in the Kaspre commit

Added generateModelName? and rerankModelName? to the LLM interface (used by store.ts optional-chaining for default model fallbacks — was a TS compile error)
Added c.red to the terminal colour map (used by Fix C's error output)
Doctor device-probe guard: check instanceof LlamaCpp before calling getDeviceInfo() — emits "unavailable" for non-local backends
Updated rerank unit test to send realistic log-odds mock scores and assert sigmoid-normalised output
Fixed store rerank-dedup spy from getDefaultLlamaCpp → getDefaultLLM
Fixed doubled async () => syntax error in one store test header

Tests

52 / 52 test/remote-llm.test.ts (unit tests with mock HTTP server)
219 / 219 test/store.test.ts
test/remote-llm-integration.test.ts skipped unless VLLM_EMBED_URL/VLLM_EMBED_MODEL env vars are set

Files changed

File	Change
`src/remote-llm.ts`	New — `RemoteLLM`, `HybridLLM`, helpers
`src/hybrid-llm.ts`	New — routing shim
`src/llm.ts`	Extended `LLM` interface; `getDefaultLLM`/`setDefaultLLM`; remote format helpers
`src/store.ts`	`getLlm()` returns `LLM`; remote chunking path; `chunkDocumentByApproxTokens` export
`src/collections.ts`	`ModelsConfig` remote API fields
`src/cli/qmd.ts`	`getStore()` wires `HybridLLM`; Fix C probe; doctor guard; colour map
`test/remote-llm.test.ts`	New unit tests
`test/remote-llm-integration.test.ts`	New live integration tests
`test/store.test.ts`	New remote-embedding test; spy/syntax fixes
`CHANGELOG.md`	Entry under `[Unreleased]`

Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add intent? to LLM interface and ILLMSession expandQuery signature (store.ts passes { intent } but interface didn't declare it — tsc error) - Derive embed model label from getDefaultLLM().embedModelName after getStore() so content_vectors.model reflects the actual LLM in use (previously always stored DEFAULT_EMBED_MODEL_URI even with remote) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is configured; throws "expandApiModel not configured" otherwise - Independent circuit breaker for the expand endpoint - parseExpandResponse() parses lex/vec/hyde lines, filters terms that don't share a word with the original query, falls back gracefully on bad model output - RemoteLLM.supportsExpand getter for routing decisions - HybridLLM routes expandQuery to remote when remote.supportsExpand, otherwise falls back to local LlamaCpp (no interface changes) - remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields - Unit tests (mock HTTP server, VCR-style): payload shape, auth header fallback, lex/vec/hyde parsing, includeLexical=false filtering, fallback on bad output, query-term filtering, circuit breaker, HybridLLM routing (remote vs local), config env vars - Integration tests: live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM routing verified via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars, skipped when absent)

Merge PR tobi#517 and keep it compatible with the current main branch. Constraint: Upstream main diverged after PR tobi#517, so a fast-forward merge was not possible Rejected: Cherry-pick the PR commits directly | would still require the same compatibility fixes and lose merge context Confidence: medium Scope-risk: moderate Directive: Keep RemoteLLM and HybridLLM aligned with the LLM tokenize/detokenize interface and verify Windows CLI wrappers separately from Unix shell scripts Tested: npx tsc -p tsconfig.build.json; npx vitest run --reporter=verbose test/remote-llm.test.ts test/remote-llm-integration.test.ts Not-tested: full vitest suite; npm run build wrapper script on Windows; live GitHub Actions

When the active embedding backend is remote, generateEmbeddings now uses character-space chunking instead of token-based preprocessing. This keeps qmd embed from initializing node-llama-cpp solely to tokenize input before calling a remote embedding API. The change is scoped to indexing. Query-time expansion and reranking keep their existing routing rules, and a regression test now fails if remote embedding falls back to local tokenization during indexing. Constraint: Remote embedding backends do not expose a tokenizer interface in QMD today Rejected: Change HybridLLM tokenize() globally | would alter query-time behavior and broaden risk unnecessarily Confidence: high Scope-risk: narrow Reversibility: clean Directive: If remote token-aware chunking is added later, keep qmd embed free of mandatory local llama initialization Tested: npx tsc -p tsconfig.build.json Tested: npx vitest run test/store.test.ts -t "generateEmbeddings" --reporter=verbose Tested: npx vitest run test/store.test.ts -t "Token chunking guardrails" --reporter=verbose Not-tested: Full end-to-end qmd embed against a live remote embedding service after this code change

Fix A (remote-llm.ts): Apply sigmoid normalization to reranker scores. llama.cpp /v1/rerank emits log-odds (~-10 to +10), not probabilities. sigmoid(x) = 1/(1+e^-x) normalises to [0,1] without breaking rerankers that already output probabilities (sigmoid is near-linear for x in 0-1). Fix B: Already present in commit f2fd64e (remote query expansion via OpenAI-compatible chat/completions with lex:/vec:/hyde: format). Fix C (cli/qmd.ts): Pre-flight probe in vectorIndex() startup. When usesRemoteEmbedding is true, embed a single token before the batch loop so config mistakes (bad URL, wrong model, auth failure) surface immediately with a clear error message rather than mid-run. Rebase fix 1 (cli/qmd.ts): Use getDefaultLLM().embedModelName for the model variable at the embed status line (line ~1938). Rebase fix 2 (cli/qmd.ts): Doctor device-probe guard — check instanceof LlamaCpp before calling getDeviceInfo; emit unavailable message for non-local LLM backends. Additional build fixes: - llm.ts: Add generateModelName? and rerankModelName? to LLM interface (store.ts uses these via optional chaining for default model names). - cli/qmd.ts: Add 'red' to the terminal colour map (used by Fix C error message). Test fixes: - test/remote-llm.test.ts: Update rerank test to send log-odds mock scores and assert sigmoid-normalised output (toBeCloseTo). - test/store.test.ts: Fix rerank-dedup spy from getDefaultLlamaCpp to getDefaultLLM (store.ts now routes through getDefaultLLM). - test/store.test.ts: Fix doubled async callback syntax on one test header (parse error from earlier conflict resolution).

Kaspre · 2026-06-02T15:22:45Z

Hi @rghamilton3 — opened tobi#705 consolidating the tobi#629 line onto current main with the documented fixes (sigmoid / expandQuery / pre-flight probe) plus a HybridLLM.rerank local fallback and oversized-rerank recovery, targeting upstream. Since your PR did the same tobi#629 + fixes rebase, flagging it in case you'd like to converge there.

rghamilton3 · 2026-06-02T15:55:59Z

Hi @rghamilton3 — opened tobi#705 consolidating the tobi#629 line onto current main with the documented fixes (sigmoid / expandQuery / pre-flight probe) plus a HybridLLM.rerank local fallback and oversized-rerank recovery, targeting upstream. Since your PR did the same tobi#629 + fixes rebase, flagging it in case you'd like to converge there.

Hi @Kaspre Thanks for the heads up. Since this was shamelessly stolen from your work anyways I'll follow you on this :)

Jim Smith and others added 6 commits May 27, 2026 11:03

rghamilton3 closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenAI-compatible remote embedding, reranking, and query expansion#1

feat: OpenAI-compatible remote embedding, reranking, and query expansion#1
rghamilton3 wants to merge 6 commits into
mainfrom
remote-llm-pr629

rghamilton3 commented May 27, 2026

Uh oh!

Kaspre commented Jun 2, 2026

Uh oh!

rghamilton3 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rghamilton3 commented May 27, 2026

Summary

What's new

Remote LLM backends (RemoteLLM, HybridLLM)

Configuration (YAML models: block or env vars)

LLM interface additions

Remote-aware chunking

Kaspre's fixes (PR tobi#629 comments)

Other fixes in the Kaspre commit

Tests

Files changed

Uh oh!

Kaspre commented Jun 2, 2026

Uh oh!

rghamilton3 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remote LLM backends (`RemoteLLM`, `HybridLLM`)

Configuration (YAML `models:` block or env vars)