feat: OpenAI-compatible remote embedding, reranking, and query expansion#1
feat: OpenAI-compatible remote embedding, reranking, and query expansion#1rghamilton3 wants to merge 6 commits into
Conversation
Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add intent? to LLM interface and ILLMSession expandQuery signature
(store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
getStore() so content_vectors.model reflects the actual LLM in use
(previously always stored DEFAULT_EMBED_MODEL_URI even with remote)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is configured; throws "expandApiModel not configured" otherwise - Independent circuit breaker for the expand endpoint - parseExpandResponse() parses lex/vec/hyde lines, filters terms that don't share a word with the original query, falls back gracefully on bad model output - RemoteLLM.supportsExpand getter for routing decisions - HybridLLM routes expandQuery to remote when remote.supportsExpand, otherwise falls back to local LlamaCpp (no interface changes) - remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields - Unit tests (mock HTTP server, VCR-style): payload shape, auth header fallback, lex/vec/hyde parsing, includeLexical=false filtering, fallback on bad output, query-term filtering, circuit breaker, HybridLLM routing (remote vs local), config env vars - Integration tests: live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM routing verified via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars, skipped when absent)
Merge PR tobi#517 and keep it compatible with the current main branch. Constraint: Upstream main diverged after PR tobi#517, so a fast-forward merge was not possible Rejected: Cherry-pick the PR commits directly | would still require the same compatibility fixes and lose merge context Confidence: medium Scope-risk: moderate Directive: Keep RemoteLLM and HybridLLM aligned with the LLM tokenize/detokenize interface and verify Windows CLI wrappers separately from Unix shell scripts Tested: npx tsc -p tsconfig.build.json; npx vitest run --reporter=verbose test/remote-llm.test.ts test/remote-llm-integration.test.ts Not-tested: full vitest suite; npm run build wrapper script on Windows; live GitHub Actions
When the active embedding backend is remote, generateEmbeddings now uses character-space chunking instead of token-based preprocessing. This keeps qmd embed from initializing node-llama-cpp solely to tokenize input before calling a remote embedding API. The change is scoped to indexing. Query-time expansion and reranking keep their existing routing rules, and a regression test now fails if remote embedding falls back to local tokenization during indexing. Constraint: Remote embedding backends do not expose a tokenizer interface in QMD today Rejected: Change HybridLLM tokenize() globally | would alter query-time behavior and broaden risk unnecessarily Confidence: high Scope-risk: narrow Reversibility: clean Directive: If remote token-aware chunking is added later, keep qmd embed free of mandatory local llama initialization Tested: npx tsc -p tsconfig.build.json Tested: npx vitest run test/store.test.ts -t "generateEmbeddings" --reporter=verbose Tested: npx vitest run test/store.test.ts -t "Token chunking guardrails" --reporter=verbose Not-tested: Full end-to-end qmd embed against a live remote embedding service after this code change
Fix A (remote-llm.ts): Apply sigmoid normalization to reranker scores. llama.cpp /v1/rerank emits log-odds (~-10 to +10), not probabilities. sigmoid(x) = 1/(1+e^-x) normalises to [0,1] without breaking rerankers that already output probabilities (sigmoid is near-linear for x in 0-1). Fix B: Already present in commit f2fd64e (remote query expansion via OpenAI-compatible chat/completions with lex:/vec:/hyde: format). Fix C (cli/qmd.ts): Pre-flight probe in vectorIndex() startup. When usesRemoteEmbedding is true, embed a single token before the batch loop so config mistakes (bad URL, wrong model, auth failure) surface immediately with a clear error message rather than mid-run. Rebase fix 1 (cli/qmd.ts): Use getDefaultLLM().embedModelName for the model variable at the embed status line (line ~1938). Rebase fix 2 (cli/qmd.ts): Doctor device-probe guard — check instanceof LlamaCpp before calling getDeviceInfo; emit unavailable message for non-local LLM backends. Additional build fixes: - llm.ts: Add generateModelName? and rerankModelName? to LLM interface (store.ts uses these via optional chaining for default model names). - cli/qmd.ts: Add 'red' to the terminal colour map (used by Fix C error message). Test fixes: - test/remote-llm.test.ts: Update rerank test to send log-odds mock scores and assert sigmoid-normalised output (toBeCloseTo). - test/store.test.ts: Fix rerank-dedup spy from getDefaultLlamaCpp to getDefaultLLM (store.ts now routes through getDefaultLLM). - test/store.test.ts: Fix doubled async callback syntax on one test header (parse error from earlier conflict resolution).
|
Hi @rghamilton3 — opened tobi#705 consolidating the tobi#629 line onto current |
Hi @Kaspre Thanks for the heads up. Since this was shamelessly stolen from your work anyways I'll follow you on this :) |
Summary
Cherry-picks the 5 commits from georgelichen's PR #629 (branch
georgelichen:merge-pr-517-remote-llm) onto current main, then applies Kaspre's three documented fixes from the tobi#629 comment thread.What's new
Remote LLM backends (
RemoteLLM,HybridLLM)RemoteLLM— HTTP client implementing theLLMinterface against any OpenAI-compatible API:/v1/embeddingsfor embedding (with batch splitting, dimension validation, index-sorting)/v1/rerankfor reranking/v1/chat/completionsfor query expansion (lex:/vec:/hyde:line format)HybridLLM— Routes embed/embedBatch/rerank/expandQuery to remote, generate/tokenize/detokenize to localLlamaCppremoteConfigFromEnv()— ReadsQMD_EMBED_API_URL,QMD_EMBED_API_KEY,QMD_EMBED_API_MODEL,QMD_RERANK_API_*,QMD_EXPAND_API_*(env vars override YAMLmodels:block)Configuration (YAML
models:block or env vars)LLM interface additions
embedModelName,generateModelName?,rerankModelName?readonly propertiesusesRemoteEmbedding?— skips local tokenizer preprocessing when trueembedBatch()methodtokenize()/detokenize()on interface (stubs throw for remote-only backends)getDefaultLLM()/setDefaultLLM()generalized singleton (getDefaultLlamaCpp()deprecated but kept)Remote-aware chunking
When
usesRemoteEmbeddingis true,chunkDocumentByApproxTokens()uses a character-heuristic (~3 chars/token) instead of the local tokenizer.Kaspre's fixes (PR tobi#629 comments)
Fix A — Sigmoid normalization in
RemoteLLM.rerankllama.cpp's
/v1/rerankemits log-odds (~−10 to +10), not probabilities. Without normalization every score goes negative and--min-score 0.3fires "No results found" for every query.No-op for rerankers already emitting 0–1 values.
Fix B —
RemoteLLM.expandQuery(already in the cherry-picked commits)POSTs to
<expand_api_url>/chat/completionswith a system prompt elicitinglex:/vec:/hyde:line-prefixed output. Parsing and fallback mirrorLocalLLM.expandQueryshape.Fix C — Pre-flight probe in
vectorIndex()When
usesRemoteEmbedding === true, embeds a single token before starting the batch loop. Config mistakes (bad URL, wrong model name, auth failure) surface immediately:Other fixes in the Kaspre commit
generateModelName?andrerankModelName?to theLLMinterface (used bystore.tsoptional-chaining for default model fallbacks — was a TS compile error)c.redto the terminal colour map (used by Fix C's error output)instanceof LlamaCppbefore callinggetDeviceInfo()— emits "unavailable" for non-local backendsgetDefaultLlamaCpp→getDefaultLLMasync () =>syntax error in one store test headerTests
test/remote-llm.test.ts(unit tests with mock HTTP server)test/store.test.tstest/remote-llm-integration.test.tsskipped unlessVLLM_EMBED_URL/VLLM_EMBED_MODELenv vars are setFiles changed
src/remote-llm.tsRemoteLLM,HybridLLM, helperssrc/hybrid-llm.tssrc/llm.tsLLMinterface;getDefaultLLM/setDefaultLLM; remote format helperssrc/store.tsgetLlm()returnsLLM; remote chunking path;chunkDocumentByApproxTokensexportsrc/collections.tsModelsConfigremote API fieldssrc/cli/qmd.tsgetStore()wiresHybridLLM; Fix C probe; doctor guard; colour maptest/remote-llm.test.tstest/remote-llm-integration.test.tstest/store.test.tsCHANGELOG.md[Unreleased]