feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM)#705
feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM)#705Kaspre wants to merge 6 commits into
Conversation
…nsion Add a RemoteLLM backend that talks to any OpenAI-compatible HTTP API (vLLM, TEI, Ollama, llama.cpp --server, LiteLLM, OpenAI, ...), composed with the local LlamaCpp via a HybridLLM that routes each operation independently: - RemoteLLM: /v1/embeddings, /v1/rerank, /v1/chat/completions; per-endpoint circuit breakers; bearer auth; char-based token approximation so chunking works without a local tokenizer. - HybridLLM: embed/embedBatch/rerank/expandQuery -> remote, generate/tokenize/detokenize -> local, with per-operation local fallback. - Widened LLM interface (embedBatch, tokenize/detokenize/countTokens, isRemote, embedModelName, rerankModelName, generateModelName, usesRemoteEmbedding, supportsRerank/supportsExpand) plus getDefaultLLM/setDefaultLLM alongside the existing getDefaultLlamaCpp/setDefaultLlamaCpp (no breaking change). - Sigmoid normalization of log-odds rerank scores; RemoteLLM.expandQuery via chat completions; startup pre-flight embed probe; HybridLLM.rerank local fallback (symmetric with the expandQuery fallback). Opt-in via models.*_api_url / QMD_*_API_* env vars; with nothing configured the local-only path is byte-for-byte unchanged. Builds on @georgelichen's remote-LLM work in tobi#629. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ncating When a remote /v1/rerank request is rejected as too large (HTTP 413 / "too large to process" / context length), RemoteLLM.rerank recursively bisects the batch and halve-truncates a single oversized document down to a 32-char floor, remapping response indices to the originals and re-sorting by score. Non-oversized errors still propagate so the circuit breaker / HybridLLM local fallback can react. Adapted from the rerank recovery in tobi#619 (@loopyd). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- remoteConfigFromEnv: throw on a half-configured remote backend (embed_api_url set without embed_api_model, or vice-versa) instead of silently falling back to the local backend and skipping the remote pre-flight probe. - RemoteLLM.rerank: normalize scores once over the full (possibly recovery-split) result set, applying sigmoid only when logit-range values are present (any score < 0 or > 1). Rerankers that already return [0,1] probabilities (Cohere/Voyage-style) pass through unchanged, avoiding distortion of the blend and --min-score filtering. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Follow-up hardening has been pushed in Latest changes:
Validation run:
Maintainers: this should be ready for review again. |
|
I really hope this PR gets merged soon. I truly need this feature to use qmd on machines |
|
Running this on a Mac Mini M4 gateway with a dedicated GPU server (AMD Strix Halo, 128GB) hosting Qwen3-Embedding-4B and Qwen3-Reranker-4B on OpenAI-compatible endpoints (llama.cpp server). The key use case for us: serving embedding and reranking to multiple machines on the network that are too weak to run meaningful models locally - thin clients, ARM64 SBCs, VPS nodes. A single GPU box providing high-quality retrieval for the whole fleet. Happy to test any of the PRs against our endpoints. |
…#97) * feat(spine): add remote LLM support via QMD PR #705 (OpenAI-compatible embedding) Vendor a pre-built @tobilu/qmd 2.5.3 from Kaspre/qmd#feat/remote-llm-openai-compatible (tobi/qmd#705), which adds RemoteLLM/HybridLLM backends for routing embedding, reranking, and query expansion to any OpenAI-compatible API endpoint. - spine/vendor/tobilu-qmd-2.5.3.tgz: built dist from the PR branch (dist/ is gitignored upstream so a github: dep would install a broken package) - spine/package.json: switch from ^2.5.1 registry dep to file:./vendor tarball - spine/.gitignore: allow vendor/*.tgz so the tarball is tracked by git - spine/src/config.ts: add QmdModelsConfig + getQmdModelsConfig() to read [spine.qmd] from config.toml (embed_api_url/model, rerank, expand settings) - spine/src/search.ts: pass models: getQmdModelsConfig() to createStore so remote embedding activates when configured; undefined preserves local-only path - config.toml.example: document the new [spine.qmd] section (all commented out) Remote embedding is fully opt-in: omitting [spine.qmd] leaves behaviour unchanged. Env vars QMD_EMBED_API_URL / QMD_EMBED_API_MODEL also work and take precedence over config.toml. * docs(spine): address PR review findings on remote QMD config - Add comment to QmdModelsConfig clarifying it's an intentional structural subset of QMD's internal ModelsConfig (not in public exports) - Add comment to getQmdModelsConfig() confirming env vars take precedence via QMD's remoteConfigFromEnv(), and that QMD itself throws on partial embed config (url without model, or vice versa) - Add vendor removal note in package.json for when @tobilu/qmd >=2.6.0 lands on npm * fix(spine): remove invalid //vendor comment from package.json bun install treats //vendor as a real package name and tries to resolve it from npm, breaking CI. Remove the comment entry; the intent is captured in git history.
|
Thank you (and all the others who worked on PRs related to this) |
|
This is a strong PR overall. I really like the direction here: the RemoteLLM/HybridLLM split is thoughtful, the production hardening is substantial, and a lot of the follow-up work from #629 is integrated in a pragmatic way. Thanks for pushing this forward. I went through the diff pretty carefully and found three things that I think are worth fixing before merge:
None of these feel far from fixable, and I still think the overall PR is heading in a very good direction. I just wanted to flag these before merge so they do not turn into release-time or Windows-specific regressions. |
Problem
QMD's BM25 + vector + query-expansion + LLM-rerank pipeline has a hard dependency on local GGUF models via node-llama-cpp. Devices with inadequate or no GPU at all fall back to CPU embedding of 300M–0.6B models, which is unacceptably slow. Node-llama-cpp serialises inference through a single context, so concurrent workers queue and time out. Moving embedding/reranking/expansion to a dedicated server (vLLM, TEI, Ollama, llama.cpp, LiteLLM, OpenAI, …) removes both problems.
This builds directly on @georgelichen's #629, which contributed the core
RemoteLLM/HybridLLMarchitecture. This PR rebases that scaffold onto currentmainand adds five refinements (A–E below) needed to make it production-correct against real OpenAI-compatible servers.What this adds
A backend that talks to any OpenAI-compatible HTTP API, composed with the existing local backend so each operation can be routed independently. Fully opt-in — with no
*_api_urlconfigured, behaviour is byte-for-byte the current local-only path.RemoteLLM—/v1/embeddings,/v1/rerank,/v1/chat/completions; per-endpoint circuit breakers (an embed outage doesn't take down rerank/chat); bearer auth; char-based token approximation so chunking works without a local tokenizer.HybridLLM— routesembed/embedBatch/rerank/expandQueryto remote,generate/tokenize/detokenizeto localLlamaCpp, with per-operation fallback.LLMinterface (embedBatch,tokenize/detokenize/countTokens,isRemote,embedModelName,rerankModelName?,generateModelName?,usesRemoteEmbedding?,supportsRerank/supportsExpand) andgetDefaultLLM/setDefaultLLMalongside the existinggetDefaultLlamaCpp/setDefaultLlamaCpp(no breaking change).The five refinements on top of #629
/v1/rerank(e.g.bge-reranker-v2-m3) and most cross-encoders return log-odds (~−10…+10), not 0–1. QMD's score blend and the--min-score 0.3default assume probabilities, so withoutσ(x)=1/(1+e⁻ˣ)every blended score goes negative and every query returns "No results found". No-op for backends already emitting 0–1.RemoteLLM.expandQuery. Query expansion via/v1/chat/completions, emitting the samelex:/vec:/hyde:line format as the local model; falls back to local on error.usesRemoteEmbedding, embed one token atvectorIndex()startup so a bad URL / wrong model / auth failure surfaces immediately instead of silently falling back mid-batch.HybridLLM.reranklocal fallback. Symmetric with theexpandQueryfallback already in Add remote embedding, reranking, and query expansion support #629: if the remote backend doesn't support rerank, fall back to localLlamaCpp.rerankrather than blindly POSTing/v1/rerankat an embed-only server. Lets you mix "remote embed + local rerank" freely.RemoteLLM.rerankrecursively bisects the batch and halve-truncates a single oversized document (down to a 32-char floor), remapping response indices to the originals and re-sorting by normalized score. Non-oversized errors still propagate to the circuit breaker / local fallback. Adapted from the recovery in @loopyd's #619.Architecture
flowchart TD Q["qmd: index / query"] --> H{"HybridLLM<br/>remote config present?"} H -->|no| L0["LlamaCpp only — unchanged default"] H -->|yes| E["embed / embedBatch"] & RK["rerank"] & EX["expandQuery"] & G["generate"] & TK["tokenize / detokenize"] E --> R1["RemoteLLM → POST /v1/embeddings<br/>+ pre-flight 1-token probe ◆C"] RK --> RKQ{"remote supportsRerank?"} RKQ -->|yes| R2["RemoteLLM → POST /v1/rerank<br/>+ σ: log-odds→0‥1 ◆A<br/>+ oversized split/truncate recovery ◆E"] RKQ -->|no| L1["LlamaCpp local rerank ◆D"] EX --> EXQ{"remote supportsExpand?"} EXQ -->|yes| R3["RemoteLLM → POST /v1/chat/completions ◆B"] EXQ -->|no| L2["LlamaCpp local expansion"] G --> L3["LlamaCpp local"] TK --> L4["LlamaCpp local (char-approx if usesRemoteEmbedding)"]Configuration
Any OpenAI-compatible server works — point each operation wherever you like. Example (
~/.config/qmd/index.yml):Each
*_api_urlis independent: set onlyembed_api_urland rerank/expand stay local (fix D); set all three to go fully remote. Leave them all unset and nothing changes.Backward compatibility
No breaking changes. All new behaviour is opt-in via config; local-only setups are unaffected;
getDefaultLlamaCpp/setDefaultLlamaCppretained.Testing
Build (
node scripts/build.mjs) andtsc --noEmitare clean. Under the project's CI test mode (CI=true, matching upstream CI), this branch shows no new failures vsmain— the deterministic failing set is identical on both (the only failures are pre-existing WSL/Linux Git-Bash-path and skill-bundling tests, unrelated to this change). The newtest/remote-llm.test.tssuite passes in full: embed/rerank routing, per-endpoint circuit breakers + independence, sigmoid normalization, oversized-rerank split/truncate recovery, auth, timeouts,expandQueryparsing + fallback, and theHybridLLMrerank fallback — all driven by in-process HTTP servers, no mocks.This configuration runs in production today: embedding + reranking against two
llama.cppserver instances and query expansion against a cloud chat endpoint — which is where refinements A, B, C (and the need for E) came from.Related work & credit
This PR aims to consolidate the OpenAI-compatible-backend effort into one production-ready change, building on and crediting the prior attempts:
RemoteLLM/HybridLLMscaffold this PR is built on. Refinements A & D were discussed on the Add remote embedding, reranking, and query expansion support #629 / feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints #575 threads.llama-swap); refinement E is adapted from its rerank recovery. feat(Add OpenAI-compatible backend support) #619 uses a whole-backend provider switch (one base URL + server-side model aliases); this PR'sHybridLLMgeneralizes that to independent per-operation endpoints with local fallback — the single-server case is justembed/rerank/expand_api_urlall pointing at one server.embed_api_url: https://api.openai.com/v1) without hardcoding a provider.Follow-ups (not in this PR)
maxBatchBytes+ chunk strategies +--max-batch-mb/--chunk-strategyflags), as in feat(Add OpenAI-compatible backend support) #619. This PR keeps the existing count-based batching (maxBatchSize); byte-aware batching is a natural follow-on.