Add OpenAI-compatible remote embedding and reranking#517
Conversation
8e26a6c to
f640303
Compare
Test results from a live vLLM deploymentRan the full test suite on
Remote LLM tests (the tests added by this PR): ✅ 66/66 passFull suite: 699 pass, 48 failThe 48 failures are all in the LlamaCpp/local-model path (token chunking, query expansion, local reranking) — expected on a machine without a local GGUF model downloaded. No failures in any remote, BM25, AST, SDK, collections, or MCP tests. This feature works well in practice. The remote OpenAI-compatible embedding is significantly faster than CPU GGUF inference for bulk indexing — happy to help test anything else if useful. |
Full test suite —
|
Two fixes found while deploying on a live systemWhile integrating this branch into production, I hit two issues and fixed them. Hopefully, I figured out how to incorporate them correctly into the PR: 1.
|
|
Both fixes above have been pushed to the branch: jhsmith409@6596448 |
|
All comments have been addressed and two production fixes have been pushed (see above). Tested against live vLLM servers — 699/747 tests passing, all 48 failures are pre-existing LlamaCpp-path issues unrelated to this PR. Ready for review. |
|
great work! would you consider adding support for remote query expansion as well? |
Let's close out this PR and get it merged. Then open an issue for remote query expansion and I'll try to address it. |
|
Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern) |
I'll tackle the first part (query expansion) and unit tests but I'll let the qmd models serve option for someone else to implement. Does that work for you, tobi? |
|
Remote query expansion is now implemented. Here's what was added (commit f8c6030): Changes
|
Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add intent? to LLM interface and ILLMSession expandQuery signature
(store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
getStore() so content_vectors.model reflects the actual LLM in use
(previously always stored DEFAULT_EMBED_MODEL_URI even with remote)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is configured; throws "expandApiModel not configured" otherwise - Independent circuit breaker for the expand endpoint - parseExpandResponse() parses lex/vec/hyde lines, filters terms that don't share a word with the original query, falls back gracefully on bad model output - RemoteLLM.supportsExpand getter for routing decisions - HybridLLM routes expandQuery to remote when remote.supportsExpand, otherwise falls back to local LlamaCpp (no interface changes) - remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields - Unit tests (mock HTTP server, VCR-style): payload shape, auth header fallback, lex/vec/hyde parsing, includeLexical=false filtering, fallback on bad output, query-term filtering, circuit breaker, HybridLLM routing (remote vs local), config env vars - Integration tests: live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM routing verified via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars, skipped when absent)
f8c6030 to
f2fd64e
Compare
|
@tobi I added remote query expansion. Could you take another look at this PR please? |
|
Hey @tobi, friendly bump on this. Query expansion is implemented per your earlier ask, with VCR-pattern unit tests + live integration tests. Branch is clean against main, all 66/66 RemoteLLM tests pass against live vLLM (the 48 LlamaCpp-path failures pre-date this PR and need a local GGUF that wasn't on the test machine). Skipped Anything else needed before merge? |
|
@tobi heads-up: I just corrected the PR description, which was stale and was telling reviewers that query expansion stayed local. The branch in fact has all three remote paths fully implemented and independently configurable:
Each path falls back to local PR is mergeable, branch is clean against main. |
|
@jhsmith409 thank you thank you for all your work on this PR. I want it to land SO BAD so that CPU only VMs running OpenClaw can use QMD. Plz @tobi :) |
Summary
RemoteLLMclass that calls OpenAI-compatible HTTP endpoints for embedding (POST /v1/embeddings), reranking (POST /v1/rerank), and query expansion (POST /v1/chat/completions), with per-endpoint circuit breakers, dimension validation, batch splitting, auth headers, and configurable timeoutsHybridLLMcompositor that routes embed/rerank/expand to a remote server when configured, falling back to localLlamaCppfor any path that isn't (tokenization always stays local)LLMinterface withembedBatchandembedModelName, and updates the singleton/session management to accept anyLLMimplementation (backward-compatible)QMD_EMBED_API_URL/QMD_EMBED_API_MODEL,QMD_RERANK_API_URL/QMD_RERANK_API_MODEL,QMD_EXPAND_API_URL/QMD_EXPAND_API_MODEL/QMD_EXPAND_API_KEY) or YAML fields undermodels:(embed_api_*,rerank_api_*,expand_api_*)fetch()Motivation
Allows offloading embedding, reranking, and query expansion to a remote OpenAI-compatible server (e.g. vLLM, Ollama, LM Studio) while keeping QMD's local-first defaults intact. Useful when:
Each remote endpoint is independently configurable — operators can mix-and-match (e.g. remote embed + remote rerank + local expand, or all-remote, or any subset). Unconfigured paths fall back to the existing local
LlamaCpppath with no behavior change.Related: #489, #427, #446, #511
What's now remote-capable
POST /v1/embeddingsQMD_EMBED_API_URL+QMD_EMBED_API_MODELLlamaCpp.embed()POST /v1/embeddings(multi-input)LlamaCpp.embedBatch()POST /v1/rerankQMD_RERANK_API_URL+QMD_RERANK_API_MODELLlamaCpp.rerank()POST /v1/chat/completionsQMD_EXPAND_API_URL+QMD_EXPAND_API_MODEL(+ optionalQMD_EXPAND_API_KEY)LlamaCpp.expandQuery()LlamaCpp.tokenize())Files changed
src/remote-llm.tsRemoteLLMclass (embed + rerank + expand + per-endpoint circuit breakers) +remoteConfigFromEnv()src/hybrid-llm.tsHybridLLMrouting compositor (per-path remote/local routing)src/llm.tsembedBatch/embedModelNametoLLMinterface,intenttoexpandQuerysignature,isRemoteModel(), generalize singleton toLLMsrc/store.tsLlamaCpptype refs →LLMinterface, gracefultokenize()fallback, derive content_vectors model label fromgetDefaultLLM().embedModelNamesrc/collections.tsembed_api_*,rerank_api_*,expand_api_*) toModelsConfigsrc/cli/qmd.tsHybridLLMwhen any remote endpoint is configuredtest/remote-llm.test.tstest/remote-llm-integration.test.tsCHANGELOG.mdREADME.mdTest plan
test/remote-llm.test.ts) covering embed, batch, auth, dimension validation, circuit breaker, rerank, expand (lex/vec/hyde parsing, includeLexical filtering, query-term filtering, fallback on bad model output, intent forwarding), HybridLLM routing (remote vs local per path), config env vars / YAML fieldstest/remote-llm-integration.test.ts) covering live embed, dimension consistency, normalization, semantic similarity, batch, rerank relevance, edge cases, end-to-end search simulation, and expand end-to-end (live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM remote-vs-local routing verified via LOCAL_SENTINEL)getDefaultLLM()returnsLlamaCppwhen no remote config, all interface methods present,tokenize()duck-typing worksbun test— only the existing 48 LlamaCpp-path tests fail, all of which require a local GGUF model that isn't on the test machine; all 89 RemoteLLM tests passQwen/Qwen3-Embedding-0.6B(embed),qwen3-reranker-4b(rerank), andQwen3.6-35B(expand)🤖 Generated with Claude Code