Skip to content

Add OpenAI-compatible remote embedding and reranking#517

Open
jhsmith409 wants to merge 3 commits into
tobi:mainfrom
jhsmith409:feature/remote-openai-embed
Open

Add OpenAI-compatible remote embedding and reranking#517
jhsmith409 wants to merge 3 commits into
tobi:mainfrom
jhsmith409:feature/remote-openai-embed

Conversation

@jhsmith409

@jhsmith409 jhsmith409 commented Apr 6, 2026

Copy link
Copy Markdown

Summary

  • Adds RemoteLLM class that calls OpenAI-compatible HTTP endpoints for embedding (POST /v1/embeddings), reranking (POST /v1/rerank), and query expansion (POST /v1/chat/completions), with per-endpoint circuit breakers, dimension validation, batch splitting, auth headers, and configurable timeouts
  • Adds HybridLLM compositor that routes embed/rerank/expand to a remote server when configured, falling back to local LlamaCpp for any path that isn't (tokenization always stays local)
  • Generalizes the LLM interface with embedBatch and embedModelName, and updates the singleton/session management to accept any LLM implementation (backward-compatible)
  • Configured via env vars (QMD_EMBED_API_URL / QMD_EMBED_API_MODEL, QMD_RERANK_API_URL / QMD_RERANK_API_MODEL, QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY) or YAML fields under models: (embed_api_*, rerank_api_*, expand_api_*)
  • Skips nomic/Qwen3 text formatting prefixes for remote models (they handle their own prompt formatting)
  • Zero new dependencies — uses Node.js built-in fetch()

Motivation

Allows offloading embedding, reranking, and query expansion to a remote OpenAI-compatible server (e.g. vLLM, Ollama, LM Studio) while keeping QMD's local-first defaults intact. Useful when:

  • The indexing machine doesn't have a GPU
  • You want larger/better embedding or reranker models than what fits in local VRAM
  • You already run a shared GPU inference server and want QMD to use it for the LLM-bound steps

Each remote endpoint is independently configurable — operators can mix-and-match (e.g. remote embed + remote rerank + local expand, or all-remote, or any subset). Unconfigured paths fall back to the existing local LlamaCpp path with no behavior change.

Related: #489, #427, #446, #511

What's now remote-capable

Path Remote endpoint Env vars Local fallback
Embedding POST /v1/embeddings QMD_EMBED_API_URL + QMD_EMBED_API_MODEL LlamaCpp.embed()
Batch embedding POST /v1/embeddings (multi-input) (same as embedding) LlamaCpp.embedBatch()
Reranking POST /v1/rerank QMD_RERANK_API_URL + QMD_RERANK_API_MODEL LlamaCpp.rerank()
Query expansion POST /v1/chat/completions QMD_EXPAND_API_URL + QMD_EXPAND_API_MODEL (+ optional QMD_EXPAND_API_KEY) LlamaCpp.expandQuery()
Tokenization always local (LlamaCpp.tokenize())

Files changed

File Change
src/remote-llm.ts NewRemoteLLM class (embed + rerank + expand + per-endpoint circuit breakers) + remoteConfigFromEnv()
src/hybrid-llm.ts NewHybridLLM routing compositor (per-path remote/local routing)
src/llm.ts Add embedBatch/embedModelName to LLM interface, intent to expandQuery signature, isRemoteModel(), generalize singleton to LLM
src/store.ts Change LlamaCpp type refs → LLM interface, graceful tokenize() fallback, derive content_vectors model label from getDefaultLLM().embedModelName
src/collections.ts Add remote fields (embed_api_*, rerank_api_*, expand_api_*) to ModelsConfig
src/cli/qmd.ts Auto-detect remote config, create HybridLLM when any remote endpoint is configured
test/remote-llm.test.ts Unit tests (mock HTTP server, VCR-style) — embed, rerank, expand, batching, auth, circuit breaker, HybridLLM routing, env-var config, intent forwarding
test/remote-llm-integration.test.ts Integration tests against live vLLM (skipped when env absent) — embed/rerank/expand end-to-end, including LOCAL_SENTINEL routing verification
CHANGELOG.md Unreleased entry
README.md Configuration docs + vLLM example

Test plan

  • 52 unit tests (test/remote-llm.test.ts) covering embed, batch, auth, dimension validation, circuit breaker, rerank, expand (lex/vec/hyde parsing, includeLexical filtering, query-term filtering, fallback on bad model output, intent forwarding), HybridLLM routing (remote vs local per path), config env vars / YAML fields
  • 37 integration tests (test/remote-llm-integration.test.ts) covering live embed, dimension consistency, normalization, semantic similarity, batch, rerank relevance, edge cases, end-to-end search simulation, and expand end-to-end (live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM remote-vs-local routing verified via LOCAL_SENTINEL)
  • Local-only path verified: getDefaultLLM() returns LlamaCpp when no remote config, all interface methods present, tokenize() duck-typing works
  • Full test suite with bun test — only the existing 48 LlamaCpp-path tests fail, all of which require a local GGUF model that isn't on the test machine; all 89 RemoteLLM tests pass
  • Tested in production deployment against vLLM running Qwen/Qwen3-Embedding-0.6B (embed), qwen3-reranker-4b (rerank), and Qwen3.6-35B (expand)

🤖 Generated with Claude Code

@jhsmith409 jhsmith409 force-pushed the feature/remote-openai-embed branch from 8e26a6c to f640303 Compare April 6, 2026 14:23
@jhsmith409

jhsmith409 commented Apr 6, 2026

Copy link
Copy Markdown
Author

Test results from a live vLLM deployment

Ran the full test suite on feature/remote-openai-embed against real vLLM servers (no mocks):

Endpoint Model
Embedding (http://192.168.x.x:x/v1) Qwen/Qwen3-Embedding-0.6B
Reranking (http://192.168.x.x:x/v1) qwen3-reranker-4b

Remote LLM tests (the tests added by this PR): ✅ 66/66 pass

bun test v1.3.11 (af24e281)

test/remote-llm-integration.test.ts:
  Embedding dimension: 1024
  Rerank scores: {
    "cookies.md": 0.37974488735198975,
    "quantum.md": 0.35188794136047363,
    "space.md": 0.35012370347976685,
    "baking.md": 0.12711383402347565,
  }
  Similarity ranking:
    git.md: 0.7216
    typescript.md: 0.4591
    docker.md: 0.4050
    cooking.md: 0.3419
    gardening.md: 0.3270

 66 pass
 0 fail
 1214 expect() calls
Ran 66 tests across 2 files. [1149.00ms]

Full suite: 699 pass, 48 fail

The 48 failures are all in the LlamaCpp/local-model path (token chunking, query expansion, local reranking) — expected on a machine without a local GGUF model downloaded. No failures in any remote, BM25, AST, SDK, collections, or MCP tests.


This feature works well in practice. The remote OpenAI-compatible embedding is significantly faster than CPU GGUF inference for bulk indexing — happy to help test anything else if useful.

@jhsmith409

Copy link
Copy Markdown
Author

Full test suite — bun test

bun test v1.3.11 (af24e281)

 699 pass
 48 fail
 2720 expect() calls
Ran 747 tests across 20 files. [555.28s]

All 48 failures are in the local LlamaCpp path (token-based chunking, query expansion, local reranking, hybrid pipeline) — these require a local GGUF model to be downloaded, which isn't present on this machine. They are pre-existing failures unrelated to this PR.

Failures breakdown:

  • Token-based Chunking — node-llama-cpp compile/load timeout (no local model)
  • LlamaCpp Integration — expandQuery/rerank timeout (no local model)
  • LLM Session Management — withLLMSession timeout (no local model)
  • MCP Server > hybridQuery — LLM query expansion timeout (no local model)
  • search > with LLM query expansion — same
  • MCP HTTP Transport — depends on query expansion

Zero failures in: AST chunking, BM25, collections config, store paths, SDK, formatter, intent, multi-collection filter, RRF trace, structured search, store helpers, remote LLM — i.e., everything that doesn't require a local GGUF model passes cleanly. Existing tests are unaffected.

@jhsmith409

jhsmith409 commented Apr 6, 2026

Copy link
Copy Markdown
Author

Two fixes found while deploying on a live system

While integrating this branch into production, I hit two issues and fixed them. Hopefully, I figured out how to incorporate them correctly into the PR:

1. intent missing from LLM interface / ILLMSession (src/llm.ts)

store.ts calls llm.expandQuery(query, { intent }) but the interface only declared { context?, includeLexical? }, so tsc failed to build:

src/store.ts(3191,50): error TS2353: Object literal may only specify known properties,
and 'intent' does not exist in type '{ context?: string | undefined; includeLexical?: boolean | undefined; }'

Fix — add intent? to both interface declarations:

-  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean }): Promise<Queryable[]>;
+  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean; intent?: string }): Promise<Queryable[]>;

(Same change needed on both ILLMSession at line ~172 and LLM at line ~359.)

2. vectorIndex logs and stores the default GGUF URI as the model label, even when using remote embedding (src/cli/qmd.ts)

vectorIndex has model: string = DEFAULT_EMBED_MODEL_URI as a default parameter and passes it straight to generateEmbeddings. When remote embedding is active, getStore() sets up a HybridLLM — but model still shows/stores the GGUF string, not Qwen/Qwen3-Embedding-0.6B.

Fix — derive the label from the actual configured LLM after getStore():

   const storeInstance = getStore();
   const db = storeInstance.db;

+  // Use the actual model name from the configured LLM (may be remote, not the default GGUF URI)
+  model = getDefaultLLM().embedModelName;
+
   if (force) {

After this fix, qmd embed -f correctly shows and stores Model: Qwen/Qwen3-Embedding-0.6B in content_vectors.model.

@jhsmith409

Copy link
Copy Markdown
Author

Both fixes above have been pushed to the branch: jhsmith409@6596448

@jhsmith409

Copy link
Copy Markdown
Author

All comments have been addressed and two production fixes have been pushed (see above). Tested against live vLLM servers — 699/747 tests passing, all 48 failures are pre-existing LlamaCpp-path issues unrelated to this PR. Ready for review.

@viniciushsantana

Copy link
Copy Markdown

great work! would you consider adding support for remote query expansion as well?

@jhsmith409

Copy link
Copy Markdown
Author

great work! would you consider adding support for remote query expansion as well?

Let's close out this PR and get it merged. Then open an issue for remote query expansion and I'll try to address it.

@tobi

tobi commented Apr 9, 2026

Copy link
Copy Markdown
Owner

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern)
And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

@jhsmith409

Copy link
Copy Markdown
Author

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern) And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

I'll tackle the first part (query expansion) and unit tests but I'll let the qmd models serve option for someone else to implement. Does that work for you, tobi?

@jhsmith409

Copy link
Copy Markdown
Author

Remote query expansion is now implemented. Here's what was added (commit f8c6030):

Changes

src/remote-llm.ts

  • expandQuery() now calls /chat/completions when expandApiModel is configured (throws "expandApiModel not configured" otherwise — no behavior change for existing users)
  • supportsExpand getter for routing decisions
  • Independent circuit breaker for the expand endpoint
  • parseExpandResponse() helper: parses lex:/vec:/hyde: lines, filters out variants with no term overlap with the original query, falls back gracefully on bad model output
  • remoteConfigFromEnv() now reads QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields

src/hybrid-llm.ts

  • expandQuery() routes to remote when remote instanceof RemoteLLM && remote.supportsExpand, otherwise falls back to local LlamaCpp — no interface changes

test/remote-llm.test.ts (unit tests, mock HTTP server)

New expandQuery describe block covering:

  • supportsExpand flag behavior
  • /chat/completions payload shape (model, message roles, intent inclusion)
  • Auth header: expandApiKey → falls back to embedApiKey
  • lex/vec/hyde parsing, includeLexical: false filtering
  • Fallback Queryable[] when model output is unparseable
  • Query-term filtering (variants with no overlap are dropped)
  • Circuit breaker trips after 3 failures
  • HybridLLM routing: remote when expandApiModel set, local when not

test/remote-llm-integration.test.ts (live server)

  • New VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars (tests skipped when absent)
  • All three types returned, includeLexical: false, intent incorporation
  • HybridLLM routing verified via LOCAL_SENTINEL sentinel value

Test results

Unit tests: 52/52 pass
Integration tests (live vLLM): 37/37 pass
Full suite: 773 pass, 48 fail (same pre-existing LlamaCpp failures as before — no regressions)

Jim Smith and others added 3 commits April 12, 2026 18:26
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add intent? to LLM interface and ILLMSession expandQuery signature
  (store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
  getStore() so content_vectors.model reflects the actual LLM in use
  (previously always stored DEFAULT_EMBED_MODEL_URI even with remote)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is
  configured; throws "expandApiModel not configured" otherwise
- Independent circuit breaker for the expand endpoint
- parseExpandResponse() parses lex/vec/hyde lines, filters terms that
  don't share a word with the original query, falls back gracefully on
  bad model output
- RemoteLLM.supportsExpand getter for routing decisions
- HybridLLM routes expandQuery to remote when remote.supportsExpand,
  otherwise falls back to local LlamaCpp (no interface changes)
- remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL /
  QMD_EXPAND_API_KEY and YAML expand_api_* fields
- Unit tests (mock HTTP server, VCR-style): payload shape, auth header
  fallback, lex/vec/hyde parsing, includeLexical=false filtering,
  fallback on bad output, query-term filtering, circuit breaker,
  HybridLLM routing (remote vs local), config env vars
- Integration tests: live server connectivity, all three types returned,
  includeLexical=false, intent incorporation, HybridLLM routing verified
  via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL
  env vars, skipped when absent)
@jhsmith409

Copy link
Copy Markdown
Author

@tobi I added remote query expansion. Could you take another look at this PR please?

@jhsmith409

Copy link
Copy Markdown
Author

Hey @tobi, friendly bump on this. Query expansion is implemented per your earlier ask, with VCR-pattern unit tests + live integration tests. Branch is clean against main, all 66/66 RemoteLLM tests pass against live vLLM (the 48 LlamaCpp-path failures pre-date this PR and need a local GGUF that wasn't on the test machine).

Skipped qmd models serve per the discussion above — happy to leave that for a follow-up PR by someone else.

Anything else needed before merge?

@jhsmith409

Copy link
Copy Markdown
Author

@tobi heads-up: I just corrected the PR description, which was stale and was telling reviewers that query expansion stayed local. The branch in fact has all three remote paths fully implemented and independently configurable:

  • EmbeddingPOST /v1/embeddings (QMD_EMBED_API_*)
  • RerankingPOST /v1/rerank (QMD_RERANK_API_*)
  • Query expansionPOST /v1/chat/completions (QMD_EXPAND_API_*) — added in commit f2fd64e per your earlier ask, with VCR-pattern unit tests + live integration tests

Each path falls back to local LlamaCpp if its env vars / YAML fields aren't set, so existing local-only users see no behavior change. Operators can mix-and-match (e.g. remote embed + remote rerank + local expand). See the new "What's now remote-capable" table in the description for the matrix.

PR is mergeable, branch is clean against main. qmd models serve (your other suggestion) was deliberately left out of scope for a follow-up — happy to revisit if you'd like it bundled here.

@azogheb

azogheb commented Jun 8, 2026

Copy link
Copy Markdown

@jhsmith409 thank you thank you for all your work on this PR. I want it to land SO BAD so that CPU only VMs running OpenClaw can use QMD. Plz @tobi :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants