Add OpenAI-compatible remote embedding and reranking by jhsmith409 · Pull Request #517 · tobi/qmd

jhsmith409 · 2026-04-06T14:19:13Z

Summary

Adds RemoteLLM class that calls OpenAI-compatible HTTP endpoints for embedding (POST /v1/embeddings), reranking (POST /v1/rerank), and query expansion (POST /v1/chat/completions), with per-endpoint circuit breakers, dimension validation, batch splitting, auth headers, and configurable timeouts
Adds HybridLLM compositor that routes embed/rerank/expand to a remote server when configured, falling back to local LlamaCpp for any path that isn't (tokenization always stays local)
Generalizes the LLM interface with embedBatch and embedModelName, and updates the singleton/session management to accept any LLM implementation (backward-compatible)
Configured via env vars (QMD_EMBED_API_URL / QMD_EMBED_API_MODEL, QMD_RERANK_API_URL / QMD_RERANK_API_MODEL, QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY) or YAML fields under models: (embed_api_*, rerank_api_*, expand_api_*)
Skips nomic/Qwen3 text formatting prefixes for remote models (they handle their own prompt formatting)
Zero new dependencies — uses Node.js built-in fetch()

Motivation

Allows offloading embedding, reranking, and query expansion to a remote OpenAI-compatible server (e.g. vLLM, Ollama, LM Studio) while keeping QMD's local-first defaults intact. Useful when:

The indexing machine doesn't have a GPU
You want larger/better embedding or reranker models than what fits in local VRAM
You already run a shared GPU inference server and want QMD to use it for the LLM-bound steps

Each remote endpoint is independently configurable — operators can mix-and-match (e.g. remote embed + remote rerank + local expand, or all-remote, or any subset). Unconfigured paths fall back to the existing local LlamaCpp path with no behavior change.

Related: #489, #427, #446, #511

What's now remote-capable

Path	Remote endpoint	Env vars	Local fallback
Embedding	`POST /v1/embeddings`	`QMD_EMBED_API_URL` + `QMD_EMBED_API_MODEL`	`LlamaCpp.embed()`
Batch embedding	`POST /v1/embeddings` (multi-input)	(same as embedding)	`LlamaCpp.embedBatch()`
Reranking	`POST /v1/rerank`	`QMD_RERANK_API_URL` + `QMD_RERANK_API_MODEL`	`LlamaCpp.rerank()`
Query expansion	`POST /v1/chat/completions`	`QMD_EXPAND_API_URL` + `QMD_EXPAND_API_MODEL` (+ optional `QMD_EXPAND_API_KEY`)	`LlamaCpp.expandQuery()`
Tokenization	—	—	always local (`LlamaCpp.tokenize()`)

Files changed

File	Change
`src/remote-llm.ts`	New — `RemoteLLM` class (embed + rerank + expand + per-endpoint circuit breakers) + `remoteConfigFromEnv()`
`src/hybrid-llm.ts`	New — `HybridLLM` routing compositor (per-path remote/local routing)
`src/llm.ts`	Add `embedBatch`/`embedModelName` to `LLM` interface, `intent` to `expandQuery` signature, `isRemoteModel()`, generalize singleton to `LLM`
`src/store.ts`	Change `LlamaCpp` type refs → `LLM` interface, graceful `tokenize()` fallback, derive content_vectors model label from `getDefaultLLM().embedModelName`
`src/collections.ts`	Add remote fields (`embed_api_`, `rerank_api_`, `expand_api_*`) to `ModelsConfig`
`src/cli/qmd.ts`	Auto-detect remote config, create `HybridLLM` when any remote endpoint is configured
`test/remote-llm.test.ts`	Unit tests (mock HTTP server, VCR-style) — embed, rerank, expand, batching, auth, circuit breaker, HybridLLM routing, env-var config, intent forwarding
`test/remote-llm-integration.test.ts`	Integration tests against live vLLM (skipped when env absent) — embed/rerank/expand end-to-end, including LOCAL_SENTINEL routing verification
`CHANGELOG.md`	Unreleased entry
`README.md`	Configuration docs + vLLM example

Test plan

52 unit tests (test/remote-llm.test.ts) covering embed, batch, auth, dimension validation, circuit breaker, rerank, expand (lex/vec/hyde parsing, includeLexical filtering, query-term filtering, fallback on bad model output, intent forwarding), HybridLLM routing (remote vs local per path), config env vars / YAML fields
37 integration tests (test/remote-llm-integration.test.ts) covering live embed, dimension consistency, normalization, semantic similarity, batch, rerank relevance, edge cases, end-to-end search simulation, and expand end-to-end (live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM remote-vs-local routing verified via LOCAL_SENTINEL)
Local-only path verified: getDefaultLLM() returns LlamaCpp when no remote config, all interface methods present, tokenize() duck-typing works
Full test suite with bun test — only the existing 48 LlamaCpp-path tests fail, all of which require a local GGUF model that isn't on the test machine; all 89 RemoteLLM tests pass
Tested in production deployment against vLLM running Qwen/Qwen3-Embedding-0.6B (embed), qwen3-reranker-4b (rerank), and Qwen3.6-35B (expand)

🤖 Generated with Claude Code

jhsmith409 · 2026-04-06T16:48:44Z

Test results from a live vLLM deployment

Ran the full test suite on feature/remote-openai-embed against real vLLM servers (no mocks):

Endpoint	Model
Embedding (`http://192.168.x.x:x/v1`)	`Qwen/Qwen3-Embedding-0.6B`
Reranking (`http://192.168.x.x:x/v1`)	`qwen3-reranker-4b`

Remote LLM tests (the tests added by this PR): ✅ 66/66 pass

bun test v1.3.11 (af24e281)

test/remote-llm-integration.test.ts:
  Embedding dimension: 1024
  Rerank scores: {
    "cookies.md": 0.37974488735198975,
    "quantum.md": 0.35188794136047363,
    "space.md": 0.35012370347976685,
    "baking.md": 0.12711383402347565,
  }
  Similarity ranking:
    git.md: 0.7216
    typescript.md: 0.4591
    docker.md: 0.4050
    cooking.md: 0.3419
    gardening.md: 0.3270

 66 pass
 0 fail
 1214 expect() calls
Ran 66 tests across 2 files. [1149.00ms]

Full suite: 699 pass, 48 fail

The 48 failures are all in the LlamaCpp/local-model path (token chunking, query expansion, local reranking) — expected on a machine without a local GGUF model downloaded. No failures in any remote, BM25, AST, SDK, collections, or MCP tests.

This feature works well in practice. The remote OpenAI-compatible embedding is significantly faster than CPU GGUF inference for bulk indexing — happy to help test anything else if useful.

jhsmith409 · 2026-04-06T16:51:45Z

Full test suite — `bun test`

bun test v1.3.11 (af24e281)

 699 pass
 48 fail
 2720 expect() calls
Ran 747 tests across 20 files. [555.28s]

All 48 failures are in the local LlamaCpp path (token-based chunking, query expansion, local reranking, hybrid pipeline) — these require a local GGUF model to be downloaded, which isn't present on this machine. They are pre-existing failures unrelated to this PR.

Failures breakdown:

Token-based Chunking — node-llama-cpp compile/load timeout (no local model)
LlamaCpp Integration — expandQuery/rerank timeout (no local model)
LLM Session Management — withLLMSession timeout (no local model)
MCP Server > hybridQuery — LLM query expansion timeout (no local model)
search > with LLM query expansion — same
MCP HTTP Transport — depends on query expansion

Zero failures in: AST chunking, BM25, collections config, store paths, SDK, formatter, intent, multi-collection filter, RRF trace, structured search, store helpers, remote LLM — i.e., everything that doesn't require a local GGUF model passes cleanly. Existing tests are unaffected.

jhsmith409 · 2026-04-06T17:36:38Z

Two fixes found while deploying on a live system

While integrating this branch into production, I hit two issues and fixed them. Hopefully, I figured out how to incorporate them correctly into the PR:

1. `intent` missing from `LLM` interface / `ILLMSession` (`src/llm.ts`)

store.ts calls llm.expandQuery(query, { intent }) but the interface only declared { context?, includeLexical? }, so tsc failed to build:

src/store.ts(3191,50): error TS2353: Object literal may only specify known properties,
and 'intent' does not exist in type '{ context?: string | undefined; includeLexical?: boolean | undefined; }'

Fix — add intent? to both interface declarations:

-  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean }): Promise<Queryable[]>;
+  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean; intent?: string }): Promise<Queryable[]>;

(Same change needed on both ILLMSession at line ~172 and LLM at line ~359.)

2. `vectorIndex` logs and stores the default GGUF URI as the model label, even when using remote embedding (`src/cli/qmd.ts`)

vectorIndex has model: string = DEFAULT_EMBED_MODEL_URI as a default parameter and passes it straight to generateEmbeddings. When remote embedding is active, getStore() sets up a HybridLLM — but model still shows/stores the GGUF string, not Qwen/Qwen3-Embedding-0.6B.

Fix — derive the label from the actual configured LLM after getStore():

   const storeInstance = getStore();
   const db = storeInstance.db;

+  // Use the actual model name from the configured LLM (may be remote, not the default GGUF URI)
+  model = getDefaultLLM().embedModelName;
+
   if (force) {

After this fix, qmd embed -f correctly shows and stores Model: Qwen/Qwen3-Embedding-0.6B in content_vectors.model.

jhsmith409 · 2026-04-06T17:41:21Z

Both fixes above have been pushed to the branch: jhsmith409@6596448

jhsmith409 · 2026-04-08T15:47:35Z

All comments have been addressed and two production fixes have been pushed (see above). Tested against live vLLM servers — 699/747 tests passing, all 48 failures are pre-existing LlamaCpp-path issues unrelated to this PR. Ready for review.

viniciushsantana · 2026-04-08T23:41:30Z

great work! would you consider adding support for remote query expansion as well?

jhsmith409 · 2026-04-09T00:06:14Z

great work! would you consider adding support for remote query expansion as well?

Let's close out this PR and get it merged. Then open an issue for remote query expansion and I'll try to address it.

tobi · 2026-04-09T01:23:20Z

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern)
And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

jhsmith409 · 2026-04-09T12:25:05Z

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern) And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

I'll tackle the first part (query expansion) and unit tests but I'll let the qmd models serve option for someone else to implement. Does that work for you, tobi?

jhsmith409 · 2026-04-12T22:22:15Z

Remote query expansion is now implemented. Here's what was added (commit f8c6030):

Changes

`src/remote-llm.ts`

expandQuery() now calls /chat/completions when expandApiModel is configured (throws "expandApiModel not configured" otherwise — no behavior change for existing users)
supportsExpand getter for routing decisions
Independent circuit breaker for the expand endpoint
parseExpandResponse() helper: parses lex:/vec:/hyde: lines, filters out variants with no term overlap with the original query, falls back gracefully on bad model output
remoteConfigFromEnv() now reads QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields

`src/hybrid-llm.ts`

expandQuery() routes to remote when remote instanceof RemoteLLM && remote.supportsExpand, otherwise falls back to local LlamaCpp — no interface changes

`test/remote-llm.test.ts` (unit tests, mock HTTP server)

New expandQuery describe block covering:

supportsExpand flag behavior
/chat/completions payload shape (model, message roles, intent inclusion)
Auth header: expandApiKey → falls back to embedApiKey
lex/vec/hyde parsing, includeLexical: false filtering
Fallback Queryable[] when model output is unparseable
Query-term filtering (variants with no overlap are dropped)
Circuit breaker trips after 3 failures
HybridLLM routing: remote when expandApiModel set, local when not

`test/remote-llm-integration.test.ts` (live server)

New VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars (tests skipped when absent)
All three types returned, includeLexical: false, intent incorporation
HybridLLM routing verified via LOCAL_SENTINEL sentinel value

Test results

Unit tests: 52/52 pass
Integration tests (live vLLM): 37/37 pass
Full suite: 773 pass, 48 fail (same pre-existing LlamaCpp failures as before — no regressions)

Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add intent? to LLM interface and ILLMSession expandQuery signature (store.ts passes { intent } but interface didn't declare it — tsc error) - Derive embed model label from getDefaultLLM().embedModelName after getStore() so content_vectors.model reflects the actual LLM in use (previously always stored DEFAULT_EMBED_MODEL_URI even with remote) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is configured; throws "expandApiModel not configured" otherwise - Independent circuit breaker for the expand endpoint - parseExpandResponse() parses lex/vec/hyde lines, filters terms that don't share a word with the original query, falls back gracefully on bad model output - RemoteLLM.supportsExpand getter for routing decisions - HybridLLM routes expandQuery to remote when remote.supportsExpand, otherwise falls back to local LlamaCpp (no interface changes) - remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields - Unit tests (mock HTTP server, VCR-style): payload shape, auth header fallback, lex/vec/hyde parsing, includeLexical=false filtering, fallback on bad output, query-term filtering, circuit breaker, HybridLLM routing (remote vs local), config env vars - Integration tests: live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM routing verified via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars, skipped when absent)

jhsmith409 · 2026-04-21T12:26:01Z

@tobi I added remote query expansion. Could you take another look at this PR please?

jhsmith409 · 2026-04-26T15:20:42Z

Hey @tobi, friendly bump on this. Query expansion is implemented per your earlier ask, with VCR-pattern unit tests + live integration tests. Branch is clean against main, all 66/66 RemoteLLM tests pass against live vLLM (the 48 LlamaCpp-path failures pre-date this PR and need a local GGUF that wasn't on the test machine).

Skipped qmd models serve per the discussion above — happy to leave that for a follow-up PR by someone else.

Anything else needed before merge?

jhsmith409 · 2026-04-26T22:30:39Z

@tobi heads-up: I just corrected the PR description, which was stale and was telling reviewers that query expansion stayed local. The branch in fact has all three remote paths fully implemented and independently configurable:

Embedding → POST /v1/embeddings (QMD_EMBED_API_*)
Reranking → POST /v1/rerank (QMD_RERANK_API_*)
Query expansion → POST /v1/chat/completions (QMD_EXPAND_API_*) — added in commit f2fd64e per your earlier ask, with VCR-pattern unit tests + live integration tests

Each path falls back to local LlamaCpp if its env vars / YAML fields aren't set, so existing local-only users see no behavior change. Operators can mix-and-match (e.g. remote embed + remote rerank + local expand). See the new "What's now remote-capable" table in the description for the matrix.

PR is mergeable, branch is clean against main. qmd models serve (your other suggestion) was deliberately left out of scope for a follow-up — happy to revisit if you'd like it bundled here.

azogheb · 2026-06-08T22:23:48Z

@jhsmith409 thank you thank you for all your work on this PR. I want it to land SO BAD so that CPU only VMs running OpenClaw can use QMD. Plz @tobi :)

jhsmith409 force-pushed the feature/remote-openai-embed branch from 8e26a6c to f640303 Compare April 6, 2026 14:23

Jim Smith and others added 3 commits April 12, 2026 18:26

jhsmith409 force-pushed the feature/remote-openai-embed branch from f8c6030 to f2fd64e Compare April 12, 2026 22:27

lukeboyett mentioned this pull request Apr 15, 2026

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints #575

Closed

This was referenced May 5, 2026

Avoid local llama startup during remote embedding georgelichen/qmd#1

Open

Add remote embedding, reranking, and query expansion support #629

Open

This was referenced Jun 10, 2026

feat: remote endpoint support for QMD (embed, expand, rerank, generate) unithejerk/qmd#1

Closed

feat: remote endpoint support for QMD (embed, expand, rerank, generate) #720

Open

Conversation

jhsmith409 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What's now remote-capable

Files changed

Test plan

Uh oh!

jhsmith409 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test results from a live vLLM deployment

Remote LLM tests (the tests added by this PR): ✅ 66/66 pass

Full suite: 699 pass, 48 fail

Uh oh!

jhsmith409 commented Apr 6, 2026

Full test suite — bun test

Uh oh!

jhsmith409 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two fixes found while deploying on a live system

1. intent missing from LLM interface / ILLMSession (src/llm.ts)

2. vectorIndex logs and stores the default GGUF URI as the model label, even when using remote embedding (src/cli/qmd.ts)

Uh oh!

jhsmith409 commented Apr 6, 2026

Uh oh!

jhsmith409 commented Apr 8, 2026

Uh oh!

viniciushsantana commented Apr 8, 2026

Uh oh!

jhsmith409 commented Apr 9, 2026

Uh oh!

tobi commented Apr 9, 2026

Uh oh!

jhsmith409 commented Apr 9, 2026

Uh oh!

jhsmith409 commented Apr 12, 2026

Changes

src/remote-llm.ts

src/hybrid-llm.ts

test/remote-llm.test.ts (unit tests, mock HTTP server)

test/remote-llm-integration.test.ts (live server)

Test results

Uh oh!

jhsmith409 commented Apr 21, 2026

Uh oh!

jhsmith409 commented Apr 26, 2026

Uh oh!

jhsmith409 commented Apr 26, 2026

Uh oh!

azogheb commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhsmith409 commented Apr 6, 2026 •

edited

Loading

jhsmith409 commented Apr 6, 2026 •

edited

Loading

Full test suite — `bun test`

jhsmith409 commented Apr 6, 2026 •

edited

Loading

1. `intent` missing from `LLM` interface / `ILLMSession` (`src/llm.ts`)

2. `vectorIndex` logs and stores the default GGUF URI as the model label, even when using remote embedding (`src/cli/qmd.ts`)

`src/remote-llm.ts`

`src/hybrid-llm.ts`

`test/remote-llm.test.ts` (unit tests, mock HTTP server)

`test/remote-llm-integration.test.ts` (live server)