feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM) by Kaspre · Pull Request #705 · tobi/qmd

Kaspre · 2026-06-02T15:22:12Z

Problem

QMD's BM25 + vector + query-expansion + LLM-rerank pipeline has a hard dependency on local GGUF models via node-llama-cpp. Devices with inadequate or no GPU at all fall back to CPU embedding of 300M–0.6B models, which is unacceptably slow. Node-llama-cpp serialises inference through a single context, so concurrent workers queue and time out. Moving embedding/reranking/expansion to a dedicated server (vLLM, TEI, Ollama, llama.cpp, LiteLLM, OpenAI, …) removes both problems.

This builds directly on @georgelichen's #629, which contributed the core RemoteLLM/HybridLLM architecture. This PR rebases that scaffold onto current main and adds five refinements (A–E below) needed to make it production-correct against real OpenAI-compatible servers.

What this adds

A backend that talks to any OpenAI-compatible HTTP API, composed with the existing local backend so each operation can be routed independently. Fully opt-in — with no *_api_url configured, behaviour is byte-for-byte the current local-only path.

RemoteLLM — /v1/embeddings, /v1/rerank, /v1/chat/completions; per-endpoint circuit breakers (an embed outage doesn't take down rerank/chat); bearer auth; char-based token approximation so chunking works without a local tokenizer.
HybridLLM — routes embed/embedBatch/rerank/expandQuery to remote, generate/tokenize/detokenize to local LlamaCpp, with per-operation fallback.
Widened LLM interface (embedBatch, tokenize/detokenize/countTokens, isRemote, embedModelName, rerankModelName?, generateModelName?, usesRemoteEmbedding?, supportsRerank/supportsExpand) and getDefaultLLM/setDefaultLLM alongside the existing getDefaultLlamaCpp/setDefaultLlamaCpp (no breaking change).

The five refinements on top of #629

A — Sigmoid rerank normalization. llama.cpp's /v1/rerank (e.g. bge-reranker-v2-m3) and most cross-encoders return log-odds (~−10…+10), not 0–1. QMD's score blend and the --min-score 0.3 default assume probabilities, so without σ(x)=1/(1+e⁻ˣ) every blended score goes negative and every query returns "No results found". No-op for backends already emitting 0–1.
B — RemoteLLM.expandQuery. Query expansion via /v1/chat/completions, emitting the same lex:/vec:/hyde: line format as the local model; falls back to local on error.
C — Pre-flight embed probe. When usesRemoteEmbedding, embed one token at vectorIndex() startup so a bad URL / wrong model / auth failure surfaces immediately instead of silently falling back mid-batch.
D — HybridLLM.rerank local fallback. Symmetric with the expandQuery fallback already in Add remote embedding, reranking, and query expansion support #629: if the remote backend doesn't support rerank, fall back to local LlamaCpp.rerank rather than blindly POSTing /v1/rerank at an embed-only server. Lets you mix "remote embed + local rerank" freely.
E — Rerank oversized-payload recovery. When the server rejects a rerank request as too large (HTTP 413 / "too large to process" / context-length), RemoteLLM.rerank recursively bisects the batch and halve-truncates a single oversized document (down to a 32-char floor), remapping response indices to the originals and re-sorting by normalized score. Non-oversized errors still propagate to the circuit breaker / local fallback. Adapted from the recovery in @loopyd's #619.

Architecture

flowchart TD
    Q["qmd: index / query"] --> H{"HybridLLM<br/>remote config present?"}
    H -->|no| L0["LlamaCpp only — unchanged default"]
    H -->|yes| E["embed / embedBatch"] & RK["rerank"] & EX["expandQuery"] & G["generate"] & TK["tokenize / detokenize"]

    E --> R1["RemoteLLM → POST /v1/embeddings<br/>＋ pre-flight 1-token probe ◆C"]
    RK --> RKQ{"remote supportsRerank?"}
    RKQ -->|yes| R2["RemoteLLM → POST /v1/rerank<br/>＋ σ: log-odds→0‥1 ◆A<br/>＋ oversized split/truncate recovery ◆E"]
    RKQ -->|no| L1["LlamaCpp local rerank ◆D"]
    EX --> EXQ{"remote supportsExpand?"}
    EXQ -->|yes| R3["RemoteLLM → POST /v1/chat/completions ◆B"]
    EXQ -->|no| L2["LlamaCpp local expansion"]
    G --> L3["LlamaCpp local"]
    TK --> L4["LlamaCpp local (char-approx if usesRemoteEmbedding)"]

Configuration

Any OpenAI-compatible server works — point each operation wherever you like. Example (~/.config/qmd/index.yml):

models:
  embed_api_url:   http://embed-host:8081/v1   # vLLM / TEI / llama.cpp / OpenAI / Ollama
  embed_api_model: nomic-embed-text
  # rerank/expand are optional and independent — omit to keep them local:
  rerank_api_url:   http://rerank-host:8082/v1
  rerank_api_model: bge-reranker-v2-m3
  expand_api_url:   https://your-chat-endpoint/v1
  expand_api_model: <chat-model>
  # *_api_key optional (bearer)

Each *_api_url is independent: set only embed_api_url and rerank/expand stay local (fix D); set all three to go fully remote. Leave them all unset and nothing changes.

Backward compatibility

No breaking changes. All new behaviour is opt-in via config; local-only setups are unaffected; getDefaultLlamaCpp/setDefaultLlamaCpp retained.

Testing

Build (node scripts/build.mjs) and tsc --noEmit are clean. Under the project's CI test mode (CI=true, matching upstream CI), this branch shows no new failures vs main — the deterministic failing set is identical on both (the only failures are pre-existing WSL/Linux Git-Bash-path and skill-bundling tests, unrelated to this change). The new test/remote-llm.test.ts suite passes in full: embed/rerank routing, per-endpoint circuit breakers + independence, sigmoid normalization, oversized-rerank split/truncate recovery, auth, timeouts, expandQuery parsing + fallback, and the HybridLLM rerank fallback — all driven by in-process HTTP servers, no mocks.

This configuration runs in production today: embedding + reranking against two llama.cpp server instances and query expansion against a cloud chat endpoint — which is where refinements A, B, C (and the need for E) came from.

Related work & credit

This PR aims to consolidate the OpenAI-compatible-backend effort into one production-ready change, building on and crediting the prior attempts:

@georgelichen — #629: the RemoteLLM/HybridLLM scaffold this PR is built on. Refinements A & D were discussed on the Add remote embedding, reranking, and query expansion support #629 / feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints #575 threads.
@loopyd — #619: an independent OpenAI-compatible backend (tested against llama-swap); refinement E is adapted from its rerank recovery. feat(Add OpenAI-compatible backend support) #619 uses a whole-backend provider switch (one base URL + server-side model aliases); this PR's HybridLLM generalizes that to independent per-operation endpoints with local fallback — the single-server case is just embed/rerank/expand_api_url all pointing at one server.
@jonesj38 — #689: an OpenAI-specific backend; this PR's generic path covers the OpenAI case purely by configuration (embed_api_url: https://api.openai.com/v1) without hardcoding a provider.

Follow-ups (not in this PR)

Byte-based embed batching (maxBatchBytes + chunk strategies + --max-batch-mb / --chunk-strategy flags), as in feat(Add OpenAI-compatible backend support) #619. This PR keeps the existing count-based batching (maxBatchSize); byte-aware batching is a natural follow-on.

@georgelichen

…nsion Add a RemoteLLM backend that talks to any OpenAI-compatible HTTP API (vLLM, TEI, Ollama, llama.cpp --server, LiteLLM, OpenAI, ...), composed with the local LlamaCpp via a HybridLLM that routes each operation independently: - RemoteLLM: /v1/embeddings, /v1/rerank, /v1/chat/completions; per-endpoint circuit breakers; bearer auth; char-based token approximation so chunking works without a local tokenizer. - HybridLLM: embed/embedBatch/rerank/expandQuery -> remote, generate/tokenize/detokenize -> local, with per-operation local fallback. - Widened LLM interface (embedBatch, tokenize/detokenize/countTokens, isRemote, embedModelName, rerankModelName, generateModelName, usesRemoteEmbedding, supportsRerank/supportsExpand) plus getDefaultLLM/setDefaultLLM alongside the existing getDefaultLlamaCpp/setDefaultLlamaCpp (no breaking change). - Sigmoid normalization of log-odds rerank scores; RemoteLLM.expandQuery via chat completions; startup pre-flight embed probe; HybridLLM.rerank local fallback (symmetric with the expandQuery fallback). Opt-in via models.*_api_url / QMD_*_API_* env vars; with nothing configured the local-only path is byte-for-byte unchanged. Builds on @georgelichen's remote-LLM work in tobi#629. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@loopyd

…ncating When a remote /v1/rerank request is rejected as too large (HTTP 413 / "too large to process" / context length), RemoteLLM.rerank recursively bisects the batch and halve-truncates a single oversized document down to a 32-char floor, remapping response indices to the originals and re-sorting by score. Non-oversized errors still propagate so the circuit breaker / HybridLLM local fallback can react. Adapted from the rerank recovery in tobi#619 (@loopyd). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- remoteConfigFromEnv: throw on a half-configured remote backend (embed_api_url set without embed_api_model, or vice-versa) instead of silently falling back to the local backend and skipping the remote pre-flight probe. - RemoteLLM.rerank: normalize scores once over the full (possibly recovery-split) result set, applying sigmoid only when logit-range values are present (any score < 0 or > 1). Rerankers that already return [0,1] probabilities (Cohere/Voyage-style) pass through unchanged, avoiding distortion of the blend and --min-score filtering. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Kaspre · 2026-06-03T21:44:34Z

Follow-up hardening has been pushed in db5a32e.

Latest changes:

Preserve the model identity that actually produced rerank scores when writing rerank cache entries, so local fallback scores are not cached under a remote reranker key.
Reject SDK store.embed({ model }) overrides when remote embedding is configured and the requested model differs from the configured remote embed model.
Thread the per-store LLM through SDK searchVector() query embedding so remote SDK stores do not accidentally fall back to the global/default local LLM.
Added focused regressions for rerank fallback cache identity, remote embed model override handling, and SDK searchVector() per-store LLM routing.

Validation run:

CI=true node ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/store.test.ts test/sdk.test.ts passed: 289 passed, 19 skipped.
test/remote-llm.test.ts passed outside the sandbox: 51 passed. The sandboxed run hit the known local listen EPERM restriction while starting its in-process HTTP server.
pnpm build passed.
node ./node_modules/typescript/bin/tsc -p tsconfig.build.json --noEmit passed.
git diff --check passed.
A multi-lane read-only review pass was run; it found the searchVector() per-store LLM issue above, which is now fixed, and the follow-up review came back clean.

Maintainers: this should be ready for review again.

p0pfan · 2026-06-06T16:17:38Z

I really hope this PR gets merged soon. I truly need this feature to use qmd on machines
without a GPU.

udo76 · 2026-06-06T19:13:12Z

Running this on a Mac Mini M4 gateway with a dedicated GPU server (AMD Strix Halo, 128GB) hosting Qwen3-Embedding-4B and Qwen3-Reranker-4B on OpenAI-compatible endpoints (llama.cpp server).
Currently stuck with local 4B GGUF as workaround - works but ties up M4 resources.

The key use case for us: serving embedding and reranking to multiple machines on the network that are too weak to run meaningful models locally - thin clients, ARM64 SBCs, VPS nodes. A single GPU box providing high-quality retrieval for the whole fleet.
PR #705's HybridLLM per-operation fallback design matches this perfectly.

Happy to test any of the PRs against our endpoints.

…#97) * feat(spine): add remote LLM support via QMD PR #705 (OpenAI-compatible embedding) Vendor a pre-built @tobilu/qmd 2.5.3 from Kaspre/qmd#feat/remote-llm-openai-compatible (tobi/qmd#705), which adds RemoteLLM/HybridLLM backends for routing embedding, reranking, and query expansion to any OpenAI-compatible API endpoint. - spine/vendor/tobilu-qmd-2.5.3.tgz: built dist from the PR branch (dist/ is gitignored upstream so a github: dep would install a broken package) - spine/package.json: switch from ^2.5.1 registry dep to file:./vendor tarball - spine/.gitignore: allow vendor/*.tgz so the tarball is tracked by git - spine/src/config.ts: add QmdModelsConfig + getQmdModelsConfig() to read [spine.qmd] from config.toml (embed_api_url/model, rerank, expand settings) - spine/src/search.ts: pass models: getQmdModelsConfig() to createStore so remote embedding activates when configured; undefined preserves local-only path - config.toml.example: document the new [spine.qmd] section (all commented out) Remote embedding is fully opt-in: omitting [spine.qmd] leaves behaviour unchanged. Env vars QMD_EMBED_API_URL / QMD_EMBED_API_MODEL also work and take precedence over config.toml. * docs(spine): address PR review findings on remote QMD config - Add comment to QmdModelsConfig clarifying it's an intentional structural subset of QMD's internal ModelsConfig (not in public exports) - Add comment to getQmdModelsConfig() confirming env vars take precedence via QMD's remoteConfigFromEnv(), and that QMD itself throws on partial embed config (url without model, or vice versa) - Add vendor removal note in package.json for when @tobilu/qmd >=2.6.0 lands on npm * fix(spine): remove invalid //vendor comment from package.json bun install treats //vendor as a real package name and tries to resolve it from npm, breaking CI. Remove the comment entry; the intent is captured in git history.

tobi · 2026-06-08T16:47:55Z

Thank you (and all the others who worked on PRs related to this)

georgelichen · 2026-06-13T23:27:21Z

This is a strong PR overall. I really like the direction here: the RemoteLLM/HybridLLM split is thoughtful, the production hardening is substantial, and a lot of the follow-up work from #629 is integrated in a pragmatic way. Thanks for pushing this forward.

I went through the diff pretty carefully and found three things that I think are worth fixing before merge:

I think the publish workflow will fail to authenticate to npm as written.
In .github/workflows/publish.yml, npm publish --provenance --access public no longer passes NODE_AUTH_TOKEN. actions/setup-node writes the registry config, but it still expects the token to be present in the environment. Without that, the release job looks correct but will fail at publish time.
I think the new launcher mis-detects installed packages on Windows.
In bin/qmd, the installed-package heuristic uses pkgDir.split("/") to look for a node_modules segment. On Windows paths that never matches, so an installed package can be mistaken for a source checkout and incorrectly routed into source mode again. That seems to reintroduce the same class of Windows install/runtime issues this launcher rewrite is trying to fix.
I think --index path normalization is still incomplete on Windows.
In src/cli/qmd.ts and src/collections.ts, the “treat this like a path and normalize it into an index name” branch only runs when the argument contains /. On Windows, --index C:\tmp\demo skips that path-normalization path, which can send the DB path and config path to unexpected locations. I reproduced that locally with Node path resolution.

None of these feel far from fixable, and I still think the overall PR is heading in a very good direction. I just wanted to flag these before merge so they do not turn into release-time or Windows-specific regressions.

Kaspre and others added 3 commits June 2, 2026 11:20

This was referenced Jun 2, 2026

Add remote embedding, reranking, and query expansion support #629

Open

feat(Add OpenAI-compatible backend support) #619

Open

Feat/OpenAI embeddings #689

Closed

feat: OpenAI-compatible remote embedding, reranking, and query expansion rghamilton3/qmd#1

Closed

Kaspre added 3 commits June 2, 2026 11:56

fix(remote): align hybrid backend wiring

6c07938

fix(remote): harden expansion cache and breaker

46539d8

fix(remote): preserve producing LLM identity

db5a32e

rghamilton3 mentioned this pull request Jun 7, 2026

feat(spine): remote LLM embedding via QMD PR #705 (OpenAI-compatible) rghamilton3/lattice#97

Merged

4 tasks

This was referenced Jun 10, 2026

feat: remote endpoint support for QMD (embed, expand, rerank, generate) unithejerk/qmd#1

Closed

feat: remote endpoint support for QMD (embed, expand, rerank, generate) #720

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM)#705

feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM)#705
Kaspre wants to merge 6 commits into
tobi:mainfrom
Kaspre:feat/remote-llm-openai-compatible

Kaspre commented Jun 2, 2026 •

edited

Loading

Uh oh!

Kaspre commented Jun 3, 2026

Uh oh!

p0pfan commented Jun 6, 2026

Uh oh!

udo76 commented Jun 6, 2026

Uh oh!

tobi commented Jun 8, 2026

Uh oh!

georgelichen commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Kaspre commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this adds

The five refinements on top of #629

Architecture

Configuration

Backward compatibility

Testing

Related work & credit

Follow-ups (not in this PR)

Uh oh!

Kaspre commented Jun 3, 2026

Uh oh!

p0pfan commented Jun 6, 2026

Uh oh!

udo76 commented Jun 6, 2026

Uh oh!

tobi commented Jun 8, 2026

Uh oh!

georgelichen commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Kaspre commented Jun 2, 2026 •

edited

Loading