Skip to content

feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM)#705

Open
Kaspre wants to merge 6 commits into
tobi:mainfrom
Kaspre:feat/remote-llm-openai-compatible
Open

feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM)#705
Kaspre wants to merge 6 commits into
tobi:mainfrom
Kaspre:feat/remote-llm-openai-compatible

Conversation

@Kaspre

@Kaspre Kaspre commented Jun 2, 2026

Copy link
Copy Markdown

Problem

QMD's BM25 + vector + query-expansion + LLM-rerank pipeline has a hard dependency on local GGUF models via node-llama-cpp. Devices with inadequate or no GPU at all fall back to CPU embedding of 300M–0.6B models, which is unacceptably slow. Node-llama-cpp serialises inference through a single context, so concurrent workers queue and time out. Moving embedding/reranking/expansion to a dedicated server (vLLM, TEI, Ollama, llama.cpp, LiteLLM, OpenAI, …) removes both problems.

This builds directly on @georgelichen's #629, which contributed the core RemoteLLM/HybridLLM architecture. This PR rebases that scaffold onto current main and adds five refinements (A–E below) needed to make it production-correct against real OpenAI-compatible servers.

What this adds

A backend that talks to any OpenAI-compatible HTTP API, composed with the existing local backend so each operation can be routed independently. Fully opt-in — with no *_api_url configured, behaviour is byte-for-byte the current local-only path.

  • RemoteLLM/v1/embeddings, /v1/rerank, /v1/chat/completions; per-endpoint circuit breakers (an embed outage doesn't take down rerank/chat); bearer auth; char-based token approximation so chunking works without a local tokenizer.
  • HybridLLM — routes embed/embedBatch/rerank/expandQuery to remote, generate/tokenize/detokenize to local LlamaCpp, with per-operation fallback.
  • Widened LLM interface (embedBatch, tokenize/detokenize/countTokens, isRemote, embedModelName, rerankModelName?, generateModelName?, usesRemoteEmbedding?, supportsRerank/supportsExpand) and getDefaultLLM/setDefaultLLM alongside the existing getDefaultLlamaCpp/setDefaultLlamaCpp (no breaking change).

The five refinements on top of #629

  • A — Sigmoid rerank normalization. llama.cpp's /v1/rerank (e.g. bge-reranker-v2-m3) and most cross-encoders return log-odds (~−10…+10), not 0–1. QMD's score blend and the --min-score 0.3 default assume probabilities, so without σ(x)=1/(1+e⁻ˣ) every blended score goes negative and every query returns "No results found". No-op for backends already emitting 0–1.
  • B — RemoteLLM.expandQuery. Query expansion via /v1/chat/completions, emitting the same lex:/vec:/hyde: line format as the local model; falls back to local on error.
  • C — Pre-flight embed probe. When usesRemoteEmbedding, embed one token at vectorIndex() startup so a bad URL / wrong model / auth failure surfaces immediately instead of silently falling back mid-batch.
  • D — HybridLLM.rerank local fallback. Symmetric with the expandQuery fallback already in Add remote embedding, reranking, and query expansion support #629: if the remote backend doesn't support rerank, fall back to local LlamaCpp.rerank rather than blindly POSTing /v1/rerank at an embed-only server. Lets you mix "remote embed + local rerank" freely.
  • E — Rerank oversized-payload recovery. When the server rejects a rerank request as too large (HTTP 413 / "too large to process" / context-length), RemoteLLM.rerank recursively bisects the batch and halve-truncates a single oversized document (down to a 32-char floor), remapping response indices to the originals and re-sorting by normalized score. Non-oversized errors still propagate to the circuit breaker / local fallback. Adapted from the recovery in @loopyd's #619.

Architecture

flowchart TD
    Q["qmd: index / query"] --> H{"HybridLLM<br/>remote config present?"}
    H -->|no| L0["LlamaCpp only — unchanged default"]
    H -->|yes| E["embed / embedBatch"] & RK["rerank"] & EX["expandQuery"] & G["generate"] & TK["tokenize / detokenize"]

    E --> R1["RemoteLLM → POST /v1/embeddings<br/>+ pre-flight 1-token probe ◆C"]
    RK --> RKQ{"remote supportsRerank?"}
    RKQ -->|yes| R2["RemoteLLM → POST /v1/rerank<br/>+ σ: log-odds→0‥1 ◆A<br/>+ oversized split/truncate recovery ◆E"]
    RKQ -->|no| L1["LlamaCpp local rerank ◆D"]
    EX --> EXQ{"remote supportsExpand?"}
    EXQ -->|yes| R3["RemoteLLM → POST /v1/chat/completions ◆B"]
    EXQ -->|no| L2["LlamaCpp local expansion"]
    G --> L3["LlamaCpp local"]
    TK --> L4["LlamaCpp local (char-approx if usesRemoteEmbedding)"]
Loading

Configuration

Any OpenAI-compatible server works — point each operation wherever you like. Example (~/.config/qmd/index.yml):

models:
  embed_api_url:   http://embed-host:8081/v1   # vLLM / TEI / llama.cpp / OpenAI / Ollama
  embed_api_model: nomic-embed-text
  # rerank/expand are optional and independent — omit to keep them local:
  rerank_api_url:   http://rerank-host:8082/v1
  rerank_api_model: bge-reranker-v2-m3
  expand_api_url:   https://your-chat-endpoint/v1
  expand_api_model: <chat-model>
  # *_api_key optional (bearer)

Each *_api_url is independent: set only embed_api_url and rerank/expand stay local (fix D); set all three to go fully remote. Leave them all unset and nothing changes.

Backward compatibility

No breaking changes. All new behaviour is opt-in via config; local-only setups are unaffected; getDefaultLlamaCpp/setDefaultLlamaCpp retained.

Testing

Build (node scripts/build.mjs) and tsc --noEmit are clean. Under the project's CI test mode (CI=true, matching upstream CI), this branch shows no new failures vs main — the deterministic failing set is identical on both (the only failures are pre-existing WSL/Linux Git-Bash-path and skill-bundling tests, unrelated to this change). The new test/remote-llm.test.ts suite passes in full: embed/rerank routing, per-endpoint circuit breakers + independence, sigmoid normalization, oversized-rerank split/truncate recovery, auth, timeouts, expandQuery parsing + fallback, and the HybridLLM rerank fallback — all driven by in-process HTTP servers, no mocks.

This configuration runs in production today: embedding + reranking against two llama.cpp server instances and query expansion against a cloud chat endpoint — which is where refinements A, B, C (and the need for E) came from.

Related work & credit

This PR aims to consolidate the OpenAI-compatible-backend effort into one production-ready change, building on and crediting the prior attempts:

Follow-ups (not in this PR)

  • Byte-based embed batching (maxBatchBytes + chunk strategies + --max-batch-mb / --chunk-strategy flags), as in feat(Add OpenAI-compatible backend support) #619. This PR keeps the existing count-based batching (maxBatchSize); byte-aware batching is a natural follow-on.

Kaspre and others added 3 commits June 2, 2026 11:20
…nsion

Add a RemoteLLM backend that talks to any OpenAI-compatible HTTP API (vLLM,
TEI, Ollama, llama.cpp --server, LiteLLM, OpenAI, ...), composed with the local
LlamaCpp via a HybridLLM that routes each operation independently:

- RemoteLLM: /v1/embeddings, /v1/rerank, /v1/chat/completions; per-endpoint
  circuit breakers; bearer auth; char-based token approximation so chunking
  works without a local tokenizer.
- HybridLLM: embed/embedBatch/rerank/expandQuery -> remote,
  generate/tokenize/detokenize -> local, with per-operation local fallback.
- Widened LLM interface (embedBatch, tokenize/detokenize/countTokens, isRemote,
  embedModelName, rerankModelName, generateModelName, usesRemoteEmbedding,
  supportsRerank/supportsExpand) plus getDefaultLLM/setDefaultLLM alongside the
  existing getDefaultLlamaCpp/setDefaultLlamaCpp (no breaking change).
- Sigmoid normalization of log-odds rerank scores; RemoteLLM.expandQuery via
  chat completions; startup pre-flight embed probe; HybridLLM.rerank local
  fallback (symmetric with the expandQuery fallback).

Opt-in via models.*_api_url / QMD_*_API_* env vars; with nothing configured the
local-only path is byte-for-byte unchanged.

Builds on @georgelichen's remote-LLM work in tobi#629.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ncating

When a remote /v1/rerank request is rejected as too large (HTTP 413 / "too
large to process" / context length), RemoteLLM.rerank recursively bisects the
batch and halve-truncates a single oversized document down to a 32-char floor,
remapping response indices to the originals and re-sorting by score.
Non-oversized errors still propagate so the circuit breaker / HybridLLM local
fallback can react.

Adapted from the rerank recovery in tobi#619 (@loopyd).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- remoteConfigFromEnv: throw on a half-configured remote backend (embed_api_url
  set without embed_api_model, or vice-versa) instead of silently falling back
  to the local backend and skipping the remote pre-flight probe.
- RemoteLLM.rerank: normalize scores once over the full (possibly
  recovery-split) result set, applying sigmoid only when logit-range values are
  present (any score < 0 or > 1). Rerankers that already return [0,1]
  probabilities (Cohere/Voyage-style) pass through unchanged, avoiding
  distortion of the blend and --min-score filtering.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Kaspre commented Jun 3, 2026

Copy link
Copy Markdown
Author

Follow-up hardening has been pushed in db5a32e.

Latest changes:

  • Preserve the model identity that actually produced rerank scores when writing rerank cache entries, so local fallback scores are not cached under a remote reranker key.
  • Reject SDK store.embed({ model }) overrides when remote embedding is configured and the requested model differs from the configured remote embed model.
  • Thread the per-store LLM through SDK searchVector() query embedding so remote SDK stores do not accidentally fall back to the global/default local LLM.
  • Added focused regressions for rerank fallback cache identity, remote embed model override handling, and SDK searchVector() per-store LLM routing.

Validation run:

  • CI=true node ./node_modules/vitest/vitest.mjs run --reporter=verbose --testTimeout 60000 test/store.test.ts test/sdk.test.ts passed: 289 passed, 19 skipped.
  • test/remote-llm.test.ts passed outside the sandbox: 51 passed. The sandboxed run hit the known local listen EPERM restriction while starting its in-process HTTP server.
  • pnpm build passed.
  • node ./node_modules/typescript/bin/tsc -p tsconfig.build.json --noEmit passed.
  • git diff --check passed.
  • A multi-lane read-only review pass was run; it found the searchVector() per-store LLM issue above, which is now fixed, and the follow-up review came back clean.

Maintainers: this should be ready for review again.

@p0pfan

p0pfan commented Jun 6, 2026

Copy link
Copy Markdown

I really hope this PR gets merged soon. I truly need this feature to use qmd on machines
without a GPU.

@udo76

udo76 commented Jun 6, 2026

Copy link
Copy Markdown

Running this on a Mac Mini M4 gateway with a dedicated GPU server (AMD Strix Halo, 128GB) hosting Qwen3-Embedding-4B and Qwen3-Reranker-4B on OpenAI-compatible endpoints (llama.cpp server).
Currently stuck with local 4B GGUF as workaround - works but ties up M4 resources.

The key use case for us: serving embedding and reranking to multiple machines on the network that are too weak to run meaningful models locally - thin clients, ARM64 SBCs, VPS nodes. A single GPU box providing high-quality retrieval for the whole fleet.
PR #705's HybridLLM per-operation fallback design matches this perfectly.

Happy to test any of the PRs against our endpoints.

rghamilton3 added a commit to rghamilton3/lattice that referenced this pull request Jun 8, 2026
…#97)

* feat(spine): add remote LLM support via QMD PR #705 (OpenAI-compatible embedding)

Vendor a pre-built @tobilu/qmd 2.5.3 from Kaspre/qmd#feat/remote-llm-openai-compatible
(tobi/qmd#705), which adds RemoteLLM/HybridLLM backends for routing embedding,
reranking, and query expansion to any OpenAI-compatible API endpoint.

- spine/vendor/tobilu-qmd-2.5.3.tgz: built dist from the PR branch (dist/ is
  gitignored upstream so a github: dep would install a broken package)
- spine/package.json: switch from ^2.5.1 registry dep to file:./vendor tarball
- spine/.gitignore: allow vendor/*.tgz so the tarball is tracked by git
- spine/src/config.ts: add QmdModelsConfig + getQmdModelsConfig() to read
  [spine.qmd] from config.toml (embed_api_url/model, rerank, expand settings)
- spine/src/search.ts: pass models: getQmdModelsConfig() to createStore so
  remote embedding activates when configured; undefined preserves local-only path
- config.toml.example: document the new [spine.qmd] section (all commented out)

Remote embedding is fully opt-in: omitting [spine.qmd] leaves behaviour
unchanged. Env vars QMD_EMBED_API_URL / QMD_EMBED_API_MODEL also work and
take precedence over config.toml.

* docs(spine): address PR review findings on remote QMD config

- Add comment to QmdModelsConfig clarifying it's an intentional structural
  subset of QMD's internal ModelsConfig (not in public exports)
- Add comment to getQmdModelsConfig() confirming env vars take precedence
  via QMD's remoteConfigFromEnv(), and that QMD itself throws on partial
  embed config (url without model, or vice versa)
- Add vendor removal note in package.json for when @tobilu/qmd >=2.6.0
  lands on npm

* fix(spine): remove invalid //vendor comment from package.json

bun install treats //vendor as a real package name and tries to resolve
it from npm, breaking CI. Remove the comment entry; the intent is
captured in git history.
@tobi

tobi commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thank you (and all the others who worked on PRs related to this)

@georgelichen

Copy link
Copy Markdown

This is a strong PR overall. I really like the direction here: the RemoteLLM/HybridLLM split is thoughtful, the production hardening is substantial, and a lot of the follow-up work from #629 is integrated in a pragmatic way. Thanks for pushing this forward.

I went through the diff pretty carefully and found three things that I think are worth fixing before merge:

  1. I think the publish workflow will fail to authenticate to npm as written.
    In .github/workflows/publish.yml, npm publish --provenance --access public no longer passes NODE_AUTH_TOKEN. actions/setup-node writes the registry config, but it still expects the token to be present in the environment. Without that, the release job looks correct but will fail at publish time.

  2. I think the new launcher mis-detects installed packages on Windows.
    In bin/qmd, the installed-package heuristic uses pkgDir.split("/") to look for a node_modules segment. On Windows paths that never matches, so an installed package can be mistaken for a source checkout and incorrectly routed into source mode again. That seems to reintroduce the same class of Windows install/runtime issues this launcher rewrite is trying to fix.

  3. I think --index path normalization is still incomplete on Windows.
    In src/cli/qmd.ts and src/collections.ts, the “treat this like a path and normalize it into an index name” branch only runs when the argument contains /. On Windows, --index C:\tmp\demo skips that path-normalization path, which can send the DB path and config path to unexpected locations. I reproduced that locally with Node path resolution.

None of these feel far from fixable, and I still think the overall PR is heading in a very good direction. I just wanted to flag these before merge so they do not turn into release-time or Windows-specific regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants