Skip to content

feat: OpenAI-compatible remote embedding, reranking, and query expansion#1

Closed
rghamilton3 wants to merge 6 commits into
mainfrom
remote-llm-pr629
Closed

feat: OpenAI-compatible remote embedding, reranking, and query expansion#1
rghamilton3 wants to merge 6 commits into
mainfrom
remote-llm-pr629

Conversation

@rghamilton3

Copy link
Copy Markdown
Owner

Summary

Cherry-picks the 5 commits from georgelichen's PR #629 (branch georgelichen:merge-pr-517-remote-llm) onto current main, then applies Kaspre's three documented fixes from the tobi#629 comment thread.

What's new

Remote LLM backends (RemoteLLM, HybridLLM)

  • RemoteLLM — HTTP client implementing the LLM interface against any OpenAI-compatible API:
    • /v1/embeddings for embedding (with batch splitting, dimension validation, index-sorting)
    • /v1/rerank for reranking
    • /v1/chat/completions for query expansion (lex:/vec:/hyde: line format)
    • Circuit breaker pattern (closed → open → half-open) per endpoint
  • HybridLLM — Routes embed/embedBatch/rerank/expandQuery to remote, generate/tokenize/detokenize to local LlamaCpp
  • remoteConfigFromEnv() — Reads QMD_EMBED_API_URL, QMD_EMBED_API_KEY, QMD_EMBED_API_MODEL, QMD_RERANK_API_*, QMD_EXPAND_API_* (env vars override YAML models: block)

Configuration (YAML models: block or env vars)

models:
  embed_api_url: http://vllm-host:8000/v1
  embed_api_model: text-embedding-3-small
  embed_api_key: sk-...          # optional
  rerank_api_url: http://vllm-host:8001/v1
  rerank_api_model: bge-reranker-v2-m3
  expand_api_url: http://vllm-host:8002/v1
  expand_api_model: Qwen3-4B

LLM interface additions

  • embedModelName, generateModelName?, rerankModelName? readonly properties
  • usesRemoteEmbedding? — skips local tokenizer preprocessing when true
  • embedBatch() method
  • tokenize()/detokenize() on interface (stubs throw for remote-only backends)
  • getDefaultLLM() / setDefaultLLM() generalized singleton (getDefaultLlamaCpp() deprecated but kept)

Remote-aware chunking

When usesRemoteEmbedding is true, chunkDocumentByApproxTokens() uses a character-heuristic (~3 chars/token) instead of the local tokenizer.

Kaspre's fixes (PR tobi#629 comments)

Fix A — Sigmoid normalization in RemoteLLM.rerank

llama.cpp's /v1/rerank emits log-odds (~−10 to +10), not probabilities. Without normalization every score goes negative and --min-score 0.3 fires "No results found" for every query.

const sigmoid = (x: number) => 1 / (1 + Math.exp(-x));
score: sigmoid(r.relevance_score)   // normalises log-odds → [0, 1]

No-op for rerankers already emitting 0–1 values.

Fix B — RemoteLLM.expandQuery (already in the cherry-picked commits)

POSTs to <expand_api_url>/chat/completions with a system prompt eliciting lex:/vec:/hyde: line-prefixed output. Parsing and fallback mirror LocalLLM.expandQuery shape.

Fix C — Pre-flight probe in vectorIndex()

When usesRemoteEmbedding === true, embeds a single token before starting the batch loop. Config mistakes (bad URL, wrong model name, auth failure) surface immediately:

✗ Remote embedding probe failed: connect ECONNREFUSED 127.0.0.1:8000
  Verify QMD_EMBED_API_URL and QMD_EMBED_API_MODEL, then retry.

Other fixes in the Kaspre commit

  • Added generateModelName? and rerankModelName? to the LLM interface (used by store.ts optional-chaining for default model fallbacks — was a TS compile error)
  • Added c.red to the terminal colour map (used by Fix C's error output)
  • Doctor device-probe guard: check instanceof LlamaCpp before calling getDeviceInfo() — emits "unavailable" for non-local backends
  • Updated rerank unit test to send realistic log-odds mock scores and assert sigmoid-normalised output
  • Fixed store rerank-dedup spy from getDefaultLlamaCppgetDefaultLLM
  • Fixed doubled async () => syntax error in one store test header

Tests

  • 52 / 52 test/remote-llm.test.ts (unit tests with mock HTTP server)
  • 219 / 219 test/store.test.ts
  • test/remote-llm-integration.test.ts skipped unless VLLM_EMBED_URL/VLLM_EMBED_MODEL env vars are set

Files changed

File Change
src/remote-llm.ts New — RemoteLLM, HybridLLM, helpers
src/hybrid-llm.ts New — routing shim
src/llm.ts Extended LLM interface; getDefaultLLM/setDefaultLLM; remote format helpers
src/store.ts getLlm() returns LLM; remote chunking path; chunkDocumentByApproxTokens export
src/collections.ts ModelsConfig remote API fields
src/cli/qmd.ts getStore() wires HybridLLM; Fix C probe; doctor guard; colour map
test/remote-llm.test.ts New unit tests
test/remote-llm-integration.test.ts New live integration tests
test/store.test.ts New remote-embedding test; spy/syntax fixes
CHANGELOG.md Entry under [Unreleased]

Jim Smith and others added 6 commits May 27, 2026 11:03
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add intent? to LLM interface and ILLMSession expandQuery signature
  (store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
  getStore() so content_vectors.model reflects the actual LLM in use
  (previously always stored DEFAULT_EMBED_MODEL_URI even with remote)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is
  configured; throws "expandApiModel not configured" otherwise
- Independent circuit breaker for the expand endpoint
- parseExpandResponse() parses lex/vec/hyde lines, filters terms that
  don't share a word with the original query, falls back gracefully on
  bad model output
- RemoteLLM.supportsExpand getter for routing decisions
- HybridLLM routes expandQuery to remote when remote.supportsExpand,
  otherwise falls back to local LlamaCpp (no interface changes)
- remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL /
  QMD_EXPAND_API_KEY and YAML expand_api_* fields
- Unit tests (mock HTTP server, VCR-style): payload shape, auth header
  fallback, lex/vec/hyde parsing, includeLexical=false filtering,
  fallback on bad output, query-term filtering, circuit breaker,
  HybridLLM routing (remote vs local), config env vars
- Integration tests: live server connectivity, all three types returned,
  includeLexical=false, intent incorporation, HybridLLM routing verified
  via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL
  env vars, skipped when absent)
Merge PR tobi#517 and keep it compatible with the current main branch.

Constraint: Upstream main diverged after PR tobi#517, so a fast-forward merge was not possible
Rejected: Cherry-pick the PR commits directly | would still require the same compatibility fixes and lose merge context
Confidence: medium
Scope-risk: moderate
Directive: Keep RemoteLLM and HybridLLM aligned with the LLM tokenize/detokenize interface and verify Windows CLI wrappers separately from Unix shell scripts
Tested: npx tsc -p tsconfig.build.json; npx vitest run --reporter=verbose test/remote-llm.test.ts test/remote-llm-integration.test.ts
Not-tested: full vitest suite; npm run build wrapper script on Windows; live GitHub Actions
When the active embedding backend is remote, generateEmbeddings now uses
character-space chunking instead of token-based preprocessing. This keeps
qmd embed from initializing node-llama-cpp solely to tokenize input before
calling a remote embedding API.

The change is scoped to indexing. Query-time expansion and reranking keep
their existing routing rules, and a regression test now fails if remote
embedding falls back to local tokenization during indexing.

Constraint: Remote embedding backends do not expose a tokenizer interface in QMD today
Rejected: Change HybridLLM tokenize() globally | would alter query-time behavior and broaden risk unnecessarily
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If remote token-aware chunking is added later, keep qmd embed free of mandatory local llama initialization
Tested: npx tsc -p tsconfig.build.json
Tested: npx vitest run test/store.test.ts -t "generateEmbeddings" --reporter=verbose
Tested: npx vitest run test/store.test.ts -t "Token chunking guardrails" --reporter=verbose
Not-tested: Full end-to-end qmd embed against a live remote embedding service after this code change
Fix A (remote-llm.ts): Apply sigmoid normalization to reranker scores.
  llama.cpp /v1/rerank emits log-odds (~-10 to +10), not probabilities.
  sigmoid(x) = 1/(1+e^-x) normalises to [0,1] without breaking rerankers
  that already output probabilities (sigmoid is near-linear for x in 0-1).

Fix B: Already present in commit f2fd64e (remote query expansion via
  OpenAI-compatible chat/completions with lex:/vec:/hyde: format).

Fix C (cli/qmd.ts): Pre-flight probe in vectorIndex() startup.
  When usesRemoteEmbedding is true, embed a single token before the batch
  loop so config mistakes (bad URL, wrong model, auth failure) surface
  immediately with a clear error message rather than mid-run.

Rebase fix 1 (cli/qmd.ts): Use getDefaultLLM().embedModelName for the
  model variable at the embed status line (line ~1938).

Rebase fix 2 (cli/qmd.ts): Doctor device-probe guard — check instanceof
  LlamaCpp before calling getDeviceInfo; emit unavailable message for
  non-local LLM backends.

Additional build fixes:
  - llm.ts: Add generateModelName? and rerankModelName? to LLM interface
    (store.ts uses these via optional chaining for default model names).
  - cli/qmd.ts: Add 'red' to the terminal colour map (used by Fix C error
    message).

Test fixes:
  - test/remote-llm.test.ts: Update rerank test to send log-odds mock
    scores and assert sigmoid-normalised output (toBeCloseTo).
  - test/store.test.ts: Fix rerank-dedup spy from getDefaultLlamaCpp to
    getDefaultLLM (store.ts now routes through getDefaultLLM).
  - test/store.test.ts: Fix doubled async callback syntax on one test
    header (parse error from earlier conflict resolution).
@Kaspre

Kaspre commented Jun 2, 2026

Copy link
Copy Markdown

Hi @rghamilton3 — opened tobi#705 consolidating the tobi#629 line onto current main with the documented fixes (sigmoid / expandQuery / pre-flight probe) plus a HybridLLM.rerank local fallback and oversized-rerank recovery, targeting upstream. Since your PR did the same tobi#629 + fixes rebase, flagging it in case you'd like to converge there.

@rghamilton3

Copy link
Copy Markdown
Owner Author

Hi @rghamilton3 — opened tobi#705 consolidating the tobi#629 line onto current main with the documented fixes (sigmoid / expandQuery / pre-flight probe) plus a HybridLLM.rerank local fallback and oversized-rerank recovery, targeting upstream. Since your PR did the same tobi#629 + fixes rebase, flagging it in case you'd like to converge there.

Hi @Kaspre Thanks for the heads up. Since this was shamelessly stolen from your work anyways I'll follow you on this :)

@rghamilton3 rghamilton3 closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants