feat(Add OpenAI-compatible backend support) by loopyd · Pull Request #619 · tobi/qmd

loopyd · 2026-05-02T02:43:47Z

Problem

QMD currently assumes a local llama.cpp style setup for generation, embeddings, and reranking which is baked in. This prevents any and all user customization to llama cpp (such as for example, using TheTom's turboquant fork...) and doesn't integrate well with existing homelab servers as it tries to run it all locally on the machine.

That makes it harder to use QMD with a local OpenAI-compatible server setup, such as llama-swap which is what I tested this PR with. Even when the server already exposes the same models through /v1/chat/completions, /v1/embeddings, and /v1/rerank, why can't we? Now we can!

Solution

This PR adds an OpenAI-compatible backend alongside the existing llama cpp one that lets QMD talk to a local compatible server instead of requiring direct local model access. Meaning now you can run this on your laptop while your homelab does the tensor crunching!

It also makes the CLI and store paths respect configured model aliases, so users can route QMD generation, embedding, and reranking through named server-side models (ex: qmd-generate, qmd-embed, and qmd-rerank) which I set my server up with for testing this PR.

What's Changed?

Added config support for:
- llm.provider
- llm.baseUrl
- llm.apiKey
Updated QMD to select the backend when configured.
Routed query expansion, embedding, and reranking through the configured remote model aliases.
Updated vector search and rerank paths to use the configured embed and rerank model names instead of hardcoded defaults.
Updated qmd embed so it uses the configured embedding alias instead of forcing the built-in default model name.
Updated help and status output to show the active configured models and backend more clearly.
Added rerank handling that recovers from oversized rerank requests by splitting batches and truncating oversized single documents when needed.
Added and updated tests for the new backend behavior.

Testing

Automated

Run the focused LLM test for the new rerank recovery path:

npx vitest run test/llm.test.ts -t "recovers from oversized rerank requests by splitting and truncating"

Run the existing adjacent OpenAI-compatible rerank mapping test:

npx vitest run test/llm.test.ts -t "rerank maps remote indices back to source files"

Manual

You can test this with llama-swap or any server that exposes OpenAI-compatible chat, embedding, and rerank endpoints.

Option A: Use llama-swap

Start or prepare a local OpenAI-compatible server.
Expose three model IDs, for example:
- qmd-generate
- qmd-embed
- qmd-rerank
Make sure the server exposes these routes:
- POST /v1/chat/completions
- POST /v1/embeddings
- POST /v1/rerank

Option B: Roll your own compatible server

Use any local server that follows the same OpenAI-compatible route layout.
Configure one model for generation, one for embeddings, and one for reranking.
Point QMD at that server with the config below.

QMD config example

Create or update your QMD config:

models:
  generate: qmd-generate
  embed: qmd-embed
  rerank: qmd-rerank

llm:
  provider: openai-compatible
  baseUrl: http://127.0.0.1:8080/v1
  apiKey: your-local-api-key

Verify the flow

Check help and status:

qmd --index my-index --help
qmd --index my-index status

Build embeddings:

qmd --index my-index embed

Run a query:

qmd --index my-index query "How do I unpack EMI archives?" -n 3 --json

Confirm the server receives requests for:
- chat completions
- embeddings
- rerank
If your reranker has tighter request limits, verify the query still succeeds and that rerank requests continue after the first oversized split when needed.

droiter · 2026-05-29T15:22:40Z

adfasd@xLow:~/.config/qmd$ qmd embed -c xyoutuber
Model: embeddinggemma-300m-qat-Q8_0

Embedding error: Error: OpenAI-compatible request failed (500 Internal Server Error): {"error":{"code":500,"message":"[json.exception.parse_error.101] parse error at line 1, column 53965: syntax error while parsing value - invalid string: surrogate U+DC00..U+DFFF must follow U+D800..U+DBFF; last read: '"title: Session: 2026-04-10 08:00:34 UTC | text: \udcca'","type":"server_error"}}
at OpenAICompatibleLLM.requestJson (file:///home/adfasd/qmd/dist/llm.js:1111:19)
at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
at async OpenAICompatibleLLM.embedBatch (file:///home/adfasd/qmd/dist/llm.js:1141:29)
at async LLMSession.withOperation (file:///home/adfasd/qmd/dist/llm.js:1449:20)
at async withLLMSessionForLlm.maxDuration (file:///home/adfasd/qmd/dist/store.js:1109:40)
at async withLLMSessionForLlm (file:///home/adfasd/qmd/dist/llm.js:1512:16)
at async generateEmbeddings (file:///home/adfasd/qmd/dist/store.js:1037:20)
at async vectorIndex (file:///home/adfasd/qmd/dist/cli/qmd.js:1487:20)
at async file:///home/adfasd/qmd/dist/cli/qmd.js:2736:17
█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2% 0/1523 32 err 7.7 KB/s ETA 9m 2s ⚠ Error rate too high (1523/32) — aborting embedding
███████████████████████░░░░░░░ 77% 0/1523 1523 err 346.6 KB/s ETA 3s ⚠ Error rate too high (1587/1523) — aborting embedding
███████████████████████░░░░░░░ 77% 0/1587 1587 err 346.9 KB/s ETA 3s ⚠ Error rate too high (1651/1587) — aborting embedding
███████████████████████░░░░░░░ 77% 0/1651 1651 err 347.1 KB/s ETA 3s ⚠ Error rate too high (2075/1651) — aborting embedding
█████████████████████████████░ 96% 0/2075 2075 err 432.5 KB/s ETA 0s ⚠ Error rate too high (2172/2075) — aborting embedding
██████████████████████████████ 100%

✓ Done! Embedded 0 chunks from 290 documents in 9s
⚠ 2172 chunks failed
while local embedding model is working properly

adfasd@xLow:~/.config/qmd$ qmd embed -c xyoutuber
Model: hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf

QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details.
███████░░░░░░░░░░░░░░░░░░░░░░░ 23% 544/1780 3.6 KB/s ETA 15m 2s

Kaspre · 2026-06-02T15:22:42Z

Hi @loopyd — opened #705 consolidating the OpenAI-compatible-backend effort. It builds on #629's per-operation HybridLLM design (independent embed/rerank/expand endpoints + local fallback), which generalizes your single-base-URL provider switch — the single-llama-swap-server case becomes all three *_api_url pointing at one server. I adapted your oversized-rerank split/truncate recovery (credited in the PR). Thanks for the llama-swap testing and that recovery idea.

feat: support openai-compatible llama-swap backends

d6c66e9

loopyd changed the title ~~Add OpenAI-compatible llama-swap backend support~~ Add OpenAI-compatible backend support May 2, 2026

loopyd mentioned this pull request May 2, 2026

Support OpenAI-compatible backends for generation, embeddings, and reranking #620

Open

loopyd changed the title ~~Add OpenAI-compatible backend support~~ feat(Add OpenAI-compatible backend support) May 2, 2026

Shrub24 mentioned this pull request May 29, 2026

Feat/OpenAI embeddings #689

Closed

Kaspre mentioned this pull request Jun 2, 2026

feat(llm): OpenAI-compatible remote embedding, reranking & query expansion (RemoteLLM + HybridLLM) #705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Add OpenAI-compatible backend support)#619

feat(Add OpenAI-compatible backend support)#619
loopyd wants to merge 1 commit into
tobi:mainfrom
loopyd:feat/openai-compatible-llamaswap

loopyd commented May 2, 2026 •

edited

Loading

Uh oh!

droiter commented May 29, 2026 •

edited

Loading

Uh oh!

Kaspre commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loopyd commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

What's Changed?

Testing

Automated

Manual

Option A: Use llama-swap

Option B: Roll your own compatible server

QMD config example

Verify the flow

Uh oh!

droiter commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kaspre commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

loopyd commented May 2, 2026 •

edited

Loading

droiter commented May 29, 2026 •

edited

Loading