Skip to content

feat(Add OpenAI-compatible backend support)#619

Open
loopyd wants to merge 1 commit into
tobi:mainfrom
loopyd:feat/openai-compatible-llamaswap
Open

feat(Add OpenAI-compatible backend support)#619
loopyd wants to merge 1 commit into
tobi:mainfrom
loopyd:feat/openai-compatible-llamaswap

Conversation

@loopyd

@loopyd loopyd commented May 2, 2026

Copy link
Copy Markdown

Problem

QMD currently assumes a local llama.cpp style setup for generation, embeddings, and reranking which is baked in. This prevents any and all user customization to llama cpp (such as for example, using TheTom's turboquant fork...) and doesn't integrate well with existing homelab servers as it tries to run it all locally on the machine.

That makes it harder to use QMD with a local OpenAI-compatible server setup, such as llama-swap which is what I tested this PR with. Even when the server already exposes the same models through /v1/chat/completions, /v1/embeddings, and /v1/rerank, why can't we? Now we can!

Solution

This PR adds an OpenAI-compatible backend alongside the existing llama cpp one that lets QMD talk to a local compatible server instead of requiring direct local model access. Meaning now you can run this on your laptop while your homelab does the tensor crunching!

It also makes the CLI and store paths respect configured model aliases, so users can route QMD generation, embedding, and reranking through named server-side models (ex: qmd-generate, qmd-embed, and qmd-rerank) which I set my server up with for testing this PR.

What's Changed?

  • Added config support for:
    • llm.provider
    • llm.baseUrl
    • llm.apiKey
  • Updated QMD to select the backend when configured.
  • Routed query expansion, embedding, and reranking through the configured remote model aliases.
  • Updated vector search and rerank paths to use the configured embed and rerank model names instead of hardcoded defaults.
  • Updated qmd embed so it uses the configured embedding alias instead of forcing the built-in default model name.
  • Updated help and status output to show the active configured models and backend more clearly.
  • Added rerank handling that recovers from oversized rerank requests by splitting batches and truncating oversized single documents when needed.
  • Added and updated tests for the new backend behavior.

Testing

Automated

  1. Run the focused LLM test for the new rerank recovery path:
npx vitest run test/llm.test.ts -t "recovers from oversized rerank requests by splitting and truncating"
  1. Run the existing adjacent OpenAI-compatible rerank mapping test:
npx vitest run test/llm.test.ts -t "rerank maps remote indices back to source files"

Manual

You can test this with llama-swap or any server that exposes OpenAI-compatible chat, embedding, and rerank endpoints.

Option A: Use llama-swap

  1. Start or prepare a local OpenAI-compatible server.
  2. Expose three model IDs, for example:
    • qmd-generate
    • qmd-embed
    • qmd-rerank
  3. Make sure the server exposes these routes:
    • POST /v1/chat/completions
    • POST /v1/embeddings
    • POST /v1/rerank

Option B: Roll your own compatible server

  1. Use any local server that follows the same OpenAI-compatible route layout.
  2. Configure one model for generation, one for embeddings, and one for reranking.
  3. Point QMD at that server with the config below.

QMD config example

Create or update your QMD config:

models:
  generate: qmd-generate
  embed: qmd-embed
  rerank: qmd-rerank

llm:
  provider: openai-compatible
  baseUrl: http://127.0.0.1:8080/v1
  apiKey: your-local-api-key

Verify the flow

  1. Check help and status:
qmd --index my-index --help
qmd --index my-index status
  1. Build embeddings:
qmd --index my-index embed
  1. Run a query:
qmd --index my-index query "How do I unpack EMI archives?" -n 3 --json
  1. Confirm the server receives requests for:

    • chat completions
    • embeddings
    • rerank
  2. If your reranker has tighter request limits, verify the query still succeeds and that rerank requests continue after the first oversized split when needed.

@loopyd loopyd changed the title Add OpenAI-compatible llama-swap backend support Add OpenAI-compatible backend support May 2, 2026
@loopyd loopyd changed the title Add OpenAI-compatible backend support feat(Add OpenAI-compatible backend support) May 2, 2026
@droiter

droiter commented May 29, 2026

Copy link
Copy Markdown

adfasd@xLow:~/.config/qmd$ qmd embed -c xyoutuber
Model: embeddinggemma-300m-qat-Q8_0

Embedding error: Error: OpenAI-compatible request failed (500 Internal Server Error): {"error":{"code":500,"message":"[json.exception.parse_error.101] parse error at line 1, column 53965: syntax error while parsing value - invalid string: surrogate U+DC00..U+DFFF must follow U+D800..U+DBFF; last read: '"title: Session: 2026-04-10 08:00:34 UTC | text: \udcca'","type":"server_error"}}
at OpenAICompatibleLLM.requestJson (file:///home/adfasd/qmd/dist/llm.js:1111:19)
at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
at async OpenAICompatibleLLM.embedBatch (file:///home/adfasd/qmd/dist/llm.js:1141:29)
at async LLMSession.withOperation (file:///home/adfasd/qmd/dist/llm.js:1449:20)
at async withLLMSessionForLlm.maxDuration (file:///home/adfasd/qmd/dist/store.js:1109:40)
at async withLLMSessionForLlm (file:///home/adfasd/qmd/dist/llm.js:1512:16)
at async generateEmbeddings (file:///home/adfasd/qmd/dist/store.js:1037:20)
at async vectorIndex (file:///home/adfasd/qmd/dist/cli/qmd.js:1487:20)
at async file:///home/adfasd/qmd/dist/cli/qmd.js:2736:17
█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2% 0/1523 32 err 7.7 KB/s ETA 9m 2s ⚠ Error rate too high (1523/32) — aborting embedding
███████████████████████░░░░░░░ 77% 0/1523 1523 err 346.6 KB/s ETA 3s ⚠ Error rate too high (1587/1523) — aborting embedding
███████████████████████░░░░░░░ 77% 0/1587 1587 err 346.9 KB/s ETA 3s ⚠ Error rate too high (1651/1587) — aborting embedding
███████████████████████░░░░░░░ 77% 0/1651 1651 err 347.1 KB/s ETA 3s ⚠ Error rate too high (2075/1651) — aborting embedding
█████████████████████████████░ 96% 0/2075 2075 err 432.5 KB/s ETA 0s ⚠ Error rate too high (2172/2075) — aborting embedding
██████████████████████████████ 100%

✓ Done! Embedded 0 chunks from 290 documents in 9s
⚠ 2172 chunks failed
while local embedding model is working properly

adfasd@xLow:~/.config/qmd$ qmd embed -c xyoutuber
Model: hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf

QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details.
███████░░░░░░░░░░░░░░░░░░░░░░░ 23% 544/1780 3.6 KB/s ETA 15m 2s

@Kaspre

Kaspre commented Jun 2, 2026

Copy link
Copy Markdown

Hi @loopyd — opened #705 consolidating the OpenAI-compatible-backend effort. It builds on #629's per-operation HybridLLM design (independent embed/rerank/expand endpoints + local fallback), which generalizes your single-base-URL provider switch — the single-llama-swap-server case becomes all three *_api_url pointing at one server. I adapted your oversized-rerank split/truncate recovery (credited in the PR). Thanks for the llama-swap testing and that recovery idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants