feat(Add OpenAI-compatible backend support)#619
Conversation
|
adfasd@xLow:~/.config/qmd$ qmd embed -c xyoutuber Embedding error: Error: OpenAI-compatible request failed (500 Internal Server Error): {"error":{"code":500,"message":"[json.exception.parse_error.101] parse error at line 1, column 53965: syntax error while parsing value - invalid string: surrogate U+DC00..U+DFFF must follow U+D800..U+DBFF; last read: '"title: Session: 2026-04-10 08:00:34 UTC | text: \udcca'","type":"server_error"}} ✓ Done! Embedded 0 chunks from 290 documents in 9s adfasd@xLow:~/.config/qmd$ qmd embed -c xyoutuber QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details. |
|
Hi @loopyd — opened #705 consolidating the OpenAI-compatible-backend effort. It builds on #629's per-operation |
Problem
QMD currently assumes a local
llama.cppstyle setup for generation, embeddings, and reranking which is baked in. This prevents any and all user customization to llama cpp (such as for example, using TheTom's turboquant fork...) and doesn't integrate well with existing homelab servers as it tries to run it all locally on the machine.That makes it harder to use QMD with a local OpenAI-compatible server setup, such as
llama-swapwhich is what I tested this PR with. Even when the server already exposes the same models through/v1/chat/completions,/v1/embeddings, and/v1/rerank, why can't we? Now we can!Solution
This PR adds an OpenAI-compatible backend alongside the existing llama cpp one that lets QMD talk to a local compatible server instead of requiring direct local model access. Meaning now you can run this on your laptop while your homelab does the tensor crunching!
It also makes the CLI and store paths respect configured model aliases, so users can route QMD generation, embedding, and reranking through named server-side models (ex:
qmd-generate,qmd-embed, andqmd-rerank) which I set my server up with for testing this PR.What's Changed?
llm.providerllm.baseUrlllm.apiKeyqmd embedso it uses the configured embedding alias instead of forcing the built-in default model name.Testing
Automated
npx vitest run test/llm.test.ts -t "recovers from oversized rerank requests by splitting and truncating"npx vitest run test/llm.test.ts -t "rerank maps remote indices back to source files"Manual
You can test this with
llama-swapor any server that exposes OpenAI-compatible chat, embedding, and rerank endpoints.Option A: Use llama-swap
qmd-generateqmd-embedqmd-rerankPOST /v1/chat/completionsPOST /v1/embeddingsPOST /v1/rerankOption B: Roll your own compatible server
QMD config example
Create or update your QMD config:
Verify the flow
qmd --index my-index query "How do I unpack EMI archives?" -n 3 --jsonConfirm the server receives requests for:
If your reranker has tighter request limits, verify the query still succeeds and that rerank requests continue after the first oversized split when needed.