Skip to content

fix(mlx): route vision-language models to the mlx-vlm backend#10274

Merged
mudler merged 1 commit into
masterfrom
fix/mlx-vlm-routing
Jun 12, 2026
Merged

fix(mlx): route vision-language models to the mlx-vlm backend#10274
mudler merged 1 commit into
masterfrom
fix/mlx-vlm-routing

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

What

Fixes #10269.

Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit declare the image-text-to-text pipeline tag on HuggingFace. The mlx importer hardcoded backend: "mlx" for every mlx-community/* model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The chat template was never applied and the model produced degenerate, looping output that echoed the prompt:

exactly: MLX inside LocalAI works exactly: MLX inside LocalAI works exactly: MLX inside Local

The same checkpoint served through mlx_vlm.server replies correctly, confirming the weights/runtime are fine and the bug is in prompt/template handling on the wrong code path.

Change

  • core/gallery/importers/mlx.go: detect the image-text-to-text pipeline tag and route those models to the mlx-vlm backend, which applies the processor-aware chat template. An explicit backend: preference still wins.
  • backend/python/mlx/backend.py: defensive backstop — warn loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping.

Tests

  • New specs in core/gallery/importers/mlx_test.go cover: VLM auto-routes to mlx-vlm, text-only models stay on mlx, and an explicit backend: mlx preference is honored even for a VLM. Written test-first (red → green); full importer suite (308 specs) passes; lint clean.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit
declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx
importer hardcoded backend "mlx" for every mlx-community model, so these
VLMs were served by the text-only mlx-lm backend whose tokenizer does not
carry the processor chat template. The template was never applied and the
model produced degenerate, looping output that echoed the prompt.

Detect the "image-text-to-text" pipeline tag in the importer and route those
models to mlx-vlm, which applies the processor-aware chat template. An
explicit backend preference still wins.

As a defensive backstop, the mlx backend now warns loudly when the loaded
model has no chat template, so a misrouted VLM surfaces the problem instead
of silently looping.

Fixes #10269

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit a7a7bd6 into master Jun 12, 2026
64 checks passed
@mudler mudler deleted the fix/mlx-vlm-routing branch June 12, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mlx backend: degenerate looping output with gemma-4 E4B (chat template apparently not applied)

2 participants