fix(mlx): route vision-language models to the mlx-vlm backend by localai-bot · Pull Request #10274 · mudler/LocalAI

localai-bot · 2026-06-12T20:11:17Z

What

Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit declare the image-text-to-text pipeline tag on HuggingFace. The mlx importer hardcoded backend: "mlx" for every mlx-community/* model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The chat template was never applied and the model produced degenerate, looping output that echoed the prompt:

exactly: MLX inside LocalAI works exactly: MLX inside LocalAI works exactly: MLX inside Local

The same checkpoint served through mlx_vlm.server replies correctly, confirming the weights/runtime are fine and the bug is in prompt/template handling on the wrong code path.

Change

core/gallery/importers/mlx.go: detect the image-text-to-text pipeline tag and route those models to the mlx-vlm backend, which applies the processor-aware chat template. An explicit backend: preference still wins.
backend/python/mlx/backend.py: defensive backstop — warn loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping.

Tests

New specs in core/gallery/importers/mlx_test.go cover: VLM auto-routes to mlx-vlm, text-only models stay on mlx, and an explicit backend: mlx preference is honored even for a VLM. Written test-first (red → green); full importer suite (308 specs) passes; lint clean.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx importer hardcoded backend "mlx" for every mlx-community model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The template was never applied and the model produced degenerate, looping output that echoed the prompt. Detect the "image-text-to-text" pipeline tag in the importer and route those models to mlx-vlm, which applies the processor-aware chat template. An explicit backend preference still wins. As a defensive backstop, the mlx backend now warns loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping. Fixes #10269 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler merged commit a7a7bd6 into master Jun 12, 2026
64 checks passed

mudler deleted the fix/mlx-vlm-routing branch June 12, 2026 21:12

localai-bot mentioned this pull request Jun 12, 2026

mlx backend: degenerate looping output with gemma-4 E4B (chat template apparently not applied) #10269

Closed

BrewTestBot mentioned this pull request Jun 13, 2026

localai 4.4.3 Homebrew/homebrew-core#287865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(mlx): route vision-language models to the mlx-vlm backend#10274

fix(mlx): route vision-language models to the mlx-vlm backend#10274
mudler merged 1 commit into
masterfrom
fix/mlx-vlm-routing

localai-bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 12, 2026

What

Change

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants