fix(mlx): route vision-language models to the mlx-vlm backend#10274
Merged
Conversation
Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx importer hardcoded backend "mlx" for every mlx-community model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The template was never applied and the model produced degenerate, looping output that echoed the prompt. Detect the "image-text-to-text" pipeline tag in the importer and route those models to mlx-vlm, which applies the processor-aware chat template. An explicit backend preference still wins. As a defensive backstop, the mlx backend now warns loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping. Fixes #10269 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #10269.
Vision-language checkpoints such as
mlx-community/gemma-4-E4B-it-qat-4bitdeclare theimage-text-to-textpipeline tag on HuggingFace. The mlx importer hardcodedbackend: "mlx"for everymlx-community/*model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The chat template was never applied and the model produced degenerate, looping output that echoed the prompt:The same checkpoint served through
mlx_vlm.serverreplies correctly, confirming the weights/runtime are fine and the bug is in prompt/template handling on the wrong code path.Change
core/gallery/importers/mlx.go: detect theimage-text-to-textpipeline tag and route those models to themlx-vlmbackend, which applies the processor-aware chat template. An explicitbackend:preference still wins.backend/python/mlx/backend.py: defensive backstop — warn loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping.Tests
core/gallery/importers/mlx_test.gocover: VLM auto-routes tomlx-vlm, text-only models stay onmlx, and an explicitbackend: mlxpreference is honored even for a VLM. Written test-first (red → green); full importer suite (308 specs) passes; lint clean.Assisted-by: Claude:claude-opus-4-8 [Claude Code]