Skip to content

mlx backend: degenerate looping output with gemma-4 E4B (chat template apparently not applied) #10269

Description

@enneo-service

LocalAI version: 4.4.1 (Homebrew bottle)
Environment: macOS (Apple Silicon, M1 Max), darwin/arm64
Backend: mlx (darwin image installs fine)

What happened

Serving mlx-community/gemma-4-E4B-it-qat-4bit through the mlx backend produces degenerate, looping output that echoes the prompt — the chat template does not seem to be applied.

Model config:

name: gemma-qat-mlx
backend: mlx
parameters:
  model: mlx-community/gemma-4-E4B-it-qat-4bit

Request:

curl http://127.0.0.1:1240/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"gemma-qat-mlx","messages":[{"role":"user","content":"Reply with exactly: MLX inside LocalAI works"}],"max_tokens":30}'

Response content:

 exactly: MLX inside LocalAI works exactly: MLX inside LocalAI works exactly: MLX inside Local

— prompt-fragment echo repeated until max_tokens, on every request.

Expected

The same checkpoint served by mlx_vlm.server (mlx-vlm 0.6.2) on the same machine replies correctly ("MLX inside LocalAI works"), so the weights and the MLX runtime are fine — the difference points at prompt/chat-template handling in the LocalAI mlx backend (gemma-4 E4B is a vision-language architecture; possibly the backend applies a plain-LM template or none at all).

Secondary observation

Cold load through the backend took ~82 s for this model; mlx_vlm.server loads the same checkpoint from the same HF cache in ~10 s. Worth a look once the output issue is addressed.

Happy to provide debug logs or test patches on this machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions