LocalAI version: 4.4.1 (Homebrew bottle)
Environment: macOS (Apple Silicon, M1 Max), darwin/arm64
Backend: mlx (darwin image installs fine)
What happened
Serving mlx-community/gemma-4-E4B-it-qat-4bit through the mlx backend produces degenerate, looping output that echoes the prompt — the chat template does not seem to be applied.
Model config:
name: gemma-qat-mlx
backend: mlx
parameters:
model: mlx-community/gemma-4-E4B-it-qat-4bit
Request:
curl http://127.0.0.1:1240/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"gemma-qat-mlx","messages":[{"role":"user","content":"Reply with exactly: MLX inside LocalAI works"}],"max_tokens":30}'
Response content:
exactly: MLX inside LocalAI works exactly: MLX inside LocalAI works exactly: MLX inside Local
— prompt-fragment echo repeated until max_tokens, on every request.
Expected
The same checkpoint served by mlx_vlm.server (mlx-vlm 0.6.2) on the same machine replies correctly ("MLX inside LocalAI works"), so the weights and the MLX runtime are fine — the difference points at prompt/chat-template handling in the LocalAI mlx backend (gemma-4 E4B is a vision-language architecture; possibly the backend applies a plain-LM template or none at all).
Secondary observation
Cold load through the backend took ~82 s for this model; mlx_vlm.server loads the same checkpoint from the same HF cache in ~10 s. Worth a look once the output issue is addressed.
Happy to provide debug logs or test patches on this machine.
LocalAI version: 4.4.1 (Homebrew bottle)
Environment: macOS (Apple Silicon, M1 Max), darwin/arm64
Backend:
mlx(darwin image installs fine)What happened
Serving
mlx-community/gemma-4-E4B-it-qat-4bitthrough themlxbackend produces degenerate, looping output that echoes the prompt — the chat template does not seem to be applied.Model config:
Request:
Response content:
— prompt-fragment echo repeated until
max_tokens, on every request.Expected
The same checkpoint served by
mlx_vlm.server(mlx-vlm 0.6.2) on the same machine replies correctly ("MLX inside LocalAI works"), so the weights and the MLX runtime are fine — the difference points at prompt/chat-template handling in the LocalAI mlx backend (gemma-4 E4B is a vision-language architecture; possibly the backend applies a plain-LM template or none at all).Secondary observation
Cold load through the backend took ~82 s for this model;
mlx_vlm.serverloads the same checkpoint from the same HF cache in ~10 s. Worth a look once the output issue is addressed.Happy to provide debug logs or test patches on this machine.