metal : optimize MoE for large batches #13388

ggerganov · 2025-05-08T19:20:08Z

Utilize #12850 to improve the mat-mat MUL_MAT_ID performance:

Map src1 [n_embd, n_expert_used, n_tokens] -> hsrc1 [n_embd, n_tokens, n_expert]
Perform regular mat-mat multiplication src0 x hsrc1 with dynamic neh11(expert_id)
Unmap the result back to dst

./scripts/compare-commits.sh master gg/metal-mm-id-opt -m models/qwen3-30b-a3b/ggml-model-f16.gguf -m models/qwen3-30b-a3b/ggml-model-q8_0.gguf -m models/qwen3-30b-a3b/ggml-model-q4_0-pure.gguf -m models/mixtral-8x7b-32k-fast/ggml-model-q4_0.gguf -m models/nomic-embed-text-v2-moe/ggml-model-f16.gguf -fa 1 -p 512 -n 0 -t 1

Model	Test	t/s master	t/s gg/metal-mm-id-opt	Speedup
llama 8x7B Q4_0	pp512	295.20	651.44	2.21
nomic-bert-moe 475M F16	pp512	13083.98	24008.05	1.83
qwen3moe 30B.A3B F16	pp512	344.46	1400.07	4.06
qwen3moe 30B.A3B Q4_0	pp512	759.53	1359.49	1.79
qwen3moe 30B.A3B Q8_0	pp512	707.46	1350.71	1.91

ggml-ci

ggml-ci (cherry picked from commit 611aa91)

* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...

metal : optimize MoE for large batches

b6a4d53

ggml-ci

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 8, 2025

ggerganov merged commit 611aa91 into master May 9, 2025
53 checks passed

ggerganov deleted the gg/metal-mm-id-opt branch May 9, 2025 12:15

filipwiech mentioned this pull request May 9, 2025

Update llama.cpp to include new optimizations for MoE models on Apple Silicon ollama/ollama#10630

Closed

LostRuins pushed a commit to LostRuins/koboldcpp that referenced this pull request May 9, 2025

metal : optimize MoE for large batches (ggml-org#13388)

a62e1df

ggml-ci (cherry picked from commit 611aa91)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : optimize MoE for large batches #13388

metal : optimize MoE for large batches #13388

ggerganov commented May 8, 2025

metal : optimize MoE for large batches #13388

metal : optimize MoE for large batches #13388

Conversation

ggerganov commented May 8, 2025