[Model][Quantization] Add GGUF support for MiniMax-M2.1 by JoursBleu · Pull Request #36965 · vllm-project/vllm

JoursBleu · 2026-03-13T08:28:55Z

Purpose

Add GGUF loading support for MiniMax-M2.1 (456B MoE, 45.9B active, 256 experts, 8 active per token).

This enables serving GGUF-quantized MiniMax-M2.1 checkpoints (e.g. unsloth/MiniMax-M2.1-GGUF) with vLLM.

vllm/model_executor/model_loader/gguf_loader.py

Add MoE expert weight mapping for MiniMax-M2 (fused ffn_gate_exps/ffn_down_exps/ffn_up_exps → per-expert w1/w2/w3), following the same pattern as DeepSeek2.
Map exp_probs_b.bias → block_sparse_moe.e_score_correction_bias.
Register sideload params regex (\.block_sparse_moe\.experts\.(gate_up_proj|down_proj)) so merged expert weights are excluded from the unmapped parameter check.
Add multi-shard GGUF support: _get_all_gguf_files() auto-discovers shard files from a single shard path (e.g. *-00001-of-00005.gguf → all 5 shards), with dynamic shard index padding width detection. Multi-shard iteration is used for weight loading, weight type mapping, and extra tensor name collection.

vllm/model_executor/model_loader/weight_utils.py

Add gguf_quant_weights_iterator_multi() for iterating over quantized weights across multiple GGUF shard files. Like the single-file version, it yields all weight types before weight data to avoid issues with packed layers that have different quant types.

vllm/model_executor/layers/quantization/gguf.py

Add override_quantization_method() so --quantization gguf overrides HF config quantization (e.g. fp8).

vllm/config/model.py

Add "gguf" to the quantization override whitelist.

vllm/model_executor/models/minimax_m2.py

Fix embed_tokens and lm_head to pass quant_config instead of None, consistent with Qwen2/Qwen3 MoE GGUF fixes ([Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models #30307, Fix GGUF loader for Qwen3 MoE. #22785).

Requires a corresponding transformers PR: huggingface/transformers#44526

Recreated from #36444 which was accidentally closed due to a bad force-push.

Test Plan

python -m vllm.entrypoints.openai.api_server \
  --model MiniMax-M2.1-Q8_0-00001-of-00005.gguf \
  --tokenizer MiniMaxAI/MiniMax-M2.1 \
  --quantization gguf \
  --tensor-parallel-size 8 \
  --max-model-len 4096 \
  --dtype float16 \
  --port 8000

Test Result

Verified end-to-end on two GPU platforms. Model loads and serves correctly.

8×AMD W7900D (48GB each)

Benchmark	GGUF Q8_0	Official BF16
GSM8K 8-shot	91.5%	92.0%
MMLU 5-shot	85.66%	86.2%

AMD MI350X (288GB each)

Benchmark	TP	GGUF Q8_0	Official BF16
GSM8K 8-shot	8	91.7%	92.0%
GSM8K 8-shot	4	92.2%	92.0%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update.

gemini-code-assist

Code Review

This pull request introduces GGUF quantization support, including registering GGUF as a quantization method, adding an override mechanism for user-specified GGUF quantization, and implementing logic to handle sharded GGUF models by discovering all shards and iterating over them for weight loading and type mapping. It also includes specific weight mapping for minimax_m2 models and enables quantization for embedding and LM head layers in minimax_m2. A review comment highlights that the sharded GGUF file discovery logic (_get_all_gguf_files) might be brittle due to hardcoded shard index padding and suggests a more robust regex-based approach for dynamic padding.

vllm/model_executor/model_loader/gguf_loader.py

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

vllm/model_executor/model_loader/gguf_loader.py Show resolved Hide resolved

[Model][Quantization] Add GGUF support for MiniMax-M2.1

d022b3d

JoursBleu force-pushed the feat/gguf-minimax-m2 branch from c7bada3 to d022b3d Compare March 13, 2026 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model][Quantization] Add GGUF support for MiniMax-M2.1#36965

[Model][Quantization] Add GGUF support for MiniMax-M2.1#36965
JoursBleu wants to merge 1 commit intovllm-project:mainfrom
JoursBleu:feat/gguf-minimax-m2

JoursBleu commented Mar 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JoursBleu commented Mar 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoursBleu commented Mar 13, 2026 •

edited by github-actions bot

Loading