Skip to content

[Model][Quantization] Add GGUF support for MiniMax-M2.1#36965

Draft
JoursBleu wants to merge 1 commit intovllm-project:mainfrom
JoursBleu:feat/gguf-minimax-m2
Draft

[Model][Quantization] Add GGUF support for MiniMax-M2.1#36965
JoursBleu wants to merge 1 commit intovllm-project:mainfrom
JoursBleu:feat/gguf-minimax-m2

Conversation

@JoursBleu
Copy link

@JoursBleu JoursBleu commented Mar 13, 2026

Purpose

Add GGUF loading support for MiniMax-M2.1 (456B MoE, 45.9B active, 256 experts, 8 active per token).

This enables serving GGUF-quantized MiniMax-M2.1 checkpoints (e.g. unsloth/MiniMax-M2.1-GGUF) with vLLM.

vllm/model_executor/model_loader/gguf_loader.py

  • Add MoE expert weight mapping for MiniMax-M2 (fused ffn_gate_exps/ffn_down_exps/ffn_up_exps → per-expert w1/w2/w3), following the same pattern as DeepSeek2.
  • Map exp_probs_b.biasblock_sparse_moe.e_score_correction_bias.
  • Register sideload params regex (\.block_sparse_moe\.experts\.(gate_up_proj|down_proj)) so merged expert weights are excluded from the unmapped parameter check.
  • Add multi-shard GGUF support: _get_all_gguf_files() auto-discovers shard files from a single shard path (e.g. *-00001-of-00005.gguf → all 5 shards), with dynamic shard index padding width detection. Multi-shard iteration is used for weight loading, weight type mapping, and extra tensor name collection.

vllm/model_executor/model_loader/weight_utils.py

  • Add gguf_quant_weights_iterator_multi() for iterating over quantized weights across multiple GGUF shard files. Like the single-file version, it yields all weight types before weight data to avoid issues with packed layers that have different quant types.

vllm/model_executor/layers/quantization/gguf.py

  • Add override_quantization_method() so --quantization gguf overrides HF config quantization (e.g. fp8).

vllm/config/model.py

  • Add "gguf" to the quantization override whitelist.

vllm/model_executor/models/minimax_m2.py

Requires a corresponding transformers PR: huggingface/transformers#44526

Recreated from #36444 which was accidentally closed due to a bad force-push.

Test Plan

python -m vllm.entrypoints.openai.api_server \
  --model MiniMax-M2.1-Q8_0-00001-of-00005.gguf \
  --tokenizer MiniMaxAI/MiniMax-M2.1 \
  --quantization gguf \
  --tensor-parallel-size 8 \
  --max-model-len 4096 \
  --dtype float16 \
  --port 8000

Test Result

Verified end-to-end on two GPU platforms. Model loads and serves correctly.

8×AMD W7900D (48GB each)

Benchmark GGUF Q8_0 Official BF16
GSM8K 8-shot 91.5% 92.0%
MMLU 5-shot 85.66% 86.2%

AMD MI350X (288GB each)

Benchmark TP GGUF Q8_0 Official BF16
GSM8K 8-shot 8 91.7% 92.0%
GSM8K 8-shot 4 92.2% 92.0%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GGUF quantization support, including registering GGUF as a quantization method, adding an override mechanism for user-specified GGUF quantization, and implementing logic to handle sharded GGUF models by discovering all shards and iterating over them for weight loading and type mapping. It also includes specific weight mapping for minimax_m2 models and enables quantization for embedding and LM head layers in minimax_m2. A review comment highlights that the sharded GGUF file discovery logic (_get_all_gguf_files) might be brittle due to hardcoded shard index padding and suggests a more robust regex-based approach for dynamic padding.

@JoursBleu JoursBleu force-pushed the feat/gguf-minimax-m2 branch from c7bada3 to d022b3d Compare March 13, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant