[Model][Quantization] Add GGUF support for MiniMax-M2.1#36965
Draft
JoursBleu wants to merge 1 commit intovllm-project:mainfrom
Draft
[Model][Quantization] Add GGUF support for MiniMax-M2.1#36965JoursBleu wants to merge 1 commit intovllm-project:mainfrom
JoursBleu wants to merge 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces GGUF quantization support, including registering GGUF as a quantization method, adding an override mechanism for user-specified GGUF quantization, and implementing logic to handle sharded GGUF models by discovering all shards and iterating over them for weight loading and type mapping. It also includes specific weight mapping for minimax_m2 models and enables quantization for embedding and LM head layers in minimax_m2. A review comment highlights that the sharded GGUF file discovery logic (_get_all_gguf_files) might be brittle due to hardcoded shard index padding and suggests a more robust regex-based approach for dynamic padding.
c7bada3 to
d022b3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Add GGUF loading support for MiniMax-M2.1 (456B MoE, 45.9B active, 256 experts, 8 active per token).
This enables serving GGUF-quantized MiniMax-M2.1 checkpoints (e.g. unsloth/MiniMax-M2.1-GGUF) with vLLM.
vllm/model_executor/model_loader/gguf_loader.pyffn_gate_exps/ffn_down_exps/ffn_up_exps→ per-expertw1/w2/w3), following the same pattern as DeepSeek2.exp_probs_b.bias→block_sparse_moe.e_score_correction_bias.\.block_sparse_moe\.experts\.(gate_up_proj|down_proj)) so merged expert weights are excluded from the unmapped parameter check._get_all_gguf_files()auto-discovers shard files from a single shard path (e.g.*-00001-of-00005.gguf→ all 5 shards), with dynamic shard index padding width detection. Multi-shard iteration is used for weight loading, weight type mapping, and extra tensor name collection.vllm/model_executor/model_loader/weight_utils.pygguf_quant_weights_iterator_multi()for iterating over quantized weights across multiple GGUF shard files. Like the single-file version, it yields all weight types before weight data to avoid issues with packed layers that have different quant types.vllm/model_executor/layers/quantization/gguf.pyoverride_quantization_method()so--quantization ggufoverrides HF config quantization (e.g. fp8).vllm/config/model.py"gguf"to the quantization override whitelist.vllm/model_executor/models/minimax_m2.pyembed_tokensandlm_headto passquant_configinstead ofNone, consistent with Qwen2/Qwen3 MoE GGUF fixes ([Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models #30307, Fix GGUF loader for Qwen3 MoE. #22785).Requires a corresponding
transformersPR: huggingface/transformers#44526Test Plan
Test Result
Verified end-to-end on two GPU platforms. Model loads and serves correctly.
8×AMD W7900D (48GB each)
AMD MI350X (288GB each)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.