Add support for MiniMax's MiniMax-Text-01 #35831

geetu040 · 2025-01-22T06:49:51Z

What does this PR do?

Fixes #35710

This PR adds MiniMaxAI's MiniMax-Text-01 model to Hugging Face Transformers.

MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token.
To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE).
MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference.
On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.

Relevant Links

Research Paper: MiniMax-01: Scaling Foundation Models with Lightning Attention
Authors: MiniMax, Aonian Li, Bangwei Gong, et al.
Implementation: MiniMaxAI/MiniMax-Text-01
Models Weights: MiniMaxAI/MiniMax-Text-01
Github Page: MiniMax-AI/MiniMax-01

CC: @MiniMax-AI-Dev

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker, @Rocketknight1

Change Log

Tokenizer: It uses the existing GPT2Tokenizer
Config: Matches the MixtralConfig with a few additional parameters:
- residual_post_norm, attn_type_list
- layernorm_attention_alpha, layernorm_lightning_attention_alpha, layernorm_mlp_alpha
- layernorm_attention_beta, layernorm_lightning_attention_beta, layernorm_mlp_beta
Weight Conversion Script: No script needed, original weights can be loaded directly into the new architecutre
Model: MiniMax-Text-01 architecture matches and uses most of the Mixtral architecture, with a few changes in
- DecoderLayer
  - hidden_states can be used as residual connections, before or after layernorm is applied
  - weighted sum is used in residual connection
  - selection between softmax and lightning attention based on the layer_idx
- LightningAttention
  - intially used in TransNormerLLM, upgraded in Lightning Attention-2 and adopted in MiniMax-01
  - every 8th decoder layer uses a softmax attention, rest of the layers use lightning attention, which is not previously implemented in transformers

To summarize above, the main area of review is the LightningAttention implementation.

TODOs

Rocketknight1 · 2025-01-22T17:33:41Z

Hi @geetu040, this looks quite good! You can ping us whenever it's ready for review. Also, code quality issues in the CI can be fixed with pip install -e .[quality] in the transformers directory, followed by make fixup.

This reverts commit d8d3c40.

ArthurZucker · 2025-01-27T10:18:58Z

Eager to see this! 🔥 Feel free to ping @Cyrilvallez and me for a review!

ArthurZucker · 2025-02-10T09:54:44Z

cc @Cyrilvallez can you have a look!

ArthurZucker · 2025-05-23T17:38:48Z

Do you want me to commit directly to help get this merged?

qscqesze · 2025-05-26T08:30:09Z

I tested the MiniMax model by using this code.

Prompt is "Hello! How are you?"

quantized_model = MiniMaxForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

the output is garbled, for example:

この広告はメリカ就很.pankajoudhia成这样brakk a Поједина and aプライバerjakan회견 grandissimumb。这是因为 a,カウンセ a

However, if I switch to:

quantized_model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

then the output is correct, such as:

I'm doing well, thank you for asking! How about you?

It seems like there's something wrong when using MiniMaxForCausalLM.from_pretrained.

geetu040 · 2025-05-26T10:19:21Z

@qscqesze Thanks for testing this out!

This could be happening because of minor difference in naming the config attributes

can you trying doing this with the config

config = AutoConfig.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
)

config = config.to_dict()
config['full_attn_alpha_factor'] = config['layernorm_full_attention_alpha']
config['full_attn_beta_factor'] = config['layernorm_full_attention_beta']
config['linear_attn_alpha_factor'] = config['layernorm_linear_attention_alpha']
config['linear_attn_beta_factor'] = config['layernorm_linear_attention_beta']
config['mlp_alpha_factor'] = config['layernorm_mlp_alpha']
config['mlp_beta_factor'] = config['layernorm_mlp_beta']

config = MiniMaxConfig(**config)

quantized_model = MiniMaxForCausalLM.from_pretrained(
    MODEL_PATH,
    config=config,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

Also, don't use the code from latest commit of this branch, I am fixing some bugs here, please use the previous commit 9c5397d code

qscqesze · 2025-05-26T11:59:43Z

@qscqesze Thanks for testing this out!

This could be happening because of minor difference in naming the config attributes

can you trying doing this with the config

config = AutoConfig.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
)

config = config.to_dict()
config['full_attn_alpha_factor'] = config['layernorm_full_attention_alpha']
config['full_attn_beta_factor'] = config['layernorm_full_attention_beta']
config['linear_attn_alpha_factor'] = config['layernorm_linear_attention_alpha']
config['linear_attn_beta_factor'] = config['layernorm_linear_attention_beta']
config['mlp_alpha_factor'] = config['layernorm_mlp_alpha']
config['mlp_beta_factor'] = config['layernorm_mlp_beta']

config = MiniMaxConfig(**config)

quantized_model = MiniMaxForCausalLM.from_pretrained(
    MODEL_PATH,
    config=config,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

Also, don't use the code from latest commit of this branch, I am fixing some bugs here, please use the previous commit 9c5397d code

Thanks for your code—everything works fine on my side now.

geetu040 · 2025-05-27T05:27:12Z

@ArthurZucker please review the PR, if everything looks fine, we can move to handling the checkpoints on huggingface.

qscqesze · 2025-05-27T08:57:36Z

Hello. I'm curious—why do we need to change attn_type_list to layer_type_list? I checked the config file at config.json, and it actually doesn't contain a layer_type_list configuration.

When I use the latest code, I get the following error:

You are using a model of type minimax_text_01 to instantiate a model of type MiniMaxText01. This is not supported for all configurations of models and can yield errors.
You are using a model of type mixtral to instantiate a model of type MiniMaxText01. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/home/qingjun/vllm_test/huggingface_test.py", line 62, in <module>
    quantized_model = MiniMaxForCausalLM.from_pretrained(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/modeling_utils.py", line 314, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/modeling_utils.py", line 4622, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/models/minimax/modeling_minimax.py", line 883, in __init__
    self.model = MiniMaxModel(config)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/models/minimax/modeling_minimax.py", line 657, in __init__
    [MiniMaxDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/models/minimax/modeling_minimax.py", line 500, in __init__
    self.layer_type = config.layer_type_list[layer_idx]
                      ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: list index out of range

However, if I modify it like this:

config['layer_type_list'] = config['attn_type_list']

It works fine.

geetu040 · 2025-05-28T08:49:42Z

@qscqesze yes, some of the config attributes are named differently here in transformers as suggested at #35831 (comment) and #35831 (comment).

We renamed these for better clarify and to avoid confusion by names.

Are you fine with the changes, or would prefer to have the original names or suggest a better one?

CC: @ArthurZucker @Cyrilvallez (could we rename them back?)

qscqesze · 2025-05-29T07:01:34Z

I understand the reasoning behind renaming these attributes for better clarity, and I appreciate the effort to make things more consistent. However, one drawback is that users who download the config files directly from Hugging Face might encounter issues if they’re unaware of the renaming—since the current setup requires them to manually remap the keys to make the model work. This could easily lead to confusion, especially for users who aren’t closely following the codebase changes.

I’d love to hear what other reviewers think about this as well.

geetu040 · 2025-05-29T08:10:06Z

@qscqesze We actually handle this by creating another repository on huggingface, in this case something like MiniMaxAI/MiniMax-Text-01-hf which works with the implementation in transformers

ArthurZucker · 2025-06-02T14:48:42Z

@qscqesze when you use trust_remote_code=True, it is not expected to work with the code of this PR.
We will open PRs to the repo to update the remote code!

The reason we need to rename is to adhere to the standard: layer_type is now shared across models!

transformers/src/transformers/models/gemma2/modular_gemma2.py

Line 197 in afe78bd

self.layer_types = layer_types

this means we can enable better support for different layer type!

ArthurZucker

My last comment then let's merge! cc @LysandreJik to merge once this is fixed

src/transformers/models/minimax/modular_minimax.py

qscqesze · 2025-06-03T06:08:12Z

@ArthurZucker @geetu040 Thank you both!

That makes perfect sense regarding the need for a separate Hugging Face repo and the renaming to align with the new layer_type standards.

geetu040 · 2025-06-03T10:10:13Z

My last comment then let's merge! cc @LysandreJik to merge once this is fixed

Hi @LysandreJik, This PR is ready for merge!

These checkpoints for integration tests should also be moved from geetu040/MiniMax-tiny to hf-internal-testing/MiniMax-tiny.

LysandreJik · 2025-06-04T07:38:29Z

Hey @geetu040, I mirrored your tiny checkpoint here: https://huggingface.co/hf-internal-testing/MiniMax-tiny/tree/main

I'm merging the PR!

Thanks for your contribution 🤗

* end-to-end architecture * lightning-attn: refactor, clean, optimize * put minimax_text_01 in other files * use latest __init__ standards and auto-generate modular * support attention_mask for lightning-attn * Revert "use latest __init__ standards and auto-generate modular" This reverts commit d8d3c40. * fix modular conversion * pass both attention masks instead of tuple * formatting * Updated Dynamic Cache * created MiniMaxText01Cache * fix hardcoded slope_rate * update attn_type_list in config * fix lightning when use_cache=False * copy tests from mixtral * (checkpoint) all tests pass for normal attention * fix all unittests * fix import sorting * fix consistency and formatting tests * fix config * update tests, since changes in main * fix seq_len error * create dummy docs * fix checkpoint * add checkpoint in config docstring * run modular_conversion * update docs * fix checkpoint path and update tests * fix ruff * remove repeated expected_slice * update docs * rename "minimax-text-01" to "minimax" * inherit config from mixtral * remove from docs in other languages * undo files that should be untouched * move minimax to end in conversation docs * use MiniMaxForCausalLM as it is * ruff fixes * run modular * fix docstring example in causallm * refactor attention loop and decay factors * refactor config in modular * run modular * refactor cache * rename static_cache to linear_cache * make positional embeddings necessary * remove unnecessary layernorms declarations * fix import in tests * refactor attention in next tokens * remove outdated code * formatting and modular * update tests * rename layernorm alpha/beta factors * register decay factors as buffers * remove unused declarations of decay factors * update config for alpha/beta factors * run modular * remove head_dim in tests * remove minimax from fx.py * remove stuff that is not really needed * update __init__ * update qkv torch.split Co-authored-by: Cyril Vallez <[email protected]> * fix qkv torch.split * quality fixes * remove mistakenly added dummy * purge unused ModelTester code * fix-copies * run fix-copies * fix head_dim * write cache formatting tests * remove postnorm * avoid contiguous in attention current states * update expected_slice * add generation test for integration * fix dtype in generation test * update authors * update with changes in main * update graident checkpointing and minor fixes * fix mutable attn_type_list * rename: attn_type -> layer_type * update for layer_types * update integration tests * update checkpoint * clean overview in docs --------- Co-authored-by: Shakib-IO <[email protected]> Co-authored-by: Cyril Vallez <[email protected]>

hadipash · 2025-06-27T07:51:10Z

src/transformers/models/minimax/configuration_minimax.py

+
+    [MiniMaxAI/MiniMax-Text-01-hf](https://huggingface.co/MiniMaxAI/MiniMax-Text-01-hf)


When will this repo become available?

@sriting @qscqesze from MiniMax team were planning to release them soon

I've also opened a PR on the hf model repo that might be helpful:
https://huggingface.co/MiniMaxAI/MiniMax-Text-01/discussions/39

@hadipash you should now be able to see the repo at MiniMaxAI/MiniMax-Text-01-hf

end-to-end architecture

3a046a4

geetu040 added 9 commits January 24, 2025 01:19

lightning-attn: refactor, clean, optimize

519eda3

put minimax_text_01 in other files

c54f804

use latest __init__ standards and auto-generate modular

d8d3c40

support attention_mask for lightning-attn

8d654d8

Merge branch 'main' into minimax-text-01

70b787c

Revert "use latest __init__ standards and auto-generate modular"

5b40a5c

This reverts commit d8d3c40.

fix modular conversion

a93ee3f

pass both attention masks instead of tuple

92f7963

formatting

ecdc0eb

ArthurZucker added the New model label Jan 27, 2025

Shakib-IO and others added 12 commits January 29, 2025 23:23

Updated Dynamic Cache

0027c0e

created MiniMaxText01Cache

e117d26

fix hardcoded slope_rate

7d7ae06

update attn_type_list in config

d209bd3

fix lightning when use_cache=False

d6eb561

Merge branch 'main' into minimax-text-01

99db3a0

copy tests from mixtral

e459521

(checkpoint) all tests pass for normal attention

866ba89

fix all unittests

1a6f086

Merge branch 'main' into minimax-text-01

e27af34

fix import sorting

5ba152b

fix consistency and formatting tests

8fd17ca

geetu040 added 4 commits February 12, 2025 08:18

fix config

d907633

Merge branch 'main' into minimax-text-01

09b07ea

update tests, since changes in main

c08a619

fix seq_len error

8117b2d

Merge branch 'main' into minimax-text-01

edb9337

update with changes in main

8e418ec

geetu040 added 3 commits May 27, 2025 09:34

update graident checkpointing and minor fixes

9754ae8

fix mutable attn_type_list

3c1ff61

rename: attn_type -> layer_type

c8f7ed2

Merge branch "main" (resolve conflicts)

a107b2e

ArthurZucker approved these changes Jun 2, 2025

View reviewed changes

src/transformers/models/minimax/modular_minimax.py Outdated Show resolved Hide resolved

geetu040 added 6 commits June 3, 2025 11:25

Merge branch 'main' into minimax-text-01

51f5dd1

update for layer_types

7837c75

update integration tests

278aad7

update checkpoint

7ee260d

clean overview in docs

8e156d4

Merge branch 'main' into minimax-text-01

9a94d04

LysandreJik merged commit 55736ee into huggingface:main Jun 4, 2025
18 checks passed

geetu040 mentioned this pull request Jun 4, 2025

Fix MiniMax (docs and integration tests checkpoint) #38575

Merged

1 task

hadipash reviewed Jun 27, 2025

View reviewed changes


		[MiniMaxAI/MiniMax-Text-01-hf](https://huggingface.co/MiniMaxAI/MiniMax-Text-01-hf)

Add support for MiniMax's MiniMax-Text-01 #35831

Add support for MiniMax's MiniMax-Text-01 #35831

Uh oh!

Conversation

geetu040 commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Change Log

TODOs

Uh oh!

Rocketknight1 commented Jan 22, 2025

Uh oh!

ArthurZucker commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Feb 10, 2025

Uh oh!

ArthurZucker commented May 23, 2025

Uh oh!

qscqesze commented May 26, 2025

Uh oh!

geetu040 commented May 26, 2025

Uh oh!

qscqesze commented May 26, 2025

Uh oh!

geetu040 commented May 27, 2025

Uh oh!

qscqesze commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geetu040 commented May 28, 2025

Uh oh!

qscqesze commented May 29, 2025

Uh oh!

geetu040 commented May 29, 2025

Uh oh!

ArthurZucker commented Jun 2, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qscqesze commented Jun 3, 2025

Uh oh!

geetu040 commented Jun 3, 2025

Uh oh!

LysandreJik commented Jun 4, 2025

Uh oh!

Uh oh!

hadipash Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geetu040 Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geetu040 Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

geetu040 Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geetu040 commented Jan 22, 2025 •

edited

Loading

ArthurZucker commented Jan 27, 2025 •

edited

Loading

qscqesze commented May 27, 2025 •

edited

Loading

hadipash Jun 27, 2025 •

edited

Loading

geetu040 Jun 27, 2025 •

edited

Loading