Skip to content

Add support for MiniMax's MiniMax-Text-01 #35831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 103 commits into from
Jun 4, 2025

Conversation

geetu040
Copy link
Contributor

@geetu040 geetu040 commented Jan 22, 2025

What does this PR do?

Fixes #35710

This PR adds MiniMaxAI's MiniMax-Text-01 model to Hugging Face Transformers.

  • MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token.
  • To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE).
  • MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference.
  • On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.

Relevant Links

CC: @MiniMax-AI-Dev

Before submitting

Who can review?

@ArthurZucker, @Rocketknight1

Change Log

  • Tokenizer: It uses the existing GPT2Tokenizer
  • Config: Matches the MixtralConfig with a few additional parameters:
    • residual_post_norm, attn_type_list
    • layernorm_attention_alpha, layernorm_lightning_attention_alpha, layernorm_mlp_alpha
    • layernorm_attention_beta, layernorm_lightning_attention_beta, layernorm_mlp_beta
  • Weight Conversion Script: No script needed, original weights can be loaded directly into the new architecutre
  • Model: MiniMax-Text-01 architecture matches and uses most of the Mixtral architecture, with a few changes in
    • DecoderLayer
      • hidden_states can be used as residual connections, before or after layernorm is applied
      • weighted sum is used in residual connection
      • selection between softmax and lightning attention based on the layer_idx
    • LightningAttention
      • intially used in TransNormerLLM, upgraded in Lightning Attention-2 and adopted in MiniMax-01
      • every 8th decoder layer uses a softmax attention, rest of the layers use lightning attention, which is not previously implemented in transformers

To summarize above, the main area of review is the LightningAttention implementation.

TODOs

  • Update Documentation
  • Update Tests
  • Import Statements and Auto Modeling
  • Implement Model
    • Implement End-to-End Architecture
    • Implement LightningAttention
      • Work with avialable code
      • Refactor, Clean and Optimize
      • Implement Decays
      • Implement Caching
      • Implement attention_mask
      • Support .generate() method
  • Fix CI/CD tests
  • MiniMax-Tiny for slow integration tests
  • Model Card (only config and README file is to be updated)

@Rocketknight1
Copy link
Member

Hi @geetu040, this looks quite good! You can ping us whenever it's ready for review. Also, code quality issues in the CI can be fixed with pip install -e .[quality] in the transformers directory, followed by make fixup.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 27, 2025

Eager to see this! 🔥 Feel free to ping @Cyrilvallez and me for a review!

@ArthurZucker
Copy link
Collaborator

cc @Cyrilvallez can you have a look!

@ArthurZucker
Copy link
Collaborator

Do you want me to commit directly to help get this merged?

@qscqesze
Copy link

I tested the MiniMax model by using this code.

Prompt is "Hello! How are you?"

quantized_model = MiniMaxForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

the output is garbled, for example:

この広告はメリカ就很.pankajoudhia成这样brakk a Поједина and aプライバerjakan회견 grandissimumb。这是因为 a,カウンセ a

However, if I switch to:

quantized_model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

then the output is correct, such as:

I'm doing well, thank you for asking! How about you?

It seems like there's something wrong when using MiniMaxForCausalLM.from_pretrained.

@geetu040
Copy link
Contributor Author

@qscqesze Thanks for testing this out!

This could be happening because of minor difference in naming the config attributes

can you trying doing this with the config

config = AutoConfig.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
)

config = config.to_dict()
config['full_attn_alpha_factor'] = config['layernorm_full_attention_alpha']
config['full_attn_beta_factor'] = config['layernorm_full_attention_beta']
config['linear_attn_alpha_factor'] = config['layernorm_linear_attention_alpha']
config['linear_attn_beta_factor'] = config['layernorm_linear_attention_beta']
config['mlp_alpha_factor'] = config['layernorm_mlp_alpha']
config['mlp_beta_factor'] = config['layernorm_mlp_beta']

config = MiniMaxConfig(**config)

quantized_model = MiniMaxForCausalLM.from_pretrained(
    MODEL_PATH,
    config=config,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

Also, don't use the code from latest commit of this branch, I am fixing some bugs here, please use the previous commit 9c5397d code

@qscqesze
Copy link

@qscqesze Thanks for testing this out!

This could be happening because of minor difference in naming the config attributes

can you trying doing this with the config

config = AutoConfig.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
)

config = config.to_dict()
config['full_attn_alpha_factor'] = config['layernorm_full_attention_alpha']
config['full_attn_beta_factor'] = config['layernorm_full_attention_beta']
config['linear_attn_alpha_factor'] = config['layernorm_linear_attention_alpha']
config['linear_attn_beta_factor'] = config['layernorm_linear_attention_beta']
config['mlp_alpha_factor'] = config['layernorm_mlp_alpha']
config['mlp_beta_factor'] = config['layernorm_mlp_beta']

config = MiniMaxConfig(**config)

quantized_model = MiniMaxForCausalLM.from_pretrained(
    MODEL_PATH,
    config=config,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

Also, don't use the code from latest commit of this branch, I am fixing some bugs here, please use the previous commit 9c5397d code

Thanks for your code—everything works fine on my side now.

@geetu040
Copy link
Contributor Author

@ArthurZucker please review the PR, if everything looks fine, we can move to handling the checkpoints on huggingface.

@qscqesze
Copy link

qscqesze commented May 27, 2025

Hello. I'm curious—why do we need to change attn_type_list to layer_type_list? I checked the config file at config.json, and it actually doesn't contain a layer_type_list configuration.

When I use the latest code, I get the following error:

You are using a model of type minimax_text_01 to instantiate a model of type MiniMaxText01. This is not supported for all configurations of models and can yield errors.
You are using a model of type mixtral to instantiate a model of type MiniMaxText01. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/home/qingjun/vllm_test/huggingface_test.py", line 62, in <module>
    quantized_model = MiniMaxForCausalLM.from_pretrained(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/modeling_utils.py", line 314, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/modeling_utils.py", line 4622, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/models/minimax/modeling_minimax.py", line 883, in __init__
    self.model = MiniMaxModel(config)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/models/minimax/modeling_minimax.py", line 657, in __init__
    [MiniMaxDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/qingjun/transformers/src/transformers/models/minimax/modeling_minimax.py", line 500, in __init__
    self.layer_type = config.layer_type_list[layer_idx]
                      ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: list index out of range

However, if I modify it like this:

config['layer_type_list'] = config['attn_type_list']

It works fine.

@geetu040
Copy link
Contributor Author

@qscqesze yes, some of the config attributes are named differently here in transformers as suggested at #35831 (comment) and #35831 (comment).

We renamed these for better clarify and to avoid confusion by names.

Are you fine with the changes, or would prefer to have the original names or suggest a better one?

CC: @ArthurZucker @Cyrilvallez (could we rename them back?)

@qscqesze
Copy link

I understand the reasoning behind renaming these attributes for better clarity, and I appreciate the effort to make things more consistent. However, one drawback is that users who download the config files directly from Hugging Face might encounter issues if they’re unaware of the renaming—since the current setup requires them to manually remap the keys to make the model work. This could easily lead to confusion, especially for users who aren’t closely following the codebase changes.

I’d love to hear what other reviewers think about this as well.

@geetu040
Copy link
Contributor Author

@qscqesze We actually handle this by creating another repository on huggingface, in this case something like MiniMaxAI/MiniMax-Text-01-hf which works with the implementation in transformers

@ArthurZucker
Copy link
Collaborator

@qscqesze when you use trust_remote_code=True, it is not expected to work with the code of this PR.
We will open PRs to the repo to update the remote code!

The reason we need to rename is to adhere to the standard: layer_type is now shared across models!

self.layer_types = layer_types
this means we can enable better support for different layer type!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment then let's merge! cc @LysandreJik to merge once this is fixed

@qscqesze
Copy link

qscqesze commented Jun 3, 2025

@ArthurZucker @geetu040 Thank you both!

That makes perfect sense regarding the need for a separate Hugging Face repo and the renaming to align with the new layer_type standards.

@geetu040
Copy link
Contributor Author

geetu040 commented Jun 3, 2025

My last comment then let's merge! cc @LysandreJik to merge once this is fixed

Hi @LysandreJik, This PR is ready for merge!

These checkpoints for integration tests should also be moved from geetu040/MiniMax-tiny to hf-internal-testing/MiniMax-tiny.

@LysandreJik
Copy link
Member

Hey @geetu040, I mirrored your tiny checkpoint here: https://huggingface.co/hf-internal-testing/MiniMax-tiny/tree/main

I'm merging the PR!

Thanks for your contribution 🤗

@LysandreJik LysandreJik merged commit 55736ee into huggingface:main Jun 4, 2025
18 checks passed
bvantuan pushed a commit to bvantuan/transformers that referenced this pull request Jun 12, 2025
* end-to-end architecture

* lightning-attn: refactor, clean, optimize

* put minimax_text_01 in other files

* use latest __init__ standards and auto-generate modular

* support attention_mask for lightning-attn

* Revert "use latest __init__ standards and auto-generate modular"

This reverts commit d8d3c40.

* fix modular conversion

* pass both attention masks instead of tuple

* formatting

* Updated Dynamic Cache

* created MiniMaxText01Cache

* fix hardcoded slope_rate

* update attn_type_list in config

* fix lightning when use_cache=False

* copy tests from mixtral

* (checkpoint) all tests pass for normal attention

* fix all unittests

* fix import sorting

* fix consistency and formatting tests

* fix config

* update tests, since changes in main

* fix seq_len error

* create dummy docs

* fix checkpoint

* add checkpoint in config docstring

* run modular_conversion

* update docs

* fix checkpoint path and update tests

* fix ruff

* remove repeated expected_slice

* update docs

* rename "minimax-text-01" to "minimax"

* inherit config from mixtral

* remove from docs in other languages

* undo files that should be untouched

* move minimax to end in conversation docs

* use MiniMaxForCausalLM as it is

* ruff fixes

* run modular

* fix docstring example in causallm

* refactor attention loop and decay factors

* refactor config in modular

* run modular

* refactor cache

* rename static_cache to linear_cache

* make positional embeddings necessary

* remove unnecessary layernorms declarations

* fix import in tests

* refactor attention in next tokens

* remove outdated code

* formatting and modular

* update tests

* rename layernorm alpha/beta factors

* register decay factors as buffers

* remove unused declarations of decay factors

* update config for alpha/beta factors

* run modular

* remove head_dim in tests

* remove minimax from fx.py

* remove stuff that is not really needed

* update __init__

* update qkv torch.split

Co-authored-by: Cyril Vallez <[email protected]>

* fix qkv torch.split

* quality fixes

* remove mistakenly added dummy

* purge unused ModelTester code

* fix-copies

* run fix-copies

* fix head_dim

* write cache formatting tests

* remove postnorm

* avoid contiguous in attention current states

* update expected_slice

* add generation test for integration

* fix dtype in generation test

* update authors

* update with changes in main

* update graident checkpointing and minor fixes

* fix mutable attn_type_list

* rename: attn_type -> layer_type

* update for layer_types

* update integration tests

* update checkpoint

* clean overview in docs

---------

Co-authored-by: Shakib-IO <[email protected]>
Co-authored-by: Cyril Vallez <[email protected]>
Comment on lines +30 to +31

[MiniMaxAI/MiniMax-Text-01-hf](https://huggingface.co/MiniMaxAI/MiniMax-Text-01-hf)
Copy link

@hadipash hadipash Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this repo become available?

Copy link
Contributor Author

@geetu040 geetu040 Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sriting @qscqesze from MiniMax team were planning to release them soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also opened a PR on the hf model repo that might be helpful:
https://huggingface.co/MiniMaxAI/MiniMax-Text-01/discussions/39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hadipash you should now be able to see the repo at MiniMaxAI/MiniMax-Text-01-hf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for MiniMax-Text-01 and MiniMax-VL-01 from MiniMaxAI
10 participants