Granite Four #13550

gabe-l-hart · 2025-05-14T20:13:13Z

Description

This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:

Mamba2 model support: llama : initial Mamba-2 support #9126
Hybrid recurrent cache: Hybrid recurrent cache #13979

Additionally, this PR replaces some work done on other PRs / branches:

Initial Bamba support: Bamba architecture #10810
- Bamba is fully supported on this branch, so the old PR can be closed in favor of this PR
Refactored Bamba support: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactor
- I've used this branch as an A/B comparison along the way, but will abandon it now
Draft Granite 4.0 support: https://github.com/gabe-l-hart/llama.cpp/tree/GraniteFourDraft
- Build off of the previous Bamba work, this will also be abandoned in favor of this PR
Initial work on Jamba: llama : support Jamba hybrid Transformer-Mamba models #7531
- This work is quite out-of-date and would be a lot of work to overhaul to the refactors on master.
- I had planned to include Jamba support in this branch, but on further inspection, it looks like the Jamba architecture has some additional bells-and-whistles (eg sliding-window-attention) that would need further work, so my plan is to leave Jamba off for now and possibly tackle it later (hopefully it's much easier than the original branch!)

Outstanding Questions

Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:

This PR contains several changes to llama-kv-cache beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition of hparams.recurrent_layer_arr which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?
Is there a more efficient way to implement hparams.recurrent_layer_arr? Using a max-layer-size std::array doesn't feel quite right.
There are still some numerical differences between the attention outputs when running Bamba and granite-4.0-tiny-shared-preview on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.
The use of dymamic_cast to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types in llama-graph?
The switch statement for determining the type of KV cache to allocate in llama-model.cpp seems redundant with llama_model_is_recurrent and llama_model_is_hybrid. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?

Testing

To test out this branch, I've been using the following models:

granite-4.0-tiny-preview: https://huggingface.co/ibm-granite/granite-4.0-tiny-preview
Bamba-9B-v1: https://huggingface.co/ibm-ai-platform/Bamba-9B-v1
- NOTE: v2 is out (here), but I already had v1 from previous branches and stuck with that for consistency
mamba2-370m-hf: https://huggingface.co/AntonV/mamba2-370m-hf

Details

This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general mamba2 and llama_kv_cache_hybrid changes, this PR does the following:

python side

Add conversion support for BambaForCausalLM and GraniteMoeHybridForCausalLM
- This includes one small tweak to gguf_writer.py that allows duplicate key/value pairs through add_key_value if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.
- This also adds the new HybridAttention section under Keys in constants.py to hold attention.layer_indices. OPEN QUESTION: Should this just go under Attention?

c++ side

Add a new public API function llama_model_is_hybrid akin to llama_model_is_recurrent
- I also split up both this function and llama_model_is_recurrent into llm_arch_is_* implemented in llama-arch.* and llama_model_is_* implemented in llama-model.*. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populate hparams.recurrent_layer_arr (see below).
Add hparams.recurrent_layer_arr and support parsing it
- The current implementation pre-allocates it as a fixed-length array which doesn't feel quite right.
Add an optional layer id to hparams.n_embd_k_s / hparams.n_embd_v_s
- This is done because for hybrid models, the values may be different by layer.
- I plumbed through as many usages of these methods as I could find to properly pass the layer index, but there are some places where it's not available which default to layer 0. This should be fine since none of those places interact with the hybrid caching.
Add hparams.recurrent_layer(uint32_t) to check whether a given layer is recurrent
Model name/param/arch plumbing for bamba and granitemoeshared in llama-arch.* (the boring part!)
(possibly breaking) Add hparams as an additional argument to the llama_model.create_memory method
- This is done so the hparams can be given to the cache construction and used to determine which layers are recurrent for hybrid cache creation
In llama-graph, anywhere that a specific cache type needs to be fetched, it is grabbed using new methods get_recurrent_cache / get_unified_cache. These methods use dynamic_cast to handle both non-hybrid caches and hybrid caches.
Add support for instantiating the hybrid cache in llama-model.cpp
Add model support for bamba and granitemoehybrid in llama-model
- Most of this is "business as usual," but that breaks down when trying to avoid code duplication for the hybrid architecture
- To avoid code duplication, I hoisted build_mamba_layer / build_mamba2_layer from llm_build_mamba and build_attention_layer / build_layer_ffn from llm_build_granite into static methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.
- I tried an alternative route using diamond inheritance, but this would have required some kind of "don't actually initialize the graph" switch in the parent model builders' constructors to avoid trying to build the parent model graphs during initialization of the hybrid class.

* ggml : improve ggml_mul speed when masking recurrent states

* ggml : make the ggml_mul fast broadcast path more consistently formatted

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

@younesbelkada

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> Co-Authored-By: @younesbelkada

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

…ostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

* origin/compilade/mamba2: (27 commits) ggml-cpu : reorder SVE FMA for consistency with other SIMD arches ggml : fix mamba2 ssm scan when compiled with SVE graph : fix recurrent state copies when avoiding copies kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 ...

This is borrowed and adapted from the original implementation ggml-org#10810 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

…_t> for layer index arr in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture `if` statement inside the mamba implementation. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

… methods This will allow these layer-builder methods to be used from other build structs without complex inheritance. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

Also no need to pass in kv cache since it's already in the inp_attn Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

It generates (garbage) tokens! Still lots of debugging to do. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

…n hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

…_mamba*_layer Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

… impl to use mixins The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from `self->` everywhere, but this is still a bit cleaner than making those methods static I think. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

…r builders This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of `build_rs` / `build_attn`. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

compilade added 30 commits August 21, 2024 18:00

llama : initial Mamba-2 support

1f0fea7

ggml : SIMD ggml_ssm_scan for Mamba-2

dceff23

* ggml : improve ggml_mul speed when masking recurrent states

llama : support running Mamba-Codestral-7B-v0.1

2bfe9de

llama : fix Mamba-2 conv state saving

aff9692

* ggml : make the ggml_mul fast broadcast path more consistently formatted

llama : remove unused variable

e04910d

llama : add missing break

fa358e7

convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

38913dc

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

Merge branch 'master' into compilade/mamba2

0e601ca

llama : avoid redundant state copy for Mamba 1 and 2

273e7a4

Merge branch 'master' into compilade/mamba2

7d6cb36

metal : attempt to adapt SSM_SCAN for Mamba-2

2c77d79

metal : fix SSM_SCAN pipeline scope

87b97d0

metal : use log and exp instead of log1pf and expf in SSM_SCAN

03d0e6e

metal : remove unused arguments for SSM_SCAN

7a351ab

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

8b15bc6

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset

5b8ec2b

metal : fix wrong number of tokens per sequence in SSM_SCAN

62b09b3

Merge branch 'master' into compilade/mamba2

038d958

ggml : remove unused fast broadcast path in GGML_MUL

805512a

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

Merge branch 'master' into compilade/mamba2

7d16e1b

Merge branch 'master' into compilade/mamba2

8d8f065

convert : fix flake8 lint

b4e9c59

Merge branch 'master' into compilade/mamba2

1ee6c48

Merge branch 'master' into compilade/mamba2

c9ecf62

Merge branch 'master' into compilade/mamba2

35d06fa

metal : fix confusion between ; and ,

cf4f0a4

metal : add missing args for nb references in ssm_scan_f32_group

6def5cd

metal : single-user mamba2 inference works

791998b

kv-cache : remove const_cast when setting inputs for s_copy

94c3d53

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

gabe-l-hart added 28 commits June 16, 2025 15:18

fix: Remove llama_model_is_hybrid_Recurrent public API

bd37fc8

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

refactor: Use llama_memory_state_ptr for child states in hybrid memor…

2e5e45c

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

fix: Fix resize vs reserve and skip null tensors in size computation

beddd62

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> Co-Authored-By: @younesbelkada

feat: Add conversion for Bamba models

f0533fe

This is borrowed and adapted from the original implementation ggml-org#10810 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add Granite 4 conversion

4aedcfb

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Plumb bamba through llama-arch

984639a

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add bamba to llama_arch_is_hybrid_recurrent

ade76b3

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add optional mamba ssm_in bias tensor

5a4f5e2

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add template specialization for get_arr to load a vector<uint32…

0603500

…_t> for layer index arr in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Use an explicit bool to determine mamaba vs mamba2

35fa26a

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture `if` statement inside the mamba implementation. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Isolate mamba(2) and granite attention layer building in static…

667d2e5

… methods This will allow these layer-builder methods to be used from other build structs without complex inheritance. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

fix: Use per-layer sizes in granite build_attention_layer

fcc41a4

Also no need to pass in kv cache since it's already in the inp_attn Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: First (broken) pass at end-to-end Bamba implementation

0b82ea8

It generates (garbage) tokens! Still lots of debugging to do. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

fix: Only do Granite multipliers if set

e37884c

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

refactor: Pull granite ffn portion into a static function and reuse i…

86caee8

…n hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

refactor(py): Simplify granitemoehybrid conversion to use parents better

a5368f8

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add GRANITE_MOE_HYBRID through llama-arch

061b98d

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

feat: Support GRANITE_MOE_HYBRID in llama-model

0c7d090

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

style: Fix flake8 errors

be34396

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

fix: Fix recurrent cache get after rebase

00e61dd

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

fix: Fix hybrid granite implementation for signature changes in build…

4fea4a4

…_mamba*_layer Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteFour branch from 8d2c8db to 2bfd756 Compare June 16, 2025 21:28

gabe-l-hart mentioned this pull request Jun 16, 2025

Feature Request: Granite 4 Support #13275

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Granite Four #13550

Granite Four #13550

gabe-l-hart commented May 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Granite Four #13550

Are you sure you want to change the base?

Granite Four #13550

Conversation

gabe-l-hart commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Outstanding Questions

Testing

Details

python side

c++ side

Uh oh!

Uh oh!

gabe-l-hart commented May 14, 2025 •

edited

Loading