Model: Granite MoE shared #13269

gabe-l-hart · 2025-05-02T16:20:23Z

Description

This PR adds support for the GraniteMoEShared architecture, matching the implementation in transformers. The model is an iteration on top of GraniteMoE and adds a shared expert to each MoE layer.

NOTE: There is not a public model with this architecture for testing yet, but it is a key building block for the just-released Granite 4 architecture.

ngxson · 2025-05-02T16:56:27Z

src/llama-model.cpp

+
+                            // For Granite MoE Shared
+                            if (arch == LLM_ARCH_GRANITE_MOE_SHARED) {
+                                layer.ffn_gate_shexp = create_tensor(tn(LLM_TENSOR_FFN_GATE_SHEXP, "weight", i), {n_embd, hparams.n_ff_shexp}, 0);


I think you can simply check hparams.n_ff_shexp > 0 then load these tensors. The value of hparams.n_ff_shexp will be 0 if it's not written in gguf

This way, you can remove 2/3 code of this PR, making it much easier

Ah, yes, this would be much simpler! All of the granite architectures are highly related and slight tweaks on other architectures. In transformers, the approach has always been to keep the architectures isolated and use their modular approach for shared code. I know that policy is less strict here, but wasn't sure how much to lean into collapsing all of these architectures into a single architecture string versus keeping them separate and sharing code underneath. Is there a strong preference here going forward?

IMO in this particular case, the shared exp is a common thing and we are quite sure that whatever we're adding can be reused later by another model.

But if you want even more isolated code, it's better to duplicate the whole builder function, instead of adding a new if branch. However, I don't quite like this approach (reason above)

CC @ggerganov too, WDYT?

Makes sense. My personal preference is to keep a 1:1 mapping between architecture names in huggingface, then have maximal code reuse on the backend. That is a very weakly held preference, though, so I'm happy to take whatever direction is best from your collective maintainers' POV.

@gabe-l-hart Not sure if you have a near deadline to merge this PR (?)

We never aim for 1:1 mapping between HF model_type <> llama.cpp arch name from the beginning, so it's probably fine to consider this new arch as a variant of GRANITE_MOE with n_ff_shexp > 0. I'm not sure in which case the 1:1 mapping will be useful?

In anw, after second thought, I have no strong opinion whether this need to be a dedicated arch name or not, you can probably keep it this way if you like. But maybe we should move the cgraph of granite family to a dedicated build_* function. My main concern is that we're adding things on top of build_llama which make it not very easy to read

That makes total sense. I juggle a lot of different frameworks, so the opinion of keeping the architecture names aligned is mostly just for mental clarity on my part (and anyone else comparing across formats/engines). I like the idea of a dedicated build function so we don't keep clogging up llama.

Now that the full Granite 4 architecture is public, I can state that this shared MoE architecture is really just a building block for the full Granite 4 and we don't plan to release models with this as a standalone architecture. Given that, I'd be totally happy not merging this at all and bringing in the shared expert as part of the granitemoehybrid architecture.

ngxson · 2025-05-02T16:57:26Z

gguf-py/gguf/constants.py

+    EXAONE             = auto()
+    GRANITE            = auto()
+    GRANITE_MOE        = auto()
+    GRANITE_MOE_SHARED = auto()


I don't think a new arch is needed here. Some other archs also have exp/shared exp and they are being controlled via the n_ff_shexp

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

The hparam and architecture plumbing should be correct, but the implementation of the shared experts seems to still be broken. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

I had misread that the shared experts take the inputs _before_ the standard MoE layer and was feeding the output of the MoE to the shared experts. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

This is a cleaner way that will allow more flexibility in architecture strings going forward. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

This helps de-clutter the llama-family graph construction and allows granite to diverge further (in preparation for Granite 4). NOTE: I removed the granite scale factors from llm_build_deci because they appear to only be there as copy-paste from llm_build_llama. The HF config does not seem to set those values: https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-05-09T18:58:58Z

@ngxson @ggerganov I've rebased this on master and made the following changes based on your suggestions:

Used checks based on hparams.n_ff_shexp rather than the architecture string (this is definitely cleaner and more extensible)
Moved all granite model construction to llm_build_granite and removed granite-specific conditionals from llm_build_llama
- NOTE: I also removed granite-specific conditionals from llm_build_deci which seem to have been there as copy-paste from llm_build_llama. I checked with the HF model config and it doesn't appear that the Deci models use these scale factors.

This should not have been reachable, but it warns on some compliers Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

ggerganov · 2025-05-12T08:07:47Z

I agree with @ngxson that adding separate arch for MoE-shared is a bit redundant, but it's OK either way. The separate build function for Granite models is good refactoring.

I think we are ready to merge. @gabe-l-hart Are the models out?

gabe-l-hart · 2025-05-12T13:09:29Z

@ggerganov @ngxson Now that I think more about it, I agree that we should not use a separate architecture name for this. We're not currently planning to release models with this architecture by itself, but we will be using it for the attention layers in the Granite 4.0 models which are a hybrid of mamba2 and this architecture (Granite MoE w/ shared expert).

I'll consolidate the changes to remove the extra enum later today.

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-05-13T03:23:55Z

@ggerganov @ngxson I've now removed GRANITE_MOE_SHARED as a standalone architecture and consolidated into GRANITE_MOE. I've verified that conversion and inference work as expected with both a GraniteMoeForCausalLM and GraniteMoeSharedForCausalLM model.

ggerganov · 2025-05-14T05:17:58Z

src/llama-model.cpp

+                    if (!inp_pos) {
+                        inp_pos = build_inp_pos();
+                    }


@gabe-l-hart This should be moved at the beginning of the loop function, before the loop over the layers, to avoid creating the tensor for each layer.

Thanks for pointing this out. My thought with putting it there was that it would be lazily initialized on the first layer if use_rope is true and then not re-initialized on subsequent loop iterations because of the if check. Is there something I'm missing about how this tensor is used below that would cause the pointer to be nullptr on later loop iterations?

Regardless, there's no harm in putting it back at the top with the use_rope check, so I can get a small PR in place to do that or just lump the change in with my Granite 4 branch.

I got somehow confused - it won't create the tensor for each layer. Still, it's better to move it before the loop for consistency with the other build functions.

Makes sense! Quick PR to fix it: #13538

github-actions bot added the python python script changes label May 2, 2025

ngxson reviewed May 2, 2025

View reviewed changes

This was referenced May 2, 2025

Feature Request: Granite 4 Support #13275

Open

Add support for IBM Granite-4.0-Tiny-Preview ollama/ollama#10557

Open

gabe-l-hart added 7 commits May 9, 2025 12:21

feat: Add GGUF conversion for granitemoeshared

a6529cc

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

feat: hparam and arch plumbing for granitemoeshared

731c5fc

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

fix: Split MoE fused tensors for shared experts in conversion

c5d897e

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

feat: First WIP cut at model arch in cpp

054059e

The hparam and architecture plumbing should be correct, but the implementation of the shared experts seems to still be broken. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

fix: Cleaner (maybe more correct?) splitting for gate/up

5a98b48

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

fix: Fix the input to the shared experts

9763c9a

I had misread that the shared experts take the inputs _before_ the standard MoE layer and was feeding the output of the MoE to the shared experts. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

fix: Avoid architecture-specific checks for Granite MoE Shared

52d2ed6

This is a cleaner way that will allow more flexibility in architecture strings going forward. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoEShared branch from 97de56d to 52d2ed6 Compare May 9, 2025 18:33

fix: Fix compiler warning about uninitialized inp_pos

3d79214

This should not have been reachable, but it warns on some compliers Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoEShared branch from 0753a7f to 3d79214 Compare May 9, 2025 20:31

ggerganov approved these changes May 12, 2025

View reviewed changes

This comment was marked as spam.

Sign in to view

gabe-l-hart added 2 commits May 12, 2025 21:21

fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

2aed91c

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

33008e8

Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>

CISC merged commit d590cd4 into ggml-org:master May 13, 2025
47 checks passed

gabe-l-hart deleted the GraniteMoEShared branch May 13, 2025 14:13

ggerganov reviewed May 14, 2025

View reviewed changes

gabe-l-hart mentioned this pull request May 14, 2025

Granite MoE NoPE fix #13538

Merged

Model: Granite MoE shared #13269

Model: Granite MoE shared #13269

Uh oh!

Conversation

gabe-l-hart commented May 2, 2025

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented May 9, 2025

Uh oh!

ggerganov commented May 12, 2025

Uh oh!

This comment was marked as spam.

gabe-l-hart commented May 12, 2025

Uh oh!

gabe-l-hart commented May 13, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabe-l-hart May 2, 2025 •

edited

Loading

ngxson May 2, 2025 •

edited

Loading

ggerganov May 14, 2025 •

edited

Loading