-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Model: Granite MoE shared #13269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Model: Granite MoE shared #13269
Conversation
Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>
The hparam and architecture plumbing should be correct, but the implementation of the shared experts seems to still be broken. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>
I had misread that the shared experts take the inputs _before_ the standard MoE layer and was feeding the output of the MoE to the shared experts. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]>
|
||
// For Granite MoE Shared | ||
if (arch == LLM_ARCH_GRANITE_MOE_SHARED) { | ||
layer.ffn_gate_shexp = create_tensor(tn(LLM_TENSOR_FFN_GATE_SHEXP, "weight", i), {n_embd, hparams.n_ff_shexp}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simply check hparams.n_ff_shexp > 0
then load these tensors. The value of hparams.n_ff_shexp
will be 0 if it's not written in gguf
This way, you can remove 2/3 code of this PR, making it much easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, this would be much simpler! All of the granite architectures are highly related and slight tweaks on other architectures. In transformers
, the approach has always been to keep the architectures isolated and use their modular
approach for shared code. I know that policy is less strict here, but wasn't sure how much to lean into collapsing all of these architectures into a single architecture string versus keeping them separate and sharing code underneath. Is there a strong preference here going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO in this particular case, the shared exp is a common thing and we are quite sure that whatever we're adding can be reused later by another model.
But if you want even more isolated code, it's better to duplicate the whole builder function, instead of adding a new if
branch. However, I don't quite like this approach (reason above)
CC @ggerganov too, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. My personal preference is to keep a 1:1 mapping between architecture names in huggingface
, then have maximal code reuse on the backend. That is a very weakly held preference, though, so I'm happy to take whatever direction is best from your collective maintainers' POV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabe-l-hart Not sure if you have a near deadline to merge this PR (?)
We never aim for 1:1 mapping between HF model_type
<> llama.cpp arch name from the beginning, so it's probably fine to consider this new arch as a variant of GRANITE_MOE
with n_ff_shexp > 0
. I'm not sure in which case the 1:1 mapping will be useful?
In anw, after second thought, I have no strong opinion whether this need to be a dedicated arch name or not, you can probably keep it this way if you like. But maybe we should move the cgraph of granite family to a dedicated build_*
function. My main concern is that we're adding things on top of build_llama
which make it not very easy to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes total sense. I juggle a lot of different frameworks, so the opinion of keeping the architecture names aligned is mostly just for mental clarity on my part (and anyone else comparing across formats/engines). I like the idea of a dedicated build
function so we don't keep clogging up llama
.
Now that the full Granite 4 architecture is public, I can state that this shared MoE architecture is really just a building block for the full Granite 4 and we don't plan to release models with this as a standalone architecture. Given that, I'd be totally happy not merging this at all and bringing in the shared expert as part of the granitemoehybrid
architecture.
EXAONE = auto() | ||
GRANITE = auto() | ||
GRANITE_MOE = auto() | ||
GRANITE_MOE_SHARED = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a new arch is needed here. Some other archs also have exp/shared exp and they are being controlled via the n_ff_shexp
Description
This PR adds support for the
GraniteMoEShared
architecture, matching the implementation in transformers. The model is an iteration on top ofGraniteMoE
and adds a shared expert to each MoE layer.NOTE: There is not a public model with this architecture for testing yet, but it is a key building block for the just-released Granite 4 architecture.