Mamba2 Architecture — How it works and where can we improve it? #193184
Replies: 5 comments 3 replies
-
|
Hi @Kahabk, This is a great conversation starter! Mamba2 and SSD are definitely the "hot topics" in architecture right now. Here is a simplified take on the points you raised:
Think of it this way: Transformers have "Photographic Memory" (they look at everything), while Mamba2 has a "Smart Notebook" (it summarizes everything). As we go to 100B+, the summary in that notebook has to be incredibly perfect. If the "notebook" gets too crowded, the model might start losing fine details. We haven't seen a 100B pure Mamba yet because we're still figuring out how to keep that summary from getting "blurry."
Attention is the electric motor (perfect for precision but eats up battery/memory). Mamba is the gas engine (extremely efficient for long distances).
I hope this helps! Good luck with your research and exploring this further )) |
Beta Was this translation helpful? Give feedback.
This comment was marked as low quality.
This comment was marked as low quality.
-
|
I think you’re asking the right questions—especially around scaling and the role of the state. On scaling, I agree with the point about d_state being the real bottleneck. Unlike transformers, where scaling is mostly about adding parameters and compute, Mamba2 shifts the problem toward representation capacity. If the state doesn’t scale appropriately, you end up compressing too aggressively, which leads to information loss. That “blurry notebook” analogy fits really well here. So the 100B+ question isn’t just about feasibility—it’s about whether we can scale the state without breaking efficiency. On hybrid models, I don’t see them as defeating the purpose. The goal was never to completely eliminate attention, but to reduce its cost while preserving quality. Architectures like Jamba show that keeping a small amount of attention for precise retrieval while using Mamba for long-range efficiency is actually a practical compromise. For long context, it |
Beta Was this translation helpful? Give feedback.
This comment was marked as low quality.
This comment was marked as low quality.
-
|
This is a great set of questions — you’re basically touching the exact fault lines where Mamba2 is still “promising but not fully settled.” I’ll try to break it down point by point. 1. Scaling — does SSD actually hold up?Short answer: we don’t fully know yet at frontier scale, but structurally it should. Mamba2’s SSD (State Space Duality) gives you:
So in theory: But the real question is not complexity — it’s optimization stability at scale. What’s still unclear:
My take: 2. Hybrid models (Mamba + Attention)This is actually not a contradiction — it’s probably the endgame architecture. Why hybrids make sense:
They solve different problems. So models like Jamba aren’t “defeating the purpose” — they’re: Think of it as:
Pure Mamba might struggle with:
3. Long context (100k+ tokens)This is where Mamba2 should dominate — but benchmarks are still limited. Theoretical advantage:
Practical concerns:
Unlike attention, Mamba doesn’t “look back” — it summarizes. So the real question becomes:
We need more:
🛠️ 4. Fine-tuning (you’re absolutely right)This is currently a weak spot. Why LoRA works well for transformers:
Mamba:
So:
This is a big research opportunity 5. State initializationUnderrated point. Right now:
Potential improvements:
This is very similar to RNN-era problems — but more subtle. 6. MoE inside MambaThis is honestly one of the most interesting directions. Why it fits well:
Challenges:
But conceptually: Big pictureMamba2 is not just “a transformer replacement” — it’s a shift in paradigm: So the tradeoff is: Strength | Weakness -- | -- Linear scaling | Less precise retrieval Memory efficient | Harder to interpret Streaming-friendly | Less mature ecosystemMy honest take
But we’re still early — especially on:
Really good questions btw — this is exactly the kind of discussion the Mamba ecosystem needs right now. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion Type
Question
Discussion Content
Hey everyone 👋
I've been going through the Mamba2 paper and playing around with it
for a while now, and I genuinely find the architecture really
interesting. Wanted to start an open conversation about it because
I feel like there's a lot to unpack here.
So Mamba2 brought in this State Space Duality (SSD) idea which kind
of bridges the gap between SSMs and attention — which is honestly
a clever move. But I've been sitting with a few questions that I
can't fully answer on my own.
Like — how far can this actually scale? Transformers have been
pushed to hundreds of billions of parameters and we know their
breaking points. But with Mamba2, are we confident the SSD layer
holds up at that scale too?
Also the hybrid approach (mixing Mamba2 with attention layers) seems
to be gaining traction. Jamba does this. But does mixing defeat the
whole point of moving away from attention in the first place?
Curious what people think.
And long context — this is where Mamba2 should theoretically shine
over transformers. Has anyone actually stress tested it at 100k+
tokens in a real task? Would love to see real numbers not just
theoretical complexity arguments.
A few things I personally think could be improved:
but nobody really has a solid recipe for Mamba2 yet.
I doubt they're optimal for every task.
layers. Feels like a natural next step.
Anyway I'm not an expert here, just someone genuinely curious and
trying to learn. If you've done any work on this or have strong
opinions — please jump in. Would love a real conversation about this.
Beta Was this translation helpful? Give feedback.
All reactions