Mamba2 Architecture — How it works and where can we improve it? #193184

Kahabk · 2026-04-20T13:08:13Z

Kahabk
Apr 20, 2026

Discussion Type

Question

Discussion Content

Hey everyone 👋

I've been going through the Mamba2 paper and playing around with it
for a while now, and I genuinely find the architecture really
interesting. Wanted to start an open conversation about it because
I feel like there's a lot to unpack here.

So Mamba2 brought in this State Space Duality (SSD) idea which kind
of bridges the gap between SSMs and attention — which is honestly
a clever move. But I've been sitting with a few questions that I
can't fully answer on my own.

Like — how far can this actually scale? Transformers have been
pushed to hundreds of billions of parameters and we know their
breaking points. But with Mamba2, are we confident the SSD layer
holds up at that scale too?

Also the hybrid approach (mixing Mamba2 with attention layers) seems
to be gaining traction. Jamba does this. But does mixing defeat the
whole point of moving away from attention in the first place?
Curious what people think.

And long context — this is where Mamba2 should theoretically shine
over transformers. Has anyone actually stress tested it at 100k+
tokens in a real task? Would love to see real numbers not just
theoretical complexity arguments.

A few things I personally think could be improved:

Fine tuning story is still weak. LoRA works great for transformers
but nobody really has a solid recipe for Mamba2 yet.
State initialization feels underexplored. The defaults work but
I doubt they're optimal for every task.
Would love to see someone seriously try MoE routing inside Mamba2
layers. Feels like a natural next step.

Anyway I'm not an expert here, just someone genuinely curious and
trying to learn. If you've done any work on this or have strong
opinions — please jump in. Would love a real conversation about this.

abbosaliboev · 2026-04-20T17:53:08Z

abbosaliboev
Apr 20, 2026

Hi @Kahabk,

This is a great conversation starter! Mamba2 and SSD are definitely the "hot topics" in architecture right now. Here is a simplified take on the points you raised:

Can it scale to 100B+?
The main challenge with scaling Mamba2 isn't the number of parameters, but the "Notebook" (Hidden State) capacity.

Think of it this way: Transformers have "Photographic Memory" (they look at everything), while Mamba2 has a "Smart Notebook" (it summarizes everything).

As we go to 100B+, the summary in that notebook has to be incredibly perfect. If the "notebook" gets too crowded, the model might start losing fine details. We haven't seen a 100B pure Mamba yet because we're still figuring out how to keep that summary from getting "blurry."

Is the Hybrid approach "cheating"?
Actually, it’s just smart engineering—like a Hybrid Car.

Attention is the electric motor (perfect for precision but eats up battery/memory).

Mamba is the gas engine (extremely efficient for long distances).
By mixing them, models like Jamba get the best of both worlds: they can remember specific facts perfectly while running 10x faster and cheaper. It’s not moving away from the point; it’s making it practical for real-world use.

100k+ Token Stress Tests
In real tests, Mamba2 is the "King of Memory." It won't crash your GPU like a Transformer would at 100k tokens. However, it can suffer from the "Lost in the Middle" problem. It’s like reading a massive book in one sitting—you’ll remember the beginning and the end clearly, but the details on page 450 might get a bit fuzzy.

I hope this helps! Good luck with your research and exploring this further ))

1 reply

Kahabk Apr 21, 2026
Author

I really appreciated your analysis, and the example of the “smart notebook vs photographic memory” really helped me understand things clearly. As for your scaling argument, it seems like a valid one because Mamba2 works a lot on how efficiently it can compress data, and having such a “blurry notebook” might turn out to be a problem when scaling. It seems like the hybrid approach cannot be considered cheating because it is simply efficient engineering. It is also interesting what you have said about the importance of a long context, and how the system does not crash after processing 100k tokens.

Kahabk · 2026-04-21T16:06:49Z

Kahabk Apr 21, 2026
Author

That’s an excellent breakdown – especially the emphasis that the true limiting factor here is d_state scaling, not just parameter count. That really underscores how Mamba2 changes the goal to be one of better representation as opposed to computation, and without scaling that, the problem of 100 billion + parameters is very much still out there. As for hybrids, I do like that idea as well. While they don’t necessarily circumvent the core issue, I think it’s definitely a good middle ground to pursue. I’m also glad you talked about the long context limitations; it does seem quite logical especially considering that bounded states would be detrimental to context retention in practice. With regards to fine-tuning, I’m currently working on building a package that can enable LoRA style tuning by focusing on the surrounding projection layers of Mamba2 since a transformer based model wouldn’t work here.

Abhishek-Coder-01 · 2026-04-21T18:39:40Z

Abhishek-Coder-01
Apr 21, 2026

I think you’re asking the right questions—especially around scaling and the role of the state.

On scaling, I agree with the point about d_state being the real bottleneck. Unlike transformers, where scaling is mostly about adding parameters and compute, Mamba2 shifts the problem toward representation capacity. If the state doesn’t scale appropriately, you end up compressing too aggressively, which leads to information loss. That “blurry notebook” analogy fits really well here. So the 100B+ question isn’t just about feasibility—it’s about whether we can scale the state without breaking efficiency.

On hybrid models, I don’t see them as defeating the purpose. The goal was never to completely eliminate attention, but to reduce its cost while preserving quality. Architectures like Jamba show that keeping a small amount of attention for precise retrieval while using Mamba for long-range efficiency is actually a practical compromise.

For long context, it

1 reply

dev-801000 Apr 21, 2026

Good

AbhinavPabbaraju · 2026-04-21T23:05:22Z

AbhinavPabbaraju
Apr 21, 2026

This is a great set of questions — you’re basically touching the exact fault lines where Mamba2 is still “promising but not fully settled.”

I’ll try to break it down point by point.

1. Scaling — does SSD actually hold up?

Short answer: we don’t fully know yet at frontier scale, but structurally it should.

Mamba2’s SSD (State Space Duality) gives you:

Linear time complexity in sequence length
Constant memory per token (no KV cache explosion like transformers)

So in theory:

Transformers → O(n²) attention bottleneck  
Mamba2 → O(n) selective scan

But the real question is not complexity — it’s optimization stability at scale.

What’s still unclear:

Do very deep Mamba stacks train as stably as transformers?
Does the implicit state accumulation degrade over long horizons?
How does gradient flow behave compared to attention’s direct connections?

My take:
Mamba2 likely scales efficiently, but not yet proven at GPT-4 scale regimes.

2. Hybrid models (Mamba + Attention)

This is actually not a contradiction — it’s probably the endgame architecture.

Why hybrids make sense:

Attention = content-based retrieval (random access)
Mamba = compressed sequential memory (streaming)

They solve different problems.

So models like Jamba aren’t “defeating the purpose” — they’re:

Using Mamba for efficiency + long context  
Using Attention for precise lookup

Think of it as:

Mamba = RAM compression
Attention = indexed lookup

Pure Mamba might struggle with:

Exact token recall
Sparse dependencies

3. Long context (100k+ tokens)

This is where Mamba2 should dominate — but benchmarks are still limited.

Theoretical advantage:

No KV cache
No quadratic blowup

Practical concerns:

State drift over long sequences
Information compression loss
Difficulty in retrieving specific past tokens

Unlike attention, Mamba doesn’t “look back” — it summarizes.

So the real question becomes:

How much information survives compression?

We need more:

Retrieval-style benchmarks
Long-horizon reasoning tasks
Not just perplexity

🛠️ 4. Fine-tuning (you’re absolutely right)

This is currently a weak spot.

Why LoRA works well for transformers:

Clear weight decomposition (Q/K/V/FFN)

Mamba:

Has implicit state dynamics
Less modular parameter structure

So:

No standardized LoRA-style adapters yet
Fine-tuning often requires full-model updates

This is a big research opportunity

5. State initialization

Underrated point.

Right now:

Most implementations use simple/default initialization
But the initial hidden state directly affects sequence dynamics

Potential improvements:

Learned initial states
Task-conditioned initialization
Reset/gating strategies for long sequences

This is very similar to RNN-era problems — but more subtle.

6. MoE inside Mamba

This is honestly one of the most interesting directions.

Why it fits well:

Mamba already processes tokens sequentially
MoE could add conditional computation without quadratic cost

Challenges:

Routing without attention signals
Maintaining stability in state updates

But conceptually:

Selective State + Selective Experts = very powerful

Big picture

Mamba2 is not just “a transformer replacement” — it’s a shift in paradigm:

Transformers → explicit pairwise interactions  
Mamba → implicit compressed state evolution

So the tradeoff is:

My honest take

Short term → Hybrid models win
Mid term → Better training + adapters make Mamba competitive
Long term → Pure SSM-based models could replace attention in many domains

But we’re still early — especially on:

Scaling laws
Fine-tuning recipes
Real-world long-context benchmarks

Really good questions btw — this is exactly the kind of discussion the Mamba ecosystem needs right now.

0 replies

This comment was marked as low quality.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Mamba2 Architecture — How it works and where can we improve it? #193184

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

This comment was marked as low quality.

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

This comment was marked as low quality.

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Mamba2 Architecture — How it works and where can we improve it? #193184

Uh oh!

Kahabk Apr 20, 2026

Discussion Type

Discussion Content

Replies: 5 comments · 3 replies

Uh oh!

abbosaliboev Apr 20, 2026

Uh oh!

Kahabk Apr 21, 2026 Author

This comment was marked as low quality.

Uh oh!

Kahabk Apr 21, 2026 Author

Uh oh!

Abhishek-Coder-01 Apr 21, 2026

Uh oh!

dev-801000 Apr 21, 2026

This comment was marked as low quality.

Uh oh!

AbhinavPabbaraju Apr 21, 2026

1. Scaling — does SSD actually hold up?

2. Hybrid models (Mamba + Attention)

3. Long context (100k+ tokens)

🛠️ 4. Fine-tuning (you’re absolutely right)

5. State initialization

6. MoE inside Mamba

Big picture

My honest take

Kahabk
Apr 20, 2026

Replies: 5 comments 3 replies

abbosaliboev
Apr 20, 2026

Kahabk Apr 21, 2026
Author

Kahabk Apr 21, 2026
Author

Abhishek-Coder-01
Apr 21, 2026

AbhinavPabbaraju
Apr 21, 2026