Architecture Overview

Complexity-Deep

Each of the 18 decoder layers:

RMSNorm + GQA Attention (12 Q heads, 4 KV heads, head_dim=64)
- Mu-Guided Q/K/V bias from previous layer
- QK RMSNorm + RoPE (theta=10000)
- Residual connection
RMSNorm + Token-Routed MLP (4 experts SwiGLU, 512d each)
- Sort-and-split dispatch (bmm, fullgraph safe)
- Zipf-balanced deterministic routing
- Shared Lexical Expert (dense SwiGLU, all tokens)
- Residual connection
Mu-Guidance (after MLP)
- mu = clamp(mu_param + mu_proj(h), -2, 2)
- Flows to next layer's attention

The framework also supports: