You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# mu flows between layers: layer N's mu guides layer N+1's attention# mu_init: learnable parameter so layer 0 also gets guidanceq=q+mu_to_q(mu_prev) # mu biases Q projectionk=k+mu_to_k(mu_prev) # mu biases K projectionv=v+mu_to_v(mu_prev) # mu biases V projection
3. Zipf-balanced Routing
# Problem: token_id % num_experts concentrates frequent tokens# Solution: sort by frequency, distribute round-robinsorted_by_freq=vocab.argsort(by=frequency, descending=True)
expert_assignment[sorted_by_freq[i]] =i%num_experts# Result: each expert gets equal mix of frequent/rare tokens
Cluster Parallelism
fromcomplexity.parallel.clusterimportClusterConfig, ClusterModel# Auto-configures TP × PP × DP based on model size and GPU countconfig=ClusterConfig(tp_size=8, pp_size=1, dp_size=2)
model=ClusterModel(model, config)
GPUs
Config
Effective Batch
Use Case
1
DP=1
64
Dev/test
4
DP=4
256
Ablation
8
DP=8
512
400M training
16
DP=16
1,024
1B training
64
DP=64
4,096
7B training
Features
Module
Description
Attention
GQA/MHA/MQA with Mu-Guided KQV, QK Norm, RoPE
MLP
Token-Routed with Zipf-balanced routing, Fused gate+up
Mu-Guidance
Cross-layer contextual mu, learnable mu_init
Optimizers
AdamW, Muon (Newton-Schulz), muP scaling
Schedulers
Cosine, WSD (LLaMA 3), Linear, Constant
Parallel
FSDP v2, Tensor Parallel, Pipeline Parallel, 3D Cluster