This issue tracks DeepSeek-V4 training support in Megatron Core.
DeepSeek-V4 extends DeepSeek-V3/V3.2 with hybrid compressed attention, mHC, updated routing, Muon-based training recipes, FP4 QAT, and million-token context training support.
References:
Last updated: 2026-06-10
Model Architecture
- Status: Partial Support
- Requested Features:
- Megatron-LM configs for DeepSeek-V4-Flash and DeepSeek-V4-Pro training
- CSA/HCA layer schedule wired into the model spec
- Hash routing initial MoE layer support
- mHC and MTP support
- Overall Status:
Hybrid CSA/HCA Attention
Hash routing MoE and ClampedSwiGLU
mHC Integration
- Status: Merged in Dev
- Implementation PRs:
- Tracking Issues:
Sequence and Context Support
Packed Sequence
- Status: In Progress (ETA: May 29)
- Related PRs:
- Requested Features:
- Packed sequence support for DSv4 Hybrid Attention
- End-to-end packed sequence support
- CUDA Graphs support for DSv4 THD
Long-Context Training
- Status: In Progress (ETA: Jun 19)
- Related PRs:
- Requested Features:
- Context parallel support for DSv4 Hybrid Attention
- 64K to 1M context training curriculum validation
Optimizer and Training Recipe
Muon Optimizer
- Status: In Progress
- Related PR:
- Requested Features:
- DeepSeek-V4 Muon/AdamW training recipe with Emerging Optimizers
FP4 QAT Recipe
- Status: Partially supported / needs DeepSeek-V4 specific validation
- Requested Features:
- MXFP4 QAT for routed expert weights
- Simulated FP4-to-FP8 training path with FP32 master weights
- FP4 QK path for CSA indexer
- BF16 index-score path for top-k selector
Performance Optimization
mHC Fusion Kernels
DSv4 Hybrid Attention Fusion Kernels
FP8 Indexer
- Status: In Progress
- Requested Features:
- FP8 indexer kernels
- Integration in MCore
- Related PR:
FP8 parameter + Muon + MXFP8
mHC support with EP Overlapping
- Status: Planning
- Related PR:
Bug Fixes
Megatron Bridge Examples
- Status: Needs implementation
- Requested Features:
- DeepSeek-V4-Flash proxy pretraining recipe
- DeepSeek-V4-Pro config/provider support
- Hugging Face ↔ Megatron checkpoint conversion coverage
This issue tracks DeepSeek-V4 training support in Megatron Core.
DeepSeek-V4 extends DeepSeek-V3/V3.2 with hybrid compressed attention, mHC, updated routing, Muon-based training recipes, FP4 QAT, and million-token context training support.
References:
Last updated: 2026-06-10
Model Architecture
Hybrid CSA/HCA Attention
Hash routing MoE and ClampedSwiGLU
mHC Integration
Sequence and Context Support
Packed Sequence
Long-Context Training
Optimizer and Training Recipe
Muon Optimizer
FP4 QAT Recipe
Performance Optimization
mHC Fusion Kernels
DSv4 Hybrid Attention Fusion Kernels
FP8 Indexer
FP8 parameter + Muon + MXFP8
mHC support with EP Overlapping
Bug Fixes
Megatron Bridge Examples