DeepSeek-V4 training support

This issue tracks DeepSeek-V4 training support in Megatron Core.

DeepSeek-V4 extends DeepSeek-V3/V3.2 with hybrid compressed attention, mHC, updated routing, Muon-based training recipes, FP4 QAT, and million-token context training support.

References:

* Technical report: [https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek\_V4.pdf](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)  
* Hugging Face implementation PR: [https://github.com/huggingface/transformers/pull/45616](https://github.com/huggingface/transformers/pull/45616)  
* TileKernels mHC reference: [https://github.com/deepseek-ai/TileKernels/tree/main](https://github.com/deepseek-ai/TileKernels/tree/main)  
* Emerging Optimizers Muon reference: [https://github.com/NVIDIA-NeMo/Emerging-Optimizers](https://github.com/NVIDIA-NeMo/Emerging-Optimizers)

Last updated: 2026-06-10

## **Model Architecture**

* **Status:** Partial Support
* **Requested Features:**  
  * Megatron-LM configs for DeepSeek-V4-Flash and DeepSeek-V4-Pro training  
  * CSA/HCA layer schedule wired into the model spec  
  * Hash routing initial MoE layer support
  * mHC and MTP support
* **Overall Status:**
  - [x] Hybrid CSA/HCA Attention
  - [x] Hash routing MoE and ClampedSwiGLU
  - [x] mHC and MTP support

### **Hybrid CSA/HCA Attention**

* **Status:** Merged in Dev
* **Implementation PR:**
  * \#4458
  * https://github.com/NVIDIA/Megatron-LM/pull/4894
* **Requested Features:**  
  * Compressed Sparse Attention (CSA)  
  * Heavily Compressed Attention (HCA)  
  * Integration with Megatron attention/module specs  
* **Related:** \#4252

### **Hash routing MoE and ClampedSwiGLU**

* **Status:** Merged in Dev
* **Implementation PR:** \#4481
* **Requested Features:**  
  * Hash routing MoE
  * ClampedSwiGLU

### **mHC Integration**

* **Status:** Merged in Dev
* **Implementation PRs:**
  * https://github.com/NVIDIA/Megatron-LM/pull/2943
  * https://github.com/NVIDIA/Megatron-LM/pull/4190
  * https://github.com/NVIDIA/Megatron-LM/pull/4518
* **Tracking Issues:**
  * \#2890
  * \#2919

## Sequence and Context Support

### **Packed Sequence**

* **Status:** In Progress (ETA: May 29)
* **Related PRs:**
  * https://github.com/NVIDIA/Megatron-LM/pull/4816
  * https://github.com/NVIDIA/Megatron-LM/pull/4832
  * https://github.com/NVIDIA/Megatron-LM/pull/4359
  * https://github.com/NVIDIA/Megatron-LM/pull/5258
  * https://github.com/NVIDIA/Megatron-LM/pull/5011
* **Requested Features:**  
  * Packed sequence support for DSv4 Hybrid Attention
  * End-to-end packed sequence support
  * CUDA Graphs support for DSv4 THD

### **Long-Context Training**

* **Status:** In Progress (ETA: Jun 19)
* **Related PRs:**
  * https://github.com/NVIDIA/Megatron-LM/pull/5087 
* **Requested Features:**  
  * Context parallel support for DSv4 Hybrid Attention
  * 64K to 1M context training curriculum validation

## **Optimizer and Training Recipe**

### Muon Optimizer
* **Status:** In Progress
* **Related PR:**
  * \#4523
  * https://github.com/NVIDIA/Megatron-LM/pull/4771
  * https://github.com/NVIDIA/Megatron-LM/pull/4509
  * https://github.com/NVIDIA/Megatron-LM/pull/4987
* **Requested Features:**  
  * DeepSeek-V4 Muon/AdamW training recipe with Emerging Optimizers

### **FP4 QAT Recipe**

* **Status:** Partially supported / needs DeepSeek-V4 specific validation  
* **Requested Features:**  
  * MXFP4 QAT for routed expert weights  
  * Simulated FP4-to-FP8 training path with FP32 master weights  
  * FP4 QK path for CSA indexer  
  * BF16 index-score path for top-k selector

## Performance Optimization

### mHC Fusion Kernels
* **Status:** Merged in Dev
* **Related PR:** 
  * https://github.com/NVIDIA/Megatron-LM/pull/3828
  * https://github.com/NVIDIA/Megatron-LM/pull/4624
  * [TransformerEngine] https://github.com/NVIDIA/TransformerEngine/pull/2790
* TileLang kernels from DeepSeek [https://github.com/deepseek-ai/TileKernels/tree/main](https://github.com/deepseek-ai/TileKernels/tree/main)

### DSv4 Hybrid Attention Fusion Kernels
* **Status:** Merged in Dev
* The fusion kernels are released in https://github.com/NVIDIA/cudnn-frontend release 1.24.0
* **Related PR:** 
  * https://github.com/NVIDIA/Megatron-LM/pull/4894

### FP8 Indexer
* **Status:** In Progress
* **Requested Features:**  
  * FP8 indexer kernels
  * Integration in MCore
* **Related PR:** 

### FP8 parameter + Muon + MXFP8
* **Status:** In Progress
* **Related PR:**  https://github.com/NVIDIA/Megatron-LM/pull/4987

### mHC support with EP Overlapping
* **Status:** Planning
* **Related PR:**

## **Bug Fixes**
* **Related PR:**
  * https://github.com/NVIDIA/Megatron-LM/pull/5018

## **Megatron Bridge Examples**

* **Status:** Needs implementation  
* **Requested Features:**  
  * DeepSeek-V4-Flash proxy pretraining recipe  
  * DeepSeek-V4-Pro config/provider support  
  * Hugging Face ↔ Megatron checkpoint conversion coverage


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-V4 training support #4468

Model Architecture

Hybrid CSA/HCA Attention

Hash routing MoE and ClampedSwiGLU

mHC Integration

Sequence and Context Support

Packed Sequence

Long-Context Training

Optimizer and Training Recipe

Muon Optimizer

FP4 QAT Recipe

Performance Optimization

mHC Fusion Kernels

DSv4 Hybrid Attention Fusion Kernels

FP8 Indexer

FP8 parameter + Muon + MXFP8

mHC support with EP Overlapping

Bug Fixes

Megatron Bridge Examples

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DeepSeek-V4 training support #4468

Description

Model Architecture

Hybrid CSA/HCA Attention

Hash routing MoE and ClampedSwiGLU

mHC Integration

Sequence and Context Support

Packed Sequence

Long-Context Training

Optimizer and Training Recipe

Muon Optimizer

FP4 QAT Recipe

Performance Optimization

mHC Fusion Kernels

DSv4 Hybrid Attention Fusion Kernels

FP8 Indexer

FP8 parameter + Muon + MXFP8

mHC support with EP Overlapping

Bug Fixes

Megatron Bridge Examples

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions