Skip to content

DeepSeek-V4 training support #4468

@sbhavani

Description

@sbhavani

This issue tracks DeepSeek-V4 training support in Megatron Core.

DeepSeek-V4 extends DeepSeek-V3/V3.2 with hybrid compressed attention, mHC, updated routing, Muon-based training recipes, FP4 QAT, and million-token context training support.

References:

Last updated: 2026-06-10

Model Architecture

  • Status: Partial Support
  • Requested Features:
    • Megatron-LM configs for DeepSeek-V4-Flash and DeepSeek-V4-Pro training
    • CSA/HCA layer schedule wired into the model spec
    • Hash routing initial MoE layer support
    • mHC and MTP support
  • Overall Status:
    • Hybrid CSA/HCA Attention
    • Hash routing MoE and ClampedSwiGLU
    • mHC and MTP support

Hybrid CSA/HCA Attention

Hash routing MoE and ClampedSwiGLU

mHC Integration

Sequence and Context Support

Packed Sequence

Long-Context Training

Optimizer and Training Recipe

Muon Optimizer

FP4 QAT Recipe

  • Status: Partially supported / needs DeepSeek-V4 specific validation
  • Requested Features:
    • MXFP4 QAT for routed expert weights
    • Simulated FP4-to-FP8 training path with FP32 master weights
    • FP4 QK path for CSA indexer
    • BF16 index-score path for top-k selector

Performance Optimization

mHC Fusion Kernels

DSv4 Hybrid Attention Fusion Kernels

FP8 Indexer

  • Status: In Progress
  • Requested Features:
    • FP8 indexer kernels
    • Integration in MCore
  • Related PR:

FP8 parameter + Muon + MXFP8

mHC support with EP Overlapping

  • Status: Planning
  • Related PR:

Bug Fixes

Megatron Bridge Examples

  • Status: Needs implementation
  • Requested Features:
    • DeepSeek-V4-Flash proxy pretraining recipe
    • DeepSeek-V4-Pro config/provider support
    • Hugging Face ↔ Megatron checkpoint conversion coverage

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions