NVFP4 MoE Training Status

> Keep centralized tracking of NVFP4 training for DeepSeek-V3 and LLAMA4 model in torchtitan. We will keep this status updated.

### Recipes

- [ ] Current pretraining recipe (https://arxiv.org/pdf/2509.25149)

<img width="480" height="248" alt="Image" src="https://github.com/user-attachments/assets/f8df67f0-c210-40be-a4bc-05072f9c4be4" />

### Kernels

- [ ] NVFP4 GEMM
  - [x] Currently supported through cuBLAS in `torch.nn.functional.scaled_mm`
  - [ ] A triton or cute dsl kernel will enable composability with features like Symmetric Memory. CuTe DSL kernel is available in current release.

- [ ] NVFP4 Grouped GEMM
  - [x] ~Currently lacking integration into PyTorch.~ An attempt was made before but not satisfactory: https://github.com/pytorch/pytorch/pull/156806
    - [x] NVFP4 grouped gemm (via. torch.nn.functional.scaled_grouped_mm) - https://github.com/pytorch/pytorch/pull/166308
  - [ ] CuTe DSL kernel is available in current release. A triton or cute dsl kernel will enable composability with features like Symmetric Memory.
  - [x] BF16 GroupedGemm CuTe DSL integration into inductor: https://github.com/pytorch/pytorch/pull/165036/, ​​https://github.com/pytorch/pytorch/issues/165785
- [ ] NVFP4 GEMM/Grouped GEMM variants for blackwell ultra
  - [ ] CuTe DSL kernel will be available in a later release.
  - [ ] cuBLAS support planned for future CUDA release.
- [ ] Random Hadamard Transform
  - [ ] Currently lacking native implementation. Implementation available in TE but we need to consider composability and maintenance.
- [ ] Quantize with Stochastic Rounding
  - [ ] Currently lacking native implementation. Implementation available in TE but we need to consider composability and maintenance

### Execution Plan

- [ ] Need RFC for CuTe DSL NVFP4 GEMM/Grouped GEMM in `torch.nn.functional.scaled_mm` in PyTorch: https://github.com/pytorch/pytorch/issues/166611
- [ ] TorchAO Execution Plan: https://github.com/pytorch/ao/issues/3293

### Test Plan

- [ ] Functionality
  - [ ] E2E Convergence runs
    - [ ] LLAMA 4
    - [ ] DeepSeek-V3
- [ ] Performance
  - [ ] Microbenchmarks for cuBLAS vs CuTe DSL GEMM/Grouped Gemm kernels
  - [ ] E2E Performance benchmarks
    - [ ] LLAMA 4
    - [ ] DeepSeek-V3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVFP4 MoE Training Status #1962

Recipes

Kernels

Execution Plan

Test Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVFP4 MoE Training Status #1962

Description

Recipes

Kernels

Execution Plan

Test Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions