Skip to content

NVFP4 MoE Training Status #1962

@syed-ahmed

Description

@syed-ahmed

Keep centralized tracking of NVFP4 training for DeepSeek-V3 and LLAMA4 model in torchtitan. We will keep this status updated.

Recipes

Image

Kernels

  • NVFP4 GEMM

    • Currently supported through cuBLAS in torch.nn.functional.scaled_mm
    • A triton or cute dsl kernel will enable composability with features like Symmetric Memory. CuTe DSL kernel is available in current release.
  • NVFP4 Grouped GEMM

  • NVFP4 GEMM/Grouped GEMM variants for blackwell ultra

    • CuTe DSL kernel will be available in a later release.
    • cuBLAS support planned for future CUDA release.
  • Random Hadamard Transform

    • Currently lacking native implementation. Implementation available in TE but we need to consider composability and maintenance.
  • Quantize with Stochastic Rounding

    • Currently lacking native implementation. Implementation available in TE but we need to consider composability and maintenance

Execution Plan

Test Plan

  • Functionality
    • E2E Convergence runs
      • LLAMA 4
      • DeepSeek-V3
  • Performance
    • Microbenchmarks for cuBLAS vs CuTe DSL GEMM/Grouped Gemm kernels
    • E2E Performance benchmarks
      • LLAMA 4
      • DeepSeek-V3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions