-
Notifications
You must be signed in to change notification settings - Fork 601
Open
Description
Keep centralized tracking of NVFP4 training for DeepSeek-V3 and LLAMA4 model in torchtitan. We will keep this status updated.
Recipes
- Current pretraining recipe (https://arxiv.org/pdf/2509.25149)
Kernels
-
NVFP4 GEMM
- Currently supported through cuBLAS in
torch.nn.functional.scaled_mm - A triton or cute dsl kernel will enable composability with features like Symmetric Memory. CuTe DSL kernel is available in current release.
- Currently supported through cuBLAS in
-
NVFP4 Grouped GEMM
-
Currently lacking integration into PyTorch.An attempt was made before but not satisfactory: [Draft][CUDA] Upgrade torch._scaled_grouped_mm to SM100+ pytorch#156806- NVFP4 grouped gemm (via. torch.nn.functional.scaled_grouped_mm) - NVFP4 grouped gemm support via. FBGEMM kernels pytorch#166308
- CuTe DSL kernel is available in current release. A triton or cute dsl kernel will enable composability with features like Symmetric Memory.
- BF16 GroupedGemm CuTe DSL integration into inductor: [Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel pytorch#165036, Harden CuTeDSL Inductor Path pytorch#165785
-
-
NVFP4 GEMM/Grouped GEMM variants for blackwell ultra
- CuTe DSL kernel will be available in a later release.
- cuBLAS support planned for future CUDA release.
-
Random Hadamard Transform
- Currently lacking native implementation. Implementation available in TE but we need to consider composability and maintenance.
-
Quantize with Stochastic Rounding
- Currently lacking native implementation. Implementation available in TE but we need to consider composability and maintenance
Execution Plan
- Need RFC for CuTe DSL NVFP4 GEMM/Grouped GEMM in
torch.nn.functional.scaled_mmin PyTorch: CuTe DSL NVFP4 GEMM/Grouped GEMM kernels pytorch#166611 - TorchAO Execution Plan: NVFP4 Training Tracker ao#3293
Test Plan
- Functionality
- E2E Convergence runs
- LLAMA 4
- DeepSeek-V3
- E2E Convergence runs
- Performance
- Microbenchmarks for cuBLAS vs CuTe DSL GEMM/Grouped Gemm kernels
- E2E Performance benchmarks
- LLAMA 4
- DeepSeek-V3
supriyar, vkuzo, elfiegg and apaz-cli
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
In Progress