You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+49-31Lines changed: 49 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1

2
2
3
-
# CUTLASS 3.9.0
3
+
# CUTLASS 3.9.2
4
4
5
-
_CUTLASS 3.9.0 - March 2025_
5
+
_CUTLASS 3.9.2 - May 2025_
6
6
7
7
**This repository fast-follows NVIDIA CUTLASS repository adding SYCL support for Intel GPUs.**
8
8
The CUDA support is unmodified from upstream and can be used interchangeably.
@@ -39,9 +39,9 @@ the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
39
39
operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
40
40
This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
41
41
42
-
See the [Quick Start Guide](./media/docs/quickstart.md) to get started quickly.
42
+
See the [Quick Start Guide](./media/docs/cpp/quickstart.md) to get started quickly.
43
43
44
-
See the [functionality docs](./media/docs/functionality.md) for a more comprehensive
44
+
See the [functionality docs](./media/docs/cpp/functionality.md) for a more comprehensive
45
45
list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
46
46
architecture.
47
47
@@ -57,18 +57,35 @@ architecture.
57
57
-[Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
58
58
-[Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](./examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
59
59
-[Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
60
+
-[Grouped GEMM with nvfp4 datatype](./examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu).
61
+
-[Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu).
62
+
-[Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu).
60
63
* Set of unit tests that demonstrate the usage of both [sparse](./test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](./test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
-[Blockscaled Sparse GEMM with NVFP4 input data type](./examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu)
70
+
-[Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)](./examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)
71
+
* Set of unit tests that demonstrate the usage of [sparse](./test/unit/gemm/device/sm100_sparse_tensorop_gemm) and [blockscaled sparse](./test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm) Blackwell SM100 GEMM.
72
+
* A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS [example](./examples/77_blackwell_fmha/) covers the flashMLA-like weight-absorbed decoding use-case.
73
+
* A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS [example](./examples/77_blackwell_fmha/) to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
74
+
* A new [distributed GEMM example](./examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu) for SM100 Blackwell architecture.
61
75
* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
62
76
- Enhancement of [blockwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
63
77
- Enhancement of [groupwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
64
-
- Support for [grouped GEMM with blockwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
78
+
- Support for [grouped GEMM with blockwise and groupwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
79
+
- Support for [grouped-wise GEMM](./tools/profiler/src/blockwise_gemm_operation_profiler.cu) in CUTLASS profiler.
65
80
- Support for [blockwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
66
81
- Support for [groupwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
67
-
* Added support for enhanced kernel performance search in CUTLASS:
82
+
- Support for [grouped GEMM with blockwise](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
83
+
* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
68
84
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
69
85
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
70
86
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
71
-
- More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
87
+
- More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/cpp/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
88
+
* Support `void` as the D element in sm100 kernel epilogues.
72
89
73
90
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
74
91
CUTLASS team is working on a fix.
@@ -115,7 +132,7 @@ Layouts can also be combined and manipulated via functional composition, on whic
115
132
CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
116
133
This greatly simplifies the design and improves code composability and readability.
117
134
More documentation specific to CuTe can be found in its
-[Quick Start Guide](./media/docs/quickstart.md) - basics of building and running CUTLASS
209
-
-[Functionality](./media/docs/functionality.md) - summarizes functionality available in CUTLASS
210
-
-[Efficient GEMM in CUDA](./media/docs/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
211
-
-[CUTLASS 3.x Design](./media/docs/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
212
-
-[GEMM API 3.x](./media/docs/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
213
-
-[GEMM API 2.x](./media/docs/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
214
-
-[Implicit GEMM Convolution](./media/docs/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
215
-
-[Code Organization](./media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
216
-
-[Terminology](./media/docs/terminology.md) - describes terms used in the code
217
-
-[Programming Guidelines](./media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
218
-
-[Fundamental types](./media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
219
-
-[Layouts](./media/docs/layout.md) - describes layouts of matrices and tensors in memory
220
-
-[Tile Iterators](./media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
-[CUTLASS Utilities](./media/docs/utilities.md) - additional templates used to facilitate rapid development
223
-
-[Dependent kernel launch](./media/docs/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent
226
+
-[Quick Start Guide](./media/docs/cpp/quickstart.md) - basics of building and running CUTLASS
227
+
-[Functionality](./media/docs/cpp/functionality.md) - summarizes functionality available in CUTLASS
228
+
-[Efficient GEMM in CUDA](./media/docs/cpp/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
229
+
-[CUTLASS 3.x Design](./media/docs/cpp/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
230
+
-[GEMM API 3.x](./media/docs/cpp/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
231
+
-[GEMM API 2.x](./media/docs/cpp/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
232
+
-[Implicit GEMM Convolution](./media/docs/cpp/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
233
+
-[Code Organization](./media/docs/cpp/code_organization.md) - describes the organization and contents of the CUTLASS project
234
+
-[Terminology](./media/docs/cpp/terminology.md) - describes terms used in the code
235
+
-[Programming Guidelines](./media/docs/cpp/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
236
+
-[Fundamental types](./media/docs/cpp/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
237
+
-[Layouts](./media/docs/cpp/layout.md) - describes layouts of matrices and tensors in memory
238
+
-[Tile Iterators](./media/docs/cpp/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
0 commit comments