Skip to content

Commit dee3370

Browse files
authored
Cutlass 3.9.2 (#371)
This PR adds the changes from Cutlass 3.9.2
2 parents f6ca3e7 + f909ce4 commit dee3370

File tree

351 files changed

+46309
-7732
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

351 files changed

+46309
-7732
lines changed

CHANGELOG.md

Lines changed: 58 additions & 29 deletions
Large diffs are not rendered by default.

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -765,6 +765,7 @@ target_include_directories(
765765
CUTLASS
766766
SYSTEM INTERFACE
767767
$<BUILD_INTERFACE:${CUDA_TOOLKIT_ROOT_DIR}/include>
768+
$<BUILD_INTERFACE:${CUDA_TOOLKIT_ROOT_DIR}/include/cccl>
768769
)
769770

770771
install(

PUBLICATIONS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
- ["ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization"](https://arxiv.org/abs/2502.02631). Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra. _arXiv_, February 2025.
88

9+
- ["Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light"](https://arxiv.org/abs/2504.16922). Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi. _arXiv_, April 2025.
10+
911
## 2024
1012

1113
- ["DeepSeek-V3 Technical Report"](https://arxiv.org/abs/2412.19437). DeepSeek-AI. _arXiv_, December 2024.

README.md

Lines changed: 49 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
22

3-
# CUTLASS 3.9.0
3+
# CUTLASS 3.9.2
44

5-
_CUTLASS 3.9.0 - March 2025_
5+
_CUTLASS 3.9.2 - May 2025_
66

77
**This repository fast-follows NVIDIA CUTLASS repository adding SYCL support for Intel GPUs.**
88
The CUDA support is unmodified from upstream and can be used interchangeably.
@@ -39,9 +39,9 @@ the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
3939
operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
4040
This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
4141

42-
See the [Quick Start Guide](./media/docs/quickstart.md) to get started quickly.
42+
See the [Quick Start Guide](./media/docs/cpp/quickstart.md) to get started quickly.
4343

44-
See the [functionality docs](./media/docs/functionality.md) for a more comprehensive
44+
See the [functionality docs](./media/docs/cpp/functionality.md) for a more comprehensive
4545
list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
4646
architecture.
4747

@@ -57,18 +57,35 @@ architecture.
5757
- [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
5858
- [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](./examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
5959
- [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
60+
- [Grouped GEMM with nvfp4 datatype](./examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu).
61+
- [Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu).
62+
- [Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu).
6063
* Set of unit tests that demonstrate the usage of both [sparse](./test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](./test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
64+
* Support for Blackwell SM100 Sparse kernels:
65+
- Collective mainloop that target for
66+
* [SM100 Sparse GEMM](./include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp)
67+
* Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
68+
- [Sparse GEMM](./examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu)
69+
- [Blockscaled Sparse GEMM with NVFP4 input data type](./examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu)
70+
- [Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)](./examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)
71+
* Set of unit tests that demonstrate the usage of [sparse](./test/unit/gemm/device/sm100_sparse_tensorop_gemm) and [blockscaled sparse](./test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm) Blackwell SM100 GEMM.
72+
* A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS [example](./examples/77_blackwell_fmha/) covers the flashMLA-like weight-absorbed decoding use-case.
73+
* A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS [example](./examples/77_blackwell_fmha/) to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
74+
* A new [distributed GEMM example](./examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu) for SM100 Blackwell architecture.
6175
* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
6276
- Enhancement of [blockwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
6377
- Enhancement of [groupwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
64-
- Support for [grouped GEMM with blockwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
78+
- Support for [grouped GEMM with blockwise and groupwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
79+
- Support for [grouped-wise GEMM](./tools/profiler/src/blockwise_gemm_operation_profiler.cu) in CUTLASS profiler.
6580
- Support for [blockwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
6681
- Support for [groupwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
67-
* Added support for enhanced kernel performance search in CUTLASS:
82+
- Support for [grouped GEMM with blockwise](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
83+
* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
6884
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
6985
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
7086
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
71-
- More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
87+
- More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/cpp/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
88+
* Support `void` as the D element in sm100 kernel epilogues.
7289

7390
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
7491
CUTLASS team is working on a fix.
@@ -115,7 +132,7 @@ Layouts can also be combined and manipulated via functional composition, on whic
115132
CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
116133
This greatly simplifies the design and improves code composability and readability.
117134
More documentation specific to CuTe can be found in its
118-
[dedicated documentation directory](./media/docs/cute/00_quickstart.md).
135+
[dedicated documentation directory](./media/docs/cpp/cute/00_quickstart.md).
119136

120137
# Compatibility
121138

@@ -162,6 +179,7 @@ CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be
162179
|NVIDIA H100 Tensor Core GPU |9.0|11.8|
163180
|NVIDIA H200 Tensor Core GPU |9.0|11.8|
164181
|NVIDIA B200 Tensor Core GPU |10.0|12.8|
182+
|NVIDIA GeForce RTX 50x0 series |10.0|12.8|
165183

166184
## Target Architecture
167185

@@ -197,30 +215,30 @@ NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels
197215
compiled for Blackwell SM100 architecture with arch conditional features
198216
(using `sm100a`) are not compatible with RTX 50 series GPUs.
199217

200-
Please refer to the [functionality documentation](./media/docs/functionality.md)
218+
Please refer to the [functionality documentation](./media/docs/cpp/functionality.md)
201219
for details on which kernels require which target architectures.
202220

203221
# Documentation
204222

205223
CUTLASS is described in the following documents and the accompanying
206224
[Doxygen documentation](https://nvidia.github.io/cutlass).
207225

208-
- [Quick Start Guide](./media/docs/quickstart.md) - basics of building and running CUTLASS
209-
- [Functionality](./media/docs/functionality.md) - summarizes functionality available in CUTLASS
210-
- [Efficient GEMM in CUDA](./media/docs/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
211-
- [CUTLASS 3.x Design](./media/docs/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
212-
- [GEMM API 3.x](./media/docs/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
213-
- [GEMM API 2.x](./media/docs/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
214-
- [Implicit GEMM Convolution](./media/docs/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
215-
- [Code Organization](./media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
216-
- [Terminology](./media/docs/terminology.md) - describes terms used in the code
217-
- [Programming Guidelines](./media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
218-
- [Fundamental types](./media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
219-
- [Layouts](./media/docs/layout.md) - describes layouts of matrices and tensors in memory
220-
- [Tile Iterators](./media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
221-
- [CUTLASS Profiler](./media/docs/profiler.md) - command-line driven profiling application
222-
- [CUTLASS Utilities](./media/docs/utilities.md) - additional templates used to facilitate rapid development
223-
- [Dependent kernel launch](./media/docs/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent
226+
- [Quick Start Guide](./media/docs/cpp/quickstart.md) - basics of building and running CUTLASS
227+
- [Functionality](./media/docs/cpp/functionality.md) - summarizes functionality available in CUTLASS
228+
- [Efficient GEMM in CUDA](./media/docs/cpp/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
229+
- [CUTLASS 3.x Design](./media/docs/cpp/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
230+
- [GEMM API 3.x](./media/docs/cpp/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
231+
- [GEMM API 2.x](./media/docs/cpp/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
232+
- [Implicit GEMM Convolution](./media/docs/cpp/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
233+
- [Code Organization](./media/docs/cpp/code_organization.md) - describes the organization and contents of the CUTLASS project
234+
- [Terminology](./media/docs/cpp/terminology.md) - describes terms used in the code
235+
- [Programming Guidelines](./media/docs/cpp/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
236+
- [Fundamental types](./media/docs/cpp/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
237+
- [Layouts](./media/docs/cpp/layout.md) - describes layouts of matrices and tensors in memory
238+
- [Tile Iterators](./media/docs/cpp/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
239+
- [CUTLASS Profiler](./media/docs/cpp/profiler.md) - command-line driven profiling application
240+
- [CUTLASS Utilities](./media/docs/cpp/utilities.md) - additional templates used to facilitate rapid development
241+
- [Dependent kernel launch](./media/docs/cpp/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent
224242
kernels in the same stream, and how it is used in CUTLASS.
225243

226244
# Resources
@@ -240,7 +258,7 @@ projects. Client applications should target CUTLASS's `include/` directory in th
240258
paths.
241259

242260
CUTLASS unit tests, examples, and utilities can be build with CMake.
243-
The minimum version of CMake is given in the [Quickstart guide](./media/docs/quickstart.md).
261+
The minimum version of CMake is given in the [Quickstart guide](./media/docs/cpp/quickstart.md).
244262
Make sure the `CUDACXX` environment variable points to NVCC in the CUDA Toolkit installed
245263
on your system.
246264

@@ -285,7 +303,7 @@ CUTLASS is arranged as a header-only library along with Utilities, Tools, Exampl
285303
and template concepts defined in the CUTLASS project.
286304

287305
A detailed explanation of the source code organization may be found in the
288-
[CUTLASS documentation](./media/docs/code_organization.md), but several main components are summarized below.
306+
[CUTLASS documentation](./media/docs/cpp/code_organization.md), but several main components are summarized below.
289307

290308
## CUTLASS Template Library
291309

@@ -359,7 +377,7 @@ tools/
359377
The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
360378
basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
361379

362-
Instructions for building and running the Unit tests are described in the [Quickstart guide](./media/docs/quickstart.md).
380+
Instructions for building and running the Unit tests are described in the [Quickstart guide](./media/docs/cpp/quickstart.md).
363381

364382
# Performance Profiling
365383

@@ -575,9 +593,9 @@ reference_device: Passed
575593

576594
## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
577595
- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
578-
- [GEMM CMake Examples](./media/docs/quickstart.md#gemm-cmake-examples)
579-
- [Implicit GEMM convolution CMake Examples](./media/docs/quickstart.md#convolution-cmake-examples)
580-
- [Further details about the CUTLASS Profiler are described here.](./media/docs/profiler.md)
596+
- [GEMM CMake Examples](./media/docs/cpp/quickstart.md#gemm-cmake-examples)
597+
- [Implicit GEMM convolution CMake Examples](./media/docs/cpp/quickstart.md#convolution-cmake-examples)
598+
- [Further details about the CUTLASS Profiler are described here.](./media/docs/cpp/profiler.md)
581599

582600

583601
# About

examples/04_tile_iterator/tile_iterator.cu

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
addressable memory, and then store it back into addressable memory.
3535
3636
TileIterator is a core concept in CUTLASS that enables efficient loading and storing of data to
37-
and from addressable memory. The PredicateTileIterator accepts a ThreadMap type, which defines
37+
and from addressable memory. The PredicatedTileIterator accepts a ThreadMap type, which defines
3838
the mapping of threads to a "tile" in memory. This separation of concerns enables user-defined
3939
thread mappings to be specified.
4040
@@ -124,7 +124,7 @@ __global__ void copy(
124124

125125
cudaError_t TestTileIterator(int M, int K) {
126126

127-
// For this example, we chose a <64, 4> tile shape. The PredicateTileIterator expects
127+
// For this example, we chose a <64, 4> tile shape. The PredicatedTileIterator expects
128128
// PitchLinearShape and PitchLinear layout.
129129
using Shape = cutlass::layout::PitchLinearShape<64, 4>;
130130
using Layout = cutlass::layout::PitchLinear;
@@ -136,7 +136,7 @@ cudaError_t TestTileIterator(int M, int K) {
136136
// dimension then along the strided dimension.
137137
using ThreadMap = cutlass::transform::PitchLinearStripminedThreadMap<Shape, kThreads>;
138138

139-
// Define the PredicateTileIterator, using TileShape, Element, Layout, and ThreadMap types
139+
// Define the PredicatedTileIterator, using TileShape, Element, Layout, and ThreadMap types
140140
using Iterator = cutlass::transform::threadblock::PredicatedTileIterator<
141141
Shape, Element, Layout, 1, ThreadMap>;
142142

examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ struct Options : MixedDtypeOptions{
402402
void initialize(Options const& options) {
403403

404404
auto shape_B = cute::make_shape(options.n, options.k, options.l);
405-
int const scale_k = (options.k + options.g - 1) / options.g;
405+
int const scale_k = cutlass::ceil_div(options.k, options.g);
406406
stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
407407
stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
408408
// Reverse stride here due to swap and transpose
@@ -429,7 +429,7 @@ void initialize(Options const& options) {
429429
block_zero.reset(scale_k * options.l * options.n);
430430

431431
initialize_tensor(block_A, seed + 2022);
432-
initialize_quant_tensor(block_B, seed + 2021);
432+
initialize_tensor(block_B, seed + 2021);
433433
initialize_tensor(block_C, seed + 2020);
434434
initialize_scale(block_scale, options);
435435
initialize_zero(block_zero, options);

examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -318,7 +318,7 @@ struct Options : MixedDtypeOptions {
318318
void initialize(Options const& options) {
319319

320320
auto shape_B = cute::make_shape(options.n, options.k, options.l);
321-
int const scale_k = (options.k + options.g - 1) / options.g;
321+
int const scale_k = cutlass::ceil_div(options.k, options.g);
322322
stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
323323
stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
324324
// Reverse stride here due to swap and transpose
@@ -347,7 +347,7 @@ void initialize(Options const& options) {
347347
block_zero.reset(scale_k * options.l * options.n);
348348

349349
initialize_tensor(block_A, seed + 2022);
350-
initialize_quant_tensor(block_B, seed + 2021);
350+
initialize_tensor(block_B, seed + 2021);
351351
cutlass::unified_encode_int4b(block_B.get(), block_B_modified.get(), block_B.size());
352352
initialize_tensor(block_C, seed + 2020);
353353
initialize_scale(block_scale, options);

examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,7 @@ cutlass::DeviceAllocation<typename GemmScaleWithZeroPoint::EpilogueOutputOp::Ele
288288
void initialize(MixedDtypeOptions const& options) {
289289

290290
auto shape_b = cute::make_shape(options.n, options.k, options.l);
291-
int const scale_k = (options.k + options.g - 1) / options.g;
291+
int const scale_k = cutlass::ceil_div(options.k, options.g);
292292
stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
293293
stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_b);
294294
// Reverse stride here due to swap and transpose
@@ -313,7 +313,7 @@ void initialize(MixedDtypeOptions const& options) {
313313
block_zero.reset(scale_k * options.l * options.n);
314314

315315
initialize_tensor(block_A, seed + 2022);
316-
initialize_quant_tensor(block_B, seed + 2021);
316+
initialize_tensor(block_B, seed + 2021);
317317
initialize_tensor(block_C, seed + 2020);
318318
initialize_scale(block_scale, options);
319319
initialize_zero(block_zero, options);

0 commit comments

Comments
 (0)