Releases · NVIDIA/cutlass

13 May 03:39

hwu36

v4.5.0

e406c18

CUTLASS 4.5.0 Latest

Latest

CuTe DSL

New features
- New Block API block_copy() to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by block_copy() and need not to invoke tma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy.
- MXF8F6F4 mixed precision support
  - BlockScaled MMA now supports MXF8MXF4 or MXF8MXF6
- Block Scaled MMA for SM120 now works on Spark
- EFC broadcast semantics support
  - EFC epilogue functions can now broadcast and remap tensor modes via C.remap_modes[:, 0, 1] subscript syntax (where : marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
- Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
- dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
- cute.copy now supports user specified loop unrolling
Bug fixing and improvements
- Improved source code correlation for profiling/debugging
- Fixed an aarch64 segfault issue with tvm-ffi
- Re-organization for CuTe DSL examples/tutorials for better discoverability
More examples of authorizing peak-performance kernels
- MOE examles
  - A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
  - Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
  - Compared to torch_210_cu13, very few problem has worse perf in B200.
    - mxfp8_2dx3d: avg 1.29 speedup;
    - mxfp8_2dx2d: avg 1.41 speedup;
      - nvfp4_2dx3d: avg 1.11 speedup;
    - nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
    - bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
    - bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
    - Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
API changes
- ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.

CUTLASS C++

Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
- Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
- Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
- Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
- Changes included in example 92.
Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
Add static load to tensor memory support, included in example 77.
Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
Add example 95 to support green context SM partition
- Enables launching GEMM on stream with partial SM allocation.
Add Snake activation functor for EVT.
Fix some kernel issues:
- Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
- Fix CUTLASS clang build issues
- Fix atomicCAS read-modify-write loop in ConstSubbyteReference
- Replace __nv_atomic_load_n with volatile for CUDA 11.4 compatibility in subbyte reference
- Remove PipelineStorage shadowing in SM100 complex epilogue
- Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
Fix some profiler issues:
- Add missing reference kernels for blockwise GEMM profiler.

Assets 2

1 Join discussion

17 Mar 14:55

hwu36

v4.4.2

da5e086

CUTLASS 4.4.2

CuTe DSL

New features
- CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
- Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
Bug fixing and improvements
- Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
  optimizing mbarrier synchronization to avoid unnecessary convergence barriers.
- Fix kernel loading race condition when multiple GPU are present in the same process in JAX.

CUTLASS C++

Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.

Assets 2

28 Feb 03:30

hwu36

v4.4.1

4370102

CUTLASS 4.4.1

CuTe DSL

Bug fixing and improvements
- Fixed a segfault issue with tvm-ffi on aarch64

Assets 2

26 Feb 04:01

hwu36

v4.4.0

c213bfd

CUTLASS 4.4.0

CuTe DSL

New features
- CuTe DSL now supports CUDA toolkit 13.1!
  - Set up with cutlass/python/CuTeDSL/setup.sh --cu13
  - Refer to https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html for more details
- GB300 is now supported in CuTe DSL with CTK 13.1
  - Refer to SM103 batched 3xFP4 blockscaled GEMM kernel for example kernel
- cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.
  - Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments.
  - Automatic TMA descriptor generation and update insertion.
  - Automatic vectorization and predication for SIMT copies.
  - New pipeline abstraction with convenience wrappers
  - New Partition ops to simplify partitioning logic.
  - Device-side TMA descriptor allocation, initialization, and management
  - These examples can be found here https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/experimental
- Ahead of Time (AoT) compilation is now available!
  - Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/cute/export for example usage
- JAX support - you can now use CuTeDSL along with JAX
  - Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/jax for example usage
- Introduced versioning support in DSL:
  - cutlass.version for a string representation of DSL version
  - cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL
- Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
- Grouped GEMM example now supports device-only problem shapes.
- We allow grid carve-out without problem shapes being available on host.
- Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
- It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.
More examples of authorizing peak-performance kernels
- SM103 batched 3xFP4 blockscaled GEMM kernel
- Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
- New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
- All mixed_input_gemm examples are moved into a separate folder mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.
Bug fixing and improvements
- Fixed an issue that both branches of if are executed
- Fixed cute.printf with f-string
- Fixed an indexing issue of scalar tensor
- Fixed small K reference check error for cta_tile_n = 256 case with overlapping accumulator optimization in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
API changes
- Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
- Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
- LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling init, to avoid ambiguity in data transposition.
- LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
- Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
- hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
- group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
- Deprecate nvvm wrapper from using nvvm enum, use str instead.
- cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
- In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:
  - fence_proxy
  - fence_view_async_tmem_op
  - calc_packed_f32x2_op
  - warp_redux_sync
  - atomic_add
  - atomic_and
  - atomic_or
  - atomic_xor
  - atomic_max
  - atomic_min
  - atomic_exch
  - atomic_cas
  - store
  - load
Use 'Advanced control file' for mixed input gemm examples for better performance.
- Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.

CUTLASS C++

Add example 93 for Blackwell low latency generation phase GQA kernel.
- Flash Decoding with cluster reduction.
- Kernel design details please check Readme.
Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
Add Hopper SM90 State Space Decomposition (SSD) kernel in example 111.
Add example 94 for Ada FP8xFP8 -> BF16 GEMM with blockwise dequantization of input matrices in the MMA loop with FP32 accumulation.
- Generate additional device/kernel/threadblock files in CUTLASS include directory that add functionality to carry the scaling tensors + use them in MMA loop.
- Add gemm_blockwise to include files in default_mma_core_sm80
Add Hopper e2m1 to fp32 optimized conversion and e2m1 * TF32 tensor core GEMM.
- Set MmaType to tfloat32_t for FP32 mode.
- TF32 provides FP32 inputs with reduced precision (19-bit vs 32-bit)
- Set TileShapeK=64 for TF32 (K must be multiple of 8)
- Shuffle optimization enabled via compute_memory_reordering_atom<tfloat32_t>()
- E2M1 -> FP32 -> TF32 TC path for mixed-precision GEMM
- Enable example 55 with TF32 support
Add support for arbitrary application-provided strides for block-scale tensors.
- Users and applications now must pass valid block-scale strides in all cases, even when the tensor is packed.
Support 4x blockscaled public ptx for CUDA 13.1.
Allow non-static TmaGbasis in AuxTmaParams.
- Some cases in attention kernel may require non-static tma_gbasis.
- Relax the restriction on TmaGbasis parameter of AuxTmaParams and users are allowed to manually construct a dynamic gbasis.
Fix some kernel issues:
- Fix MSVC pre process issue.
- Fix a self assign issue in GEMV kernel.
- Fix a TMA descriptor bug where the CUDA driver is not properly setting the OOB address gen mode correctly.
- Fix memory fence for clc scheduler in Blackwell SM120 pingpong kernel.
- Fix missing SMEM alignment in Blackwell SM120 scale factors.
- Fix a PDL issue for grouped gemm.
- Fix divide-by-zero issue in canimplement for sm100 implicit gemm kernels.
- Fix cluster swizzle for Grouped GEMMs.
  - Move host-side swizzling heuristics to device.
  - Apply swizzle per group based on problem shape and max swizzle size.
  - Improve examples and unit tests.
Fix some profiler issues:
- Fix a core dump issue for nvfp4 grouped GEMM kernel.
- Fix inconsistent GEMM verification logic.
- Rework grouped gemm verification logic for different types.
- Fix api break change in using nvMatmulHeuristics.
Fix some failed links under media/docs.
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.1.

Assets 2

0 Join discussion

09 Jan 06:08

hwu36

v4.3.5

4faf1a1

CUTLASS 4.3.5

CuTe DSL

Bug fixing and improvements
- Fixed the unexpected CPU overhead issue introduced by 4.3.4
Update copyright to 2026.

CUTLASS C++

Update copyright to 2026.
Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs.

Assets 2

0 Join discussion

24 Dec 05:49

hwu36

v4.3.4

1810164

CUTLASS 4.3.4

CuTe DSL

New features
- Added PDL support along with example Kernel launch with Programmatic Dependent Launch
Bug fixing and improvements
- Fixed a frame refcnt issue with cuda graph
- Enhancement for tvm-ffi AoT case for earlier module unload
- Fixed order issue in make_smem_layout_a in utils/hopper_helpers.py

CUTLASS C++

Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.

Assets 2

0 Join discussion

12 Dec 05:12

hwu36

v4.3.3

d55f6be

CUTLASS 4.3.3

CuTe DSL

New features
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
- Supported variadic tuples for JIT function argument in tvm-ffi
Bug fixing and improvements
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver

Assets 2

0 Join discussion

05 Dec 18:51

hwu36

v4.3.2

5c149f5

CUTLASS 4.3.2

CuTe DSL

New features
- New env var CUTE_DSL_CACHE_DIR to specify the path for dumping caches
Bug fixing and improvements
- Fixed an issue of CUDA JitExecutor when unloading kernels
- Fixed an issue of allocating max smem when there's statically allocated smem

Assets 2

0 Join discussion

02 Dec 03:22

hwu36

v4.3.1

f88806b

CUTLASS 4.3.1

CuTe DSL

New features
- Added Blackwell SM103 support
- Multiple dependent DSOs in the wheel have been merged into one single DSO
Bug fixing and improvements
- Fixed device reset issue with tvm-ffi
- Fixed tvm-ffi export compiled function

CUTLASS C++

Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
- The new example works for all microscaling types.

Assets 2

0 Join discussion

24 Nov 22:24

hwu36

v4.3.0

e67e63c

CUTLASS 4.3.0

CuTe DSL

New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like nsight profiling to correlate perf metrics with Python source code)
- Supported dumping PTX and CUBIN code: Hello World Example
More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
  - Generalize code to handle list of input tensors
  - Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
  - To demonstrate usage of new Pipeline APIs PipelineProducer and PipelineConsumer to simplify code without explicit pipeline state management (Exiting APIs are still maintained)
  - Separated epilogue code for non-TMA and TMA implementation
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
  - Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
  - More examples are coming for demo of optimization: Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
  - Updated implementation to handle general data type and multiple inputs
  - Updated explanation for TV layout in simpler language
  - Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
More examples of authorizing peak-performance kernels:
API updates:
- Please refer to DSL API changelog for details
Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.reduce to support static value as initial value
- Updated docstring for following APIs to be more concise and easier to understand:
  - make_layout_tv
  - is_static
  - PipelineAsync
  - SmemAllocator
- Fixed documentation for pipeline, utils and cute.math
- Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
Add support through cmdline argument lists for batch, no_verif, cluster_shape and cluster_shape_fallback in example 89.
Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API moe_stride_utils is introduced to help setup strides in the kernel.
- Instead of requiring users to set vectors like problem_shapes_device and problem_shapes_hosts, a new problem shape struct called MoEProblemShape is introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
Enable GEMM_K = 0 in grouped gemm.
Optimize group gemm kernels by enabling async TMA desc update.
Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic cutlass::int8_t and replace it with int8_t.
- Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing wait_on_dependent_grids for PDL use case.
Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function bytes_with_problem_shape of block scaled profiler.
- Fix an issue when epilogue tile N is not divided by default subtile N.
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.0U1.

Assets 2

0 Join discussion

Releases: NVIDIA/cutlass

CUTLASS 4.5.0

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.4.2

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.4.1

CuTe DSL

Uh oh!

CUTLASS 4.4.0

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.5

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.4

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.3

CuTe DSL

Uh oh!

CUTLASS 4.3.2

CuTe DSL

Uh oh!

CUTLASS 4.3.1

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.0

CuTe DSL

CUTLASS C++

Uh oh!