Releases: NVIDIA/cutlass
Releases · NVIDIA/cutlass
CUTLASS 4.5.0
CuTe DSL
-
New features
- New Block API
block_copy()to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA byblock_copy()and need not to invoketma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy. - MXF8F6F4 mixed precision support
- BlockScaled MMA now supports MXF8MXF4 or MXF8MXF6
- Block Scaled MMA for SM120 now works on Spark
- EFC broadcast semantics support
- EFC epilogue functions can now broadcast and remap tensor modes via
C.remap_modes[:, 0, 1]subscript syntax (where:marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
- EFC epilogue functions can now broadcast and remap tensor modes via
- Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
- dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
- cute.copy now supports user specified loop unrolling
- New Block API
-
Bug fixing and improvements
- Improved source code correlation for profiling/debugging
- Fixed an aarch64 segfault issue with tvm-ffi
- Re-organization for CuTe DSL examples/tutorials for better discoverability
-
More examples of authorizing peak-performance kernels
- MOE examles
- A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
- Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
- Compared to torch_210_cu13, very few problem has worse perf in B200.
- mxfp8_2dx3d: avg 1.29 speedup;
- mxfp8_2dx2d: avg 1.41 speedup;
- nvfp4_2dx3d: avg 1.11 speedup;
- nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
- bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
- bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
- Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
- MOE examles
-
API changes
- ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.
CUTLASS C++
- Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
- Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
- Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
- Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
- Changes included in example 92.
- Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
- Add static load to tensor memory support, included in example 77.
- Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
- Add example 95 to support green context SM partition
- Enables launching GEMM on stream with partial SM allocation.
- Add Snake activation functor for EVT.
- Fix some kernel issues:
- Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
- Fix CUTLASS clang build issues
- Fix atomicCAS read-modify-write loop in
ConstSubbyteReference - Replace
__nv_atomic_load_nwithvolatilefor CUDA 11.4 compatibility in subbyte reference - Remove
PipelineStorageshadowing in SM100 complex epilogue - Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
- Fix some profiler issues:
- Add missing reference kernels for blockwise GEMM profiler.
CUTLASS 4.4.2
CuTe DSL
- New features
- CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
- Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
- Bug fixing and improvements
- Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
optimizing mbarrier synchronization to avoid unnecessary convergence barriers. - Fix kernel loading race condition when multiple GPU are present in the same process in JAX.
- Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
CUTLASS C++
- Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.
CUTLASS 4.4.1
CuTe DSL
- Bug fixing and improvements
- Fixed a segfault issue with tvm-ffi on aarch64
CUTLASS 4.4.0
CuTe DSL
-
New features
- CuTe DSL now supports CUDA toolkit 13.1!
- Set up with cutlass/python/CuTeDSL/setup.sh --cu13
- Refer to https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html for more details
- GB300 is now supported in CuTe DSL with CTK 13.1
- Refer to SM103 batched 3xFP4 blockscaled GEMM kernel for example kernel
- cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.
- Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments.
- Automatic TMA descriptor generation and update insertion.
- Automatic vectorization and predication for SIMT copies.
- New pipeline abstraction with convenience wrappers
- New Partition ops to simplify partitioning logic.
- Device-side TMA descriptor allocation, initialization, and management
- These examples can be found here https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/experimental
- Ahead of Time (AoT) compilation is now available!
- Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/cute/export for example usage
- JAX support - you can now use CuTeDSL along with JAX
- Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/jax for example usage
- Introduced versioning support in DSL:
- cutlass.version for a string representation of DSL version
- cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL
- Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
- Grouped GEMM example now supports device-only problem shapes.
- We allow grid carve-out without problem shapes being available on host.
- Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
- It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.
- CuTe DSL now supports CUDA toolkit 13.1!
-
More examples of authorizing peak-performance kernels
- SM103 batched 3xFP4 blockscaled GEMM kernel
- Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
- New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
- All mixed_input_gemm examples are moved into a separate folder
mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.
-
Bug fixing and improvements
- Fixed an issue that both branches of if are executed
- Fixed
cute.printfwith f-string - Fixed an indexing issue of scalar tensor
- Fixed small K reference check error for cta_tile_n = 256 case with overlapping accumulator optimization in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
-
API changes
- Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
- Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
- LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling init, to avoid ambiguity in data transposition.
- LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
- Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
- hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
- group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
- Deprecate nvvm wrapper from using nvvm enum, use str instead.
- cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
- In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:
- fence_proxy
- fence_view_async_tmem_op
- calc_packed_f32x2_op
- warp_redux_sync
- atomic_add
- atomic_and
- atomic_or
- atomic_xor
- atomic_max
- atomic_min
- atomic_exch
- atomic_cas
- store
- load
-
Use 'Advanced control file' for mixed input gemm examples for better performance.
- Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.
CUTLASS C++
- Add example 93 for Blackwell low latency generation phase GQA kernel.
- Flash Decoding with cluster reduction.
- Kernel design details please check Readme.
- Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
- Add Hopper SM90 State Space Decomposition (SSD) kernel in example 111.
- Add example 94 for Ada FP8xFP8 -> BF16 GEMM with blockwise dequantization of input matrices in the MMA loop with FP32 accumulation.
- Generate additional device/kernel/threadblock files in CUTLASS include directory that add functionality to carry the scaling tensors + use them in MMA loop.
- Add gemm_blockwise to include files in default_mma_core_sm80
- Add Hopper e2m1 to fp32 optimized conversion and e2m1 * TF32 tensor core GEMM.
- Set MmaType to tfloat32_t for FP32 mode.
- TF32 provides FP32 inputs with reduced precision (19-bit vs 32-bit)
- Set TileShapeK=64 for TF32 (K must be multiple of 8)
- Shuffle optimization enabled via
compute_memory_reordering_atom<tfloat32_t>() - E2M1 -> FP32 -> TF32 TC path for mixed-precision GEMM
- Enable example 55 with TF32 support
- Add support for arbitrary application-provided strides for block-scale tensors.
- Users and applications now must pass valid block-scale strides in all cases, even when the tensor is packed.
- Support 4x blockscaled public ptx for CUDA 13.1.
- Allow non-static
TmaGbasisinAuxTmaParams.- Some cases in attention kernel may require non-static
tma_gbasis. - Relax the restriction on
TmaGbasisparameter ofAuxTmaParamsand users are allowed to manually construct a dynamic gbasis.
- Some cases in attention kernel may require non-static
- Fix some kernel issues:
- Fix MSVC pre process issue.
- Fix a self assign issue in GEMV kernel.
- Fix a TMA descriptor bug where the CUDA driver is not properly setting the OOB address gen mode correctly.
- Fix memory fence for clc scheduler in Blackwell SM120 pingpong kernel.
- Fix missing SMEM alignment in Blackwell SM120 scale factors.
- Fix a PDL issue for grouped gemm.
- Fix divide-by-zero issue in canimplement for sm100 implicit gemm kernels.
- Fix cluster swizzle for Grouped GEMMs.
- Move host-side swizzling heuristics to device.
- Apply swizzle per group based on problem shape and max swizzle size.
- Improve examples and unit tests.
- Fix some profiler issues:
- Fix a core dump issue for nvfp4 grouped GEMM kernel.
- Fix inconsistent GEMM verification logic.
- Rework grouped gemm verification logic for different types.
- Fix api break change in using nvMatmulHeuristics.
- Fix some failed links under
media/docs. - Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.1.
CUTLASS 4.3.5
CuTe DSL
- Bug fixing and improvements
- Fixed the unexpected CPU overhead issue introduced by 4.3.4
- Update copyright to 2026.
CUTLASS C++
- Update copyright to 2026.
- Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs.
CUTLASS 4.3.4
CuTe DSL
-
New features
- Added PDL support along with example Kernel launch with Programmatic Dependent Launch
-
Bug fixing and improvements
- Fixed a frame refcnt issue with cuda graph
- Enhancement for tvm-ffi AoT case for earlier module unload
- Fixed order issue in
make_smem_layout_ain utils/hopper_helpers.py
CUTLASS C++
- Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.
CUTLASS 4.3.3
CuTe DSL
-
New features
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
- Supported variadic tuples for JIT function argument in tvm-ffi
-
Bug fixing and improvements
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver
CUTLASS 4.3.2
CuTe DSL
-
New features
- New env var
CUTE_DSL_CACHE_DIRto specify the path for dumping caches
- New env var
-
Bug fixing and improvements
- Fixed an issue of CUDA JitExecutor when unloading kernels
- Fixed an issue of allocating max smem when there's statically allocated smem
CUTLASS 4.3.1
CuTe DSL
- New features
- Added Blackwell SM103 support
- Multiple dependent DSOs in the wheel have been merged into one single DSO
- Bug fixing and improvements
- Fixed device reset issue with tvm-ffi
- Fixed tvm-ffi export compiled function
CUTLASS C++
- Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
- The new example works for all microscaling types.
CUTLASS 4.3.0
CuTe DSL
- New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
- Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like
nsightprofiling to correlate perf metrics with Python source code) - Supported dumping PTX and CUBIN code: Hello World Example
- Supported source location tracking for DSL APIs (Allow tools like
- More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
- Generalize code to handle list of input tensors
- Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
- To demonstrate usage of new Pipeline APIs
PipelineProducerandPipelineConsumerto simplify code without explicit pipeline state management (Exiting APIs are still maintained) - Separated epilogue code for non-TMA and TMA implementation
- To demonstrate usage of new Pipeline APIs
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
- Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
- More examples are coming for demo of optimization:
Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
- Updated implementation to handle general data type and multiple inputs
- Updated explanation for TV layout in simpler language
- Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
- Improved performance of elementwise example:
- More examples of authorizing peak-performance kernels:
- Blackwell SM100 mixed-input GEMM
- Blackwell SM100 persistent blockwise dense GEMM
- Blackwell SM100 persistent blockwise contiguous grouped dense GEMM
- Blackwell SM100 persistent blockwise masked grouped dense GEMM
- Blackwell SM100 fmha bwd
- Blackwell SM100 mla
- Hopper SM90 persistent dense GEMM with static scheduling
- Blackwell GeForce batched dense GEMM
- Ampere HSTU Attention
- API updates:
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed
TensorSSA.reduceto support static value as initial value - Updated docstring for following APIs to be more concise and easier to understand:
make_layout_tvis_staticPipelineAsyncSmemAllocator
- Fixed documentation for
pipeline,utilsandcute.math - Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
- Add support through cmdline argument lists for
batch,no_verif,cluster_shapeandcluster_shape_fallbackin example 89. - Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
- Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
- Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API
moe_stride_utilsis introduced to help setup strides in the kernel. - Instead of requiring users to set vectors like
problem_shapes_deviceandproblem_shapes_hosts, a new problem shape struct calledMoEProblemShapeis introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
- Instead of requiring users to call several cute utilities to set up the stride, API
- Enable GEMM_K = 0 in grouped gemm.
- Optimize group gemm kernels by enabling async TMA desc update.
- Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
- Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
- Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
- Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic
cutlass::int8_tand replace it withint8_t. - Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing
wait_on_dependent_gridsfor PDL use case.
- Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function
bytes_with_problem_shapeof block scaled profiler. - Fix an issue when epilogue tile N is not divided by default subtile N.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.