diff --git a/CHANGELOG.md b/CHANGELOG.md
index 9f423fb2da..9ca90d8e9a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,7 +1,17 @@
 # NVIDIA CUTLASS Changelog
 
+## [3.9.2](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.2) (2025-05-03)
 
-## [3.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.0) (2025-03-20)
+* Fixed [Blockwise](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMM hang issue when problem size K is 128.
+* Optimal code generation with CUDA toolkit versions 12.9.
+
+
+## [3.9.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.1) (2025-04-30)
+
+* Fixed Group Gemm hang issue in CUTLASS 3.x
+* Improved Hopper [Blockwise](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMM performance.
+
+## [3.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.0) (2025-04-24)
 
 * Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
   - Collective mainloops that target for:
@@ -13,18 +23,37 @@
   - [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
   - [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](./examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
   - [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
+  - [Grouped GEMM with nvfp4 datatype](./examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu).
+  - [Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu).
+  - [Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu).
 * Set of unit tests that demonstrate the usage of both [sparse](./test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](./test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
+* Support for Blackwell SM100 Sparse kernels:
+  - Collective mainloop that target for
+    * [SM100 Sparse GEMM](./include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp)
+* Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
+  - [Sparse GEMM](./examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu)
+  - [Blockscaled Sparse GEMM with NVFP4 input data type](./examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu)
+  - [Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)](./examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)
+* Set of unit tests that demonstrate the usage of [sparse](./test/unit/gemm/device/sm100_sparse_tensorop_gemm) and [blockscaled sparse](./test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm) Blackwell SM100 GEMM.
+* A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS [example](./examples/77_blackwell_fmha/) covers the flashMLA-like weight-absorbed decoding use-case.
+* A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS [example](./examples/77_blackwell_fmha/) to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
+* A new [distributed GEMM example](./examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu) for SM100 Blackwell architecture.
 * Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
   - Enhancement of [blockwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
   - Enhancement of [groupwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
-  - Support for [grouped GEMM with blockwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
+  - Support for [grouped GEMM with blockwise and groupwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
+  - Support for [grouped-wise GEMM](./tools/profiler/src/blockwise_gemm_operation_profiler.cu) in CUTLASS profiler.
   - Support for [blockwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
   - Support for [groupwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
-* Added support for enhanced kernel performance search in CUTLASS:
+  - Support for [grouped GEMM with blockwise](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
+* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
   - Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
   - Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
   - Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
-  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
+  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/cpp/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
+* Support `void` as the D element in sm100 kernel epilogues.
+* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
+* Optimal code generation with CUDA toolkit versions 12.8U1.
 
 ## [3.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.8.0) (2025-01-25)
 
@@ -40,7 +69,7 @@
   - [Pipelines that implement Blackwell specific synchronization](./include/cutlass/pipeline/sm100_pipeline.hpp).
   - [Cluster launch control API supporting preferred and fallback cluster shapes](./include/cutlass/cluster_launch.hpp).
   - Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
-  - Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
+  - Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/cpp/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
   - Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
 * Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
   - [Blackwell specific kernel layers](./include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
@@ -78,11 +107,11 @@
   - A set of new [Hopper grouped GEMM kernels](./examples/69_hopper_mixed_dtype_grouped_gemm/) that support mixed A and B datatypes.
   - A new [Hopper FP8 GEMM with groupwise scaling](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu).
 * Documentation updates:
-  - [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
-  - Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
-  - A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
+  - [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/cpp/quickstart.md#instantiating-a-blackwell-gemm-kernel).
+  - Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/cpp/blackwell_functionality.md)
+  - A new [functionality documentation](./media/docs/cpp/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
   - Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).
-  - Updates to [profiler documentation](./media/docs/profiler.md) for testing mixed input GEMM kernels on Hopper.
+  - Updates to [profiler documentation](./media/docs/cpp/profiler.md) for testing mixed input GEMM kernels on Hopper.
 
 ## [3.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.7.0) (2025-01-11)
 - [Hopper blockwise scaling FP8 GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) uses 2D scaling tensor, assigning one value per threadblock.  This allows a finer-grained scaling to be applied for each output tile per gemm-k iteration. The operands and scaling tensors are loaded from global memory to shared memory using TMA and cp_async, respectively. The scaling is applied inside the mainloop.  Details with figures are [here](https://github.com/NVIDIA/cutlass/pull/1932#issue-2645398439).
@@ -95,7 +124,7 @@
   + Fix `cute::SM80_CP_ASYNC_CACHEALWAYS`, `cute::SM80_CP_ASYNC_CACHEGLOBAL`, `cute::SM80_CP_ASYNC_CACHEALWAYS_ZFILL`, `cute::SM80_CP_ASYNC_CACHEGLOBAL_ZFILL` to avoid implicitly selecting `ZFILL` behavior on predication.
   + Remove `cute::copy_vec<T>` in favor of `cute::copy_aligned` and `cute::copy(AutoVectorizingCopyWithAssumedAlignment<NumBits>,...)`.
   + A refactor of default epilogue struct `DefaultEpilogue` [API](./include/cutlass/epilogue/collective/default_epilogue.hpp) to avoid reading non-void `ElementC` value for `ElementC = void` kernel.
-- New CUTLASS profiler flags: `profiling-duration`, `min-iterations`, and `kernels-file` documented in [profiler.md](./media/docs/profiler.md#cutlass-profiler).
+- New CUTLASS profiler flags: `profiling-duration`, `min-iterations`, and `kernels-file` documented in [profiler.md](./media/docs/cpp/profiler.md#cutlass-profiler).
 - Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
 - Optimal code generation with CUDA toolkit versions 12.6.
 
@@ -109,12 +138,12 @@
 - A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
 - [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
 - [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
-- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md).
-- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
+- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/cpp/dependent_kernel_launch.md).
+- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/cpp/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
 - A new TMA-enabled [epilogue](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.
 - A SIMT-enabled pointer-array [epilogue](./include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp).
 - A new [Ping-Pong kernel schedule for Grouped GEMM](./include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.
-- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/profiler.md#instantiating-more-kernels-with-hopper).
+- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/cpp/profiler.md#instantiating-more-kernels-with-hopper).
 - A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](./include/cutlass/bfloat16.h)
 - Fixed use of isnan on Windows for [`half_t`](./test/unit/core/functional.cu).
 - Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
@@ -124,7 +153,7 @@
 
 - [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](./examples/cute/tutorial/wgmma_sm90.cu)
 - [Exposure of L2 `cache_hint`s in TMA copy atoms](./include/cute/arch/copy_sm90_tma.hpp#L48)
-- Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/profiler.md#GEMM), and
+- Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/cpp/profiler.md#GEMM), and
 [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
 - [TMA store based and EVT supported epilogues](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](./test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu).
 - A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inferrence:
@@ -137,7 +166,7 @@
 - Support for residual add (beta != 0) in convolution kernels.
 - A new convolution [epilogue](./examples/16_ampere_tensorop_conv2dfprop/ampere_tensorop_conv2dfprop.cu#L269) for CUTLASS 2.x to support non-packed NHWC output.
 - A refactor of [include files throughout CUTLASS core directories](./include/cutlass/gemm/collective/collective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](./test/self_contained_includes/CMakeLists.txt).
-- [A guide for setting up VSCode to work well with CUTLASS](./media/docs/ide_setup.md) and [expanded code style guide](./media/docs/programming_guidelines.md).
+- [A guide for setting up VSCode to work well with CUTLASS](./media/docs/cpp/ide_setup.md) and [expanded code style guide](./media/docs/cpp/programming_guidelines.md).
 - Better support for MSVC as a host compiler.
 - Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
 - Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
@@ -145,7 +174,7 @@
 ## [3.5.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.5.0) (2024-04-09)
 
 - Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + [TMA im2col](./include/cute/atom/copy_traits_sm90_im2col.hpp)
-  + Native implementation in CUTLASS 3.x using CuTe, mirroring the [same design hierarchy as that of GEMMs](./media/docs/gemm_api_3x.md).
+  + Native implementation in CUTLASS 3.x using CuTe, mirroring the [same design hierarchy as that of GEMMs](./media/docs/cpp/gemm_api_3x.md).
   + Support for 1D, 2D, and 3D convolutions in a [rank-agnostic fashion](./include/cutlass/conv/convnd_problem_shape.hpp).
   + Support for [Fprop](./test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu), [Dgrad](./test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu), and [Wgrad](./test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu) algorithms
   + [CUTLASS profiler support](./python/cutlass_library/conv3x_emitter.py) for 2D and 3D convolutions implemented via the 3.x API.
@@ -157,7 +186,7 @@
 - 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
   + [Ampere FP16 TN](./test/unit/gemm/device/gemm_f16t_f16n_f16t_tensor_op_f32_sm80.cu) and [NT](./test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu#L227-L301), [Ampere INT8 TN](./test/unit/gemm/device/gemm_s8t_s8n_s8t_tensor_op_s32_sm80.cu#L392-L1342), [Ampere INT4 TN](./test/unit/gemm/device/gemm_s4t_s4n_s4t_tensor_op_s32_sm80.cu#L372-L934).
   + [Turing FP16 TN](./test/unit/gemm/device/gemm_f16t_f16n_f16t_tensor_op_f32_sm75.cu#L55-L394), [Turing INT8 TN](./test/unit/gemm/device/gemm_s8t_s8n_s8t_tensor_op_s32_sm75.cu#L166-L537), [Turing INT4 TN](./test/unit/gemm/device/gemm_s4t_s4n_s4t_tensor_op_s32_sm75.cu#L310-L564).
-- Updates to CuTe documentation for [`cute::Tensor<>`](./media/docs/cute/03_tensor.md), [MMA atoms](./media/docs/cute/0t_mma_atom.md), and an overhauled [CuTe GEMM tutorial series](./examples/cute/tutorial).
+- Updates to CuTe documentation for [`cute::Tensor<>`](./media/docs/cpp/cute/03_tensor.md), [MMA atoms](./media/docs/cpp/cute/0t_mma_atom.md), and an overhauled [CuTe GEMM tutorial series](./examples/cute/tutorial).
 - Extensions to CuTe to support [L2 prefetching](./include/cute/algorithm/prefetch.hpp) and [TMA store+reductions](./include/cute/arch/copy_sm90_tma.hpp#L1337).
 - Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
 - Fixes to greatly reduce build warnings.
@@ -176,7 +205,7 @@
 * Beta release of [Group-GEMM](./examples/57_hopper_grouped_gemm) utilizing TMA and WGMMA (requires CUDA 12.3 or above).
 * [Ampere Sparse GEMM](./examples/15_ampere_sparse_tensorop_gemm/ampere_sparse_tensorop_gemm_with_visitor.cu) supports Epilogue Visitor Tree (EVT) now.
 * NamedBarriers usability improvement and list of [ReservedNamedBarriers](./include/cutlass/arch/barrier.h) has been officially released.
-* Improved [CuTe documentation](./media/docs/cute/) including improved clarity and depth of [Quickstart](./media/docs/cute/00_quickstart.md), [CuTe Layout](./media/docs/cute/01_layout.md), and [CuTe Layout Algebra](./media/docs/cute/02_layout_algebra.md). Associated code comments, post-conditions, and details in [CuTe Core Unit Tests](./test/unit/cute/core/) also improved.
+* Improved [CuTe documentation](./media/docs/cpp/cute/) including improved clarity and depth of [Quickstart](./media/docs/cute/00_quickstart.md), [CuTe Layout](./media/docs/cpp/cute/01_layout.md), and [CuTe Layout Algebra](./media/docs/cpp/cute/02_layout_algebra.md). Associated code comments, post-conditions, and details in [CuTe Core Unit Tests](./test/unit/cute/core/) also improved.
 
 ## [3.3](https://github.com/NVIDIA/cutlass/releases/tag/v3.3.0) (2023-10-31)
 * [Mixed-input Hopper GEMMs](./examples/55_hopper_mixed_dtype_gemm) support covering 16-bit x 8-bit input operand types.
@@ -227,7 +256,7 @@
 * Epilogue builders. Similar to mainloop builders (see [example 49](./examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu)), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
 * Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
 * Performance optimizations for the [*warp-specialized persistent ping-pong*](./include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel.
-* Changes to the [GEMM API 3.x](./media/docs/gemm_api_3x.md), involving the host-facing arguments and the underlying `Params` structs.
+* Changes to the [GEMM API 3.x](./media/docs/cpp/gemm_api_3x.md), involving the host-facing arguments and the underlying `Params` structs.
 * [FMHA Backward Pass](./examples/41_fused_multi_head_attention/fused_multi_head_attention_backward.cu) from Meta xFormers.
 * [Streamk GEMM with Broadcast](./examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu) enables epilogue broadcast with StreamK GEMM.
 * [Batched B2B GEMM](./examples/13_two_tensor_op_fusion) now can run multiple Back-to-Back GEMM with the same problem size in parallel.
@@ -239,10 +268,10 @@
 * Updates and bugfixes from the community (thanks!)
 
 ## [3.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.0.0) (2023-01-23)
-* [CuTe](./media/docs/cute/00_quickstart.md), a [new core library and backend](./include/cute) for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
-* [A new conceptual operation hierarchy](./media/docs/cutlass_3x_design.md) that replaces the architecture-centric hierarchy of CUTLASS 2.x and [documentation for CUTLASS 3.0's GEMM API changes](./media/docs/gemm_api_3x.md).
-* Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same [`device::GemmUniversalAdapter`](./include/cutlass/gemm/device/gemm_universal_adapter.h) and [`kernel::GemmUniversal`](./include/cutlass/gemm/kernel/gemm_universal.hpp) types, allowing users to include both APIs in the same translation units. More information can be found in the [3.x backwards compatibility section](./media/docs/cutlass_3x_backwards_compatibility.md).
-* Updates to [Functionality](./media/docs/functionality.md) which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
+* [CuTe](./media/docs/cpp/cute/00_quickstart.md), a [new core library and backend](./include/cute) for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
+* [A new conceptual operation hierarchy](./media/docs/cpp/cutlass_3x_design.md) that replaces the architecture-centric hierarchy of CUTLASS 2.x and [documentation for CUTLASS 3.0's GEMM API changes](./media/docs/cpp/gemm_api_3x.md).
+* Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same [`device::GemmUniversalAdapter`](./include/cutlass/gemm/device/gemm_universal_adapter.h) and [`kernel::GemmUniversal`](./include/cutlass/gemm/kernel/gemm_universal.hpp) types, allowing users to include both APIs in the same translation units. More information can be found in the [3.x backwards compatibility section](./media/docs/cpp/cutlass_3x_backwards_compatibility.md).
+* Updates to [Functionality](./media/docs/cpp/functionality.md) which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
 * Updates to [Compatibility](./README.md#compatibility) Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and [Target Architecture](./README.md#Target-Architecture).
 * New warp-specialized GEMM [kernel schedules](./include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [mainloops](./include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
 * Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
@@ -420,7 +449,7 @@
     * Global memory iterators supporting Fprop, Dgrad, and Wgrad
     * `MmaMultistage` for implicit GEMM convolution for NVIDIA Ampere architecture
     * `MmaPipeline` for implicit GEMM convolution for NVIDIA Volta and Turing architectures
-    * [Documentation](./media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
+    * [Documentation](./media/docs/cpp/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
 
 ## [2.3.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.3.0) (2020-09-23)
  * [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
@@ -434,7 +463,7 @@
  * NVIDIA Ampere GPU Architecture examples and documentation:
    * [Tensor Float 32](./examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu) and
    * [Sparse Tensor Cores](./examples/15_ampere_sparse_tensorop_gemm/ampere_sparse_tensorop_gemm.cu)
-   * Documentation added on CUTLASS [efficient row-major epilogue](./media/docs/gemm_api.md#efficient-epilogue)
+   * Documentation added on CUTLASS [efficient row-major epilogue](./media/docs/cpp/gemm_api.md#efficient-epilogue)
 
 ## [2.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.2.0) (2020-06-08)
  * [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
@@ -454,7 +483,7 @@
    * Disabled F16C by default for compatibility - enable on cmake command line with `-DCUTLASS_ENABLE_F16C=ON`
 
 ## [2.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.1.0) (2020-04-06)
- * BLAS-style host-side API added to [CUTLASS Library](./media/docs/quickstart.md#cutlass-library)
+ * BLAS-style host-side API added to [CUTLASS Library](./media/docs/cpp/quickstart.md#cutlass-library)
     * API to launch compiled kernel instances for GEMM and planar complex GEMM
  * Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
     * Computes complex matrix products on matrices stored as disjoint real and imaginary parts
@@ -468,10 +497,10 @@
     * Encapsulated functionality embodying modern C++11 programming techniques
     * Optimized containers and data types for efficient, generic, portable device code
   * Updates to:
-    * [Quick start guide](./media/docs/quickstart.md)
+    * [Quick start guide](./media/docs/cpp/quickstart.md)
     * [Documentation](./README.md#documentation)
-    * [Utilities](./media/docs/utilities.md)
-    * [CUTLASS Profiler](./media/docs/profiler.md)
+    * [Utilities](./media/docs/cpp/utilities.md)
+    * [CUTLASS Profiler](./media/docs/cpp/profiler.md)
  * Native Turing Tensor Cores
     * Efficient GEMM kernels targeting Turing Tensor Cores
     * Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8d913fed5e..df0926dd1c 100755
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -765,6 +765,7 @@ target_include_directories(
   CUTLASS
   SYSTEM INTERFACE
   $<BUILD_INTERFACE:${CUDA_TOOLKIT_ROOT_DIR}/include>
+  $<BUILD_INTERFACE:${CUDA_TOOLKIT_ROOT_DIR}/include/cccl>
   )
 
 install(
diff --git a/PUBLICATIONS.md b/PUBLICATIONS.md
index 176b42e498..9c89a40f52 100644
--- a/PUBLICATIONS.md
+++ b/PUBLICATIONS.md
@@ -6,6 +6,8 @@
 
 - ["ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization"](https://arxiv.org/abs/2502.02631). Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra. _arXiv_, February 2025.
 
+- ["Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light"](https://arxiv.org/abs/2504.16922). Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi. _arXiv_, April 2025.
+
 ## 2024
 
 - ["DeepSeek-V3 Technical Report"](https://arxiv.org/abs/2412.19437). DeepSeek-AI. _arXiv_, December 2024.
diff --git a/README.md b/README.md
index ed8011e153..24366fa195 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,8 @@
 ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
 
-# CUTLASS 3.9.0
+# CUTLASS 3.9.2
 
-_CUTLASS 3.9.0 - March 2025_
+_CUTLASS 3.9.2 - May 2025_
 
 **This repository fast-follows NVIDIA CUTLASS repository adding SYCL support for Intel GPUs.**
   The CUDA support is unmodified from upstream and can be used interchangeably.
@@ -39,9 +39,9 @@ the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
 operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
 This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
 
-See the [Quick Start Guide](./media/docs/quickstart.md) to get started quickly.
+See the [Quick Start Guide](./media/docs/cpp/quickstart.md) to get started quickly.
 
-See the [functionality docs](./media/docs/functionality.md) for a more comprehensive
+See the [functionality docs](./media/docs/cpp/functionality.md) for a more comprehensive
 list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
 architecture.
 
@@ -57,18 +57,35 @@ architecture.
   - [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
   - [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](./examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
   - [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](./examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
+  - [Grouped GEMM with nvfp4 datatype](./examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu).
+  - [Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu).
+  - [Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor](./examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu).
 * Set of unit tests that demonstrate the usage of both [sparse](./test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](./test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
+* Support for Blackwell SM100 Sparse kernels:
+  - Collective mainloop that target for
+    * [SM100 Sparse GEMM](./include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp)
+* Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
+  - [Sparse GEMM](./examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu)
+  - [Blockscaled Sparse GEMM with NVFP4 input data type](./examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu)
+  - [Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)](./examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)
+* Set of unit tests that demonstrate the usage of [sparse](./test/unit/gemm/device/sm100_sparse_tensorop_gemm) and [blockscaled sparse](./test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm) Blackwell SM100 GEMM.
+* A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS [example](./examples/77_blackwell_fmha/) covers the flashMLA-like weight-absorbed decoding use-case.
+* A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS [example](./examples/77_blackwell_fmha/) to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
+* A new [distributed GEMM example](./examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu) for SM100 Blackwell architecture.
 * Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
   - Enhancement of [blockwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
   - Enhancement of [groupwise GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
-  - Support for [grouped GEMM with blockwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
+  - Support for [grouped GEMM with blockwise and groupwise scaling](./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
+  - Support for [grouped-wise GEMM](./tools/profiler/src/blockwise_gemm_operation_profiler.cu) in CUTLASS profiler.
   - Support for [blockwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
   - Support for [groupwise GEMM](./examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
-* Added support for enhanced kernel performance search in CUTLASS:
+  - Support for [grouped GEMM with blockwise](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](./examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
+* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
   - Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
   - Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
   - Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
-  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
+  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./media/docs/cpp/profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
+* Support `void` as the D element in sm100 kernel epilogues.
 
 Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
 CUTLASS team is working on a fix.
@@ -115,7 +132,7 @@ Layouts can also be combined and manipulated via functional composition, on whic
 CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
 This greatly simplifies the design and improves code composability and readability.
 More documentation specific to CuTe can be found in its
-[dedicated documentation directory](./media/docs/cute/00_quickstart.md).
+[dedicated documentation directory](./media/docs/cpp/cute/00_quickstart.md).
 
 # Compatibility
 
@@ -162,6 +179,7 @@ CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be
 |NVIDIA H100 Tensor Core GPU            |9.0|11.8|
 |NVIDIA H200 Tensor Core GPU            |9.0|11.8|
 |NVIDIA B200 Tensor Core GPU            |10.0|12.8|
+|NVIDIA GeForce RTX 50x0 series         |10.0|12.8|
 
 ## Target Architecture
 
@@ -197,7 +215,7 @@ NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels
 compiled for Blackwell SM100 architecture with arch conditional features 
 (using `sm100a`) are not compatible with RTX 50 series GPUs. 
 
-Please refer to the [functionality documentation](./media/docs/functionality.md)
+Please refer to the [functionality documentation](./media/docs/cpp/functionality.md)
 for details on which kernels require which target architectures.
 
 # Documentation
@@ -205,22 +223,22 @@ for details on which kernels require which target architectures.
 CUTLASS is described in the following documents and the accompanying
 [Doxygen documentation](https://nvidia.github.io/cutlass).
 
-- [Quick Start Guide](./media/docs/quickstart.md) - basics of building and running CUTLASS
-- [Functionality](./media/docs/functionality.md) - summarizes functionality available in CUTLASS
-- [Efficient GEMM in CUDA](./media/docs/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
-- [CUTLASS 3.x Design](./media/docs/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
-- [GEMM API 3.x](./media/docs/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
-- [GEMM API 2.x](./media/docs/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
-- [Implicit GEMM Convolution](./media/docs/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
-- [Code Organization](./media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
-- [Terminology](./media/docs/terminology.md) - describes terms used in the code
-- [Programming Guidelines](./media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
-- [Fundamental types](./media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
-- [Layouts](./media/docs/layout.md) - describes layouts of matrices and tensors in memory
-- [Tile Iterators](./media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
-- [CUTLASS Profiler](./media/docs/profiler.md) - command-line driven profiling application
-- [CUTLASS Utilities](./media/docs/utilities.md) - additional templates used to facilitate rapid development
-- [Dependent kernel launch](./media/docs/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
+- [Quick Start Guide](./media/docs/cpp/quickstart.md) - basics of building and running CUTLASS
+- [Functionality](./media/docs/cpp/functionality.md) - summarizes functionality available in CUTLASS
+- [Efficient GEMM in CUDA](./media/docs/cpp/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
+- [CUTLASS 3.x Design](./media/docs/cpp/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
+- [GEMM API 3.x](./media/docs/cpp/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
+- [GEMM API 2.x](./media/docs/cpp/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
+- [Implicit GEMM Convolution](./media/docs/cpp/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
+- [Code Organization](./media/docs/cpp/code_organization.md) - describes the organization and contents of the CUTLASS project
+- [Terminology](./media/docs/cpp/terminology.md) - describes terms used in the code
+- [Programming Guidelines](./media/docs/cpp/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
+- [Fundamental types](./media/docs/cpp/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
+- [Layouts](./media/docs/cpp/layout.md) - describes layouts of matrices and tensors in memory
+- [Tile Iterators](./media/docs/cpp/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
+- [CUTLASS Profiler](./media/docs/cpp/profiler.md) - command-line driven profiling application
+- [CUTLASS Utilities](./media/docs/cpp/utilities.md) - additional templates used to facilitate rapid development
+- [Dependent kernel launch](./media/docs/cpp/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
 kernels in the same stream, and how it is used in CUTLASS.
 
 # Resources
@@ -240,7 +258,7 @@ projects. Client applications should target CUTLASS's `include/` directory in th
 paths.
 
 CUTLASS unit tests, examples, and utilities can be build with CMake.
-The minimum version of CMake is given in the [Quickstart guide](./media/docs/quickstart.md).
+The minimum version of CMake is given in the [Quickstart guide](./media/docs/cpp/quickstart.md).
 Make sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed
 on your system.
 
@@ -285,7 +303,7 @@ CUTLASS is arranged as a header-only library along with Utilities, Tools, Exampl
 and template concepts defined in the CUTLASS project.
 
 A detailed explanation of the source code organization may be found in the 
-[CUTLASS documentation](./media/docs/code_organization.md), but several main components are summarized below.
+[CUTLASS documentation](./media/docs/cpp/code_organization.md), but several main components are summarized below.
 
 ## CUTLASS Template Library
 
@@ -359,7 +377,7 @@ tools/
 The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
 basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
 
-Instructions for building and running the Unit tests are described in the [Quickstart guide](./media/docs/quickstart.md).
+Instructions for building and running the Unit tests are described in the [Quickstart guide](./media/docs/cpp/quickstart.md).
 
 # Performance Profiling
 
@@ -575,9 +593,9 @@ reference_device: Passed
 
 ## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
 - Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
-  - [GEMM CMake Examples](./media/docs/quickstart.md#gemm-cmake-examples) 
-  - [Implicit GEMM convolution CMake Examples](./media/docs/quickstart.md#convolution-cmake-examples)
-- [Further details about the CUTLASS Profiler are described here.](./media/docs/profiler.md)
+  - [GEMM CMake Examples](./media/docs/cpp/quickstart.md#gemm-cmake-examples) 
+  - [Implicit GEMM convolution CMake Examples](./media/docs/cpp/quickstart.md#convolution-cmake-examples)
+- [Further details about the CUTLASS Profiler are described here.](./media/docs/cpp/profiler.md)
 
 
 # About
diff --git a/examples/04_tile_iterator/tile_iterator.cu b/examples/04_tile_iterator/tile_iterator.cu
index fdfaaac9b2..025eb65f86 100644
--- a/examples/04_tile_iterator/tile_iterator.cu
+++ b/examples/04_tile_iterator/tile_iterator.cu
@@ -34,7 +34,7 @@
   addressable memory, and then store it back into addressable memory.
 
   TileIterator is a core concept in CUTLASS that enables efficient loading and storing of data to
-  and from addressable memory. The PredicateTileIterator accepts a ThreadMap type, which defines
+  and from addressable memory. The PredicatedTileIterator accepts a ThreadMap type, which defines
   the mapping of threads to a "tile" in memory. This separation of concerns enables user-defined
   thread mappings to be specified. 
 
@@ -124,7 +124,7 @@ __global__ void copy(
 
 cudaError_t TestTileIterator(int M, int K) {
 
-    // For this example, we chose a <64, 4> tile shape. The PredicateTileIterator expects
+    // For this example, we chose a <64, 4> tile shape. The PredicatedTileIterator expects
     // PitchLinearShape and PitchLinear layout.
     using Shape = cutlass::layout::PitchLinearShape<64, 4>;
     using Layout = cutlass::layout::PitchLinear;
@@ -136,7 +136,7 @@ cudaError_t TestTileIterator(int M, int K) {
     // dimension then along the strided dimension.
     using ThreadMap = cutlass::transform::PitchLinearStripminedThreadMap<Shape, kThreads>;
 
-    // Define the PredicateTileIterator, using TileShape, Element, Layout, and ThreadMap types
+    // Define the PredicatedTileIterator, using TileShape, Element, Layout, and ThreadMap types
     using Iterator = cutlass::transform::threadblock::PredicatedTileIterator<
         Shape, Element, Layout, 1, ThreadMap>;
 
diff --git a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
index 6fdcc8363f..c9fbd75643 100644
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
@@ -402,7 +402,7 @@ struct Options : MixedDtypeOptions{
 void initialize(Options const& options) {
 
   auto shape_B = cute::make_shape(options.n, options.k, options.l);
-  int const scale_k = (options.k + options.g - 1) / options.g;
+  int const scale_k = cutlass::ceil_div(options.k, options.g);
   stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
   stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
   // Reverse stride here due to swap and transpose
@@ -429,7 +429,7 @@ void initialize(Options const& options) {
   block_zero.reset(scale_k * options.l * options.n);
 
   initialize_tensor(block_A, seed + 2022);
-  initialize_quant_tensor(block_B, seed + 2021);
+  initialize_tensor(block_B, seed + 2021);
   initialize_tensor(block_C, seed + 2020);
   initialize_scale(block_scale, options);
   initialize_zero(block_zero, options);
diff --git a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
index cc54080393..dcab4a7a49 100644
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
@@ -318,7 +318,7 @@ struct Options : MixedDtypeOptions {
 void initialize(Options const& options) {
 
   auto shape_B = cute::make_shape(options.n, options.k, options.l);
-  int const scale_k = (options.k + options.g - 1) / options.g;
+  int const scale_k = cutlass::ceil_div(options.k, options.g);
   stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
   stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
   // Reverse stride here due to swap and transpose
@@ -347,7 +347,7 @@ void initialize(Options const& options) {
   block_zero.reset(scale_k * options.l * options.n);
 
   initialize_tensor(block_A, seed + 2022);
-  initialize_quant_tensor(block_B, seed + 2021);
+  initialize_tensor(block_B, seed + 2021);
   cutlass::unified_encode_int4b(block_B.get(), block_B_modified.get(), block_B.size());
   initialize_tensor(block_C, seed + 2020);
   initialize_scale(block_scale, options);
diff --git a/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu b/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
index aa114e74d7..15eb469263 100644
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
@@ -288,7 +288,7 @@ cutlass::DeviceAllocation<typename GemmScaleWithZeroPoint::EpilogueOutputOp::Ele
 void initialize(MixedDtypeOptions const& options) {
 
   auto shape_b = cute::make_shape(options.n, options.k, options.l);
-  int const scale_k = (options.k + options.g - 1) / options.g;
+  int const scale_k = cutlass::ceil_div(options.k, options.g);
   stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
   stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_b);
   // Reverse stride here due to swap and transpose
@@ -313,7 +313,7 @@ void initialize(MixedDtypeOptions const& options) {
   block_zero.reset(scale_k * options.l * options.n);
 
   initialize_tensor(block_A, seed + 2022);
-  initialize_quant_tensor(block_B, seed + 2021);
+  initialize_tensor(block_B, seed + 2021);
   initialize_tensor(block_C, seed + 2020);
   initialize_scale(block_scale, options);
   initialize_zero(block_zero, options);
diff --git a/examples/55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp b/examples/55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp
index f3dd9058d9..98c7440679 100644
--- a/examples/55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp
+++ b/examples/55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp
@@ -208,20 +208,6 @@ bool initialize_tensor(
   return true;
 }
 
-template <typename Element>
-bool initialize_quant_tensor(
-  cutlass::DeviceAllocation<Element>& block,
-  uint64_t seed = 2023) {
-  
-  float scope_min = float(cutlass::platform::numeric_limits<Element>::lowest());
-  float scope_max = float(cutlass::platform::numeric_limits<Element>::max());
-
-  cutlass::reference::device::BlockFillRandomUniform(
-    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
-
-  return true;
-}
-
 template <class Element>
 bool initialize_scale(
   cutlass::DeviceAllocation<Element>& block, 
@@ -232,10 +218,8 @@ bool initialize_scale(
   float scope_max = 1.0f, scope_min = 1.0f;
   if (options.mode != MixedDtypeGemmMode::ConvertOnly) {
     float elt_max_f = float(cutlass::platform::numeric_limits<Element>::max());
-    const float max_dequant_val = 4.f;
-    const float min_dequant_val = 0.5f;
-    scope_max = max_dequant_val / elt_max_f;
-    scope_min = min_dequant_val / elt_max_f;
+    scope_max = 2.f;
+    scope_min = 0.1f;
   }
   cutlass::reference::device::BlockFillRandomUniform(
     block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
diff --git a/examples/65_distributed_gemm/65_distributed_gemm.cu b/examples/65_distributed_gemm/65_distributed_gemm.cu
index 2289d62a8a..6509609f9f 100644
--- a/examples/65_distributed_gemm/65_distributed_gemm.cu
+++ b/examples/65_distributed_gemm/65_distributed_gemm.cu
@@ -120,8 +120,7 @@
 #include "helper.h"
 
 // Distributed GEMM helpers
-#include "util/benchmark.h"
-#include "util/device_copy.h"
+#include "dist_gemm_helpers.h"
 
 using namespace cute;
 
@@ -834,10 +833,10 @@ int main(int argc, char const **args) {
   CUDA_CHECK(cudaGetDevice(&current_device_id));
   CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
   cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
     std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater)." << std::endl;
+      << "This example requires a GPU of NVIDIA's Hopper Architecture "
+      << "(compute capability 90)." << std::endl;
     return 0;
   }
 
diff --git a/examples/65_distributed_gemm/README.md b/examples/65_distributed_gemm/README.md
index e3c48a9dd5..6bfff53c2f 100644
--- a/examples/65_distributed_gemm/README.md
+++ b/examples/65_distributed_gemm/README.md
@@ -63,6 +63,10 @@ procedure is the same, simply modify the following line in the example:
 using TP = _8;
 ```
 
+## References
+* [Distributed GEMM Blog](https://blog.shi-labs.com/distributed-gemm-88be6a481e2b)
+* [Distributed GEMM Talk on CUDA Mode](https://www.youtube.com/watch?v=NHRTCQBZokg)
+
 ## Copyright
 
 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
diff --git a/examples/65_distributed_gemm/REQUIREMENTS.md b/examples/65_distributed_gemm/REQUIREMENTS.md
index 4b8cca3b4d..c6288a91af 100644
--- a/examples/65_distributed_gemm/REQUIREMENTS.md
+++ b/examples/65_distributed_gemm/REQUIREMENTS.md
@@ -17,6 +17,8 @@ Like all other CUTLASS examples, the NVIDIA driver, runtime, and CUDA Toolkit ar
 This example specifically requires CUDA Toolkit 12.6 or newer, due to some of the necessary
 CUDA graph APIs.
 
+The minimum CUDA driver version for running this example is [560.28.03](https://docs.nvidia.com/cuda/archive/12.6.0/cuda-toolkit-release-notes/index.html#id5).
+
 ### Hardware / driver settings
 
 This example requires Hopper GPUs with NVLink network.
diff --git a/examples/65_distributed_gemm/util/device_copy.h b/examples/65_distributed_gemm/util/device_copy.h
deleted file mode 100644
index 257800a097..0000000000
--- a/examples/65_distributed_gemm/util/device_copy.h
+++ /dev/null
@@ -1,84 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/*! \file
-    \brief generic device-to-device data movement kernel based for CuTe tensors.
-
-    NOTE: this kernel assigns one element copy to every thread, and is by no means
-    an efficient way of copying tensors. It should only be used for convenience in
-    reference checks.
-
-*/
-
-#pragma once
-
-#include "cute/layout.hpp"
-#include "cute/tensor.hpp"
-#include "cutlass/cutlass.h"
-#include "cutlass/cuda_host_adapter.hpp"
-
-namespace cutlass {
-
-template <typename TensorSource, typename TensorDestination>
-void device_copy(TensorSource      tensor_source,
-                 TensorDestination tensor_destination,
-                 cudaStream_t stream);
-
-
-template <typename TensorSource, typename TensorDestination>
-__global__ void device_copy_kernel(TensorSource const tensor_source, 
-                                   TensorDestination tensor_destination) {
-  auto linear_idx = blockIdx.x * blockDim.x + threadIdx.x;
-  using ElementSrc = typename TensorSource::value_type;
-  using ElementDst = typename TensorDestination::value_type;
-  NumericConverter<ElementDst, ElementSrc> converter;
-  if (linear_idx < size(tensor_source)) {
-    tensor_destination(linear_idx) = converter(tensor_source(linear_idx));
-  }
-}
-
-template <typename TensorSource, typename TensorDestination>
-void device_copy(TensorSource      tensor_source,
-                 TensorDestination tensor_destination,
-                 cudaStream_t stream) {
-  
-  assert(tensor_source.size() == tensor_destination.size());
-
-  auto numel = tensor_source.size();
-  static constexpr int NumThreads = 128;
-  auto grid_size = cute::ceil_div(numel, NumThreads);
-
-  dim3 grid(grid_size);
-  dim3 block(NumThreads);
-  device_copy_kernel<<<grid, block, 0, stream>>>(tensor_source, tensor_destination);
-}
-
-} //namespace cutlass
diff --git a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu
index 1c21678f10..5d4fe1a180 100644
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu
@@ -75,11 +75,11 @@
 #include "cutlass/util/reference/host/tensor_copy.h"
 #include "cutlass/util/reference/host/tensor_compare.h"
 #include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
 
 // Includes from examples directory
 #include "helper.h"
 #include "hopper_fp8_commandline.hpp"
-#include "reference/host/gemm_with_blockwise_scaling.h"
 
 using namespace cute;
 
@@ -123,7 +123,13 @@ using ArchTag             = cutlass::arch::Sm90;                            // T
 using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
 using TileShape           = Shape<_128,_128,_128>;                           // Threadblock-level tile size
 using ClusterShape        = Shape<_1,_2,_1>;                                // Shape of the threadblocks in a cluster
-using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum<>;
+
+using ScaleConfig = decltype(cutlass::detail::sm90_trivial_blockwise_scale_config(TileShape{}));
+
+using LayoutSFA             = decltype(ScaleConfig::deduce_layoutSFA());                     // Layout type for SFA matrix operand
+using LayoutSFB             = decltype(ScaleConfig::deduce_layoutSFB());                     // Layout type for SFB matrix operand
+
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum; 
 using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecializedCooperative;
 
 using EpilogueTileType    = cutlass::epilogue::collective::EpilogueTileAuto;
@@ -143,8 +149,8 @@ using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBui
 
 using CollectiveMainloopWithBlockWiseScaling = typename cutlass::gemm::collective::CollectiveBuilder<
     ArchTag, OperatorClass,
-    ElementA, LayoutA, AlignmentA,
-    ElementB, LayoutB, AlignmentB,
+    ElementA, cute::tuple<LayoutA, LayoutSFA>, AlignmentA,
+    ElementB, cute::tuple<LayoutB, LayoutSFB>, AlignmentB,
     ElementAccumulator,
     TileShape, ClusterShape,
     cutlass::gemm::collective::StageCountAutoCarveout<
@@ -190,20 +196,22 @@ StrideB stride_B;
 StrideC stride_C;
 StrideD stride_D;
 StrideAux stride_aux;
+LayoutSFA layout_SFA;
+LayoutSFB layout_SFB;
 uint64_t seed;
 
+using LayoutScalar = cutlass::layout::PackedVectorLayout;
 cutlass::HostTensor<ElementA  , LayoutA  > tensor_A;
 cutlass::HostTensor<ElementB  , LayoutB  > tensor_B;
 cutlass::HostTensor<ElementC  , LayoutC  > tensor_C;
 cutlass::HostTensor<ElementD  , LayoutD  > tensor_D;
 uint32_t mma_promotion_interval;
-cutlass::HostTensor<ElementBlockScale, LayoutA> blockscale_tensor_A;
-cutlass::HostTensor<ElementBlockScale, LayoutB> blockscale_tensor_B;
+cutlass::HostTensor<ElementBlockScale, LayoutScalar> blockscale_tensor_A;
+cutlass::HostTensor<ElementBlockScale, LayoutScalar> blockscale_tensor_B;
 cutlass::HostTensor<ElementD  , LayoutD  > tensor_ref_D;
 cutlass::HostTensor<ElementAux, LayoutAux> tensor_aux;
 cutlass::HostTensor<ElementAux, LayoutAux> tensor_ref_aux;
 
-using LayoutScalar = cutlass::layout::PackedVectorLayout;
 cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_alpha;
 cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_beta;
 cutlass::HostTensor<ElementScalar, LayoutScalar> scale_A;
@@ -342,26 +350,25 @@ bool initialize_scale_tensor(
 /// Initialize operands to be used in the GEMM and reference GEMM
 void initialize(const Options<RasterOrderOptions> &options) {
 
-  // Find Block Scaling tensor shapes based on problem shape and TileShape
-  auto gemm_problem_shape = cute::make_shape(options.m, options.n, options.k);
-  auto blockscale_shape = shape(get<1>(cute::zipped_divide(cute::make_layout(gemm_problem_shape), TileShape{})));
-  auto blockscale_m = cute::get<0>(blockscale_shape);
-  auto blockscale_n = cute::get<1>(blockscale_shape);
-  auto blockscale_k = cute::get<2>(blockscale_shape);
-
   stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
   stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(options.n, options.k, options.l));
   stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.m, options.n, options.l));
   stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.m, options.n, options.l));
   stride_aux = stride_D;
 
+  // Layout SFA and SFB represent logically broadcasting data in CuTe.
+  // E.g., if Layout SFA has shape ((ScaleGranularityM, M / ScaleGranularityM), (ScaleGraunularityK, K / ScaleGranularityK))
+  // and strides ((0, 1), (0, M / ScaleGraunuarlityM)), then each collection of ScaleGranularityM x ScaleGranularityK
+  // indecies in the tensor map to the same offset.
 
+  layout_SFA = ScaleConfig::tile_atom_to_shape_SFA(make_shape(options.m, options.n, options.k, options.l));
+  layout_SFB = ScaleConfig::tile_atom_to_shape_SFB(make_shape(options.m, options.n, options.k, options.l));
 
   auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
   auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
   auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
-  auto blockscale_a_coord = cutlass::make_Coord(blockscale_m * options.l, blockscale_k);
-  auto blockscale_b_coord = cutlass::make_Coord(blockscale_k, blockscale_n * options.l);
+  auto blockscale_a_coord = cutlass::make_Coord(size(filter_zeros(layout_SFA)));
+  auto blockscale_b_coord = cutlass::make_Coord(size(filter_zeros(layout_SFB)));
 
   tensor_A.resize(a_coord);
   blockscale_tensor_A.resize(blockscale_a_coord);
@@ -465,7 +472,9 @@ typename Gemm::Arguments args_from_options(const Options<RasterOrderOptions> &op
      stride_B,
      mma_promotion_interval,
      blockscale_tensor_A.device_data(),
-     blockscale_tensor_B.device_data()
+     layout_SFA,
+     blockscale_tensor_B.device_data(),
+     layout_SFB
      },
     {
       {}, // epilogue.thread
@@ -519,12 +528,6 @@ bool verify(const Options<RasterOrderOptions> &options) {
   // Compute reference output
   //
 
-  // Block scaling tensors shapes based CTA Block (TileShape) and GEMM Problem shape
-  auto gemm_problem_shape = cute::make_shape(options.m, options.n, options.k);
-  auto blockscale_m = ceil_div(options.m, get<0>(TileShape{}));
-  auto blockscale_n = ceil_div(options.n, get<1>(TileShape{}));
-  auto blockscale_k = ceil_div(options.k, get<2>(TileShape{}));
-
   // Create instantiation for device reference gemm kernel
   auto A = cute::make_tensor(tensor_A.host_data(),
                              cute::make_layout(
@@ -557,28 +560,18 @@ bool verify(const Options<RasterOrderOptions> &options) {
                                 )
                               );
 
-  auto blockscale_A = cute::make_tensor(blockscale_tensor_A.host_data(),
-                                        cute::make_layout(
-                                          cute::make_shape(blockscale_m, blockscale_k, options.l),
-                                          cute::make_stride(1, blockscale_m, blockscale_m * blockscale_k)
-                                        )
-                                      );
-  auto blockscale_B = cute::make_tensor(blockscale_tensor_B.host_data(),
-                                        cute::make_layout(
-                                          cute::make_shape(blockscale_n, blockscale_k, options.l),
-                                          cute::make_stride(1, blockscale_n, blockscale_n * blockscale_k)
-                                        )
-                                      );
+  auto SFA = cute::make_tensor(blockscale_tensor_A.host_data(), layout_SFA);
+  auto SFB = cute::make_tensor(blockscale_tensor_B.host_data(), layout_SFB);
 
   using unused_t = decltype(D);
 
-  cutlass::reference::host::GettMainloopParams<ElementAccumulator,
-                                               decltype(A), decltype(B),
-                                               decltype(blockscale_A), decltype(blockscale_B),
-                                               TileShape> mainloop_params{
-                                               A, B,                         // Operand Tensors
-                                               blockscale_A, blockscale_B    // Blockwise scaling Tensors
-                                              };
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,
+      decltype(A),
+      decltype(SFA),
+      decltype(B),
+      decltype(SFB)
+    > mainloop_params{A, SFA, B, SFB};
 
   cutlass::reference::host::GettEpilogueParams<
       ElementScalar,
diff --git a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu
index b7cdb00a67..096e56a6b8 100644
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu
@@ -75,11 +75,11 @@
 #include "cutlass/util/reference/host/tensor_copy.h"
 #include "cutlass/util/reference/host/tensor_compare.h"
 #include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
 
 // Includes from examples directory
 #include "helper.h"
 #include "hopper_fp8_commandline.hpp"
-#include "reference/host/gemm_with_groupwise_scaling.h"
 
 using namespace cute;
 
@@ -120,55 +120,30 @@ using ElementAccumulator  = float;                                          // E
 using ElementBlockScale   = float;                                          // Element type for blockscaling during accumulation
 using ElementCompute      = float;                                          // Element type for epilogue computation
 
-using TileShape_  = Shape<_128,_128,_128>;  // This one is just to make the compiler happy with verify()...
-
-// ScaleGranularity{M,N}: number of {rows in A}/{columns in B} that share the same scaling factor
-// Given TileShape = Shape<_128,_128,_128>:
-//   ScaleGranularityM == 128 and ScaleGranularityN == 128 --> 2Dx2D (the shape of the scaling factor)
-//   ScaleGranularityM == 1   and ScaleGranularityN == 128 --> 1Dx2D scaling
-//   ScaleGranularityM == 128 and ScaleGranularityN == 1   --> 2Dx1D scaling
-//   ScaleGranularityM == 1   and ScaleGranularityN == 1   --> 1Dx1D scaling
-template <int ScaleGranularityM_, int ScaleGranularityN_>
-struct GroupScaleConfig {
-  using ArchTag       = cutlass::arch::Sm90;                          // Tag indicating the minimum SM that supports the intended feature
-  using OperatorClass = cutlass::arch::OpClassTensorOp;               // Operator class tag
-  using TileShape     = Shape<_128,_128,_128>;                        // Threadblock-level tile size
-  using ClusterShape  = Shape<_1,_2,_1>;                              // Shape of the threadblocks in a cluster
-
-  static constexpr int ScaleGranularityM = ScaleGranularityM_;
-  static constexpr int ScaleGranularityN = ScaleGranularityN_;
-  static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
-  static constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
-
-  static_assert(size<0>(TileShape{}) == ScaleGranularityM * ScaleMsPerTile,
-              "FP8 scaling granularity must evenly divide tile shape along M.");
-  static_assert(size<1>(TileShape{}) == ScaleGranularityN * ScaleNsPerTile,
-              "FP8 scaling granularity must evenly divide tile shape along N.");
-
-  using KernelSchedule    = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum<ScaleGranularityM_, ScaleGranularityN_>;
-  using EpilogueSchedule  = cutlass::epilogue::TmaWarpSpecializedCooperative;
-  using EpilogueTileType  = cutlass::epilogue::collective::EpilogueTileAuto;
-  using FusionOperation   = cutlass::epilogue::fusion::ScaledLinCombPerRowBiasEltActAmaxAux<
+using ArchTag       = cutlass::arch::Sm90;                          // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass = cutlass::arch::OpClassTensorOp;               // Operator class tag
+using TileShape     = Shape<_128,_128,_128>;                        // Threadblock-level tile size
+using ClusterShape  = Shape<_1,_2,_1>;                              // Shape of the threadblocks in a cluster
+
+constexpr int ScaleGranularityM = 1;
+constexpr int ScaleGranularityN = 128;
+constexpr int ScaleGranularityK = 128;
+
+constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
+constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
+
+using ScaleConfig   = cutlass::detail::Sm90BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>;
+
+using LayoutSFA     = decltype(ScaleConfig::deduce_layoutSFA());    // Layout type for SFA matrix operand
+using LayoutSFB     = decltype(ScaleConfig::deduce_layoutSFB());    // Layout type for SFB matrix operand
+
+using KernelSchedule    = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum;
+using EpilogueSchedule  = cutlass::epilogue::TmaWarpSpecializedCooperative;
+using EpilogueTileType  = cutlass::epilogue::collective::EpilogueTileAuto;
+using FusionOperation   = cutlass::epilogue::fusion::ScaledLinCombPerRowBiasEltActAmaxAux<
     LayoutAux, cutlass::epilogue::thread::ReLU, ElementD, ElementCompute, ElementAux, ElementAmax, ElementBias, ElementC>;
-};
 
-using GroupScale1D1DConfig = GroupScaleConfig<                    1,                     1>;
-using GroupScale1D2DConfig = GroupScaleConfig<                    1, size<1>(TileShape_{})>;
-using GroupScale2D1DConfig = GroupScaleConfig<size<0>(TileShape_{}),                     1>;
-using GroupScale2D2DConfig = GroupScaleConfig<size<0>(TileShape_{}), size<1>(TileShape_{})>;
-
-template <typename ScheduleConfig>
-struct GroupScaleGemm {
-  using ArchTag           = typename ScheduleConfig::ArchTag;
-  using OperatorClass     = typename ScheduleConfig::OperatorClass;
-  using TileShape         = typename ScheduleConfig::TileShape;
-  using ClusterShape      = typename ScheduleConfig::ClusterShape;
-  using KernelSchedule    = typename ScheduleConfig::KernelSchedule;
-  using EpilogueSchedule  = typename ScheduleConfig::EpilogueSchedule;
-  using EpilogueTileType  = typename ScheduleConfig::EpilogueTileType;
-  using FusionOperation   = typename ScheduleConfig::FusionOperation;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
     ArchTag, OperatorClass,
     TileShape, ClusterShape,
     EpilogueTileType,
@@ -179,10 +154,10 @@ struct GroupScaleGemm {
     FusionOperation
   >::CollectiveOp;
 
-  using CollectiveMainloopWithGroupWiseScaling = typename cutlass::gemm::collective::CollectiveBuilder<
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
     ArchTag, OperatorClass,
-    ElementA, LayoutA, AlignmentA,
-    ElementB, LayoutB, AlignmentB,
+    ElementA, cute::tuple<LayoutA, LayoutSFA>, AlignmentA,
+    ElementB, cute::tuple<LayoutB, LayoutSFB>, AlignmentB,
     ElementAccumulator,
     TileShape, ClusterShape,
     cutlass::gemm::collective::StageCountAutoCarveout<
@@ -191,38 +166,26 @@ struct GroupScaleGemm {
     KernelSchedule
   >::CollectiveOp;
 
-  using GemmKernelDefault = cutlass::gemm::kernel::GemmUniversal<
-      Shape<int,int,int,int>,
-      CollectiveMainloopWithGroupWiseScaling,
-      CollectiveEpilogue
-  >;
 
-  using GemmKernelStreamK = cutlass::gemm::kernel::GemmUniversal<
-      Shape<int,int,int,int>,
-      CollectiveMainloopWithGroupWiseScaling,
-      CollectiveEpilogue,
-      cutlass::gemm::StreamKScheduler
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    cutlass::gemm::StreamKScheduler
   >;
 
-  using GemmDefault = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelDefault>;
-  using GemmStreamK = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelStreamK>;
-};
-
-using GroupScale1D1DGemm = GroupScaleGemm<GroupScale1D1DConfig>;
-using GroupScale1D2DGemm = GroupScaleGemm<GroupScale1D2DConfig>;
-using GroupScale2D1DGemm = GroupScaleGemm<GroupScale2D1DConfig>;
-using GroupScale2D2DGemm = GroupScaleGemm<GroupScale2D2DConfig>;
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
 // Extract information from Gemm kernel.
-using EpilogueOutputOp  = typename GroupScale1D1DGemm::GemmDefault::EpilogueOutputOp;
+using EpilogueOutputOp  = typename Gemm::EpilogueOutputOp;
 using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
 using ElementAmax       = typename EpilogueOutputOp::ElementAmax;
 using ActivationFunctor = typename EpilogueOutputOp::ActivationFn;
 
-using StrideA = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideA;
-using StrideB = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideB;
-using StrideC = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideC;
-using StrideD = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideD;
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
 using StrideAux = StrideD;
 
 constexpr bool IsDFp8 =
@@ -242,20 +205,23 @@ StrideB stride_B;
 StrideC stride_C;
 StrideD stride_D;
 StrideAux stride_aux;
+LayoutSFA layout_SFA;
+LayoutSFB layout_SFB;
 uint64_t seed;
 
+using LayoutScalar = cutlass::layout::PackedVectorLayout;
+
 cutlass::HostTensor<ElementA  , LayoutA  > tensor_A;
 cutlass::HostTensor<ElementB  , LayoutB  > tensor_B;
 cutlass::HostTensor<ElementC  , LayoutC  > tensor_C;
 cutlass::HostTensor<ElementD  , LayoutD  > tensor_D;
 uint32_t mma_promotion_interval;
-cutlass::HostTensor<ElementBlockScale, LayoutA> blockscale_tensor_A;
-cutlass::HostTensor<ElementBlockScale, LayoutB> blockscale_tensor_B;
+cutlass::HostTensor<ElementBlockScale, LayoutScalar> blockscale_tensor_A;
+cutlass::HostTensor<ElementBlockScale, LayoutScalar> blockscale_tensor_B;
 cutlass::HostTensor<ElementD  , LayoutD  > tensor_ref_D;
 cutlass::HostTensor<ElementAux, LayoutAux> tensor_aux;
 cutlass::HostTensor<ElementAux, LayoutAux> tensor_ref_aux;
 
-using LayoutScalar = cutlass::layout::PackedVectorLayout;
 cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_alpha;
 cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_beta;
 cutlass::HostTensor<ElementScalar, LayoutScalar> scale_A;
@@ -392,32 +358,25 @@ bool initialize_scale_tensor(
 }
 
 /// Initialize operands to be used in the GEMM and reference GEMM
-template <typename GroupScaleConfig>
 void initialize(const Options<RasterOrderOptions> &options) {
 
-  using TileShape = typename GroupScaleConfig::TileShape;
-  const int ScaleGranularityM = GroupScaleConfig::ScaleGranularityM;
-  const int ScaleGranularityN = GroupScaleConfig::ScaleGranularityN;
-
   assert(options.m % ScaleGranularityM == 0);
   assert(options.n % ScaleGranularityN == 0);
 
-  // Find Group Scaling tensor shapes based on `ScaleGranularityM`, problem shape, and TileShape
-  auto groupscale_m = ceil_div(options.m, ScaleGranularityM);
-  auto groupscale_n = ceil_div(options.n, ScaleGranularityN);
-  auto blockscale_k = ceil_div(options.k, cute::get<2>(TileShape{}));
-
   stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
   stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(options.n, options.k, options.l));
   stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.m, options.n, options.l));
   stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.m, options.n, options.l));
   stride_aux = stride_D;
+  layout_SFA = ScaleConfig::tile_atom_to_shape_SFA(make_shape(options.m, options.n, options.k, options.l));
+  layout_SFB = ScaleConfig::tile_atom_to_shape_SFB(make_shape(options.m, options.n, options.k, options.l));
+
 
   auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
   auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
   auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
-  auto groupscale_a_coord = cutlass::make_Coord(groupscale_m * options.l, blockscale_k);
-  auto groupscale_b_coord = cutlass::make_Coord(groupscale_n * options.l, blockscale_k);
+  auto groupscale_a_coord = cutlass::make_Coord(size(filter_zeros(layout_SFA)));
+  auto groupscale_b_coord = cutlass::make_Coord(size(filter_zeros(layout_SFB)));
 
   tensor_A.resize(a_coord);
   tensor_B.resize(b_coord);
@@ -522,7 +481,9 @@ GemmArguments args_from_options(const Options<RasterOrderOptions> &options)
      stride_B,
      mma_promotion_interval,
      blockscale_tensor_A.device_data(),
-     blockscale_tensor_B.device_data()
+     layout_SFA,
+     blockscale_tensor_B.device_data(),
+     layout_SFB
      },
     {
       {}, // epilogue.thread
@@ -572,19 +533,10 @@ GemmArguments args_from_options(const Options<RasterOrderOptions> &options)
 }
 
 /// Don't know why the compiler does not like verify() being templated...
-bool verify(const Options<RasterOrderOptions> &options, const int ScaleMsPerTile, const int ScaleNsPerTile) {
+bool verify(const Options<RasterOrderOptions> &options) {
   //
   // Compute reference output
   //
-  const int ScaleGranularityM = get<0>(TileShape_{}) / ScaleMsPerTile;
-  const int ScaleGranularityN = get<1>(TileShape_{}) / ScaleNsPerTile;
-
-  // Group scaling tensors shapes based `ScaleGranularityM`, CTA Block (TileShape) and GEMM Problem shape
-  auto blockscale_m = ceil_div(options.m, get<0>(TileShape_{}));
-  auto blockscale_n = ceil_div(options.n, get<1>(TileShape_{}));
-  auto blockscale_k = ceil_div(options.k, get<2>(TileShape_{}));
-  auto groupscale_m = ceil_div(options.m, ScaleGranularityM);
-  auto groupscale_n = ceil_div(options.n, ScaleGranularityN);
 
   // Create instantiation for device reference gemm kernel
   auto A = cute::make_tensor(tensor_A.host_data(),
@@ -618,28 +570,18 @@ bool verify(const Options<RasterOrderOptions> &options, const int ScaleMsPerTile
                                 )
                               );
 
-  auto blockscale_A = cute::make_tensor(blockscale_tensor_A.host_data(),
-                                        cute::make_layout(
-                                          cute::make_shape(groupscale_m, blockscale_k, options.l),
-                                          cute::make_stride(1, groupscale_m, groupscale_m * blockscale_k)
-                                        )
-                                      );
-  auto blockscale_B = cute::make_tensor(blockscale_tensor_B.host_data(),
-                                        cute::make_layout(
-                                          cute::make_shape(groupscale_n, blockscale_k, options.l),
-                                          cute::make_stride(1, groupscale_n, groupscale_n * blockscale_k)
-                                        )
-                                      );
+  auto SFA = cute::make_tensor(blockscale_tensor_A.host_data(), layout_SFA);
+  auto SFB = cute::make_tensor(blockscale_tensor_B.host_data(), layout_SFB);
 
   using unused_t = decltype(D);
 
-  cutlass::reference::host::GettMainloopParams<ElementAccumulator,
-                                               decltype(A), decltype(B),
-                                               decltype(blockscale_A), decltype(blockscale_B),
-                                               TileShape_> mainloop_params{
-                                               A, B,                         // Operand Tensors
-                                               blockscale_A, blockscale_B    // Groupwise scaling Tensors
-                                              };
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,
+      decltype(A), 
+      decltype(SFA), 
+      decltype(B),
+      decltype(SFB)
+    > mainloop_params{A, SFA, B, SFB};
 
   cutlass::reference::host::GettEpilogueParams<
       ElementScalar,
@@ -713,14 +655,7 @@ bool verify(const Options<RasterOrderOptions> &options, const int ScaleMsPerTile
 }
 
 /// Execute a given example GEMM computation
-template <typename GroupScaleConfig, typename Gemm>
-int run(Options<RasterOrderOptions> &options)
-{
-  using TileShape = typename GroupScaleConfig::TileShape;
-  const int ScaleGranularityM = GroupScaleConfig::ScaleGranularityM;
-  const int ScaleGranularityN = GroupScaleConfig::ScaleGranularityN;
-  const int ScaleMsPerTile    = GroupScaleConfig::ScaleMsPerTile;
-  const int ScaleNsPerTile    = GroupScaleConfig::ScaleNsPerTile;
+int run(Options<RasterOrderOptions> &options) {
 
   bool skip = false;
   std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
@@ -747,7 +682,7 @@ int run(Options<RasterOrderOptions> &options)
   if (!skip) std::cout << "  Running... " << std::endl;
   else return -1;
 
-  initialize<GroupScaleConfig>(options);
+  initialize(options);
 
   // Instantiate CUTLASS kernel depending on templates
   Gemm gemm;
@@ -773,7 +708,7 @@ int run(Options<RasterOrderOptions> &options)
   // Check if output from CUTLASS kernel and reference kernel are equal or not
   Result result;
   if (options.verify) {
-    result.passed = verify(options, ScaleMsPerTile, ScaleNsPerTile);
+    result.passed = verify(options);
 
     std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
   }
@@ -860,28 +795,7 @@ int main(int argc, char const **args) {
 
 #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
   bool passed = true;
-  std::cout << "Basic split-K GEMM kernel" << std::endl;
-  passed &= run<GroupScale1D1DConfig, GroupScale1D1DGemm::GemmDefault>(options);
-  std::cout << std::endl;
-  passed &= run<GroupScale1D2DConfig, GroupScale1D2DGemm::GemmDefault>(options);
-  std::cout << std::endl;
-  passed &= run<GroupScale2D1DConfig, GroupScale2D1DGemm::GemmDefault>(options);
-  std::cout << std::endl;
-  passed &= run<GroupScale2D2DConfig, GroupScale2D2DGemm::GemmDefault>(options);
-  std::cout << std::endl;
-
-  std::cout << std::endl;
-
-  std::cout << "StreamK GEMM kernel" << std::endl;
-  passed &= run<GroupScale1D1DConfig, GroupScale1D1DGemm::GemmStreamK>(options);
-  std::cout << std::endl;
-  passed &= run<GroupScale1D2DConfig, GroupScale1D2DGemm::GemmStreamK>(options);
-  std::cout << std::endl;
-  passed &= run<GroupScale2D1DConfig, GroupScale2D1DGemm::GemmStreamK>(options);
-  std::cout << std::endl;
-  passed &= run<GroupScale2D2DConfig, GroupScale2D2DGemm::GemmStreamK>(options);
-  std::cout << std::endl;
-
+  passed = run(options);
   if (!passed)
     return -1;
 #endif
diff --git a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_blockwise_scaling.h b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_blockwise_scaling.h
deleted file mode 100644
index 8904060cba..0000000000
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_blockwise_scaling.h
+++ /dev/null
@@ -1,504 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Reference implementation for GETT in host-side code.
-*/
-
-#pragma once
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-#include "cutlass/gemm/gemm.h"
-#include "cutlass/complex.h"
-#include "cutlass/numeric_conversion.h"
-#include "cutlass/epilogue/thread/activation.h"
-#include "cutlass/relatively_equal.h"
-#include <iostream>
-#include "cute/tensor.hpp"
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-namespace cutlass::reference::host {
-
-template<class T, class = void>
-struct ElementTraits {
-  using type = T;
-};
-
-template<class T>
-struct ElementTraits<T, std::enable_if_t<!std::is_same_v<decltype(std::declval<T>().get()), void> > >  {
-  using type = decltype(std::declval<T>().get());
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-template<
-  class ElementAccumulator_,
-  class TensorA_,                                                                                         // (M, K, L)
-  class TensorB_,                                                                                         // (N, K, L)
-  class TensorScaleA_,                                                                                    // (m, k, L)
-  class TensorScaleB_,                                                                                    // (n, k, L)
-  class TileShape_
->
-struct GettMainloopParams {
-  using ElementAccumulator = ElementAccumulator_;
-  using TensorA = TensorA_;
-  using TensorB = TensorB_;
-  using EngineA = typename TensorA::engine_type;
-  using LayoutA = typename TensorA::layout_type;
-  using EngineB = typename TensorB::engine_type;
-  using LayoutB = typename TensorB::layout_type;
-
-  using TensorScaleA = TensorScaleA_;
-  using TensorScaleB = TensorScaleB_;
-  using TileShape = TileShape_;
-  using EngineScaleA = typename TensorScaleA::engine_type;
-  using EngineScaleB = typename TensorScaleB::engine_type;
-
-  TensorA A{};
-  TensorB B{};
-  TensorScaleA ScaleA{};
-  TensorScaleB ScaleB{};  
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-template<
-  class ElementScalar_,
-  class ElementScalingFactor_,
-  class ElementAccumulator_,
-  class ElementCompute_,
-  class TensorC_,                                                                                          // (M, N, L)
-  class TensorD_,                                                                                          // (M, N, L)
-  class VectorBias_ = TensorD_,                                                                            //    (M, 1)
-  class TensorAux_ = TensorD_,                                                                             // (M, N, L)
-  class VectorAlpha_ = TensorD_,                                                                           //    (M, 1)
-  class VectorBeta_ = VectorAlpha_,                                                                        //    (M, 1)
-  class ActivationFunctor_ = cutlass::epilogue::thread::Identity<ElementCompute_>,
-  class BiasBinaryOp_ = cutlass::plus<ElementCompute_>,
-  bool PerColumnBias_ = false
->
-struct GettEpilogueParams {
-  using ElementScalar = ElementScalar_;
-  using ElementScalingFactor = ElementScalingFactor_;
-  using ElementAccumulator = ElementAccumulator_;
-  using ElementCompute = ElementCompute_;
-  using TensorC = TensorC_;
-  using TensorD = TensorD_;
-  using TensorAux = TensorAux_;
-  using VectorBias = VectorBias_;
-  using VectorAlpha = VectorAlpha_;
-  using VectorBeta = VectorBeta_;
-  using ActivationFunctor = ActivationFunctor_;
-  using BiasBinaryOp = BiasBinaryOp_;
-
-  using EngineC = typename TensorC::engine_type;
-  using LayoutC = typename TensorC::layout_type;
-  using EngineD =  typename TensorD::engine_type;
-  using LayoutD = typename TensorD::layout_type;
-  static constexpr bool PerColumnBias = PerColumnBias_;
-  ElementScalar alpha = ElementScalar(1);
-  ElementScalar beta = ElementScalar(0);
-
-  TensorC C{};
-  TensorD D{};
-  VectorBias Bias{};
-  TensorAux Aux{};
-  VectorAlpha Valpha{};
-  VectorBeta Vbeta{};
-  ElementCompute st = ElementCompute(1);
-
-  ElementAccumulator* abs_max_D = nullptr;
-  ElementAccumulator* abs_max_Aux = nullptr;
-
-  ElementScalingFactor scale_a = ElementScalingFactor(1);
-  ElementScalingFactor scale_b = ElementScalingFactor(1);
-  ElementScalingFactor scale_c = ElementScalingFactor(1);
-  ElementScalingFactor scale_d = ElementScalingFactor(1);
-  ElementScalingFactor scale_aux = ElementScalingFactor(1);
-
-  bool beta_per_channel_scaling = false;
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - General Tensor-Tensor contraction reference kernel with Blockwise scaling
-template <
-  class MainloopParams,
-  class EpilogueParams
->
-void Gett(
-    MainloopParams const& mainloop_params,
-    EpilogueParams const& epilogue_params)
-{
-
-  static int constexpr kBlockM = cute::get<0>(typename MainloopParams::TileShape{});
-  static int constexpr kBlockN = cute::get<1>(typename MainloopParams::TileShape{});
-  // printf("mainloop_params.ScaleA.layout()"); cute::print(mainloop_params.ScaleA.layout()); printf("\n");
-  // printf("mainloop_params.ScaleB.layout()"); cute::print(mainloop_params.ScaleB.layout()); printf("\n");
-
-#if defined(_OPENMP)
-  #pragma omp parallel for collapse(3)
-#endif
-  for (int64_t l = 0; l < cute::size<2>(mainloop_params.A.layout()); ++l) {
-    for (int64_t m = 0; m < cute::size<0>(mainloop_params.A.layout()); m += kBlockM) {
-      for (int64_t n = 0; n < cute::size<0>(mainloop_params.B.layout()); n += kBlockN) {
-        typename MainloopParams::ElementAccumulator acc[kBlockM][kBlockN];
-        gett_mainloop(mainloop_params, m, n, l, acc);
-        gett_epilogue(epilogue_params, m, n, l, acc);
-      }
-    }
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - Mainloop
-template <class MainloopParams, class ElementAccumulator, int kBlockM, int kBlockN>
-void gett_mainloop(
-    MainloopParams const& mainloop_params,
-    int64_t m,
-    int64_t n,
-    int64_t l,
-    ElementAccumulator (&acc)[kBlockM][kBlockN])
-{
-
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "M, K, B");
-  static_assert(cute::rank(typename MainloopParams::LayoutB{}) == 3, "N, K, B");
-
-  using cute::raw_pointer_cast;
-
-  using ElementA = typename ElementTraits<typename MainloopParams::EngineA::value_type>::type;
-  using ElementB = typename ElementTraits<typename MainloopParams::EngineB::value_type>::type;
-  using ElementBlockScaleA = typename ElementTraits<typename MainloopParams::EngineScaleA::value_type>::type;
-  using ElementBlockScaleB = typename ElementTraits<typename MainloopParams::EngineScaleB::value_type>::type;
-
-  using RingOp = multiply_add<ElementAccumulator, ElementAccumulator, ElementAccumulator>;
-  RingOp fma_op;
-
-  multiplies<ElementAccumulator> scale_op;
-
-  static int constexpr kBlockK = cute::get<2>(typename MainloopParams::TileShape{});;
-
-  // Tempo accumulators to seperate blockwise accumulation
-  typename MainloopParams::ElementAccumulator acc_temp[kBlockM][kBlockN];
-
-  // Zero out accumulators
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      acc[m_b][n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      acc_temp[m_b][n_b] = ElementAccumulator(0);
-    }
-  }
-
-  int64_t block_m = m / kBlockM;
-  int64_t block_n = n / kBlockN;
-  cute::Tensor blockscale_A = mainloop_params.ScaleA(block_m, _, l);
-  cute::Tensor blockscale_B = mainloop_params.ScaleB(block_n, _, l);
-
-  // Compute on this k-block
-  for (int64_t k = 0; k < cute::size<1>(mainloop_params.A.layout()); ++k) {
-
-    // Load Blockwise scaling factor from blockscale Tensors for A and B
-    int64_t block_k = k / kBlockK;
-    ElementBlockScaleA scale_a = blockscale_A[block_k];
-    ElementBlockScaleB scale_b = blockscale_B[block_k];
-
-    // Load A
-    ElementAccumulator a_frag[kBlockM];
-    for (int m_b = 0; m_b < kBlockM; ++m_b) {
-      if (m + m_b < cute::size<0>(mainloop_params.A.layout())) {
-        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
-        a_frag[m_b] = static_cast<ElementAccumulator>(ElementA(mainloop_params.A(m + m_b, k, l)));
-      } else {
-        a_frag[m_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      }
-    }
-
-    // Load B
-    ElementAccumulator b_frag[kBlockN];
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (n + n_b < cute::size<0>(mainloop_params.B.layout())) {
-        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
-        b_frag[n_b] = static_cast<ElementAccumulator>(ElementB(mainloop_params.B(n + n_b, k, l)));
-      } else {
-        b_frag[n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      }
-    }
-
-    // do compute
-    for (int m_b = 0; m_b < kBlockM; ++m_b) {
-      for (int n_b = 0; n_b < kBlockN; ++n_b) {
-        acc_temp[m_b][n_b] = fma_op(a_frag[m_b], b_frag[n_b], acc_temp[m_b][n_b]);
-      }
-    }
-
-    // Apply Blockwise-scaling at kBlockK boundary
-    // (a) Apply block scaling factors on the partial accumulated results (acc_temp) at the kBlocK boundary 
-    // (b) Zero-out partial temporary (acc_temp),
-    // (c) Update permanent (accu)
-    if ((k+1) % kBlockK == 0) {
-      for (int m_b = 0; m_b < kBlockM; ++m_b) {
-        for (int n_b = 0; n_b < kBlockN; ++n_b) {
-          ElementAccumulator blockwise_scaled_accum = acc_temp[m_b][n_b] * scale_a * scale_b;
-          acc[m_b][n_b] = blockwise_scaled_accum + acc[m_b][n_b];
-          acc_temp[m_b][n_b] = ElementAccumulator(0); 
-        }
-      }
-    }
-
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - Epilogue
-template <class EpilogueParams, class ElementAccumulator, int kBlockM, int kBlockN>
-void gett_epilogue(
-    EpilogueParams const& epilogue_params,
-    int64_t m,
-    int64_t n,
-    int64_t l,
-    ElementAccumulator (&acc)[kBlockM][kBlockN])
-{
-  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == 3, "M, K, B");
-  static_assert(cute::rank(typename EpilogueParams::LayoutD{}) == 3, "N, K, B");
-
-  using cute::raw_pointer_cast;
-
-  using ElementCompute = typename EpilogueParams::ElementCompute;
-  using ElementC = typename EpilogueParams::TensorC::value_type;
-  using ElementD = typename EpilogueParams::TensorD::value_type;
-  using ElementAux = typename EpilogueParams::TensorAux::value_type;
-  using ElementBias = typename EpilogueParams::VectorBias::value_type;
-  using ElementScalar = typename EpilogueParams::ElementScalar;
-  using ElementScalingFactor = typename EpilogueParams::ElementScalingFactor;
-  using ActivationFunctor = typename EpilogueParams::ActivationFunctor;
-  using BiasBinaryOp = typename EpilogueParams::BiasBinaryOp;
-
-  constexpr bool PerColBias = EpilogueParams::PerColumnBias;
-  constexpr bool IsScalingAndAmaxOutputNeeded = 
-      cute::is_same_v<ElementD, cutlass::float_e4m3_t> or
-      cute::is_same_v<ElementD, cutlass::float_e5m2_t>;
-
-  constexpr bool IsScalingAndAmaxAuxOutputNeeded =
-      cute::is_same_v<ElementAux, cutlass::float_e4m3_t> or
-      cute::is_same_v<ElementAux, cutlass::float_e5m2_t>;
-
-  constexpr bool IsReLUAuxNeeded =
-      (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::ReLu<ElementCompute>> or
-       cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>) and 
-      cute::is_same_v<ElementAux, cutlass::uint1b_t>;
-  constexpr bool IsClamp =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>;
-
-  constexpr bool IsBackpropFusion =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dGELU<ElementCompute>> or
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dReLU<ElementCompute>>;
-
-  // Input related converter
-  NumericConverter<ElementCompute, ElementAccumulator> accumulator_converter;
-  NumericConverter<ElementCompute, ElementC> source_converter;
-  NumericConverter<ElementCompute, ElementBias> bias_converter;
-  [[maybe_unused]] NumericConverter<ElementCompute, ElementAux> aux_source_converter;
-
-  // Scale related converter
-  NumericConverter<ElementCompute, ElementScalar> scale_converter;
-  NumericConverter<ElementCompute, ElementScalingFactor> scaling_factor_converter;
-
-  // Abs max converter
-  [[maybe_unused]] NumericConverter<ElementAccumulator, ElementCompute> abs_max_output_converter;
-
-  // Output related converter
-  NumericConverter<ElementD, ElementCompute> destination_converter;
-  [[maybe_unused]] NumericConverter<ElementAux, ElementCompute> aux_destination_converter;
-  NumericConverter<ElementBias, ElementCompute> dBias_converter;
-
-  // Epilogue operations
-  multiply_add<ElementCompute, ElementCompute, ElementCompute> epilogue_fma;
-  multiplies<ElementCompute> mul;
-  plus<ElementCompute> add;
-
-  // Activation operation
-  ActivationFunctor activation;
-
-  // Bias binary operation
-  BiasBinaryOp bias_op;
-
-  // Do conversion
-  ElementCompute converted_alpha = scale_converter(epilogue_params.alpha);
-  ElementCompute converted_beta = scale_converter(epilogue_params.beta);
-  ElementCompute converted_scale_a = scaling_factor_converter(epilogue_params.scale_a);
-  ElementCompute converted_scale_b = scaling_factor_converter(epilogue_params.scale_b);
-  ElementCompute converted_scale_c = scaling_factor_converter(epilogue_params.scale_c);
-  ElementCompute converted_scale_d = scaling_factor_converter(epilogue_params.scale_d);
-  ElementCompute converted_scale_aux = scaling_factor_converter(epilogue_params.scale_aux);
-
-  // Init local var
-  [[maybe_unused]] ElementCompute local_abs_max_output = ElementCompute(0);
-  [[maybe_unused]] ElementCompute local_abs_max_aux_output = ElementCompute(0);
-
-  converted_alpha = mul(converted_alpha, mul(converted_scale_a, converted_scale_b));
-  converted_beta = mul(converted_beta, converted_scale_c);
-
-  ElementCompute inter_accum[kBlockM][kBlockN];
-
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    ElementCompute local_dBias = ElementCompute(0);
-
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
-        // Convert every type to ElementCompute first, do compute, convert to output type, write it out
-        ElementCompute converted_acc = accumulator_converter(acc[m_b][n_b]);
-        // per-row alpha
-        if (raw_pointer_cast(epilogue_params.Valpha.data())) {
-          converted_alpha = scale_converter(epilogue_params.Valpha(m + m_b));
-        }
-        ElementCompute output = mul(converted_alpha, converted_acc);
-
-        if (raw_pointer_cast(epilogue_params.Bias.data()) && not IsBackpropFusion) {
-          ElementCompute converted_bias = bias_converter(epilogue_params.Bias(PerColBias ? n + n_b : m + m_b));
-          output = bias_op(output, converted_bias);
-        }
-
-        if (raw_pointer_cast(epilogue_params.C.data())) {
-          ElementCompute converted_src = source_converter(epilogue_params.C(m + m_b, n + n_b, l));
-          // per-row beta
-          if (epilogue_params.Vbeta.data()) {
-            converted_beta = scale_converter(epilogue_params.Vbeta(m + m_b));
-          }
-          output = epilogue_fma(converted_beta, converted_src, output);
-        }
-
-        if constexpr (IsBackpropFusion) {
-          ElementAux aux_input = ElementAux(0);
-          if (raw_pointer_cast(epilogue_params.Aux.data())) {
-            aux_input = epilogue_params.Aux(m + m_b, n + n_b, l);
-          }
-
-          output = activation(output, aux_source_converter(aux_input));
-          local_dBias = add(local_dBias, output);
-        }
-        else {
-          if (raw_pointer_cast(epilogue_params.Aux.data())) {
-            auto aux_output = output;
-            if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
-              maximum_absolute_value_reduction<ElementCompute, true> amax_op;
-              local_abs_max_aux_output = amax_op(local_abs_max_aux_output, aux_output);
-              aux_output = epilogue_fma(converted_scale_aux, aux_output, ElementCompute(0));
-            }
-
-            if constexpr (IsReLUAuxNeeded) {
-              epilogue_params.Aux(m + m_b, n + n_b, l) = not (aux_output < 0) ? uint1b_t(1) : uint1b_t(0);
-            } else {
-              epilogue_params.Aux(m + m_b, n + n_b, l) = aux_destination_converter(aux_output);
-            }
-          }
-
-          if constexpr (IsClamp) { // Treat Clamp as ReLU
-            output = activation(output, {0, std::numeric_limits<ElementCompute>::max()});
-          }
-          else {
-            output = activation(output);
-          }
-        }
-
-        if constexpr (IsScalingAndAmaxOutputNeeded) {
-          maximum_absolute_value_reduction<ElementCompute, true> amax_op;
-          local_abs_max_output = amax_op(local_abs_max_output, output);
-          output = epilogue_fma(converted_scale_d, output, ElementCompute(0));
-        }
-
-        inter_accum[m_b][n_b] = ElementCompute(output);
-      }
-    } // n_b
-
-    if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n < cute::size<1>(epilogue_params.D.layout())) {
-      if (raw_pointer_cast(epilogue_params.Bias.data()) && IsBackpropFusion) {
-        ElementCompute converted_dBias = bias_converter(epilogue_params.Bias(m + m_b));
-        local_dBias = add(local_dBias, converted_dBias);
-        epilogue_params.Bias(m + m_b) = dBias_converter(local_dBias);
-      }
-    }
-  } // m_b
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
-        epilogue_params.D(m + m_b, n + n_b, l) = destination_converter(inter_accum[m_b][n_b]);
-      }
-    }
-  }
-
-#if defined(_OPENMP)
-  #pragma omp critical(Abs_Max_Data_Update)
-#endif
-  {
-    if constexpr (IsScalingAndAmaxOutputNeeded) {
-      if (epilogue_params.abs_max_D) {
-        *epilogue_params.abs_max_D = maximum_with_nan_propogation<ElementAccumulator>{}(
-          *epilogue_params.abs_max_D, abs_max_output_converter(local_abs_max_output));
-      }
-    }
-
-    if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
-      if (epilogue_params.abs_max_Aux) {
-        *epilogue_params.abs_max_Aux = maximum_with_nan_propogation<ElementAccumulator>{}(
-            *epilogue_params.abs_max_Aux, abs_max_output_converter(local_abs_max_aux_output));
-      }
-    }
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GEMM - General Matrix-Matrix contraction without conjugation options
-template <
-  class MainloopParams,
-  class EpilogueParams
->
-void Gemm3x(
-    MainloopParams const& mainloop_params,
-    EpilogueParams const& epilogue_params)
-{
-  using namespace cute;
-
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename MainloopParams::LayoutB{}));
-  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == cute::rank(typename EpilogueParams::LayoutD{}));
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename EpilogueParams::LayoutC{}));
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "Only Rank3 Tensors (M, K, Batch_Count) "
-                                                                     "with Batchmode are supported");
-  // Lower the Matrix-Multiplication with Blockwise scaling (Gemm3x) to a Tensor Contraction (Gett).
-  Gett(mainloop_params, epilogue_params);
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-} // cutlass::reference::host
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h
deleted file mode 100644
index 0bf90a4163..0000000000
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h
+++ /dev/null
@@ -1,518 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Reference implementation for GETT in host-side code.
-*/
-
-#pragma once
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-#include "cutlass/gemm/gemm.h"
-#include "cutlass/complex.h"
-#include "cutlass/numeric_conversion.h"
-#include "cutlass/epilogue/thread/activation.h"
-#include "cutlass/relatively_equal.h"
-#include <iostream>
-#include "cute/tensor.hpp"
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-namespace cutlass::reference::host {
-
-template<class T, class = void>
-struct ElementTraits {
-  using type = T;
-};
-
-template<class T>
-struct ElementTraits<T, std::enable_if_t<!std::is_same_v<decltype(std::declval<T>().get()), void> > >  {
-  using type = decltype(std::declval<T>().get());
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-template<
-  class ElementAccumulator_,
-  class TensorA_,                                                                                         // (M, K, L)
-  class TensorB_,                                                                                         // (N, K, L)
-  class TensorScaleA_,                                                                                    // (m, k, L)
-  class TensorScaleB_,                                                                                    // (n, k, L)
-  class TileShape_
->
-struct GettMainloopParams {
-  using ElementAccumulator = ElementAccumulator_;
-  using TensorA = TensorA_;
-  using TensorB = TensorB_;
-  using EngineA = typename TensorA::engine_type;
-  using LayoutA = typename TensorA::layout_type;
-  using EngineB = typename TensorB::engine_type;
-  using LayoutB = typename TensorB::layout_type;
-
-  using TensorScaleA = TensorScaleA_;
-  using TensorScaleB = TensorScaleB_;
-  using TileShape = TileShape_;
-  using EngineScaleA = typename TensorScaleA::engine_type;
-  using EngineScaleB = typename TensorScaleB::engine_type;
-
-  TensorA A{};
-  TensorB B{};
-  TensorScaleA ScaleA{};
-  TensorScaleB ScaleB{};  
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-template<
-  class ElementScalar_,
-  class ElementScalingFactor_,
-  class ElementAccumulator_,
-  class ElementCompute_,
-  class TensorC_,                                                                                          // (M, N, L)
-  class TensorD_,                                                                                          // (M, N, L)
-  class VectorBias_ = TensorD_,                                                                            //    (M, 1)
-  class TensorAux_ = TensorD_,                                                                             // (M, N, L)
-  class VectorAlpha_ = TensorD_,                                                                           //    (M, 1)
-  class VectorBeta_ = VectorAlpha_,                                                                        //    (M, 1)
-  class ActivationFunctor_ = cutlass::epilogue::thread::Identity<ElementCompute_>,
-  class BiasBinaryOp_ = cutlass::plus<ElementCompute_>,
-  bool PerColumnBias_ = false
->
-struct GettEpilogueParams {
-  using ElementScalar = ElementScalar_;
-  using ElementScalingFactor = ElementScalingFactor_;
-  using ElementAccumulator = ElementAccumulator_;
-  using ElementCompute = ElementCompute_;
-  using TensorC = TensorC_;
-  using TensorD = TensorD_;
-  using TensorAux = TensorAux_;
-  using VectorBias = VectorBias_;
-  using VectorAlpha = VectorAlpha_;
-  using VectorBeta = VectorBeta_;
-  using ActivationFunctor = ActivationFunctor_;
-  using BiasBinaryOp = BiasBinaryOp_;
-
-  using EngineC = typename TensorC::engine_type;
-  using LayoutC = typename TensorC::layout_type;
-  using EngineD =  typename TensorD::engine_type;
-  using LayoutD = typename TensorD::layout_type;
-  static constexpr bool PerColumnBias = PerColumnBias_;
-  ElementScalar alpha = ElementScalar(1);
-  ElementScalar beta = ElementScalar(0);
-
-  TensorC C{};
-  TensorD D{};
-  VectorBias Bias{};
-  TensorAux Aux{};
-  VectorAlpha Valpha{};
-  VectorBeta Vbeta{};
-  ElementCompute st = ElementCompute(1);
-
-  ElementAccumulator* abs_max_D = nullptr;
-  ElementAccumulator* abs_max_Aux = nullptr;
-
-  ElementScalingFactor scale_a = ElementScalingFactor(1);
-  ElementScalingFactor scale_b = ElementScalingFactor(1);
-  ElementScalingFactor scale_c = ElementScalingFactor(1);
-  ElementScalingFactor scale_d = ElementScalingFactor(1);
-  ElementScalingFactor scale_aux = ElementScalingFactor(1);
-
-  bool beta_per_channel_scaling = false;
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - General Tensor-Tensor contraction reference kernel with Groupwise scaling
-template <
-  class MainloopParams,
-  class EpilogueParams
->
-void Gett(
-    MainloopParams const& mainloop_params,
-    EpilogueParams const& epilogue_params)
-{
-
-  static int constexpr kBlockM = cute::get<0>(typename MainloopParams::TileShape{});
-  static int constexpr kBlockN = cute::get<1>(typename MainloopParams::TileShape{});
-  // printf("mainloop_params.ScaleA.layout()"); cute::print(mainloop_params.ScaleA.layout()); printf("\n");
-  // printf("mainloop_params.ScaleB.layout()"); cute::print(mainloop_params.ScaleB.layout()); printf("\n");
-
-#if defined(_OPENMP)
-  #pragma omp parallel for collapse(3)
-#endif
-  for (int64_t l = 0; l < cute::size<2>(mainloop_params.A.layout()); ++l) {
-    for (int64_t m = 0; m < cute::size<0>(mainloop_params.A.layout()); m += kBlockM) {
-      for (int64_t n = 0; n < cute::size<0>(mainloop_params.B.layout()); n += kBlockN) {
-        typename MainloopParams::ElementAccumulator acc[kBlockM][kBlockN];
-        gett_mainloop(mainloop_params, m, n, l, acc);
-        gett_epilogue(epilogue_params, m, n, l, acc);
-      }
-    }
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - Mainloop
-template <class MainloopParams, class ElementAccumulator, int kBlockM, int kBlockN>
-void gett_mainloop(
-    MainloopParams const& mainloop_params,
-    int64_t m,
-    int64_t n,
-    int64_t l,
-    ElementAccumulator (&acc)[kBlockM][kBlockN])
-{
-
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "M, K, B");
-  static_assert(cute::rank(typename MainloopParams::LayoutB{}) == 3, "N, K, B");
-
-  using cute::raw_pointer_cast;
-
-  using ElementA = typename ElementTraits<typename MainloopParams::EngineA::value_type>::type;
-  using ElementB = typename ElementTraits<typename MainloopParams::EngineB::value_type>::type;
-  using ElementBlockScaleA = typename ElementTraits<typename MainloopParams::EngineScaleA::value_type>::type;
-  using ElementBlockScaleB = typename ElementTraits<typename MainloopParams::EngineScaleB::value_type>::type;
-
-  using RingOp = multiply_add<ElementAccumulator, ElementAccumulator, ElementAccumulator>;
-  RingOp fma_op;
-
-  multiplies<ElementAccumulator> scale_op;
-
-  static int constexpr kBlockK = cute::get<2>(typename MainloopParams::TileShape{});;
-
-  // Tempo accumulators to seperate blockwise accumulation
-  typename MainloopParams::ElementAccumulator acc_temp[kBlockM][kBlockN];
-
-  // Zero out accumulators
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      acc[m_b][n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      acc_temp[m_b][n_b] = ElementAccumulator(0);
-    }
-  }
-
-  const int M = cute::size<0>(mainloop_params.A.layout());
-  const int N = cute::size<0>(mainloop_params.B.layout());
-  const int ScaleGranularityM = M / cute::size<0>(mainloop_params.ScaleA);
-  const int ScaleGranularityN = N / cute::size<0>(mainloop_params.ScaleB);
-  assert(ScaleGranularityM && M % ScaleGranularityM == 0
-    && "ScaleGranularityM must divide M");
-  assert(ScaleGranularityN && N % ScaleGranularityN == 0
-    && "ScaleGranularityN must divide N"); 
-
-  cute::Tensor blockscale_A = domain_offset(
-    make_coord(m / ScaleGranularityM, _0{}), mainloop_params.ScaleA(_, _, l));
-  cute::Tensor blockscale_B = domain_offset(
-    make_coord(n / ScaleGranularityN, _0{}), mainloop_params.ScaleB(_, _, l));
-
-  // Compute on this k-block
-  for (int64_t k = 0; k < cute::size<1>(mainloop_params.A.layout()); ++k) {
-
-    // Load Blockwise scaling factor from blockscale Tensors for B
-    int64_t block_k = k / kBlockK;
-    cute::Tensor scale_a = blockscale_A(_, block_k);
-    cute::Tensor scale_b = blockscale_B(_, block_k);
-
-    // Load A
-    ElementAccumulator a_frag[kBlockM];
-    for (int m_b = 0; m_b < kBlockM; ++m_b) {
-      if (m + m_b < cute::size<0>(mainloop_params.A.layout())) {
-        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
-        a_frag[m_b] = static_cast<ElementAccumulator>(ElementA(mainloop_params.A(m + m_b, k, l)));
-      } else {
-        a_frag[m_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      }
-    }
-
-    // Load B
-    ElementAccumulator b_frag[kBlockN];
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (n + n_b < cute::size<0>(mainloop_params.B.layout())) {
-        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
-        b_frag[n_b] = static_cast<ElementAccumulator>(ElementB(mainloop_params.B(n + n_b, k, l)));
-      } else {
-        b_frag[n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      }
-    }
-
-    int m_size = std::min(static_cast<int64_t>(kBlockM), cute::size<0>(mainloop_params.A.layout()) - m);
-    int n_size = std::min(static_cast<int64_t>(kBlockN), cute::size<0>(mainloop_params.B.layout()) - n);
-
-    // do compute
-    for (int m_b = 0; m_b < m_size; ++m_b) {
-      for (int n_b = 0; n_b < n_size; ++n_b) {
-        acc_temp[m_b][n_b] = fma_op(a_frag[m_b], b_frag[n_b], acc_temp[m_b][n_b]);
-      }
-    }
-
-    // Apply Groupwise-scaling at kBlockK boundary
-    // (a) Apply group and block scaling factors on the partial accumulated results (acc_temp) at the kBlocK boundary 
-    // (b) Zero-out partial temporary (acc_temp),
-    // (c) Update permanent (accu)
-    if ((k+1) % kBlockK == 0) {
-      for (int m_b = 0; m_b < m_size; ++m_b) {
-        auto scale_a_m_b = scale_a[m_b / ScaleGranularityM];
-        for (int n_b = 0; n_b < n_size; ++n_b) {
-          auto scale_b_n_b = scale_b[n_b / ScaleGranularityN];
-          ElementAccumulator blockwise_scaled_accum = acc_temp[m_b][n_b] * scale_a_m_b * scale_b_n_b;
-          acc[m_b][n_b] = blockwise_scaled_accum + acc[m_b][n_b];
-          acc_temp[m_b][n_b] = ElementAccumulator(0); 
-        }
-      }
-    }
-
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - Epilogue
-template <class EpilogueParams, class ElementAccumulator, int kBlockM, int kBlockN>
-void gett_epilogue(
-    EpilogueParams const& epilogue_params,
-    int64_t m,
-    int64_t n,
-    int64_t l,
-    ElementAccumulator (&acc)[kBlockM][kBlockN])
-{
-  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == 3, "M, K, B");
-  static_assert(cute::rank(typename EpilogueParams::LayoutD{}) == 3, "N, K, B");
-
-  using cute::raw_pointer_cast;
-
-  using ElementCompute = typename EpilogueParams::ElementCompute;
-  using ElementC = typename EpilogueParams::TensorC::value_type;
-  using ElementD = typename EpilogueParams::TensorD::value_type;
-  using ElementAux = typename EpilogueParams::TensorAux::value_type;
-  using ElementBias = typename EpilogueParams::VectorBias::value_type;
-  using ElementScalar = typename EpilogueParams::ElementScalar;
-  using ElementScalingFactor = typename EpilogueParams::ElementScalingFactor;
-  using ActivationFunctor = typename EpilogueParams::ActivationFunctor;
-  using BiasBinaryOp = typename EpilogueParams::BiasBinaryOp;
-
-  constexpr bool PerColBias = EpilogueParams::PerColumnBias;
-  constexpr bool IsScalingAndAmaxOutputNeeded = 
-      cute::is_same_v<ElementD, cutlass::float_e4m3_t> or
-      cute::is_same_v<ElementD, cutlass::float_e5m2_t>;
-
-  constexpr bool IsScalingAndAmaxAuxOutputNeeded =
-      cute::is_same_v<ElementAux, cutlass::float_e4m3_t> or
-      cute::is_same_v<ElementAux, cutlass::float_e5m2_t>;
-
-  constexpr bool IsReLUAuxNeeded =
-      (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::ReLu<ElementCompute>> or
-       cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>) and 
-      cute::is_same_v<ElementAux, cutlass::uint1b_t>;
-  constexpr bool IsClamp =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>;
-
-  constexpr bool IsBackpropFusion =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dGELU<ElementCompute>> or
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dReLU<ElementCompute>>;
-
-  // Input related converter
-  NumericConverter<ElementCompute, ElementAccumulator> accumulator_converter;
-  NumericConverter<ElementCompute, ElementC> source_converter;
-  NumericConverter<ElementCompute, ElementBias> bias_converter;
-  [[maybe_unused]] NumericConverter<ElementCompute, ElementAux> aux_source_converter;
-
-  // Scale related converter
-  NumericConverter<ElementCompute, ElementScalar> scale_converter;
-  NumericConverter<ElementCompute, ElementScalingFactor> scaling_factor_converter;
-
-  // Abs max converter
-  [[maybe_unused]] NumericConverter<ElementAccumulator, ElementCompute> abs_max_output_converter;
-
-  // Output related converter
-  NumericConverter<ElementD, ElementCompute> destination_converter;
-  [[maybe_unused]] NumericConverter<ElementAux, ElementCompute> aux_destination_converter;
-  NumericConverter<ElementBias, ElementCompute> dBias_converter;
-
-  // Epilogue operations
-  multiply_add<ElementCompute, ElementCompute, ElementCompute> epilogue_fma;
-  multiplies<ElementCompute> mul;
-  plus<ElementCompute> add;
-
-  // Activation operation
-  ActivationFunctor activation;
-
-  // Bias binary operation
-  BiasBinaryOp bias_op;
-
-  // Do conversion
-  ElementCompute converted_alpha = scale_converter(epilogue_params.alpha);
-  ElementCompute converted_beta = scale_converter(epilogue_params.beta);
-  ElementCompute converted_scale_a = scaling_factor_converter(epilogue_params.scale_a);
-  ElementCompute converted_scale_b = scaling_factor_converter(epilogue_params.scale_b);
-  ElementCompute converted_scale_c = scaling_factor_converter(epilogue_params.scale_c);
-  ElementCompute converted_scale_d = scaling_factor_converter(epilogue_params.scale_d);
-  ElementCompute converted_scale_aux = scaling_factor_converter(epilogue_params.scale_aux);
-
-  // Init local var
-  [[maybe_unused]] ElementCompute local_abs_max_output = ElementCompute(0);
-  [[maybe_unused]] ElementCompute local_abs_max_aux_output = ElementCompute(0);
-
-  converted_alpha = mul(converted_alpha, mul(converted_scale_a, converted_scale_b));
-  converted_beta = mul(converted_beta, converted_scale_c);
-
-  ElementCompute inter_accum[kBlockM][kBlockN];
-
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    ElementCompute local_dBias = ElementCompute(0);
-
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
-        // Convert every type to ElementCompute first, do compute, convert to output type, write it out
-        ElementCompute converted_acc = accumulator_converter(acc[m_b][n_b]);
-        // per-row alpha
-        if (raw_pointer_cast(epilogue_params.Valpha.data())) {
-          converted_alpha = scale_converter(epilogue_params.Valpha(m + m_b));
-        }
-        ElementCompute output = mul(converted_alpha, converted_acc);
-
-        if (raw_pointer_cast(epilogue_params.Bias.data()) && not IsBackpropFusion) {
-          ElementCompute converted_bias = bias_converter(epilogue_params.Bias(PerColBias ? n + n_b : m + m_b));
-          output = bias_op(output, converted_bias);
-        }
-
-        if (raw_pointer_cast(epilogue_params.C.data())) {
-          ElementCompute converted_src = source_converter(epilogue_params.C(m + m_b, n + n_b, l));
-          // per-row beta
-          if (epilogue_params.Vbeta.data()) {
-            converted_beta = scale_converter(epilogue_params.Vbeta(m + m_b));
-          }
-          output = epilogue_fma(converted_beta, converted_src, output);
-        }
-
-        if constexpr (IsBackpropFusion) {
-          ElementAux aux_input = ElementAux(0);
-          if (raw_pointer_cast(epilogue_params.Aux.data())) {
-            aux_input = epilogue_params.Aux(m + m_b, n + n_b, l);
-          }
-
-          output = activation(output, aux_source_converter(aux_input));
-          local_dBias = add(local_dBias, output);
-        }
-        else {
-          if (raw_pointer_cast(epilogue_params.Aux.data())) {
-            auto aux_output = output;
-            if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
-              maximum_absolute_value_reduction<ElementCompute, true> amax_op;
-              local_abs_max_aux_output = amax_op(local_abs_max_aux_output, aux_output);
-              aux_output = epilogue_fma(converted_scale_aux, aux_output, ElementCompute(0));
-            }
-
-            if constexpr (IsReLUAuxNeeded) {
-              epilogue_params.Aux(m + m_b, n + n_b, l) = not (aux_output < 0) ? uint1b_t(1) : uint1b_t(0);
-            } else {
-              epilogue_params.Aux(m + m_b, n + n_b, l) = aux_destination_converter(aux_output);
-            }
-          }
-
-          if constexpr (IsClamp) { // Treat Clamp as ReLU
-            output = activation(output, {0, std::numeric_limits<ElementCompute>::max()});
-          }
-          else {
-            output = activation(output);
-          }
-        }
-
-        if constexpr (IsScalingAndAmaxOutputNeeded) {
-          maximum_absolute_value_reduction<ElementCompute, true> amax_op;
-          local_abs_max_output = amax_op(local_abs_max_output, output);
-          output = epilogue_fma(converted_scale_d, output, ElementCompute(0));
-        }
-
-        inter_accum[m_b][n_b] = ElementCompute(output);
-      }
-    } // n_b
-
-    if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n < cute::size<1>(epilogue_params.D.layout())) {
-      if (raw_pointer_cast(epilogue_params.Bias.data()) && IsBackpropFusion) {
-        ElementCompute converted_dBias = bias_converter(epilogue_params.Bias(m + m_b));
-        local_dBias = add(local_dBias, converted_dBias);
-        epilogue_params.Bias(m + m_b) = dBias_converter(local_dBias);
-      }
-    }
-  } // m_b
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
-        epilogue_params.D(m + m_b, n + n_b, l) = destination_converter(inter_accum[m_b][n_b]);
-      }
-    }
-  }
-
-#if defined(_OPENMP)
-  #pragma omp critical(Abs_Max_Data_Update)
-#endif
-  {
-    if constexpr (IsScalingAndAmaxOutputNeeded) {
-      if (epilogue_params.abs_max_D) {
-        *epilogue_params.abs_max_D = maximum_with_nan_propogation<ElementAccumulator>{}(
-          *epilogue_params.abs_max_D, abs_max_output_converter(local_abs_max_output));
-      }
-    }
-
-    if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
-      if (epilogue_params.abs_max_Aux) {
-        *epilogue_params.abs_max_Aux = maximum_with_nan_propogation<ElementAccumulator>{}(
-            *epilogue_params.abs_max_Aux, abs_max_output_converter(local_abs_max_aux_output));
-      }
-    }
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GEMM - General Matrix-Matrix contraction without conjugation options
-template <
-  class MainloopParams,
-  class EpilogueParams
->
-void Gemm3x(
-    MainloopParams const& mainloop_params,
-    EpilogueParams const& epilogue_params)
-{
-  using namespace cute;
-
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename MainloopParams::LayoutB{}));
-  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == cute::rank(typename EpilogueParams::LayoutD{}));
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename EpilogueParams::LayoutC{}));
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "Only Rank3 Tensors (M, K, Batch_Count) "
-                                                                     "with Batchmode are supported");
-  // Lower the Matrix-Multiplication with Groupwise scaling (Gemm3x) to a Tensor Contraction (Gett).
-  Gett(mainloop_params, epilogue_params);
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-} // cutlass::reference::host
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling.cu b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling.cu
index d20bad5827..d14360deb6 100644
--- a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling.cu
+++ b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling.cu
@@ -87,11 +87,11 @@
 #include "cutlass/util/reference/host/tensor_compare.h"
 #include "cutlass/util/reference/host/tensor_norm.h"
 #include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
 
 // Includes from examples directory
 #include "helper.h"
 #include "hopper_fp8_commandline.hpp"
-#include "reference/host/gemm_with_groupwise_scaling.h"
 
 using namespace cute;
 
@@ -128,54 +128,29 @@ using ElementAccumulator  = float;                                          // E
 using ElementBlockScale   = float;                                          // Element type for blockscaling during accumulation
 using ElementCompute      = float;                                          // Element type for epilogue computation
 
-using TileShape_  = Shape<_128,_128,_128>;  // This one is just to make the compiler happy with verify()...
-
-// ScaleGranularity{M,N}: number of {rows in A}/{columns in B} that share the same scaling factor
-// Given TileShape = Shape<_128,_128,_128>:
-//   ScaleGranularityM == 128 and ScaleGranularityN == 128 --> 2Dx2D (the shape of the scaling factor)
-//   ScaleGranularityM == 1   and ScaleGranularityN == 128 --> 1Dx2D scaling
-//   ScaleGranularityM == 128 and ScaleGranularityN == 1   --> 2Dx1D scaling
-//   ScaleGranularityM == 1   and ScaleGranularityN == 1   --> 1Dx1D scaling
-template <int ScaleGranularityM_, int ScaleGranularityN_>
-struct GroupScaleConfig {
-  using ArchTag       = cutlass::arch::Sm90;                          // Tag indicating the minimum SM that supports the intended feature
-  using OperatorClass = cutlass::arch::OpClassTensorOp;               // Operator class tag
-  using TileShape     = Shape<_128,_128,_128>;                        // Threadblock-level tile size
-  using ClusterShape  = Shape<_1,_2,_1>;                              // Shape of the threadblocks in a cluster
-
-  static constexpr int ScaleGranularityM = ScaleGranularityM_;
-  static constexpr int ScaleGranularityN = ScaleGranularityN_;
-  static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
-  static constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
-
-  static_assert(size<0>(TileShape{}) == ScaleGranularityM * ScaleMsPerTile,
-              "FP8 scaling granularity must evenly divide tile shape along M.");
-  static_assert(size<1>(TileShape{}) == ScaleGranularityN * ScaleNsPerTile,
-              "FP8 scaling granularity must evenly divide tile shape along N.");
-
-  using KernelSchedule    = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum<ScaleGranularityM_, ScaleGranularityN_>;
-  using EpilogueSchedule  = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;
-  using EpilogueTileType  = cutlass::epilogue::collective::EpilogueTileAuto;
-  using FusionOperation   = cutlass::epilogue::fusion::LinearCombination<ElementC, ElementAccumulator>;
-};
+using ArchTag       = cutlass::arch::Sm90;                          // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass = cutlass::arch::OpClassTensorOp;               // Operator class tag
+using TileShape     = Shape<_128,_128,_128>;                        // Threadblock-level tile size
+using ClusterShape  = Shape<_1,_2,_1>;                              // Shape of the threadblocks in a cluster
+
+constexpr int ScaleGranularityM = 1;
+constexpr int ScaleGranularityN = 128;
+constexpr int ScaleGranularityK = 128;
+
+constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
+constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
+
+using ScaleConfig   = cutlass::detail::Sm90BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>;
 
-using GroupScale1D1DConfig = GroupScaleConfig<                    1,                     1>;
-using GroupScale1D2DConfig = GroupScaleConfig<                    1, size<1>(TileShape_{})>;
-using GroupScale2D1DConfig = GroupScaleConfig<size<0>(TileShape_{}),                     1>;
-using GroupScale2D2DConfig = GroupScaleConfig<size<0>(TileShape_{}), size<1>(TileShape_{})>;
-
-template <typename ScheduleConfig>
-struct GroupScaleGemm {
-  using ArchTag           = typename ScheduleConfig::ArchTag;
-  using OperatorClass     = typename ScheduleConfig::OperatorClass;
-  using TileShape         = typename ScheduleConfig::TileShape;
-  using ClusterShape      = typename ScheduleConfig::ClusterShape;
-  using KernelSchedule    = typename ScheduleConfig::KernelSchedule;
-  using EpilogueSchedule  = typename ScheduleConfig::EpilogueSchedule;
-  using EpilogueTileType  = typename ScheduleConfig::EpilogueTileType;
-  using FusionOperation   = typename ScheduleConfig::FusionOperation;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+using LayoutSFA     = decltype(ScaleConfig::deduce_layoutSFA());    // Layout type for SFA matrix operand
+using LayoutSFB     = decltype(ScaleConfig::deduce_layoutSFB());    // Layout type for SFB matrix operand
+
+using KernelSchedule    = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum;
+using EpilogueSchedule  = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;
+using EpilogueTileType  = cutlass::epilogue::collective::EpilogueTileAuto;
+using FusionOperation   = cutlass::epilogue::fusion::LinearCombination<ElementC, ElementAccumulator>;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
     ArchTag, OperatorClass,
     TileShape, ClusterShape,
     EpilogueTileType,
@@ -186,10 +161,10 @@ struct GroupScaleGemm {
     FusionOperation
   >::CollectiveOp;
 
-  using CollectiveMainloopWithGroupWiseScaling = typename cutlass::gemm::collective::CollectiveBuilder<
+using CollectiveMainloopWithGroupWiseScaling = typename cutlass::gemm::collective::CollectiveBuilder<
     ArchTag, OperatorClass,
-    ElementA, LayoutA *, AlignmentA,
-    ElementB, LayoutB *, AlignmentB,
+    ElementA, cute::tuple<LayoutA *, LayoutSFA *>, AlignmentA,
+    ElementB, cute::tuple<LayoutB *, LayoutSFB *>, AlignmentB,
     ElementAccumulator,
     TileShape, ClusterShape,
     cutlass::gemm::collective::StageCountAutoCarveout<
@@ -198,29 +173,23 @@ struct GroupScaleGemm {
     KernelSchedule
   >::CollectiveOp;
 
-  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-      ProblemShape,
-      CollectiveMainloopWithGroupWiseScaling,
-      CollectiveEpilogue
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopWithGroupWiseScaling,
+    CollectiveEpilogue
   >;
 
-  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-};
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
-using GroupScale1D1DGemm = GroupScaleGemm<GroupScale1D1DConfig>;
-using GroupScale1D2DGemm = GroupScaleGemm<GroupScale1D2DConfig>;
-using GroupScale2D1DGemm = GroupScaleGemm<GroupScale2D1DConfig>;
-using GroupScale2D2DGemm = GroupScaleGemm<GroupScale2D2DConfig>;
 
 // Extract information from Gemm kernel.
-using EpilogueOutputOp  = typename GroupScale1D1DGemm::Gemm::EpilogueOutputOp;
+using EpilogueOutputOp  = typename Gemm::EpilogueOutputOp;
 using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
-using ActivationFunctor = typename EpilogueOutputOp::ActivationFn;
 
-using StrideA = typename GroupScale1D1DGemm::Gemm::GemmKernel::InternalStrideA;
-using StrideB = typename GroupScale1D1DGemm::Gemm::GemmKernel::InternalStrideB;
-using StrideC = typename GroupScale1D1DGemm::Gemm::GemmKernel::InternalStrideC;
-using StrideD = typename GroupScale1D1DGemm::Gemm::GemmKernel::InternalStrideD;
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
 
 static_assert(cute::is_same_v<ElementAccumulator, ElementBlockScale>,
              "ElementAccumulator and ElementBlockScale should be same datatype");
@@ -240,6 +209,8 @@ std::vector<StrideA> stride_A_host;
 std::vector<StrideB> stride_B_host;
 std::vector<StrideC> stride_C_host;
 std::vector<StrideD> stride_D_host;
+std::vector<LayoutSFA> layout_SFA_host;
+std::vector<LayoutSFB> layout_SFB_host;
 
 std::vector<ElementAccumulator> alpha_host;
 std::vector<ElementAccumulator> beta_host;
@@ -265,6 +236,8 @@ cutlass::DeviceAllocation<StrideA> stride_A;
 cutlass::DeviceAllocation<StrideB> stride_B;
 cutlass::DeviceAllocation<StrideC> stride_C;
 cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<LayoutSFA> layout_SFA;
+cutlass::DeviceAllocation<LayoutSFB> layout_SFB;
 
 cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
 cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
@@ -343,10 +316,6 @@ bool initialize_block(
 template <typename OptionType>
 void allocate(const OptionType &options) {
 
-  using TileShape = typename OptionType::GroupScaleConfig::TileShape;
-  const int ScaleMsPerTile = OptionType::GroupScaleConfig::ScaleMsPerTile;
-  const int ScaleNsPerTile = OptionType::GroupScaleConfig::ScaleNsPerTile;
-
   int64_t total_elements_A = 0;
   int64_t total_elements_B = 0;
   int64_t total_elements_C = 0;
@@ -372,10 +341,8 @@ void allocate(const OptionType &options) {
     auto N = get<1>(problem);
     auto K = get<2>(problem);
 
-    auto blockscale_shape = shape(get<1>(cute::zipped_divide(cute::make_layout(problem), TileShape{})));
-    auto groupscale_m = cute::get<0>(blockscale_shape) * ScaleMsPerTile; // We need to pad along M in scale tensor of A to prevent illegal memory access.
-    auto groupscale_n = cute::get<1>(blockscale_shape) * ScaleNsPerTile; // We need to pad along N in scale tensor of A to prevent illegal memory access.
-    auto blockscale_k = cute::get<2>(blockscale_shape);
+    auto group_layout_SFA = ScaleConfig::tile_atom_to_shape_SFA(make_shape(M, N, K, 1));
+    auto group_layout_SFB = ScaleConfig::tile_atom_to_shape_SFB(make_shape(M, N, K, 1));
 
     offset_A.push_back(total_elements_A);
     offset_B.push_back(total_elements_B);
@@ -388,8 +355,8 @@ void allocate(const OptionType &options) {
     int64_t elements_B = K * N;
     int64_t elements_C = M * N;
     int64_t elements_D = M * N;
-    int64_t elements_blockscale_A = groupscale_m * blockscale_k;
-    int64_t elements_blockscale_B = groupscale_n * blockscale_k;
+    int64_t elements_blockscale_A = size(filter_zeros(group_layout_SFA));
+    int64_t elements_blockscale_B = size(filter_zeros(group_layout_SFB));
 
     total_elements_A += elements_A;
     total_elements_B += elements_B;
@@ -402,6 +369,8 @@ void allocate(const OptionType &options) {
     stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
     stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1}));
     stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}));
+    layout_SFA_host.push_back(group_layout_SFA);
+    layout_SFB_host.push_back(group_layout_SFB);
 
   }
 
@@ -477,6 +446,12 @@ void initialize(const OptionType &options) {
   stride_D.reset(options.groups);
   stride_D.copy_from_host(stride_D_host.data());
 
+  layout_SFA.reset(options.groups);
+  layout_SFA.copy_from_host(layout_SFA_host.data());
+
+  layout_SFB.reset(options.groups);
+  layout_SFB.copy_from_host(layout_SFB_host.data());
+
   alpha_device.reset(options.groups);
   alpha_device.copy_from_host(ptr_alpha_host.data());
   beta_device.reset(options.groups);
@@ -500,14 +475,14 @@ GemmArguments args_from_options(const OptionType &options, bool host_problem_sha
   // Change device_id to another value if you are running on a machine with multiple GPUs and wish
   // to use a GPU other than that with device ID 0.
   int device_id = 0;
-  cutlass::KernelHardwareInfo kernel_hw_info = cutlass::KernelHardwareInfo::make_kernel_hardware_info<typename GroupScale1D1DGemm::Gemm::GemmKernel>(device_id);
+  cutlass::KernelHardwareInfo kernel_hw_info = cutlass::KernelHardwareInfo::make_kernel_hardware_info<typename Gemm::GemmKernel>(device_id);
 
   GemmArguments arguments{
     cutlass::gemm::GemmUniversalMode::kGrouped,
     {options.groups, problem_sizes.get(), host_problem_shapes_available ? options.problem_sizes_host.data() : (decltype(options.problem_sizes_host.data())) nullptr},
     {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get(),
-     ptr_blockscale_A.get(),
-     ptr_blockscale_B.get()
+     ptr_blockscale_A.get(), layout_SFA.get(),
+     ptr_blockscale_B.get(), layout_SFB.get()
     },
     {
       {}, // epilogue.thread
@@ -577,12 +552,6 @@ bool verify(const OptionType &options) {
     // Group scaling tensors shapes based `ScaleGranularityM`, CTA Block (TileShape) and GEMM Problem shape
     auto [m, n, k] = options.problem_sizes_host.at(group_idx);
     auto gemm_problem_shape = cute::make_shape(m, n, k);
-    auto blockscale_shape = shape(get<1>(cute::zipped_divide(cute::make_layout(gemm_problem_shape), TileShape_{})));
-    auto blockscale_m = cute::get<0>(blockscale_shape);
-    auto blockscale_n = cute::get<1>(blockscale_shape);
-    auto blockscale_k = cute::get<2>(blockscale_shape);
-    auto groupscale_m = blockscale_m * OptionType::GroupScaleConfig::ScaleMsPerTile;
-    auto groupscale_n = blockscale_n * OptionType::GroupScaleConfig::ScaleNsPerTile;
 
     // Create instantiation for device reference gemm kernel
     auto A = cute::make_tensor(block_A_host.data() + offset_A.at(group_idx),
@@ -610,32 +579,20 @@ bool verify(const OptionType &options) {
                                 )
                               );
 
-    auto blockscale_A = cute::make_tensor(blockscale_block_A_host.data() + offset_blockscale_A.at(group_idx),
-                                          cute::make_layout(
-                                            cute::make_shape(groupscale_m, blockscale_k, 1),
-                                            cute::make_stride(1, groupscale_m, groupscale_m * blockscale_k)
-                                          )
-                                        );
-    auto blockscale_B = cute::make_tensor(blockscale_block_B_host.data() + offset_blockscale_B.at(group_idx),
-                                          cute::make_layout(
-                                            cute::make_shape(groupscale_n, blockscale_k, 1),
-                                            cute::make_stride(1, groupscale_n, groupscale_n * blockscale_k)
-                                          )
-                                        );
+    auto SFA = cute::make_tensor(blockscale_block_A_host.data() + offset_blockscale_A.at(group_idx),
+                                 layout_SFA_host.at(group_idx));
+    auto SFB = cute::make_tensor(blockscale_block_B_host.data() + offset_blockscale_B.at(group_idx),
+                                 layout_SFB_host.at(group_idx));
 
     using unused_t = decltype(D);
 
-    cutlass::reference::host::GettMainloopParams<
+    cutlass::reference::host::GettBlockScalingMainloopParams<
       ElementAccumulator,
-      decltype(A),
+      decltype(A), 
+      decltype(SFA), 
       decltype(B),
-      decltype(blockscale_A),
-      decltype(blockscale_B),
-      TileShape_
-    > mainloop_params{
-        A, B,                         // Operand Tensors
-        blockscale_A, blockscale_B    // Groupwise scaling Tensors
-    };
+      decltype(SFB)
+    > mainloop_params{A, SFA, B, SFB};
 
     cutlass::reference::host::GettEpilogueParams<
         ElementScalar,
@@ -647,8 +604,7 @@ bool verify(const OptionType &options) {
         unused_t, // bias
         unused_t, // Aux
         unused_t, // valpha
-        unused_t, // vbeta
-        ActivationFunctor
+        unused_t  // vbeta
     > epilogue_params;
 
     epilogue_params.C = C;
@@ -679,15 +635,9 @@ bool verify(const OptionType &options) {
 }
 
 /// Execute a given example GEMM computation
-template <typename Gemm, typename OptionType>
+template <typename OptionType>
 int run(OptionType &options, bool host_problem_shapes_available = true)
 {
-  using TileShape = typename OptionType::GroupScaleConfig::TileShape;
-  const int ScaleGranularityM = OptionType::GroupScaleConfig::ScaleGranularityM;
-  const int ScaleGranularityN = OptionType::GroupScaleConfig::ScaleGranularityN;
-  const int ScaleMsPerTile    = OptionType::GroupScaleConfig::ScaleMsPerTile;
-  const int ScaleNsPerTile    = OptionType::GroupScaleConfig::ScaleNsPerTile;
-
   allocate(options);
   initialize(options);
 
@@ -797,18 +747,12 @@ int main(int argc, char const **args) {
   // Parse options
   //
 
-  Options<RasterOrderOptions, ProblemShape, GroupScale1D1DConfig> options_1d1d;
-  Options<RasterOrderOptions, ProblemShape, GroupScale1D2DConfig> options_1d2d;
-  Options<RasterOrderOptions, ProblemShape, GroupScale2D1DConfig> options_2d1d;
-  Options<RasterOrderOptions, ProblemShape, GroupScale2D2DConfig> options_2d2d;
+  Options<RasterOrderOptions, ProblemShape> options;
 
-  options_1d1d.parse(argc, args);
-  options_1d2d.parse(argc, args);
-  options_2d1d.parse(argc, args);
-  options_2d2d.parse(argc, args);
+  options.parse(argc, args);
 
-  if (options_1d1d.help) {
-    options_1d1d.print_usage(std::cout) << std::endl;
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
     return 0;
   }
 
@@ -816,22 +760,10 @@ int main(int argc, char const **args) {
   // Evaluate CUTLASS kernels
   //
 
-  auto run_tests = [&] (bool host_problem_shapes_available = true) {
-    std::cout << "Grouped GEMM kernel with 1D1D group scale" << std::endl;
-    run<GroupScale1D1DGemm::Gemm>(options_1d1d, host_problem_shapes_available);
-    std::cout << "Grouped GEMM kernel with 1D2D group scale" << std::endl;
-    run<GroupScale1D2DGemm::Gemm>(options_1d2d, host_problem_shapes_available);
-    std::cout << "Grouped GEMM kernel with 2D1D group scale" << std::endl;
-    run<GroupScale2D1DGemm::Gemm>(options_2d1d, host_problem_shapes_available);
-    std::cout << "Grouped GEMM kernel with 2D2D group scale" << std::endl;
-    run<GroupScale2D2DGemm::Gemm>(options_2d2d, host_problem_shapes_available);
-    std::cout << std::endl;
-  };
-
   std::cout << "Running tests with host problem shapes:" << std::endl;
-  run_tests(true);
+  run(options, true);
   std::cout << "Running tests without host problem shapes:" << std::endl;
-  run_tests(false);
+  run(options, false);
 
 #endif
 
diff --git a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu
new file mode 100644
index 0000000000..2ea42bbf58
--- /dev/null
+++ b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu
@@ -0,0 +1,781 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Grouped scale Hopper FP8 Grouped GEMM example using CUTLASS 3.0 APIs for NVIDIA Hopper architecture
+    This example demonstrates a grouped scaled FP8 Grouped GEMM using the new CUTLASS 3.0.
+    APIs on NVIDIA Hopper architecture. New features that will be showcased in this example are as follows:
+    1. NVIDIA Hopper architecture introduces a new series of tensor core instructions (GMMA)
+    which are more efficient than the Ampere tensor core instructions.
+    2. NVIDIA Hopper architecture includes new Tensor Memory Accelerator (TMA) unit to transfer large
+    blocks of data efficiently between global memory and shared memory. TMA also supports asynchronous
+    copies between thread blocks in a cluster. This example also showcases on-the-fly modification of TMA
+    descriptors to move between groups/problem_count (represented by groups).
+    3. This example uses the Warp Specialized kernel design (see /media/docs/efficient_gemm.md for details).
+    4. A simple way to tune the CTA rasterization direction and swizzle pattern of Hopper kernels. Both the
+    CTA rasterization direction and swizzle pattern impact cross-CTA locality of accesses. By tuning we can
+    improve performance.
+    5. This example is tuned specifically for the sparse groups case, where the number of active groups (groups
+    with non-zero problem count) is much smaller than the total number of groups.
+    Examples:
+      $ ./examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups  \
+        --m=2816 --n=3072 --k=16384 --save_aux=false --save_amax=false \
+        --raster=h --swizzle=2 --benchmark=./test_benchmark.txt
+
+      Where the test_benchmark.txt may look as such:
+        0 256x512x128
+        1 256x512x512
+        2 512x256x128
+        3 256x256x128
+        4 256x512x1024
+        5 1024x512x128 and so on
+*/
+
+#include <iostream>
+#include <optional>
+#include <fstream>
+#include <sstream>
+#include <vector>
+#include <cfloat>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/numeric_types.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+
+// Includes from examples directory
+#include "helper.h"
+#include "hopper_fp8_commandline.hpp"
+
+using namespace cute;
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::float_e4m3_t;                          // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C matrix configuration
+using         ElementC    = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = AlignmentC;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementBlockScale   = float;                                          // Element type for blockscaling during accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+
+using ArchTag       = cutlass::arch::Sm90;                          // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass = cutlass::arch::OpClassTensorOp;               // Operator class tag
+
+using TileShape  = Shape<_128,_128,_128>;  // This one is just to make the compiler happy with verify()...
+using ClusterShape  = Shape<_1,_1,_1>;                              // Shape of the threadblocks in a cluster
+
+static constexpr int ScaleGranularityM = 1;
+static constexpr int ScaleGranularityN = 128;
+static constexpr int ScaleGranularityK = 128;
+static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
+static constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
+
+using ScaleConfig   = cutlass::detail::Sm90BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>;
+
+using LayoutSFA     = decltype(ScaleConfig::deduce_layoutSFA());    // Layout type for SFA matrix operand
+using LayoutSFB     = decltype(ScaleConfig::deduce_layoutSFB());    // Layout type for SFB matrix operand
+
+
+using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8BlockScaledAccum;
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
+using EpilogueTileType  = cutlass::epilogue::collective::EpilogueTileAuto;
+using FusionOperation   = cutlass::epilogue::fusion::LinearCombination<ElementC, ElementAccumulator>;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+  ArchTag, OperatorClass,
+  TileShape, ClusterShape,
+  EpilogueTileType,
+  ElementAccumulator, ElementCompute,
+  ElementC, LayoutC *, AlignmentC,
+  ElementD, LayoutD *, AlignmentD,
+  EpilogueSchedule,
+  FusionOperation
+>::CollectiveOp;
+
+using CollectiveMainloopWithGroupWiseScaling = typename cutlass::gemm::collective::CollectiveBuilder<
+  ArchTag, OperatorClass,
+  ElementA, cute::tuple<LayoutA *, LayoutSFA *>, AlignmentA,
+  ElementB, cute::tuple<LayoutB *, LayoutSFB *>, AlignmentB,
+  ElementAccumulator,
+  TileShape, ClusterShape,
+  cutlass::gemm::collective::StageCountAutoCarveout<
+    static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))
+  >,
+  KernelSchedule
+>::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopWithGroupWiseScaling,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Extract information from Gemm kernel.
+using EpilogueOutputOp  = typename Gemm::EpilogueOutputOp;
+using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
+using ActivationFunctor = typename EpilogueOutputOp::ActivationFn;
+
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
+
+static_assert(cute::is_same_v<ElementAccumulator, ElementBlockScale>,
+             "ElementAccumulator and ElementBlockScale should be same datatype");
+
+/// Initialization
+
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_blockscale_A;
+std::vector<int64_t> offset_blockscale_B;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+std::vector<LayoutSFA> layout_SFA_host;
+std::vector<LayoutSFB> layout_SFB_host;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+uint64_t seed;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<ElementD> block_D;
+cutlass::DeviceAllocation<ElementBlockScale> blockscale_block_A;
+cutlass::DeviceAllocation<ElementBlockScale> blockscale_block_B;
+
+cutlass::DeviceAllocation<const ElementA *> ptr_A;
+cutlass::DeviceAllocation<const ElementB *> ptr_B;
+cutlass::DeviceAllocation<const ElementC *> ptr_C;
+cutlass::DeviceAllocation<ElementD *> ptr_D;
+cutlass::DeviceAllocation<ElementD *> ptr_ref_D;
+cutlass::DeviceAllocation<const ElementBlockScale *> ptr_blockscale_A;
+cutlass::DeviceAllocation<const ElementBlockScale *> ptr_blockscale_B;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<LayoutSFA> layout_SFA;
+cutlass::DeviceAllocation<LayoutSFB> layout_SFB;
+
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED) 
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<Shape<int,int,int>>::RasterOrderOptions;
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  double gbps;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    double gbps = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), gbps(gbps), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element, class ScopeMin = std::nullopt_t, class ScopeMax = std::nullopt_t>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023,
+  ScopeMin scope_min = std::nullopt, ScopeMax scope_max = std::nullopt) {
+
+  double _scope_max, _scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+  if (bits_input == 1) {
+    _scope_max = 2;
+    _scope_min = 0;
+  } else if (bits_input <= 8) {
+    _scope_max = 2;
+    _scope_min = -2;
+  } else if (bits_input == 16) {
+    _scope_max = 5;
+    _scope_min = -5;
+  } else {
+    _scope_max = 8;
+    _scope_min = -8;
+  }
+  if constexpr (!std::is_same_v<ScopeMax, std::nullopt_t>) {
+    _scope_max = scope_max;
+  }
+  if constexpr (!std::is_same_v<ScopeMin, std::nullopt_t>) {
+    _scope_min = scope_min;
+  }
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, (Element) _scope_max, (Element) _scope_min, 0);
+
+  return true;
+}
+
+/// Allocates device-side data
+template <typename OptionType>
+void allocate(const OptionType &options) {
+
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+  int64_t total_elements_blockscale_A = 0;
+  int64_t total_elements_blockscale_B = 0;
+
+  offset_A.clear();
+  offset_B.clear();
+  offset_C.clear();
+  offset_D.clear();
+  offset_blockscale_A.clear();
+  offset_blockscale_B.clear();
+  stride_A_host.clear();
+  stride_B_host.clear();
+  stride_C_host.clear();
+  stride_D_host.clear();
+  
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_after_alignment_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    auto group_layout_SFA = ScaleConfig::tile_atom_to_shape_SFA(make_shape(M, N, K, 1));
+    auto group_layout_SFB = ScaleConfig::tile_atom_to_shape_SFB(make_shape(M, N, K, 1));
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_blockscale_A.push_back(total_elements_blockscale_A);
+    offset_blockscale_B.push_back(total_elements_blockscale_B);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N;
+    int64_t elements_C = M * N;
+    int64_t elements_D = M * N;
+    int64_t elements_blockscale_A = size(filter_zeros(group_layout_SFA));
+    int64_t elements_blockscale_B = size(filter_zeros(group_layout_SFB));
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_elements_blockscale_A += elements_blockscale_A;
+    total_elements_blockscale_B += elements_blockscale_B;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}));
+    layout_SFA_host.push_back(group_layout_SFA);
+    layout_SFB_host.push_back(group_layout_SFB);
+
+  }
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+  blockscale_block_A.reset(total_elements_blockscale_A);
+  blockscale_block_B.reset(total_elements_blockscale_B);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+template <typename OptionType>
+void initialize(const OptionType &options) {
+
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_after_alignment_host.data());
+
+  std::vector<ElementA *> ptr_A_host(options.groups);
+  std::vector<ElementB *> ptr_B_host(options.groups);
+  std::vector<ElementC *> ptr_C_host(options.groups);
+  std::vector<ElementD *> ptr_D_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+  std::vector<ElementBlockScale *> ptr_blockscale_A_host(options.groups);
+  std::vector<ElementBlockScale *> ptr_blockscale_B_host(options.groups);
+
+  alpha_host.clear();
+  beta_host.clear();
+
+  for (int i = 0; i < options.groups; i++) {
+    ptr_A_host.at(i) = block_A.get() + offset_A.at(i);
+    ptr_B_host.at(i) = block_B.get() + offset_B.at(i);
+    ptr_C_host.at(i) = block_C.get() + offset_C.at(i);
+    ptr_D_host.at(i) = block_D.get() + offset_D.at(i);
+    ptr_blockscale_A_host.at(i) = blockscale_block_A.get() + offset_blockscale_A.at(i);
+    ptr_blockscale_B_host.at(i) = blockscale_block_B.get() + offset_blockscale_B.at(i);
+    alpha_host.push_back((options.alpha == FLT_MAX) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == FLT_MAX) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  ptr_blockscale_A.reset(options.groups);
+  ptr_blockscale_A.copy_from_host(ptr_blockscale_A_host.data());
+
+  ptr_blockscale_B.reset(options.groups);
+  ptr_blockscale_B.copy_from_host(ptr_blockscale_B_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  layout_SFA.reset(options.groups);
+  layout_SFA.copy_from_host(layout_SFA_host.data());
+
+  layout_SFB.reset(options.groups);
+  layout_SFB.copy_from_host(layout_SFB_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  initialize_block(block_A, seed + 2022);
+  initialize_block(block_B, seed + 2023);
+  initialize_block(block_C, seed + 2024);
+  initialize_block(blockscale_block_A, seed + 2025, -1, 1);
+  initialize_block(blockscale_block_B, seed + 2026, -1, 1);
+
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template<typename GemmArguments, typename OptionType>
+GemmArguments args_from_options(const OptionType &options, bool host_problem_shapes_available = true)
+{
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  int device_id = 0;
+  cutlass::KernelHardwareInfo kernel_hw_info = cutlass::KernelHardwareInfo::make_kernel_hardware_info<typename Gemm::GemmKernel>(device_id);
+
+  GemmArguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_sizes.get(), host_problem_shapes_available ? options.problem_sizes_after_alignment_host.data() : (decltype(options.problem_sizes_after_alignment_host.data())) nullptr},
+    {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get(),
+     ptr_blockscale_A.get(), layout_SFA.get(),
+     ptr_blockscale_B.get(), layout_SFB.get()
+    },
+    {
+      {}, // epilogue.thread
+      ptr_C.get(), stride_C.get(),
+      ptr_D.get(), stride_D.get()
+    },
+    kernel_hw_info
+  };
+
+  auto &fusion_args = arguments.epilogue.thread;
+  if (options.alpha != FLT_MAX && options.beta != FLT_MAX) {
+    // If both alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
+    fusion_args.alpha = options.alpha;
+    fusion_args.beta = options.beta;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.beta_ptr_array = nullptr;
+    // Single alpha and beta for all groups
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 0};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 0};
+  }
+  else {
+    // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
+    fusion_args.alpha = 0;
+    fusion_args.beta = 0;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    fusion_args.beta_ptr_array = beta_device.get();
+    // One alpha and beta per each group
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 1};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 1};
+  }
+
+  arguments.scheduler.raster_order = options.raster;
+  // The tile scheduler will swizzle up to 8 and with the nearest multiple of 2 (i.e., 1, 2, 4, and 8)
+  arguments.scheduler.max_swizzle_size = options.swizzle;
+
+  return arguments;
+}
+
+template <typename OptionType>
+bool verify(const OptionType &options) {
+
+  //
+  // Compute reference output
+  //
+
+  std::vector<ElementA> block_A_host(block_A.size());
+  std::vector<ElementB> block_B_host(block_B.size());
+  std::vector<ElementC> block_C_host(block_C.size());
+  std::vector<ElementD> block_D_host_kernel(block_D.size());
+  std::vector<ElementD> block_D_host_ref(block_D.size());
+  std::vector<ElementBlockScale> blockscale_block_A_host(blockscale_block_A.size());
+  std::vector<ElementBlockScale> blockscale_block_B_host(blockscale_block_B.size());
+
+  block_A.copy_to_host(block_A_host.data());
+  block_B.copy_to_host(block_B_host.data());
+  block_C.copy_to_host(block_C_host.data());
+  block_D.copy_to_host(block_D_host_kernel.data());
+  blockscale_block_A.copy_to_host(blockscale_block_A_host.data());
+  blockscale_block_B.copy_to_host(blockscale_block_B_host.data());
+
+  bool passed = true;
+  for (int group_idx = 0; group_idx < options.groups; group_idx++) {
+    // Group scaling tensors shapes based `ScaleGranularityM`, CTA Block (TileShape) and GEMM Problem shape
+    auto [m, n, k] = options.problem_sizes_after_alignment_host.at(group_idx);
+    auto gemm_problem_shape = cute::make_shape(m, n, k);
+
+    // Create instantiation for device reference gemm kernel
+    auto A = cute::make_tensor(block_A_host.data() + offset_A.at(group_idx),
+                              cute::make_layout(
+                                  cute::make_shape(m, k, 1),
+                                  stride_A_host.at(group_idx)
+                                )
+                              );
+    auto B = cute::make_tensor(block_B_host.data() + offset_B.at(group_idx),
+                              cute::make_layout(
+                                cute::make_shape(n, k, 1),
+                                stride_B_host.at(group_idx)
+                                )
+                              );
+    auto C = cute::make_tensor(block_C_host.data() + offset_C.at(group_idx),
+                              cute::make_layout(
+                                  cute::make_shape(m, n, 1),
+                                  stride_C_host.at(group_idx)
+                                )
+                              );
+    auto D = cute::make_tensor(block_D_host_ref.data() + offset_D.at(group_idx),
+                              cute::make_layout(
+                                  cute::make_shape(m, n, 1),
+                                  stride_D_host.at(group_idx)
+                                )
+                              );
+
+    auto SFA = cute::make_tensor(blockscale_block_A_host.data() + offset_blockscale_A.at(group_idx),
+                                 layout_SFA_host.at(group_idx));
+    auto SFB = cute::make_tensor(blockscale_block_B_host.data() + offset_blockscale_B.at(group_idx),
+                                 layout_SFB_host.at(group_idx));
+
+    using unused_t = decltype(D);
+
+    cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,
+      decltype(A), 
+      decltype(SFA), 
+      decltype(B),
+      decltype(SFB)
+    > mainloop_params{A, SFA, B, SFB};
+
+    cutlass::reference::host::GettEpilogueParams<
+        ElementScalar,
+        ElementScalar,
+        ElementAccumulator,
+        ElementCompute,
+        decltype(C),
+        decltype(D)
+    > epilogue_params;
+
+    epilogue_params.C = C;
+    epilogue_params.D = D;
+    epilogue_params.alpha = alpha_host.at(group_idx);
+    epilogue_params.beta = beta_host.at(group_idx);
+
+    // get reference result
+    cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    auto this_group_passed = std::equal(
+      // std::execution::par_unseq,
+      block_D_host_ref.data() + offset_D.at(group_idx),
+      block_D_host_ref.data() + offset_D.at(group_idx) + m * n,
+      block_D_host_kernel.data() + offset_D.at(group_idx)
+    );
+    
+    passed &= this_group_passed;
+
+#if 0
+    std::cout << "Group: " << group_idx << " M: " << m << " N: " << n << " K: " << k << " Status: " << this_group_passed << std::endl;
+#endif
+
+  }
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename OptionType>
+int run(OptionType &options, bool host_problem_shapes_available = true)
+{
+
+  allocate(options);
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<typename Gemm::Arguments>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+   exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0) {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+    result.gbps = options.template gbps<ElementA, 
+                                        ElementB, 
+                                        ElementC, 
+                                        ElementD, 
+                                        ElementBlockScale, 
+                                        TileShape, 
+                                        ScaleMsPerTile, 
+                                        ScaleNsPerTile>(result.avg_runtime_ms / 1000.0);
+
+    std::string raster = "Heuristic";
+
+    if (options.raster == RasterOrderOptions::AlongN) {
+      raster = "Along N";
+    }
+    else if (options.raster == RasterOrderOptions::AlongM) {
+      raster = "Along M";
+    }
+
+    std::cout << "  Problem Sizes, Alpha, Beta " << std::endl;
+    for (int32_t i = 0; i < options.groups; ++i) {
+      std::cout << "    " << options.problem_sizes_host.at(i);
+      std::cout << ", " << alpha_host.at(i) << ", " << beta_host.at(i) << std::endl;
+    }
+    std::cout << "  Groups      : " << options.groups  << std::endl;
+    std::cout << "  Tile shape (M, N, K): " << size<0>(TileShape{}) << ", " << size<1>(TileShape{}) << ", " << size<2>(TileShape{}) << std::endl;
+    std::cout << "  ScaleGranularityM: " << ScaleGranularityM << " (ScaleMsPerTile: " << ScaleMsPerTile << ")" << std::endl;
+    std::cout << "  ScaleGranularityN: " << ScaleGranularityN << " (ScaleNsPerTile: " << ScaleNsPerTile << ")" << std::endl;
+    std::cout << "  Rasterization: " << raster << " with a maximum CTA swizzle of " << options.swizzle << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 3)) {
+    std::cerr << "This example requires CUDA 12.3 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+  //
+  // Parse options
+  //
+
+  Options<RasterOrderOptions, ProblemShape> options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+  run(options, true);
+
+  std::cout << "Running tests without host problem shapes:" << std::endl;
+  run(options, false);
+
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/CMakeLists.txt b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/CMakeLists.txt
index f88b31674d..09d506dee1 100644
--- a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/CMakeLists.txt
+++ b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/CMakeLists.txt
@@ -59,3 +59,26 @@ cutlass_example_add_executable(
   TEST_SMALL
   TEST_SMALL_LARGE_GROUP
   )
+
+# MSVC will fail to compile this example with the following error:
+# fatal error C1083: Cannot open source file: <Some Mojibake>: No such file or directory [...\examples\68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling\68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups.vcxproj]
+# This is a known issue and we are working on a fix.
+if (NOT MSVC)
+
+cutlass_example_add_executable(
+  68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups
+  68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  )
+
+endif()
diff --git a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/hopper_fp8_commandline.hpp b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/hopper_fp8_commandline.hpp
index 3e425fe23e..19497176db 100644
--- a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/hopper_fp8_commandline.hpp
+++ b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/hopper_fp8_commandline.hpp
@@ -30,12 +30,11 @@
  **************************************************************************************************/
 
 // Command line options parsing
-template<typename _RasterOrderOptions, typename _ProblemShape, typename _GroupScaleConfig>
+template<typename _RasterOrderOptions, typename _ProblemShape>
 struct Options {
 
   using RasterOrderOptions = _RasterOrderOptions;
   using ProblemShape = _ProblemShape;
-  using GroupScaleConfig = _GroupScaleConfig;
 
   bool help = false;
 
@@ -43,6 +42,7 @@ struct Options {
   int iterations = 1000;
   int m = 1024, n = 512, k = 1024, groups = 10;
   std::string benchmark_path;
+  std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_after_alignment_host;
   std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host;
   int const tma_alignment_bits = 128;
   int const alignment = tma_alignment_bits / cutlass::sizeof_bits<cutlass::float_e4m3_t>::value;
@@ -89,6 +89,7 @@ struct Options {
     // Decide how to initialize the problems
     if (!benchmark_path.empty()) {
       if (!benchmark_problems()) {
+        problem_sizes_after_alignment_host.clear();
         problem_sizes_host.clear();
         return;
       }
@@ -105,8 +106,8 @@ struct Options {
     cmd.get_cmd_line_argument("n", cmd_line_n);
     cmd.get_cmd_line_argument("k", cmd_line_k);
 
+    problem_sizes_after_alignment_host.reserve(groups);
     problem_sizes_host.reserve(groups);
-
     for (int i = groups; i > 0; i--) {
       int m = cmd_line_m;
       int n = cmd_line_n;
@@ -120,6 +121,7 @@ struct Options {
       if (k < 1) {
         k = k_alignment * ((rand() % (32 * alignment / k_alignment)) + 1);
       }
+      problem_sizes_after_alignment_host.push_back({m, n, k});
       problem_sizes_host.push_back({m, n, k});
     }
   }
@@ -142,7 +144,7 @@ struct Options {
         break;
       }
 
-      cutlass::gemm::GemmCoord extent;
+      cutlass::gemm::GemmCoord extent_after_alignment, extent;
       std::vector<std::string> tokens;
 
       cutlass::CommandLine::tokenize(tokens, extent_str, 'x');
@@ -150,23 +152,81 @@ struct Options {
       for (int i = 0; i < int(tokens.size()); ++i) {
         int x = std::atoi(tokens.at(i).c_str());
 
+        extent.at(i) = x;
         // round up
         if (x % alignment) {
           x += (alignment - (x % alignment));
         }
 
-        extent.at(i) = x;
+        extent_after_alignment.at(i) = x;
       }
 
-      if (extent.product()) {
-        problem_sizes_host.push_back({extent.m(), extent.n(), extent.k()});
-      }
+      problem_sizes_after_alignment_host.push_back({extent_after_alignment.m(), extent_after_alignment.n(), extent_after_alignment.k()});
+      problem_sizes_host.push_back({extent.m(), extent.n(), extent.k()});
     }
-    groups = static_cast<int>(problem_sizes_host.size());
+    groups = static_cast<int>(problem_sizes_after_alignment_host.size());
 
     return true;
   }
 
+  /// Calculate memory bandwidth statistics
+  template <class ElementA, 
+            class ElementB,
+            class ElementC,
+            class ElementD,
+            class ElementBlockScale,
+            class TileShape,
+            int ScaleMsPerTile,
+            int ScaleNsPerTile>
+  auto gbps(double runtime_s) const {
+    double total_read_bytes = 0;
+    double total_write_bytes = 0;
+    
+    // Calculate bytes read and written for each problem
+    for (int i = 0; i < groups; ++i) {
+      auto problem = problem_sizes_host.at(i);
+      auto M = cute::get<0>(problem);
+      auto N = cute::get<1>(problem);
+      auto K = cute::get<2>(problem);
+      
+      if (M > 0) {  // Only count active problems
+        // Matrix A: M*K elements read
+        total_read_bytes += M * K * sizeof(ElementA);
+        
+        // Matrix B: K*N elements read
+        total_read_bytes += K * N * sizeof(ElementB);
+        
+        // Matrix C: M*N elements read (for beta operation)
+        total_read_bytes += M * N * sizeof(ElementC);
+        
+        // Block scales for A and B
+        auto blockscale_shape = cute::shape(cute::get<1>(cute::zipped_divide(cute::make_layout(problem), TileShape{})));
+        auto blockscale_m = cute::get<0>(blockscale_shape);
+        auto blockscale_n = cute::get<1>(blockscale_shape);
+        auto blockscale_k = cute::get<2>(blockscale_shape);
+        auto groupscale_m = blockscale_m * ScaleMsPerTile;
+        auto groupscale_n = blockscale_n * ScaleNsPerTile;
+        
+        total_read_bytes += groupscale_m * blockscale_k * sizeof(ElementBlockScale);  // A scales
+        total_read_bytes += groupscale_n * blockscale_k * sizeof(ElementBlockScale);  // B scales
+        
+        // Matrix D: M*N elements written
+        total_write_bytes += M * N * sizeof(ElementD);
+      }
+    }
+
+    return (total_read_bytes + total_write_bytes) / 1.0e9 / runtime_s;
+  }
+
+  double bandwidth_util(double eff_bandwidth) const {
+    int memoryClockRate;
+    int memoryBusWidth;
+    cudaDeviceGetAttribute(&memoryClockRate, cudaDevAttrMemoryClockRate, 0);
+    cudaDeviceGetAttribute(&memoryBusWidth, cudaDevAttrGlobalMemoryBusWidth , 0);
+    double bw = 2.0 * memoryClockRate * (memoryBusWidth / 8) / 1.0e6;
+    return eff_bandwidth / bw * 100.0;
+  }
+
   /// Prints the usage statement.
   std::ostream & print_usage(std::ostream &out) const {
 
diff --git a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h b/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h
deleted file mode 100644
index 1a94af670b..0000000000
--- a/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h
+++ /dev/null
@@ -1,520 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Reference implementation for GETT in host-side code.
-*/
-
-#pragma once
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-#include "cutlass/gemm/gemm.h"
-#include "cutlass/complex.h"
-#include "cutlass/numeric_conversion.h"
-#include "cutlass/epilogue/thread/activation.h"
-#include "cutlass/relatively_equal.h"
-#include <iostream>
-#include "cute/tensor.hpp"
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-namespace cutlass::reference::host {
-
-template<class T, class = void>
-struct ElementTraits {
-  using type = T;
-};
-
-template<class T>
-struct ElementTraits<T, std::enable_if_t<!std::is_same_v<decltype(std::declval<T>().get()), void> > >  {
-  using type = decltype(std::declval<T>().get());
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-template<
-  class ElementAccumulator_,
-  class TensorA_,                                                                                         // (M, K, L)
-  class TensorB_,                                                                                         // (N, K, L)
-  class TensorScaleA_,                                                                                    // (m, k, L)
-  class TensorScaleB_,                                                                                    // (n, k, L)
-  class TileShape_
->
-struct GettMainloopParams {
-  using ElementAccumulator = ElementAccumulator_;
-  using TensorA = TensorA_;
-  using TensorB = TensorB_;
-  using EngineA = typename TensorA::engine_type;
-  using LayoutA = typename TensorA::layout_type;
-  using EngineB = typename TensorB::engine_type;
-  using LayoutB = typename TensorB::layout_type;
-
-  using TensorScaleA = TensorScaleA_;
-  using TensorScaleB = TensorScaleB_;
-  using TileShape = TileShape_;
-  using EngineScaleA = typename TensorScaleA::engine_type;
-  using EngineScaleB = typename TensorScaleB::engine_type;
-
-  TensorA A{};
-  TensorB B{};
-  TensorScaleA ScaleA{};
-  TensorScaleB ScaleB{};  
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-template<
-  class ElementScalar_,
-  class ElementScalingFactor_,
-  class ElementAccumulator_,
-  class ElementCompute_,
-  class TensorC_,                                                                                          // (M, N, L)
-  class TensorD_,                                                                                          // (M, N, L)
-  class VectorBias_ = TensorD_,                                                                            //    (M, 1)
-  class TensorAux_ = TensorD_,                                                                             // (M, N, L)
-  class VectorAlpha_ = TensorD_,                                                                           //    (M, 1)
-  class VectorBeta_ = VectorAlpha_,                                                                        //    (M, 1)
-  class ActivationFunctor_ = cutlass::epilogue::thread::Identity<ElementCompute_>,
-  class BiasBinaryOp_ = cutlass::plus<ElementCompute_>,
-  bool PerColumnBias_ = false
->
-struct GettEpilogueParams {
-  using ElementScalar = ElementScalar_;
-  using ElementScalingFactor = ElementScalingFactor_;
-  using ElementAccumulator = ElementAccumulator_;
-  using ElementCompute = ElementCompute_;
-  using TensorC = TensorC_;
-  using TensorD = TensorD_;
-  using TensorAux = TensorAux_;
-  using VectorBias = VectorBias_;
-  using VectorAlpha = VectorAlpha_;
-  using VectorBeta = VectorBeta_;
-  using ActivationFunctor = ActivationFunctor_;
-  using BiasBinaryOp = BiasBinaryOp_;
-
-  using EngineC = typename TensorC::engine_type;
-  using LayoutC = typename TensorC::layout_type;
-  using EngineD =  typename TensorD::engine_type;
-  using LayoutD = typename TensorD::layout_type;
-  static constexpr bool PerColumnBias = PerColumnBias_;
-  ElementScalar alpha = ElementScalar(1);
-  ElementScalar beta = ElementScalar(0);
-
-  TensorC C{};
-  TensorD D{};
-  VectorBias Bias{};
-  TensorAux Aux{};
-  VectorAlpha Valpha{};
-  VectorBeta Vbeta{};
-  ElementCompute st = ElementCompute(1);
-
-  ElementAccumulator* abs_max_D = nullptr;
-  ElementAccumulator* abs_max_Aux = nullptr;
-
-  ElementScalingFactor scale_a = ElementScalingFactor(1);
-  ElementScalingFactor scale_b = ElementScalingFactor(1);
-  ElementScalingFactor scale_c = ElementScalingFactor(1);
-  ElementScalingFactor scale_d = ElementScalingFactor(1);
-  ElementScalingFactor scale_aux = ElementScalingFactor(1);
-
-  bool beta_per_channel_scaling = false;
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - General Tensor-Tensor contraction reference kernel with Groupwise scaling
-template <
-  class MainloopParams,
-  class EpilogueParams
->
-void Gett(
-    MainloopParams const& mainloop_params,
-    EpilogueParams const& epilogue_params)
-{
-
-  static int constexpr kBlockM = cute::get<0>(typename MainloopParams::TileShape{});
-  static int constexpr kBlockN = cute::get<1>(typename MainloopParams::TileShape{});
-  // printf("mainloop_params.ScaleA.layout()"); cute::print(mainloop_params.ScaleA.layout()); printf("\n");
-  // printf("mainloop_params.ScaleB.layout()"); cute::print(mainloop_params.ScaleB.layout()); printf("\n");
-
-#if defined(_OPENMP)
-  #pragma omp parallel for collapse(3)
-#endif
-  for (int64_t l = 0; l < cute::size<2>(mainloop_params.A.layout()); ++l) {
-    for (int64_t m = 0; m < cute::size<0>(mainloop_params.A.layout()); m += kBlockM) {
-      for (int64_t n = 0; n < cute::size<0>(mainloop_params.B.layout()); n += kBlockN) {
-        typename MainloopParams::ElementAccumulator acc[kBlockM][kBlockN];
-        gett_mainloop(mainloop_params, m, n, l, acc);
-        gett_epilogue(epilogue_params, m, n, l, acc);
-      }
-    }
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - Mainloop
-template <class MainloopParams, class ElementAccumulator, int kBlockM, int kBlockN>
-void gett_mainloop(
-    MainloopParams const& mainloop_params,
-    int64_t m,
-    int64_t n,
-    int64_t l,
-    ElementAccumulator (&acc)[kBlockM][kBlockN])
-{
-
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "M, K, B");
-  static_assert(cute::rank(typename MainloopParams::LayoutB{}) == 3, "N, K, B");
-
-  using cute::raw_pointer_cast;
-
-  using ElementA = typename ElementTraits<typename MainloopParams::EngineA::value_type>::type;
-  using ElementB = typename ElementTraits<typename MainloopParams::EngineB::value_type>::type;
-  using ElementBlockScaleA = typename ElementTraits<typename MainloopParams::EngineScaleA::value_type>::type;
-  using ElementBlockScaleB = typename ElementTraits<typename MainloopParams::EngineScaleB::value_type>::type;
-
-  using RingOp = multiply_add<ElementAccumulator, ElementAccumulator, ElementAccumulator>;
-  RingOp fma_op;
-
-  multiplies<ElementAccumulator> scale_op;
-
-  static int constexpr kBlockK = cute::get<2>(typename MainloopParams::TileShape{});;
-
-  // Tempo accumulators to seperate blockwise accumulation
-  typename MainloopParams::ElementAccumulator acc_temp[kBlockM][kBlockN];
-
-  // Zero out accumulators
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      acc[m_b][n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      acc_temp[m_b][n_b] = ElementAccumulator(0);
-    }
-  }
-
-  const int M = cute::size<0>(mainloop_params.A.layout());
-  const int N = cute::size<0>(mainloop_params.B.layout());
-
-  const int ScaleGranularityM = M / cute::size<0>(mainloop_params.ScaleA.layout());
-  const int ScaleGranularityN = N / cute::size<0>(mainloop_params.ScaleB.layout());
-
-  assert(ScaleGranularityM && M % ScaleGranularityM == 0 && "ScaleGranularityM must divide M");
-  assert(ScaleGranularityN && N % ScaleGranularityN == 0 && "ScaleGranularityN must divide N");
-
-  cute::Tensor blockscale_A = domain_offset(make_coord(m / ScaleGranularityM, _0{}), mainloop_params.ScaleA(_, _, l));
-  cute::Tensor blockscale_B = domain_offset(make_coord(n / ScaleGranularityN, _0{}), mainloop_params.ScaleB(_, _, l));
-
-  // Compute on this k-block
-  for (int64_t k = 0; k < cute::size<1>(mainloop_params.A.layout()); ++k) {
-
-    // Load Blockwise scaling factor from blockscale Tensors for B
-    int64_t block_k = k / kBlockK;
-    cute::Tensor scale_a = blockscale_A(_, block_k);
-    cute::Tensor scale_b = blockscale_B(_, block_k);
-
-    // Load A
-    ElementAccumulator a_frag[kBlockM];
-    for (int m_b = 0; m_b < kBlockM; ++m_b) {
-      if (m + m_b < cute::size<0>(mainloop_params.A.layout())) {
-        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
-        a_frag[m_b] = static_cast<ElementAccumulator>(ElementA(mainloop_params.A(m + m_b, k, l)));
-      } else {
-        a_frag[m_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      }
-    }
-
-    // Load B
-    ElementAccumulator b_frag[kBlockN];
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (n + n_b < cute::size<0>(mainloop_params.B.layout())) {
-        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
-        b_frag[n_b] = static_cast<ElementAccumulator>(ElementB(mainloop_params.B(n + n_b, k, l)));
-      } else {
-        b_frag[n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
-      }
-    }
-
-    // do compute
-    for (int m_b = 0; m_b < kBlockM; ++m_b) {
-      for (int n_b = 0; n_b < kBlockN; ++n_b) {
-        acc_temp[m_b][n_b] = fma_op(a_frag[m_b], b_frag[n_b], acc_temp[m_b][n_b]);
-      }
-    }
-
-    // Apply Groupwise-scaling at kBlockK boundary
-    // (a) Apply group and block scaling factors on the partial accumulated results (acc_temp) at the kBlocK boundary 
-    // (b) Zero-out partial temporary (acc_temp),
-    // (c) Update permanent (accu)
-    if ((k+1) % kBlockK == 0) {
-      for (int m_b = 0; m_b < kBlockM; ++m_b) {
-        auto scale_a_m_b = scale_a[m_b / ScaleGranularityM];
-        for (int n_b = 0; n_b < kBlockN; ++n_b) {
-          auto scale_b_n_b = scale_b[n_b / ScaleGranularityN];
-          ElementAccumulator blockwise_scaled_accum = acc_temp[m_b][n_b] * scale_a_m_b * scale_b_n_b;
-          acc[m_b][n_b] = blockwise_scaled_accum + acc[m_b][n_b];
-          acc_temp[m_b][n_b] = ElementAccumulator(0); 
-        }
-      }
-    }
-
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GETT - Epilogue
-template <class EpilogueParams, class ElementAccumulator, int kBlockM, int kBlockN>
-void gett_epilogue(
-    EpilogueParams const& epilogue_params,
-    int64_t m,
-    int64_t n,
-    int64_t l,
-    ElementAccumulator (&acc)[kBlockM][kBlockN])
-{
-  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == 3, "M, K, B");
-  static_assert(cute::rank(typename EpilogueParams::LayoutD{}) == 3, "N, K, B");
-
-  using cute::raw_pointer_cast;
-
-  using ElementCompute = typename EpilogueParams::ElementCompute;
-  using ElementC = typename EpilogueParams::TensorC::value_type;
-  using ElementD = typename EpilogueParams::TensorD::value_type;
-  using ElementAux = typename EpilogueParams::TensorAux::value_type;
-  using ElementBias = typename EpilogueParams::VectorBias::value_type;
-  using ElementScalar = typename EpilogueParams::ElementScalar;
-  using ElementScalingFactor = typename EpilogueParams::ElementScalingFactor;
-  using ActivationFunctor = typename EpilogueParams::ActivationFunctor;
-  using BiasBinaryOp = typename EpilogueParams::BiasBinaryOp;
-
-  constexpr bool PerColBias = EpilogueParams::PerColumnBias;
-  constexpr bool IsScalingAndAmaxOutputNeeded = 
-      cute::is_same_v<ElementD, cutlass::float_e4m3_t> or
-      cute::is_same_v<ElementD, cutlass::float_e5m2_t>;
-
-  constexpr bool IsScalingAndAmaxAuxOutputNeeded =
-      cute::is_same_v<ElementAux, cutlass::float_e4m3_t> or
-      cute::is_same_v<ElementAux, cutlass::float_e5m2_t>;
-
-  constexpr bool IsReLUAuxNeeded =
-      (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::ReLu<ElementCompute>> or
-       cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>) and 
-      cute::is_same_v<ElementAux, cutlass::uint1b_t>;
-  constexpr bool IsClamp =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>;
-
-  constexpr bool IsBackpropFusion =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dGELU<ElementCompute>> or
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dReLU<ElementCompute>>;
-
-  // Input related converter
-  NumericConverter<ElementCompute, ElementAccumulator> accumulator_converter;
-  NumericConverter<ElementCompute, ElementC> source_converter;
-  NumericConverter<ElementCompute, ElementBias> bias_converter;
-  [[maybe_unused]] NumericConverter<ElementCompute, ElementAux> aux_source_converter;
-
-  // Scale related converter
-  NumericConverter<ElementCompute, ElementScalar> scale_converter;
-  NumericConverter<ElementCompute, ElementScalingFactor> scaling_factor_converter;
-
-  // Abs max converter
-  [[maybe_unused]] NumericConverter<ElementAccumulator, ElementCompute> abs_max_output_converter;
-
-  // Output related converter
-  NumericConverter<ElementD, ElementCompute> destination_converter;
-  [[maybe_unused]] NumericConverter<ElementAux, ElementCompute> aux_destination_converter;
-  NumericConverter<ElementBias, ElementCompute> dBias_converter;
-
-  // Epilogue operations
-  multiply_add<ElementCompute, ElementCompute, ElementCompute> epilogue_fma;
-  multiplies<ElementCompute> mul;
-  plus<ElementCompute> add;
-
-  // Activation operation
-
-  auto activation = [] (ElementCompute x, ElementCompute y = ElementCompute(0)) {
-    if constexpr (std::is_same_v<ActivationFunctor, void>) {
-      return x + y;
-    } else {
-      return ActivationFunctor()(x, y);
-    }
-  };
-
-  // Bias binary operation
-  BiasBinaryOp bias_op;
-
-  // Do conversion
-  ElementCompute converted_alpha = scale_converter(epilogue_params.alpha);
-  ElementCompute converted_beta = scale_converter(epilogue_params.beta);
-  ElementCompute converted_scale_a = scaling_factor_converter(epilogue_params.scale_a);
-  ElementCompute converted_scale_b = scaling_factor_converter(epilogue_params.scale_b);
-  ElementCompute converted_scale_c = scaling_factor_converter(epilogue_params.scale_c);
-  ElementCompute converted_scale_d = scaling_factor_converter(epilogue_params.scale_d);
-  ElementCompute converted_scale_aux = scaling_factor_converter(epilogue_params.scale_aux);
-
-  // Init local var
-  [[maybe_unused]] ElementCompute local_abs_max_output = ElementCompute(0);
-  [[maybe_unused]] ElementCompute local_abs_max_aux_output = ElementCompute(0);
-
-  converted_alpha = mul(converted_alpha, mul(converted_scale_a, converted_scale_b));
-  converted_beta = mul(converted_beta, converted_scale_c);
-
-  ElementCompute inter_accum[kBlockM][kBlockN];
-
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    ElementCompute local_dBias = ElementCompute(0);
-
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
-        // Convert every type to ElementCompute first, do compute, convert to output type, write it out
-        ElementCompute converted_acc = accumulator_converter(acc[m_b][n_b]);
-        // per-row alpha
-        if (raw_pointer_cast(epilogue_params.Valpha.data())) {
-          converted_alpha = scale_converter(epilogue_params.Valpha(m + m_b));
-        }
-        ElementCompute output = mul(converted_alpha, converted_acc);
-
-        if (raw_pointer_cast(epilogue_params.Bias.data()) && not IsBackpropFusion) {
-          ElementCompute converted_bias = bias_converter(epilogue_params.Bias(PerColBias ? n + n_b : m + m_b));
-          output = bias_op(output, converted_bias);
-        }
-
-        if (raw_pointer_cast(epilogue_params.C.data())) {
-          ElementCompute converted_src = source_converter(epilogue_params.C(m + m_b, n + n_b, l));
-          // per-row beta
-          if (epilogue_params.Vbeta.data()) {
-            converted_beta = scale_converter(epilogue_params.Vbeta(m + m_b));
-          }
-          output = epilogue_fma(converted_beta, converted_src, output);
-        }
-
-        if constexpr (IsBackpropFusion) {
-          ElementAux aux_input = ElementAux(0);
-          if (raw_pointer_cast(epilogue_params.Aux.data())) {
-            aux_input = epilogue_params.Aux(m + m_b, n + n_b, l);
-          }
-
-          output = activation(output, aux_source_converter(aux_input));
-          local_dBias = add(local_dBias, output);
-        }
-        else {
-          if (raw_pointer_cast(epilogue_params.Aux.data())) {
-            auto aux_output = output;
-            if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
-              maximum_absolute_value_reduction<ElementCompute, true> amax_op;
-              local_abs_max_aux_output = amax_op(local_abs_max_aux_output, aux_output);
-              aux_output = epilogue_fma(converted_scale_aux, aux_output, ElementCompute(0));
-            }
-
-            if constexpr (IsReLUAuxNeeded) {
-              epilogue_params.Aux(m + m_b, n + n_b, l) = not (aux_output < 0) ? uint1b_t(1) : uint1b_t(0);
-            } else {
-              epilogue_params.Aux(m + m_b, n + n_b, l) = aux_destination_converter(aux_output);
-            }
-          }
-
-          if constexpr (IsClamp) { // Treat Clamp as ReLU
-            output = activation(output, {0, std::numeric_limits<ElementCompute>::max()});
-          }
-          else {
-            output = activation(output);
-          }
-        }
-
-        if constexpr (IsScalingAndAmaxOutputNeeded) {
-          maximum_absolute_value_reduction<ElementCompute, true> amax_op;
-          local_abs_max_output = amax_op(local_abs_max_output, output);
-          output = epilogue_fma(converted_scale_d, output, ElementCompute(0));
-        }
-
-        inter_accum[m_b][n_b] = ElementCompute(output);
-      }
-    } // n_b
-
-    if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n < cute::size<1>(epilogue_params.D.layout())) {
-      if (raw_pointer_cast(epilogue_params.Bias.data()) && IsBackpropFusion) {
-        ElementCompute converted_dBias = bias_converter(epilogue_params.Bias(m + m_b));
-        local_dBias = add(local_dBias, converted_dBias);
-        epilogue_params.Bias(m + m_b) = dBias_converter(local_dBias);
-      }
-    }
-  } // m_b
-  for (int m_b = 0; m_b < kBlockM; ++m_b) {
-    for (int n_b = 0; n_b < kBlockN; ++n_b) {
-      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
-        epilogue_params.D(m + m_b, n + n_b, l) = destination_converter(inter_accum[m_b][n_b]);
-      }
-    }
-  }
-
-#if defined(_OPENMP)
-  #pragma omp critical(Abs_Max_Data_Update)
-#endif
-  {
-    if constexpr (IsScalingAndAmaxOutputNeeded) {
-      if (epilogue_params.abs_max_D) {
-        *epilogue_params.abs_max_D = maximum_with_nan_propogation<ElementAccumulator>{}(
-          *epilogue_params.abs_max_D, abs_max_output_converter(local_abs_max_output));
-      }
-    }
-
-    if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
-      if (epilogue_params.abs_max_Aux) {
-        *epilogue_params.abs_max_Aux = maximum_with_nan_propogation<ElementAccumulator>{}(
-            *epilogue_params.abs_max_Aux, abs_max_output_converter(local_abs_max_aux_output));
-      }
-    }
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// GEMM - General Matrix-Matrix contraction without conjugation options
-template <
-  class MainloopParams,
-  class EpilogueParams
->
-void Gemm3x(
-    MainloopParams const& mainloop_params,
-    EpilogueParams const& epilogue_params)
-{
-  using namespace cute;
-
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename MainloopParams::LayoutB{}));
-  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == cute::rank(typename EpilogueParams::LayoutD{}));
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename EpilogueParams::LayoutC{}));
-  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "Only Rank3 Tensors (M, K, Batch_Count) "
-                                                                     "with Batchmode are supported");
-  // Lower the Matrix-Multiplication with Groupwise scaling (Gemm3x) to a Tensor Contraction (Gett).
-  Gett(mainloop_params, epilogue_params);
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
-} // cutlass::reference::host
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu
index c1978c3212..9b56697bdc 100644
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu
@@ -374,7 +374,7 @@ void allocate(Options const& options) {
     auto N = get<1>(problem);
     auto K = get<2>(problem);
 
-    const int scale_k = 1;
+  int const scale_k = cutlass::ceil_div(options.k, options.c);
 
     offset_A.push_back(total_elements_A);
     offset_B.push_back(total_elements_B * cutlass::sizeof_bits<QuantType>::value / 8);
@@ -510,7 +510,7 @@ void initialize(Options &options) {
   beta_device.copy_from_host(ptr_beta_host.data());
 
   initialize_tensor(block_A, seed + 2023);
-  initialize_quant_tensor(block_B, seed + 2022);
+  initialize_tensor(block_B, seed + 2022);
   initialize_tensor(block_C, seed + 2021);
   initialize_scale(block_scale, options);
   initialize_zero(block_zero, options);
@@ -519,13 +519,13 @@ void initialize(Options &options) {
 
   
   for (int32_t i = 0; i < options.groups; ++i) {
-    const int scale_k = 1;
+    int const scale_k = cutlass::ceil_div(options.k, options.c);
     auto shape_B = cute::make_shape(cute::get<1>(options.problem_sizes_host[i]), cute::get<2>(options.problem_sizes_host[i]), Int<1>{});
     auto shape_scale = cute::make_shape(cute::get<1>(options.problem_sizes_host[i]), scale_k, Int<1>{});
     auto layout_B = make_layout(shape_B, stride_B_host.at(i));
     auto layout_scale = make_layout(shape_scale, stride_S_host_ref.at(i));
     cudaStream_t stream = cudaStreamDefault;
-    cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale, options.k, stream);
+    cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale, options.c, stream);
   }
 
   problem_sizes.reset(options.groups);
@@ -619,7 +619,7 @@ typename Gemm::Arguments args_from_options(Options const& options, bool host_pro
     arguments = Args {
       cutlass::gemm::GemmUniversalMode::kGrouped,
       {options.groups, problem_sizes.get(), nullptr},
-      {ptr_B.get(), dB, ptr_A.get(), stride_A.get(), ptr_scale.get(), stride_S.get(), options.k},
+      {ptr_B.get(), dB, ptr_A.get(), stride_A.get(), ptr_scale.get(), stride_S.get(), options.c},
       {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
       hw_info
     };
@@ -676,6 +676,7 @@ bool verify(Options const& options) {
 
   for (int32_t i = 0; i < options.groups; ++i) {
     auto problem = options.problem_sizes_host.at(i);
+    // we don't swap and transpose in the verify so revert the problem shape.
     auto N = get<0>(problem);
     auto M = get<1>(problem);
     auto K = get<2>(problem);
@@ -712,7 +713,7 @@ bool verify(Options const& options) {
       CUDA_CHECK(cudaDeviceSynchronize());
 
       passed &= cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N, epsilon, non_zero_floor);
-      std::cout << "Group: " << i << " Status: " << passed << std::endl;
+      std::cout << "Group " << i << ": " << options.problem_sizes_host[i] << ", alpha: " << alpha_host[i] << ", beta: " << beta_host[i] << " Status: " << passed << std::endl;
     }
   }
   return passed;
diff --git a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm.cu b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm.cu
index 07ff66b31a..8407cdad5e 100644
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm.cu
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm.cu
@@ -341,7 +341,7 @@ void allocate(Options const& options) {
     auto N = get<1>(problem);
     auto K = get<2>(problem);
 
-    const int scale_k = 1;
+    int const scale_k = cutlass::ceil_div(options.k, options.c);
 
     offset_A.push_back(total_elements_A);
     offset_B.push_back(total_elements_B * cutlass::sizeof_bits<QuantType>::value / 8);
@@ -479,7 +479,7 @@ void initialize(Options& options) {
   beta_device.copy_from_host(ptr_beta_host.data());
 
   initialize_tensor(block_A, seed + 2023);
-  initialize_quant_tensor(block_B, seed + 2022);
+  initialize_tensor(block_B, seed + 2022);
   cutlass::unified_encode_int4b(block_B.get(), block_B_modified.get(), block_B.size());
   initialize_tensor(block_C, seed + 2021);
   initialize_scale(block_scale, options);
@@ -565,7 +565,7 @@ typename Gemm::Arguments args_from_options(Options const& options, bool host_pro
   arguments = Args {
     cutlass::gemm::GemmUniversalMode::kGrouped,
     {options.groups, problem_sizes.get(), nullptr},
-    {ptr_B.get(), dB, ptr_A.get(), stride_A.get(), ptr_scale_packed.get(), stride_S.get(), options.k},
+    {ptr_B.get(), dB, ptr_A.get(), stride_A.get(), ptr_scale_packed.get(), stride_S.get(), options.c},
     {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
     hw_info
   };
@@ -617,6 +617,7 @@ bool verify(Options const& options) {
 
   for (int32_t i = 0; i < options.groups; ++i) {
     auto problem = options.problem_sizes_host.at(i);
+    // we don't swap and transpose in the verify so revert the problem shape.
     auto N = get<0>(problem);
     auto M = get<1>(problem);
     auto K = get<2>(problem);
@@ -630,11 +631,11 @@ bool verify(Options const& options) {
       stride_A_verif = cutlass::make_cute_packed_stride(StrideA_verif{}, cute::make_shape(M, K, 1));
       stride_B_verif = cutlass::make_cute_packed_stride(StrideB_verif{}, cute::make_shape(N, K, 1));
 
-      const int scale_k = 1;
+      int const scale_k = cutlass::ceil_div(options.k, options.c);
       auto layout_B = make_layout(cute::make_shape(N, K, Int<1>{}), stride_B_host.at(i));
       auto layout_scale_zero = make_layout(cute::make_shape(N, scale_k, Int<1>{}), stride_S_host_ref.at(i));
       cudaStream_t stream = cudaStreamDefault;
-      cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale_zero, options.k, stream);
+      cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale_zero, options.c, stream);
 
       //
       // Compute reference output
@@ -659,7 +660,7 @@ bool verify(Options const& options) {
       CUDA_CHECK(cudaDeviceSynchronize());
 
       passed &= cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N, epsilon, non_zero_floor);
-      std::cout << "Group: " << i << " Status: " << passed << std::endl;
+      std::cout << "Group " << i << ": " << options.problem_sizes_host[i] << ", alpha: " << alpha_host[i] << ", beta: " << beta_host[i] << " Status: " << passed << std::endl;
     }
   }
   return passed;
diff --git a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm.cu b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm.cu
index ffeb233ea5..41cccfbbf1 100644
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm.cu
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm.cu
@@ -282,7 +282,7 @@ void allocate(Options const& options) {
     auto N = get<1>(problem);
     auto K = get<2>(problem);
 
-    const int scale_k = 1;
+    int const scale_k = cutlass::ceil_div(options.k, options.c);
 
     offset_A.push_back(total_elements_A);
     offset_B.push_back(total_elements_B * cutlass::sizeof_bits<QuantType>::value / 8);
@@ -418,7 +418,7 @@ void initialize(Options &options) {
   beta_device.copy_from_host(ptr_beta_host.data());
 
   initialize_tensor(block_A, seed + 2023);
-  initialize_quant_tensor(block_B, seed + 2022);
+  initialize_tensor(block_B, seed + 2022);
   initialize_tensor(block_C, seed + 2021);
   initialize_scale(block_scale, options);
   initialize_zero(block_zero, options);
@@ -485,7 +485,7 @@ typename Gemm::Arguments args_from_options(Options const& options, bool host_pro
     arguments = typename Gemm::Arguments {
       cutlass::gemm::GemmUniversalMode::kGrouped,
       {options.groups, problem_sizes.get(), nullptr},
-      {ptr_B.get(), stride_B.get(), ptr_A.get(), stride_A.get(), ptr_scale.get(), stride_S.get(), options.k},
+      {ptr_B.get(), stride_B.get(), ptr_A.get(), stride_A.get(), ptr_scale.get(), stride_S.get(), options.c},
       {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
       hw_info
     };
@@ -542,6 +542,7 @@ bool verify(Options const& options) {
 
   for (int32_t i = 0; i < options.groups; ++i) {
     auto problem = options.problem_sizes_host.at(i);
+    // we don't swap and transpose in the verify so revert the problem shape.
     auto N = get<0>(problem);
     auto M = get<1>(problem);
     auto K = get<2>(problem);
@@ -555,11 +556,11 @@ bool verify(Options const& options) {
       stride_A_verif = cutlass::make_cute_packed_stride(StrideA_verif{}, cute::make_shape(M, K, 1));
       stride_B_verif = cutlass::make_cute_packed_stride(StrideB_verif{}, cute::make_shape(N, K, 1));
 
-      const int scale_k = 1;
+      int const scale_k = cutlass::ceil_div(options.k, options.c);
       auto layout_B = make_layout(cute::make_shape(N, K, Int<1>{}), stride_B_host.at(i));
       auto layout_scale_zero = make_layout(cute::make_shape(N, scale_k, Int<1>{}), stride_S_host_ref.at(i));
       cudaStream_t stream = cudaStreamDefault;
-      cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale_zero, options.k, stream);
+      cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale_zero, options.c, stream);
 
       //
       // Compute reference output
@@ -584,7 +585,7 @@ bool verify(Options const& options) {
       CUDA_CHECK(cudaDeviceSynchronize());
 
       passed &= cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N, epsilon, non_zero_floor);
-      std::cout << "Group: " << i << " Status: " << passed << std::endl;
+      std::cout << "Group " << i << ": " << options.problem_sizes_host[i] << ", alpha: " << alpha_host[i] << ", beta: " << beta_host[i] << " Status: " << passed << std::endl;
     }
   }
   return passed;
diff --git a/examples/69_hopper_mixed_dtype_grouped_gemm/CMakeLists.txt b/examples/69_hopper_mixed_dtype_grouped_gemm/CMakeLists.txt
index 4c21cd4854..f32c5d527f 100644
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/CMakeLists.txt
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/CMakeLists.txt
@@ -50,6 +50,7 @@ set(TEST_RANDOM_PERF_LARGE_GROUP --groups=100 --iterations=10)
 set(TEST_DIRECT_BATCHED --m=2048 --n=5120 --k=8192 --mode=0 --iterations=0)              # Direct conversion
 
 set(TEST_SCALE_PERCOL --m=4096 --n=5120 --k=8192 --c=8192 --mode=1 --iterations=0)       # Per Column scaling
+set(TEST_SCALE_GROUP --m=2048 --n=5120 --k=8192 --c=512 --mode=1 --iterations=0)         # Group-wise scaling
 
 cutlass_example_add_executable(
   69_hopper_mixed_dtype_grouped_gemm
@@ -69,6 +70,7 @@ cutlass_example_add_executable(
   TEST_RANDOM_PERF_LARGE_GROUP
   TEST_DIRECT_BATCHED
   TEST_SCALE_PERCOL
+  TEST_SCALE_GROUP
 )
 
 cutlass_example_add_executable(
@@ -89,6 +91,7 @@ cutlass_example_add_executable(
   TEST_RANDOM_PERF_LARGE_GROUP
   TEST_DIRECT_BATCHED
   TEST_SCALE_PERCOL
+  TEST_SCALE_GROUP
 )
 
 cutlass_example_add_executable(
@@ -109,4 +112,5 @@ cutlass_example_add_executable(
   TEST_RANDOM_PERF_LARGE_GROUP
   TEST_DIRECT_BATCHED
   TEST_SCALE_PERCOL
+  TEST_SCALE_GROUP
 )
diff --git a/examples/69_hopper_mixed_dtype_grouped_gemm/README.md b/examples/69_hopper_mixed_dtype_grouped_gemm/README.md
index f4d71ea3f1..10b57aa08c 100644
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/README.md
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/README.md
@@ -7,11 +7,11 @@ This example shows how to perform Grouped GEMMs on Hopper when A and B have diff
 - in the arguments, pass the group size, array of the problem sizes, and the array of strides for matrix A and B.
 - if scales and zero-points are included, also pass the array of their strides in the arguments.
 
-Note that in Example 55, the argument `--g` is used to determine the block scale size. It is important not to confuse this with the `--groups` argument in this example, which specifies the number of GEMMs.
+Note that in Example 55, the argument `--g` is used to determine the group size of scaling. To avoid confusion with the `--groups` argument in this example, which defines the number of GEMMs, `--c` is used here to represent the group size for scaling.
 
 ## Upcoming features
 
-Currently, the Mixed-input Grouped GEMM only supports row-wise scaling. Please contact us if zero-points or block-wise scaling are needed.
+Currently, the Mixed-input Grouped GEMM only supports row-wise scaling, and group-wise scaling for identical problem shapes across all groups. Please contact us if zero-points or block-wise scaling are needed.
 
 ## Copyright
 
diff --git a/examples/69_hopper_mixed_dtype_grouped_gemm/grouped_mixed_dtype_utils.hpp b/examples/69_hopper_mixed_dtype_grouped_gemm/grouped_mixed_dtype_utils.hpp
index db391cce8f..8568b467dd 100644
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/grouped_mixed_dtype_utils.hpp
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/grouped_mixed_dtype_utils.hpp
@@ -58,6 +58,7 @@ class GroupedMixedDtypeOptions : public MixedDtypeOptions {
     void parse(int argc, char const **args) {
         cutlass::CommandLine cmd(argc, args);
         cmd.get_cmd_line_argument("groups", groups);
+        cmd.get_cmd_line_argument("benchmark", benchmark_path);
         cmd.get_cmd_line_argument("c", c);
         MixedDtypeOptions::parse(argc, args);
 
@@ -71,6 +72,7 @@ class GroupedMixedDtypeOptions : public MixedDtypeOptions {
             << "  --m=<int>                   Sets the M extent of the GEMM for all groups\n"
             << "  --n=<int>                   Sets the N extent of the GEMM for all groups\n"
             << "  --k=<int>                   Sets the K extent of the GEMM for all groups\n"
+            << "  --c=<int>                   Sets the chunk size for scaling the quantized weights\n"
             << "  --groups=<int>              Sets the number of individual GEMM problems\n"
             << "  --mode=<int>                The mode to run the gemm\n"
             << "  --alpha=<f32>               Epilogue scalar alpha\n"
@@ -183,11 +185,6 @@ void grouped_mixed_dtype_profiling(
 
     result.avg_runtime_ms = std::accumulate(runtimes.begin(), runtimes.end(), 0.0f) / runtimes.size();
     result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
-
-    std::cout << "  Problem Sizes, Alpha, Beta\n";
-    for (int32_t i = 0; i < options.groups; ++i) {
-        std::cout << "    " << options.problem_sizes_host[i] << ", " << alpha_host[i] << ", " << beta_host[i] << '\n';
-    }
     std::cout << "  Groups      : " << options.groups << '\n'
               << "  Avg runtime : " << result.avg_runtime_ms << " ms\n"
               << "  GFLOPS      : " << result.gflops << '\n';
diff --git a/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu b/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu
index 75d3437d1b..8be4f6395d 100644
--- a/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu
+++ b/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu
@@ -480,7 +480,12 @@ bool verify(const Options &options) {
   passed &= (cutlass::reference::host::TensorNorm(block_reference_D.host_view()) > 0);
   passed &= (cutlass::reference::host::TensorNorm(block_D.host_view()) > 0);
 
-  return passed;
+  block_SFD.sync_host();
+  bool passed_sfd = cutlass::reference::host::TensorEquals(block_reference_SFD.host_view(), block_SFD.host_view());
+  passed_sfd &= (cutlass::reference::host::TensorNorm(block_reference_SFD.host_view()) > 0);
+  passed_sfd &= (cutlass::reference::host::TensorNorm(block_SFD.host_view()) > 0);
+
+  return passed && passed_sfd;
 }
 
 /// Execute a given example GEMM computation
diff --git a/examples/77_blackwell_fmha/77_blackwell_fmha.cu b/examples/77_blackwell_fmha/77_blackwell_fmha.cu
index 1d1314d145..c879212223 100644
--- a/examples/77_blackwell_fmha/77_blackwell_fmha.cu
+++ b/examples/77_blackwell_fmha/77_blackwell_fmha.cu
@@ -67,9 +67,6 @@
             --b=2048 --h=2048 --d=2048 --q=2048 --k=2048
 */
 
-#define DSHOW(x) print(#x ": "); print(x); print("\n");
-#define DSHOWT(x) print(#x ": "); print_tensor(x); print("\n");
-
 #include <iostream>
 #include <random>
 #include <regex>
@@ -247,8 +244,8 @@ struct Options {
       << "                              and are split B-ways, alternatingly +10% and -10%\n"
       << "                              with the last batch sized to make it fit\n"
       << "                              implies at least residual masking for correctness\n"
-      << " --sm-count                   Sets SM count rather than querying it\n"
-      << " --kernel-filter=<filter>     Sets regexp to match kernel against\n"
+      << "  --sm-count                  Sets SM count rather than querying it\n"
+      << "  --kernel-filter=<filter>    Sets regexp to match kernel against\n"
       << "\n";
 
     return out;
diff --git a/examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu b/examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu
new file mode 100644
index 0000000000..1c02a29ef0
--- /dev/null
+++ b/examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu
@@ -0,0 +1,865 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Example implementation of fused multi-head attention for Blackwell using CUTLASS 3.
+
+    This example showcases the use of CUTLASS to build backward fused
+    multi-head attantion (FMHA) collectives from existing CUTLASS collectives targeting
+    the NVIDIA Blackwell architecture.
+
+    Background and motivation
+    -------------------------
+    CUTLASS is a highly flexible library that provides open-source building blocks
+    for tensor core programming for GEMM or GEMM-like problems. Fused multi-head
+    attention (FMHA) is a foundational kernel for large language models (LLMs) since it
+    makes long sequence lengths feasible from a memory-usage perspective. It also
+    improves computational efficiency since it transforms an outer-product-like and
+    a matrix-vector-like GEMM into a fused operation with much higher arithmetic
+    intensity. For more details, see Dao et al, 2022; Dao, 2023.
+    Implementing this kernel in CUTLASS enabled easy customization and high
+    performance.
+
+    Introduction
+    ------------
+    The example targets the NVIDIA Blackwell architecture, and takes advantage of
+    5th gen tensor cores and the Tensor Memory Accelerator (TMA), just like
+    GEMMs do. It provides a backward pass (often abbreviated
+    bwd in the code).
+    The code is structured into three layers: The runner (and the reference kernels)
+    takes care of initialization, measurement, and testing; the device layer
+    orchestrates kernel calls and partitions workspace; and the kernel layer (just
+    like the CUTLASS kernel layer.
+
+    Support
+    ---------
+
+    We support fp16 and fp8 data types with a head dimension of 128.
+
+    Example usage:
+    $ ./examples/77_blackwell_fmha/77_blackwell_fmha_bwd_fp16 \
+            --b=2048 --h=2048 --d=2048 --q=2048 --k=2048
+*/
+
+#include <iostream>
+#include <random>
+#include <regex>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/kernel_hardware_info.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "reference/fmha_fwd_reference.hpp"
+#include "reference/fmha_bwd_reference.hpp"
+#include "reference/reference_abs_error.hpp"
+
+#include "collective/fmha_fusion.hpp"
+#include "device/fmha_device_bwd.hpp"
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+using namespace cute;
+using namespace cutlass::fmha::kernel;
+using namespace cutlass::fmha::collective;
+using namespace cutlass::fmha;
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+enum class InitStyle {
+  kOne, kZero, kLinearStride128, kLinearStride1, kRandom, kNone
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool error = false;
+
+  int b = 16;
+  int h = 16;
+  int h_k = 1;
+  int q = 1024;
+  int k = 1024;
+  int d = 128;
+  int iterations = 3;
+  bool verify = false;
+  bool verbose = false;
+
+  bool causal = false;
+  int sm_count = 0;
+
+  std::string kernel_filter;
+
+  InitStyle init_style_q = InitStyle::kRandom;
+  InitStyle init_style_k = InitStyle::kRandom;
+  InitStyle init_style_v = InitStyle::kRandom;
+  InitStyle init_style_do = InitStyle::kRandom;
+  bool skip_reference = false;
+
+  static void get_init_style_argument(cutlass::CommandLine& cmd, const char* name, InitStyle& dst, InitStyle const& src) {
+    std::string s;
+    cmd.get_cmd_line_argument(name, s, s);
+    if (s.empty()) {
+      dst = src;
+    }
+    else {
+      if (s == "r") {
+        dst = InitStyle::kRandom;
+      }
+      else if (s == "0") {
+        dst = InitStyle::kZero;
+      }
+      else if (s == "1") {
+        dst = InitStyle::kOne;
+      }
+      else if (s == "d") {
+        dst = InitStyle::kLinearStride1;
+      }
+      else if (s == "s") {
+        dst = InitStyle::kLinearStride128;
+      }
+      else if (s == "n") {
+        dst = InitStyle::kNone;
+      }
+      else {
+        std::cout << "Error: " << s << " is not a valid input type.\n";
+        std::exit(-1);
+      }
+    }
+  }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    Options defaults;
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("d", d, defaults.d);
+    cmd.get_cmd_line_argument("h", h, -1);
+    if (h == -1) h = 2048 / d;
+
+    cmd.get_cmd_line_argument("q", q, -1);
+    cmd.get_cmd_line_argument("k", k, -1);
+    if (q == -1) q = k;
+    if (k == -1) k = q;
+    if (q == -1 && k == -1) q = k = defaults.q;
+
+    cmd.get_cmd_line_argument("b", b, -1);
+    if (b == -1) b = 16384 / k;
+    if (b == 0) b = 1;
+
+    cmd.get_cmd_line_argument("iterations", iterations, defaults.iterations);
+    verify = cmd.check_cmd_line_flag("verify");
+    verbose = cmd.check_cmd_line_flag("verbose");
+    std::string mask;
+    cmd.get_cmd_line_argument<std::string>("mask", mask, "");
+    if (mask == "causal") {
+      causal = true;
+    }
+    else {
+      causal = defaults.causal;
+    }
+
+    skip_reference = cmd.check_cmd_line_flag("skip-reference");
+    cmd.get_cmd_line_argument("sm-count", sm_count, defaults.sm_count);
+
+    get_init_style_argument(cmd, "init-style", init_style_q, defaults.init_style_q);
+    get_init_style_argument(cmd, "init-style", init_style_k, defaults.init_style_k);
+    get_init_style_argument(cmd, "init-style", init_style_v, defaults.init_style_v);
+    get_init_style_argument(cmd, "init-style", init_style_do, defaults.init_style_do);
+    get_init_style_argument(cmd, "init-style-q", init_style_q, init_style_q);
+    get_init_style_argument(cmd, "init-style-k", init_style_k, init_style_k);
+    get_init_style_argument(cmd, "init-style-v", init_style_v, init_style_v);
+    get_init_style_argument(cmd, "init-style-do", init_style_v, init_style_do);
+
+    cmd.get_cmd_line_argument("kernel-filter", kernel_filter, defaults.kernel_filter);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "77_blackwell_fmha_bwd\n\n"
+      << "  This example showcases the use of CUTLASS's collective operation builders to easily construct\n"
+      << "  fused multi-head attention kernels for the backward pass targeting NVIDIA's Blackwell architecture.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --b=<int>                   Sets the B extent\n"
+      << "  --h=<int>                   Sets the H extent\n"
+      << "  --q=<int>                   Sets the Q extent\n"
+      << "  --k=<int>                   Sets the K extent\n"
+      << "  --d=<int>                   Sets the D extentn"
+      << "  --iterations=<int>          Benchmarking iterations\n"
+      << "  --verify                    Verify results\n"
+      << "  --verbose                   Print smem and execution time per kernel\n"
+      << "  --mask=<no|causal>          Enables masking\n"
+      << "  --sm-count                  Sets SM count rather than querying it\n"
+      << "  --kernel-filter=<filter>    Sets regexp to match kernel against\n"
+      << "\n";
+
+    return out;
+  }
+};
+
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+void initialize_block(
+    DeviceAllocation<Element>& block,
+    uint64_t seed=2023, InitStyle init_style = InitStyle::kRandom) {
+
+  switch (init_style) {
+    case InitStyle::kOne: {
+      cutlass::reference::device::BlockFillRandomUniform(
+        block.get(), block.size(), seed, (Element) 1, (Element) 1);
+      break;
+    }
+    case InitStyle::kZero: {
+      cutlass::reference::device::BlockFillRandomUniform(
+        block.get(), block.size(), seed, (Element) 0, (Element) 0);
+      break;
+    }
+    case InitStyle::kRandom: {
+      cutlass::reference::device::BlockFillRandomGaussian(
+        block.get(), block.size(), seed, (Element) 0, (Element) 1);
+      break;
+    }
+    case InitStyle::kLinearStride1: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (j % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kLinearStride128: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (i % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kNone: {
+      break;
+    }
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct ExampleResult {
+  bool passed = false;
+  bool verified = false;
+  float runtime_ms = 0;
+  double tflops_tc_s = 0;
+  size_t smem_size = 0;
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+  class TileShape,
+  class DispatchPolicy,
+  class ActiveMask,
+  class... KernelOptions
+>
+struct BwdRunner {
+
+#ifdef FP8
+  using Element = cutlass::float_e4m3_t;
+#else
+  using Element = cutlass::half_t;
+#endif
+  using ElementAccumulator = float;
+
+  // Q K D (H B)
+  using ProblemShapeType = cute::tuple<int, int, int, cute::tuple<int, int>>;
+
+  using Operation = cutlass::fmha::device::Sm100FmhaBwd<Element, ElementAccumulator, TileShape, ActiveMask>;
+  
+  using TensorStride = Stride<int, _1, Stride<int, int>>; // Seq D (H B)
+  using StrideQ = TensorStride;
+  using StrideK = TensorStride;
+  using StrideV = TensorStride;
+  using StrideO = TensorStride;
+  using StrideLSE = Stride<_1, Stride<int, int>>; // Seq (H B)
+
+  // Backwards specific
+  using StrideDQ = TensorStride;
+  using StrideDK = TensorStride;
+  using StrideDV = TensorStride;
+  using StrideDO = TensorStride;
+
+  //
+  // Data members
+  //
+
+  /// Initialization
+  StrideQ stride_Q;
+  StrideK stride_K;
+  StrideV stride_V;
+  StrideO stride_O;
+  StrideLSE stride_LSE;
+
+  StrideDQ stride_dQ;
+  StrideDK stride_dK;
+  StrideDV stride_dV;
+  StrideDO stride_dO;
+
+  uint64_t seed = 0;
+
+  DeviceAllocation<Element> block_Q;
+  DeviceAllocation<Element> block_K;
+  DeviceAllocation<Element> block_V;
+  DeviceAllocation<Element> block_O;
+  DeviceAllocation<ElementAccumulator> block_LSE;
+
+  DeviceAllocation<Element> block_dQ;
+  DeviceAllocation<Element> block_dK;
+  DeviceAllocation<Element> block_dV;
+  DeviceAllocation<Element> block_dO;
+
+  DeviceAllocation<Element> block_ref_dQ;
+  DeviceAllocation<Element> block_ref_dK;
+  DeviceAllocation<Element> block_ref_dV;
+
+  //
+  // Methods
+  //
+  bool verify(const ProblemShapeType& problem_shape) {
+    auto [Q, K, D, HB] = problem_shape;
+    auto [H, B] = HB;
+
+    Tensor mQ = make_tensor(make_gmem_ptr(block_Q.get()),
+      select<0,2,3>(problem_shape),
+      stride_Q);
+
+    Tensor mK = make_tensor(make_gmem_ptr(block_K.get()),
+      select<1,2,3>(problem_shape),
+      stride_K);
+
+    Tensor mV = make_tensor(make_gmem_ptr(block_V.get()),
+      select<1,2,3>(problem_shape),
+      stride_V);
+
+    Tensor mO = make_tensor(make_gmem_ptr(block_O.get()),
+      select<0,2,3>(problem_shape),
+      stride_O);
+
+    // keep going here! (this might be better in cursor)
+
+    Tensor mLSE = make_tensor(make_gmem_ptr(block_LSE.get()),
+      select<0,3>(problem_shape),
+      stride_LSE);
+
+    Tensor mDQ = make_tensor(make_gmem_ptr(block_ref_dQ.get()),
+      select<0,2,3>(problem_shape),
+      stride_dQ);
+
+    Tensor mDK = make_tensor(make_gmem_ptr(block_ref_dK.get()),
+      select<1,2,3>(problem_shape),
+      stride_dK);
+
+    Tensor mDV = make_tensor(make_gmem_ptr(block_ref_dV.get()),
+      select<1,2,3>(problem_shape),
+      stride_dV);
+
+    Tensor mDO = make_tensor(make_gmem_ptr(block_dO.get()),
+      select<0,2,3>(problem_shape),
+      stride_dO);
+
+    fmha_bwd_reference(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDQ, mDK, mDV, ActiveMask{});
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Reference kernel failed. Last CUDA error: "
+                << cudaGetErrorString(result) << std::endl;
+      return false;
+    }
+
+    const double kMaxDiffThresh = sizeof(Element) == 1 ? 1e-0 : 1e-2;
+    const double kMeanDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-3;
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    double max_diff = 0;
+    double mean_diff = 0;
+    reference_abs_diff(block_dQ, block_ref_dQ, max_diff, mean_diff);
+
+    bool passed_dQ = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if (! passed_dQ) {
+      std::cerr << "failed dQ: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    reference_abs_diff(block_dK, block_ref_dK, max_diff, mean_diff);
+
+    bool passed_dK = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if (! passed_dK) {
+      std::cerr << "failed dK: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    reference_abs_diff(block_dV, block_ref_dV, max_diff, mean_diff);
+
+    bool passed_dV = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if (! passed_dV) {
+      std::cerr << "failed dV: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    return passed_dQ && passed_dK && passed_dV;
+  }
+
+  /// Initialize operands to be used in the GEMM and reference GEMM
+  void initialize(const ProblemShapeType& problem_shape, Options const& options) {
+    auto [Q, K, D, HB] = problem_shape;
+    auto [H, B] = HB;
+    D = cutlass::round_up(D, 8);  // Alignment
+    Q = cutlass::round_up(Q, 8);  // Alignment
+
+    auto shape_QO = select<0,2,3>(problem_shape);
+    auto shape_KV = select<1,2,3>(problem_shape);
+    auto shape_LSE = select<0,3>(problem_shape);
+
+    stride_Q = make_stride(D, _1{}, make_stride(D*Q, D*Q*H));
+    stride_K = make_stride(D, _1{}, make_stride(D*K, D*K*H));
+    stride_V = stride_K;
+    stride_O = stride_Q;
+    stride_LSE = make_stride(_1{}, make_stride(Q, Q*H));
+
+    stride_dQ = stride_Q;
+    stride_dK = stride_K;
+    stride_dV = stride_V;
+    stride_dO = stride_O;
+
+    auto lsize = [](auto shape) {
+      return size(make_shape(1ull, shape));
+    };
+
+    block_Q.reset(lsize(shape_QO));
+    block_K.reset(lsize(shape_KV));
+    block_V.reset(lsize(shape_KV));
+    block_O.reset(lsize(shape_QO));
+    block_LSE.reset(lsize(shape_LSE));
+
+    block_dQ.reset(lsize(shape_QO));
+    block_dK.reset(lsize(shape_KV));
+    block_dV.reset(lsize(shape_KV));
+    block_dO.reset(lsize(shape_QO));
+
+    block_ref_dQ.reset(lsize(shape_QO));
+    block_ref_dK.reset(lsize(shape_KV));
+    block_ref_dV.reset(lsize(shape_KV));
+
+    initialize_block(block_Q, seed + 2023, options.init_style_q);
+    initialize_block(block_K, seed + 2022, options.init_style_k);
+    initialize_block(block_V, seed + 2021, options.init_style_v);
+    initialize_block(block_dO, seed + 2020, options.init_style_do);
+
+    Tensor mQ = make_tensor(make_gmem_ptr(block_Q.get()),
+      select<0,2,3>(problem_shape),
+      stride_Q);
+
+    Tensor mK = make_tensor(make_gmem_ptr(block_K.get()),
+      select<1,2,3>(problem_shape),
+      stride_K);
+
+    Tensor mV = make_tensor(make_gmem_ptr(block_V.get()),
+      select<1,2,3>(problem_shape),
+      stride_V);
+
+    Tensor mO = make_tensor(make_gmem_ptr(block_O.get()),
+      select<0,2,3>(problem_shape),
+      stride_O);
+
+    Tensor mLSE = make_tensor(make_gmem_ptr(block_LSE.get()),
+      select<0,3>(problem_shape),
+      stride_LSE);
+
+    if (! options.skip_reference) {
+      fmha_reference(problem_shape, mQ, mK, mV, mO, mLSE, ActiveMask{});
+    }
+  }
+
+  ExampleResult run(const Options& options, const cutlass::KernelHardwareInfo& hw_info) {
+    auto problem_shape = make_shape(options.q, options.k, options.d, make_shape(options.h, options.b));
+
+    initialize(problem_shape, options);
+
+    ElementAccumulator softmax_scale = 1.0f / sqrtf(options.d);
+
+    typename Operation::Arguments arguments{
+      problem_shape,
+      block_Q.get(), stride_Q,
+      block_K.get(), stride_K,
+      block_V.get(), stride_V,
+      block_O.get(), stride_O,
+      block_LSE.get(), stride_LSE,
+      block_dO.get(), stride_dO,
+      block_dQ.get(), stride_dQ,
+      block_dK.get(), stride_dK,
+      block_dV.get(), stride_dV,
+      softmax_scale,
+      hw_info
+    };
+
+    Operation op;
+
+    ExampleResult example_result;
+
+    example_result.smem_size = Operation::Kernel::SharedStorageSize;
+
+    size_t workspace_size = 0;
+    workspace_size = Operation::get_workspace_size(arguments);
+    DeviceAllocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status = cutlass::Status::kSuccess;
+    status = op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "This kernel is not supported. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    status = op.initialize(arguments, workspace.get());
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to initialize the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    // Run
+    status = op.run();
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    //
+    // Construct events
+    //
+
+    cudaEvent_t events[2];
+
+    for (auto & event : events) {
+      result = cudaEventCreate(&event);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventCreate() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+    }
+
+    // Record an event at the start of a series of GEMMs
+    result = cudaEventRecord(events[0]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    for (int i = 0; i < options.iterations; i++) {
+      status = op.run();
+      if (status != cutlass::Status::kSuccess) {
+        std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                  << cudaGetErrorString(cudaGetLastError()) << std::endl;
+        return example_result;
+      }
+    }
+
+    //
+    // Stop profiling loop
+    //
+
+    // Record an event when the GEMMs are complete
+    result = cudaEventRecord(events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Wait for work on the device to complete.
+    result = cudaEventSynchronize(events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventSynchronize() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Measure elapsed runtime
+    float runtime_ms = 0;
+    result = cudaEventElapsedTime(&runtime_ms, events[0], events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventElapsed() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    runtime_ms /= static_cast<float>(options.iterations);
+
+    double flops = 10.0 * (std::is_same_v<ActiveMask, CausalMask> ? 0.5 : 1.0);
+    flops *= static_cast<double>(get<0>(problem_shape));
+    flops *= static_cast<double>(get<1>(problem_shape));
+    flops *= static_cast<double>(get<2>(problem_shape));
+    flops *= static_cast<double>(get<3,0>(problem_shape));
+    flops *= static_cast<double>(get<3,1>(problem_shape));
+    double tflops_s = flops * 1e-12 /*tera*/ / (runtime_ms * 1e-3 /*ms*/);
+    example_result.tflops_tc_s = tflops_s;
+    example_result.runtime_ms = runtime_ms;
+
+    result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Verify that the result is correct
+    bool passed = true;
+    if (options.verify) {
+      passed = verify(problem_shape);
+      if (passed) example_result.verified = true;
+    }
+    
+    if (!passed) {
+      std::cerr << "Reference check failed" << std::endl;
+      return example_result;
+    }
+
+    example_result.passed = true;
+
+    return example_result;
+  }
+
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to print a description of the example run and its result
+void print_result(const std::string& description, ExampleResult result, bool verbose) {
+  std::ios fmt(nullptr);
+  fmt.copyfmt(std::cout);
+  std::cout << (result.passed ? (result.verified ? " [OK]  " : " [--] ") : "[FAIL] ");
+  std::cout << std::setw(32) << std::left << description;
+  std::cout.copyfmt(fmt);
+  std::cout << " : " << result.tflops_tc_s << " TFLOPS/s" << std::endl;
+  if (verbose) {
+    std::cout << "       t=" << result.runtime_ms << "ms, "
+        "smem=" << result.smem_size << "b" << std::endl;
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct KernelCoop {};
+
+//////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<class Mask>
+void run_bwd_64(Mask fusion, Options const & options, cutlass::KernelHardwareInfo const& hw_info) {
+  auto run = [&](auto shape, auto kernel, const char* name, auto... kernel_options) {
+    BwdRunner<decltype(shape), decltype(kernel), Mask, decltype(kernel_options)...> runner;
+    auto result = runner.run(options, hw_info);
+    print_result(name, result, options.verbose);
+  };
+
+  using HeadDim = _64;
+
+  run(Shape<_128, _128, HeadDim>{}, KernelCoop{}, "tma");
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<class Mask>
+void run_bwd_128(Mask fusion, Options const & options, cutlass::KernelHardwareInfo const& hw_info) {
+  auto run = [&](auto shape, auto kernel, const char* name, auto... kernel_options) {
+    BwdRunner<decltype(shape), decltype(kernel), Mask, decltype(kernel_options)...> runner;
+    auto result = runner.run(options, hw_info);
+    print_result(name, result, options.verbose);
+  };
+
+  using HeadDim = _128;
+
+  run(Shape<_128, _128, HeadDim>{}, KernelCoop{}, "tma");
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main_single(int argc, char const **args) {
+
+  cudaDeviceProp props;
+
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (error != cudaSuccess) {
+    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
+    return -1;
+  }
+
+  if (__CUDACC_VER_MAJOR__ < 12 || props.major != 10) {
+    std::cout
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture "
+      << "(compute capability 100a) and CUDA 12.8 or greater.\n";
+    return 0;
+  }
+  
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  if (options.error) {
+    std::cerr << "Aborting execution." << std::endl;
+    return -1;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  //
+  // Run examples
+  //
+
+  // The KernelHardwareInfo struct holds the number of SMs on the GPU with a given device ID. This
+  // information is used by the underlying kernel.
+  cutlass::KernelHardwareInfo hw_info;
+
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  if (options.sm_count == 0) {
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  }
+  else {
+    hw_info.sm_count = options.sm_count;
+  }
+
+  std::cout << "###### B " << options.b << " H " << options.h << " Q " << options.q << " K " << options.k << " D " << options.d << " ";
+  std::cout << "Backward" << " " << (options.causal ? "Causal" : "Full") << " ";
+  std::cout << "#SM " << hw_info.sm_count << std::endl;
+
+  auto with_causal = [&](auto fn) {
+    if (options.causal) {
+      fn(CausalMask{});
+    }
+    else {
+      fn(NoMask{});
+    }
+  };
+
+  with_causal([&](auto fusion) {
+    if (options.d <= 64) {
+      run_bwd_64(fusion, options, hw_info);
+    }
+    else if (options.d <= 128) {
+      run_bwd_128(fusion, options, hw_info);
+    }
+    else {
+      std::cout << "No kernel instantiated for d=" << options.d << std::endl;
+    }
+  });
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+  std::vector<std::string> full_arguments(args, args + argc);
+
+  int result = 0;
+
+  bool recursed = false;
+  for (size_t i = 1; i < full_arguments.size(); i++) {
+    if (full_arguments[i].find(',') != std::string::npos) {
+      auto arg = full_arguments[i];
+      size_t eq_pos = arg.find('=');
+      std::string prefix = eq_pos == std::string::npos ? "" : arg.substr(0, eq_pos+1);
+      std::string rest = eq_pos == std::string::npos ? arg : arg.substr(eq_pos+1);
+      for (;;) {
+        size_t comma_pos = rest.find(',');
+        std::string current = rest.substr(0, comma_pos);
+        full_arguments[i] = prefix + current;
+        std::vector<const char*> next_args;
+        for (auto& elem : full_arguments) { next_args.push_back(elem.data()); }
+        main(argc, next_args.data());
+        if (comma_pos == std::string::npos) break;
+        rest = rest.substr(comma_pos+1);
+      }
+      recursed = true;
+      break;
+    }
+  }
+
+  if (! recursed) {
+    main_single(argc, args);
+  }
+
+  return result;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/77_blackwell_fmha/77_blackwell_mla.cu b/examples/77_blackwell_fmha/77_blackwell_mla.cu
new file mode 100644
index 0000000000..baa70fce18
--- /dev/null
+++ b/examples/77_blackwell_fmha/77_blackwell_mla.cu
@@ -0,0 +1,832 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file A MLA (Multi-Head Latent Attention) inference kernel sample for the
+          NVIDIA Blackwell Architecture.
+*/
+
+#include <iostream>
+#include <random>
+#include <regex>
+#include <cmath>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/kernel_hardware_info.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "reference/fmha_mla_reference.hpp"
+#include "reference/reference_abs_error.hpp"
+
+#include "device/sm100_mla.hpp"
+#include "kernel/sm100_mla_tile_scheduler.hpp"
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+using namespace cute;
+using namespace cutlass::fmha::kernel;
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+enum class InitStyle {
+  kOne, kLinearStride128, kLinearStride1, kRandom, kNone
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool error = false;
+
+  int b = 1;
+  int k = 256;
+  int split_kv = -1; // number of split along k dim.
+  bool is_var_split_kv = false;
+  int max_split_kv = 16;
+  int page = -1;
+  float spread = 0.2f;
+  int iterations = 3;
+  bool verify = false;
+  bool verbose = false;
+
+  int sm_count = 0;
+
+  std::string kernel_filter;
+
+  InitStyle init_style_q = InitStyle::kRandom;
+  InitStyle init_style_c = InitStyle::kRandom;
+
+  static void get_init_style_argument(cutlass::CommandLine& cmd, const char* name, InitStyle& dst, InitStyle const& src) {
+    std::string s;
+    cmd.get_cmd_line_argument(name, s, s);
+    if (s.empty()) {
+      dst = src;
+    }
+    else {
+      if (s == "r") {
+        dst = InitStyle::kRandom;
+      }
+      else if (s == "1") {
+        dst = InitStyle::kOne;
+      }
+      else if (s == "d") {
+        dst = InitStyle::kLinearStride1;
+      }
+      else if (s == "s") {
+        dst = InitStyle::kLinearStride128;
+      }
+      else if (s == "n") {
+        dst = InitStyle::kNone;
+      }
+      else {
+        std::cout << "Error: " << s << " is not a valid input type.\n";
+        std::exit(-1);
+      }
+    }
+  }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    Options defaults;
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("k", k, -1);
+    if (k == -1) k = defaults.k;
+
+    cmd.get_cmd_line_argument("b", b, -1);
+    if (b == -1) b = 16384 / k;
+    if (b == 0) b = 1;
+
+    cmd.get_cmd_line_argument("split_kv", split_kv, defaults.split_kv);
+    cmd.get_cmd_line_argument("page", page, defaults.page);
+    cmd.get_cmd_line_argument("spread", spread, defaults.spread);
+    cmd.get_cmd_line_argument("is_var_split_kv", is_var_split_kv, false);
+    if (page == -1) {
+      is_var_split_kv = false;
+    }
+    cmd.get_cmd_line_argument("max_split_kv", max_split_kv, defaults.max_split_kv);
+    if (is_var_split_kv == true) {
+      split_kv = max_split_kv;
+    }
+    cmd.get_cmd_line_argument("iterations", iterations, defaults.iterations);
+    verify = cmd.check_cmd_line_flag("verify");
+    verbose = cmd.check_cmd_line_flag("verbose");
+    cmd.get_cmd_line_argument("sm-count", sm_count, defaults.sm_count);
+    
+    get_init_style_argument(cmd, "init-style", init_style_q, defaults.init_style_q);
+    get_init_style_argument(cmd, "init-style", init_style_c, defaults.init_style_c);
+    get_init_style_argument(cmd, "init-style-q", init_style_q, init_style_q);
+    get_init_style_argument(cmd, "init-style-c", init_style_c, init_style_c);
+
+    cmd.get_cmd_line_argument("kernel-filter", kernel_filter, defaults.kernel_filter);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "77_blackwell_mla\n\n"
+      << "  This example showcases the use of CUTLASS for fused multi-head latent\n"
+      << "  attention kernels targeting NVIDIA's Blackwell architecture.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --b=<int>                   Sets the B extent\n"
+      << "  --k=<int>                   Sets the K extent\n"
+      << "  --page=<int>                Enables paging and sets the page size\n"
+      << "  --iterations=<int>          Benchmarking iterations\n"
+      << "  --spread=<float>            Relative spread away from K for paging\n"
+      << "  --split_kv=<int>            Split KV factor\n"
+      << "  --verify                    Verify results\n"
+      << "  --verbose                   Print smem and execution time per kernel\n"
+      << " --sm-count                   Sets SM count rather than querying it\n"
+      << " --kernel-filter=<filter>     Sets regexp to match kernel against\n"
+      << "\n";
+
+    return out;
+  }
+};
+
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+void initialize_block(
+    DeviceAllocation<Element>& block,
+    uint64_t seed=2023, InitStyle init_style = InitStyle::kRandom) {
+
+  switch (init_style) {
+    case InitStyle::kOne: {
+      cutlass::reference::device::BlockFillRandomUniform(
+        block.get(), block.size(), seed, (Element) 1, (Element) 1);
+      break;
+    }
+    case InitStyle::kRandom: {
+      cutlass::reference::device::BlockFillRandomGaussian(
+        block.get(), block.size(), seed, (Element) -1, (Element) 1);
+      break;
+    }
+    case InitStyle::kLinearStride1: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (j % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kLinearStride128: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 64; i ++) {
+        for (int j = 0; j < 64; j++) {
+          data[j + 64*i] = static_cast<Element>((double) (i % 9));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kNone: {
+      break;
+    }
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct ExampleResult {
+  bool passed = false;
+  bool verified = false;
+  float runtime_ms = 0;
+  double tflops_tc_s = 0;
+  double tbytes_s = 0;
+  size_t smem_size = 0;
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<bool v>
+struct IsPersistent {
+  static const bool value = v;
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+  class TileShape,
+  class PersistenceOption = IsPersistent<true>
+>
+struct Runner {
+
+#ifdef FP8
+  using Element = cutlass::float_e4m3_t;
+#elif FP16
+  using Element = cutlass::half_t;
+#else
+  #error "Must either define FP8 or FP16"
+#endif
+
+  using ElementAcc = float;
+  using ElementOut = cutlass::half_t;
+
+  using TileShapeH = cute::tuple_element_t<0, TileShape>;
+  using TileShapeD = cute::tuple_element_t<2, TileShape>;
+
+  // H K (D_latent D_rope) B
+  using ProblemShape = cute::tuple<TileShapeH, int, TileShapeD, int>;
+  
+  using StrideQ = cute::tuple<int64_t, _1, int64_t>;  // H D B
+  using StrideK = cute::tuple<int64_t, _1, int64_t>;  // K D B
+  using StrideO = StrideK;                            // H D B
+  using StrideLSE = cute::tuple<_1, int>;             // H B
+
+  using TileScheduler = std::conditional_t<
+      PersistenceOption::value,
+      Sm100MlaPersistentTileScheduler,
+      Sm100MlaIndividualTileScheduler
+  >;
+
+  using Kernel = cutlass::fmha::kernel::Sm100FmhaMlaKernelTmaWarpspecialized<
+    TileShape, Element, ElementAcc, ElementOut, ElementAcc, TileScheduler
+  >;
+  using Operation = cutlass::fmha::device::MLA<Kernel>;
+
+  //
+  // Data members
+  //
+
+  /// Initialization
+  StrideQ stride_Q_latent;
+  StrideK stride_C_latent;
+  StrideQ stride_Q_rope;
+  StrideK stride_K_rope;
+  StrideO stride_O;
+  StrideLSE stride_LSE;
+  StrideLSE stride_PT;
+  
+  uint64_t seed = 0;
+
+  int page_size = -1;
+  int page_count = -1;
+
+  // We allocate Q and C as first latent, then rope
+  // This means that we offset the pointer by HeadDim_latent to get the rope
+  // portion
+  DeviceAllocation<Element> block_Q;
+  DeviceAllocation<Element> block_C;
+  DeviceAllocation<ElementOut> block_O;
+  DeviceAllocation<int> block_seq;
+  DeviceAllocation<int> block_PT;
+  DeviceAllocation<int> block_split_kv;
+  DeviceAllocation<int> block_accum_split_len; 
+  DeviceAllocation<ElementAcc> block_LSE;
+  DeviceAllocation<ElementOut> block_ref_O;
+  DeviceAllocation<ElementAcc> block_ref_LSE;
+   
+  ElementAcc scale;
+
+  //
+  // Methods
+  //
+
+  bool verify(const ProblemShape& problem_shape) {
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+
+    int page_K = K;
+    int page_B = B;
+    if (block_PT.get() != nullptr) {
+      page_K = page_size;
+      page_B = page_count;
+    }
+
+    Tensor mQ_latent = make_tensor(make_gmem_ptr(block_Q.get()),
+      cute::make_tuple(H, D_latent, B),
+      stride_Q_latent);
+
+    Tensor mQ_rope = make_tensor(make_gmem_ptr(block_Q.get() + D_latent),
+      cute::make_tuple(H, D_rope, B),
+      stride_Q_rope);
+
+    Tensor mC_latent = make_tensor(make_gmem_ptr(block_C.get()),
+      cute::make_tuple(page_K, D_latent, page_B),
+      stride_C_latent);
+
+    Tensor mK_rope = make_tensor(make_gmem_ptr(block_C.get() + D_latent),
+      cute::make_tuple(page_K, D_rope, page_B),
+      stride_K_rope);
+
+    Tensor mO = make_tensor(make_gmem_ptr(block_ref_O.get()),
+      cute::make_tuple(H, D_latent, B),
+      stride_O);
+
+    Tensor mLSE = make_tensor(make_gmem_ptr(block_ref_LSE.get()),
+      cute::make_tuple(H, B),
+      stride_LSE);
+
+    Tensor mSeq = make_tensor(make_gmem_ptr(static_cast<int*>(block_seq.get())), make_shape(B));
+    Tensor mPT = make_tensor(make_gmem_ptr(static_cast<int*>(block_PT.get())), make_shape(ceil_div(K, page_size), B), stride_PT);
+
+    fmha_mla_reference(problem_shape, mSeq, mPT, mQ_latent, mQ_rope, mC_latent, mK_rope, mO, mLSE, scale);
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Reference kernel failed. Last CUDA error: "
+                << cudaGetErrorString(result) << std::endl;
+      return false;
+    }
+
+    const double kMaxDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-2;
+    const double kMeanDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-3;
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    double max_diff = 0;
+    double mean_diff = 0;
+#ifdef B2B
+    reference_rel_diff(block_O, block_ref_O, max_diff, mean_diff);
+#else
+    reference_abs_diff(block_O, block_ref_O, max_diff, mean_diff);
+#endif
+
+    bool passed_O = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if (! passed_O) {
+      std::cerr << "failed O: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    bool passed_LSE = true;
+#ifndef B2B
+    reference_abs_diff(block_LSE, block_ref_LSE, max_diff, mean_diff);
+
+    passed_LSE = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if ( ! passed_LSE) {
+      std::cerr << "failed LSE: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+#endif
+
+    return passed_O && passed_LSE;
+  }
+
+  ProblemShape initialize(const Options& options) {
+    auto problem_shape = cute::make_tuple(TileShapeH{}, options.k, TileShapeD{}, options.b);
+
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+
+    // the scale is based on the non-absorbed sizes, change as appropriate
+    // we can't determine this parameter from the info we have, it's an input
+    int D_non_latent = 128;
+    scale = static_cast<decltype(scale)>(1.0 / sqrt(1.0 * (D_non_latent + D_rope)));
+    // Shape (H, D, B)
+    stride_Q_latent = cute::make_tuple(static_cast<int64_t>(0 + D_latent + D_rope), _1{}, static_cast<int64_t>(H * (0 + D_latent + D_rope)));
+    stride_Q_rope = stride_Q_latent;
+    stride_O = cute::make_tuple(static_cast<int64_t>(0 + D_latent), _1{}, static_cast<int64_t>(0 + H * D_latent));
+    stride_LSE = cute::make_tuple(_1{}, 0 + H);
+
+    block_Q.reset(static_cast<size_t>(options.b) * H * (D_latent + D_rope));
+    block_O.reset(static_cast<size_t>(options.b) * H * D_latent);
+    block_LSE.reset(static_cast<size_t>(options.b) * H);
+    block_ref_O.reset(static_cast<size_t>(options.b) * H * D_latent);
+    block_ref_LSE.reset(static_cast<size_t>(options.b) * H);
+
+    if (options.page == -1) {
+
+      stride_C_latent = cute::make_tuple(static_cast<int64_t>(0 + D_latent + D_rope), _1{}, static_cast<int64_t>(options.k) * (D_latent + D_rope));
+      stride_K_rope = stride_C_latent;
+
+      block_C.reset(static_cast<size_t>(options.b) * options.k * (D_latent + D_rope));
+
+    }
+    else {
+      
+      float spread = options.spread;
+      int max_K = static_cast<int>((1 + spread) * K);
+      int min_K = static_cast<int>((1 - spread) * K);
+      page_size = options.page;
+      page_count = B * ceil_div(max_K, page_size);
+      stride_PT = cute::make_stride(_1{}, page_count);
+
+      std::vector<int> host_seq(B);
+      std::vector<int> host_PT(page_count * B);
+
+      for (int i = 0; i < B; i++) {
+        int seq = min_K + rand() % (max_K - min_K + 1);
+        host_seq[i] = seq;
+        for (int j = 0; j < ceil_div(seq, page_size); j++) {
+          host_PT[page_count * i + j] = i + j * B;
+        }
+      }
+
+      block_seq.reset(host_seq.size());
+      block_seq.copy_from_host(host_seq.data(), host_seq.size());
+      block_PT.reset(host_PT.size());
+      block_PT.copy_from_host(host_PT.data(), host_PT.size());
+
+      get<1>(problem_shape) = max_K;
+
+      stride_C_latent = cute::make_tuple(static_cast<int64_t>(0 + D_latent + D_rope), _1{}, page_size * static_cast<int64_t>((D_latent + D_rope)));
+      stride_K_rope = stride_C_latent;
+
+      block_C.reset(page_count * page_size * static_cast<int64_t>((D_latent + D_rope)));
+
+      if (options.is_var_split_kv == true) {
+        std::vector<int> host_split_kv(B);
+        for(int i = 0; i < B; ++i) {
+          auto len = host_seq[i];
+	  int split = ceil_div(options.max_split_kv, ceil_div(max_K, len));
+	  host_split_kv[i] = split;
+        }
+	block_split_kv.reset(B);
+        block_split_kv.copy_from_host(host_split_kv.data(), host_split_kv.size());
+      } 
+    }
+
+    initialize_block(block_Q, seed + 2023, options.init_style_q);
+    initialize_block(block_C, seed + 2022, options.init_style_c);
+
+    return problem_shape;
+  }
+
+  ExampleResult run(const Options& options, const cutlass::KernelHardwareInfo& hw_info) {
+
+    ProblemShape problem_shape = initialize(options);
+
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+
+    typename Operation::Arguments arguments{
+      problem_shape,
+      { scale,
+        block_Q.get(), stride_Q_latent,
+        block_Q.get() + D_latent, stride_Q_rope,
+        block_C.get(), stride_C_latent,
+        block_C.get() + D_latent, stride_K_rope,
+        block_seq.get(),
+        block_PT.get(), stride_PT,
+        page_count, page_size},
+      { block_O.get(), 
+        stride_O,
+        block_LSE.get(),
+        stride_LSE}, 
+      hw_info,
+      options.split_kv,
+      options.is_var_split_kv ? block_split_kv.get() : nullptr
+    };
+    if (options.split_kv < 0 && !options.is_var_split_kv) {
+      Operation::set_split_kv(arguments);
+    }
+
+    Operation op;
+
+    ExampleResult example_result;
+
+    example_result.smem_size = Operation::Kernel::SharedStorageSize;
+
+    size_t workspace_size = 0;
+    workspace_size = Operation::get_workspace_size(arguments);
+    DeviceAllocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status = cutlass::Status::kSuccess;
+    status = op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "This kernel is not supported. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    status = op.initialize(arguments, workspace.get());
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to initialize the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+    // Run
+    status = op.run();
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    //
+    // Construct events
+    //
+
+    cudaEvent_t events[2];
+
+    for (auto & event : events) {
+      result = cudaEventCreate(&event);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventCreate() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+    }
+
+    // Record an event at the start of a series of GEMMs
+    result = cudaEventRecord(events[0]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    for (int i = 0; i < options.iterations; i++) {
+      status = op.run();
+      if (status != cutlass::Status::kSuccess) {
+        std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: " 
+                  << cudaGetErrorString(cudaGetLastError()) << std::endl;
+        return example_result;
+      } 
+    }
+
+    //
+    // Stop profiling loop
+    //
+
+    // Record an event when the GEMMs are complete
+    result = cudaEventRecord(events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Wait for work on the device to complete.
+    result = cudaEventSynchronize(events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventSynchronize() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Measure elapsed runtime
+    float runtime_ms = 0;
+    result = cudaEventElapsedTime(&runtime_ms, events[0], events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventElapsed() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    runtime_ms /= static_cast<float>(options.iterations);
+
+    double flops = 1.0;
+    flops *= B;
+    flops *= K;
+    flops *= H;
+    flops *= 2.0;
+    flops *= (2.0 * D_latent + D_rope);
+
+    double bytes_q = sizeof(Element);
+    bytes_q *= B;
+    bytes_q *= H;
+    bytes_q *= (D_latent + D_rope);
+    double bytes_c = sizeof(Element);
+    bytes_c *= B;
+    bytes_c *= options.k;  // K may be max_K here
+    bytes_c *= (D_latent + D_rope);
+    double bytes_o = sizeof(ElementOut);
+    bytes_o *= B;
+    bytes_o *= H;
+    bytes_o *= D_latent;
+    double bytes = bytes_q + bytes_c + bytes_o;
+
+    double tflops_s = flops * 1e-12 /*tera*/ / (runtime_ms * 1e-3 /*ms*/);
+    double tbytes_s = bytes * 1e-12 /*tera*/ / (runtime_ms * 1e-3 /*ms*/);
+    example_result.tflops_tc_s = tflops_s;
+    example_result.tbytes_s = tbytes_s;
+    example_result.runtime_ms = runtime_ms;
+
+    result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Verify that the result is correct
+    bool passed = true;
+    if (options.verify) {
+      passed = verify(problem_shape);
+      if (passed) example_result.verified = true;
+    }
+    
+    if (!passed) {
+      std::cerr << "Reference check failed" << std::endl;
+      return example_result;
+    }
+
+    example_result.passed = true;
+
+    return example_result;
+  }
+
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to print a description of the example run and its result
+void print_result(const std::string& description, ExampleResult result, bool verbose) {
+  std::ios fmt(nullptr);
+  fmt.copyfmt(std::cout);
+  std::cout << (result.passed ? (result.verified ? " [OK]  " : " [--] ") : "[FAIL] ");
+  std::cout << std::setw(32) << std::left << description;
+  std::cout.copyfmt(fmt);
+  std::cout << " : " << result.tflops_tc_s << " TFLOPS/s " << result.tbytes_s << " TB/s" << std::endl;
+  if (verbose) {
+    std::cout << "       t=" << result.runtime_ms * 1e3 << " us, "
+        "smem=" << result.smem_size << "b" << std::endl;
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+void run_mla(Options const & options, cutlass::KernelHardwareInfo const& hw_info) {
+  auto run = [&](auto shape, const char* name, auto... kernel_options) {
+    if ((! options.kernel_filter.empty()) && (! std::regex_search(name, std::basic_regex(options.kernel_filter)))) {
+        return;
+    }
+    Runner<decltype(shape), decltype(kernel_options)...> runner;
+    auto result = runner.run(options, hw_info);
+    print_result(name, result, options.verbose);
+  };
+
+  using NumHeads = _128;
+  using HeadDimLatent = _512;
+  using HeadDim = Shape<HeadDimLatent, _64>;
+
+  std::cout << "###### B " << options.b << " MLA H " << 0 + NumHeads{} << " ";
+  std::cout << "D_rope " << 0 + get<1>(HeadDim{}) << " D_latent " << 0 + get<0>(HeadDim{}) << " ";
+  std::cout << "Q 1 K " << options.k << " Gen None ";
+  std::cout << "Split " << options.split_kv << " Gen None ";
+  std::cout << "#SM " << hw_info.sm_count << std::endl;
+
+  using Blocking = _128;
+  std::string name = std::to_string((int) NumHeads{}) + "x" + std::to_string((int) Blocking{});
+  std::string individual = " individual";
+  std::string persistent = " persistent";
+#if FP8
+  name += " fp8";
+  // Persistent Tile Scheduler
+  run(Shape<NumHeads, Blocking, HeadDim>{}, (name + persistent).c_str(), IsPersistent<true>{});
+  // Individual Tile Scheduler
+  run(Shape<NumHeads, Blocking, HeadDim>{}, (name + individual).c_str(), IsPersistent<false>{});
+#elif FP16
+  name += " fp16";
+  // Persistent Tile Scheduler
+  run(Shape<NumHeads, Blocking, HeadDim>{}, (name + persistent).c_str(), IsPersistent<true>{});
+  // Individual Tile Scheduler
+  run(Shape<NumHeads, Blocking, HeadDim>{}, (name + individual).c_str(), IsPersistent<false>{});
+#endif
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+int main_single(int argc, char const **args) {
+
+  cudaDeviceProp props;
+
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (error != cudaSuccess) {
+    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
+    return -1;
+  }
+
+  if (__CUDACC_VER_MAJOR__ < 12 || props.major != 10) {
+    std::cout
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture "
+      << "(compute capability major 10) and CUDA 12.8 or greater.\n";
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  if (options.error) {
+    std::cerr << "Aborting execution." << std::endl;
+    return -1;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  //
+  // Run examples
+  //
+
+  // The KernelHardwareInfo struct holds the number of SMs on the GPU with a given device ID. This
+  // information is used by the underlying kernel.
+  cutlass::KernelHardwareInfo hw_info;
+
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  if (options.sm_count == 0) {
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  }
+  else {
+    hw_info.sm_count = options.sm_count;
+  }
+
+  run_mla(options, hw_info);
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+  std::vector<std::string> full_arguments(args, args + argc);
+
+  int result = 0;
+
+  bool recursed = false;
+  for (size_t i = 1; i < full_arguments.size(); i++) {
+    if (full_arguments[i].find(',') != std::string::npos) {
+      auto arg = full_arguments[i];
+      size_t eq_pos = arg.find('=');
+      std::string prefix = eq_pos == std::string::npos ? "" : arg.substr(0, eq_pos+1);
+      std::string rest = eq_pos == std::string::npos ? arg : arg.substr(eq_pos+1);
+      for (;;) {
+        size_t comma_pos = rest.find(',');
+        std::string current = rest.substr(0, comma_pos);
+        full_arguments[i] = prefix + current;
+        std::vector<const char*> next_args;
+        for (auto& elem : full_arguments) { next_args.push_back(elem.data()); }
+        main(argc, next_args.data());
+        if (comma_pos == std::string::npos) break;
+        rest = rest.substr(comma_pos+1);
+      }
+      recursed = true;
+      break;
+    }
+  }
+
+  if (! recursed) {
+    main_single(argc, args);
+  }
+
+  return result;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/77_blackwell_fmha/CMakeLists.txt b/examples/77_blackwell_fmha/CMakeLists.txt
index 90b4738760..f04ebe417b 100644
--- a/examples/77_blackwell_fmha/CMakeLists.txt
+++ b/examples/77_blackwell_fmha/CMakeLists.txt
@@ -28,12 +28,14 @@
 
 
 set_property(
-  SOURCE 77_blackwell_fmha.cu
-  PROPERTY COMPILE_FLAGS "--use_fast_math -ftemplate-backtrace-limit=0")
-
-set_property(
-  SOURCE 77_blackwell_fmha_gen.cu
-  PROPERTY COMPILE_FLAGS "--use_fast_math -ftemplate-backtrace-limit=0")
+  SOURCE
+      77_blackwell_fmha.cu
+      77_blackwell_fmha_gen.cu
+      77_blackwell_mla.cu
+      77_blackwell_fmha_bwd.cu
+  PROPERTY
+      COMPILE_FLAGS "--use_fast_math -ftemplate-backtrace-limit=0"
+)
 
 set(TEST_BASIC --b=1 --h=4 --q=512 --k=512 --d=128 --verify --mask=no)
 set(TEST_CAUSAL --b=1 --h=4 --q=512 --k=512 --d=128 --verify --mask=causal)
@@ -48,58 +50,98 @@ set(TEST_GEN_GQA --b=2 --h=4 --h_k=2 --k=512 --d=64 --verify)
 set(TEST_GEN_REMAP --b=2 --h=4 --h_k=2 --k=512 --d=128 --verify --remap)
 set(TEST_GEN_CACHEONLY --b=2 --h=4 --h_k=2 --k=512 --d=128 --verify --cache-only)
 
-if(NOT WIN32 AND (NOT (CMAKE_CXX_COMPILER_ID MATCHES "Clang")))
-  if (CUTLASS_NVCC_ARCHS MATCHES 100a)
-  cutlass_example_add_executable(
-      77_blackwell_fmha_fp8
-      77_blackwell_fmha.cu
-      TEST_COMMAND_OPTIONS
-      TEST_BASIC
-      # TEST_CAUSAL
-      # TEST_VARLEN
-      # TEST_HDIM64
-      # TEST_GQA)
-      )
-  target_include_directories(77_blackwell_fmha_fp8 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
-  target_compile_definitions(77_blackwell_fmha_fp8 PRIVATE FP8)
+set(TEST_MLA_BASIC --b=1 --k=512 --verify)
 
-  cutlass_example_add_executable(
-      77_blackwell_fmha_gen_fp8
-      77_blackwell_fmha_gen.cu
-      TEST_COMMAND_OPTIONS
-      TEST_GEN_BASIC
-      # TEST_GEN_VARLEN
-      # TEST_GEN_HDIM64
-      # TEST_GEN_GQA
-      # TEST_GEN_REMAP
-      # TEST_GEN_CACHEONLY)
-      )
-  target_include_directories(77_blackwell_fmha_gen_fp8 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
-  target_compile_definitions(77_blackwell_fmha_gen_fp8 PRIVATE FP8)
+if(NOT WIN32 AND (NOT (CMAKE_CXX_COMPILER_ID MATCHES "Clang")) AND (CUTLASS_NVCC_ARCHS MATCHES 100a))
 
-  cutlass_example_add_executable(
-      77_blackwell_fmha_fp16
-      77_blackwell_fmha.cu
-      TEST_COMMAND_OPTIONS
-      TEST_BASIC
-      # TEST_CAUSAL
-      # TEST_VARLEN
-      # TEST_HDIM64
-      # TEST_GQA)
-      )
-  target_include_directories(77_blackwell_fmha_fp16 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+  foreach(PREC fp8 fp16)
+    string(TOUPPER "${PREC}" PREC_MACRO)
 
-  cutlass_example_add_executable(
-      77_blackwell_fmha_gen_fp16
-      77_blackwell_fmha_gen.cu
-      TEST_COMMAND_OPTIONS
-      TEST_GEN_BASIC
-      # TEST_GEN_VARLEN
-      # TEST_GEN_HDIM64
-      # TEST_GEN_GQA
-      # TEST_GEN_REMAP
-      # TEST_GEN_CACHEONLY)
-      )
-  target_include_directories(77_blackwell_fmha_gen_fp16 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
-  endif()
+    cutlass_example_add_executable(
+        77_blackwell_fmha_${PREC}
+        77_blackwell_fmha.cu
+        TEST_COMMAND_OPTIONS
+        TEST_BASIC
+        # TEST_CAUSAL
+        # TEST_VARLEN
+        # TEST_HDIM64
+        # TEST_GQA)
+        )
+    target_include_directories(77_blackwell_fmha_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_fmha_${PREC} PRIVATE ${PREC_MACRO})
+  
+    cutlass_example_add_executable(
+        77_blackwell_fmha_gen_${PREC}
+        77_blackwell_fmha_gen.cu
+        TEST_COMMAND_OPTIONS
+        TEST_GEN_BASIC
+        # TEST_GEN_VARLEN
+        # TEST_GEN_HDIM64
+        # TEST_GEN_GQA
+        # TEST_GEN_REMAP
+        # TEST_GEN_CACHEONLY)
+        )
+    target_include_directories(77_blackwell_fmha_gen_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_fmha_gen_${PREC} PRIVATE ${PREC_MACRO})
+
+    cutlass_example_add_executable(
+        77_blackwell_mla_2sm_${PREC}
+        77_blackwell_mla.cu
+        TEST_COMMAND_OPTIONS
+        TEST_MLA_BASIC
+        )
+    target_include_directories(77_blackwell_mla_2sm_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_mla_2sm_${PREC} PRIVATE ${PREC_MACRO})
+    target_compile_options(77_blackwell_mla_2sm_${PREC} PRIVATE -Xptxas -v)
+
+    cutlass_example_add_executable(
+        77_blackwell_mla_2sm_cpasync_${PREC}
+        77_blackwell_mla.cu
+        TEST_COMMAND_OPTIONS
+        TEST_MLA_BASIC
+        )
+    target_include_directories(77_blackwell_mla_2sm_cpasync_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_mla_2sm_cpasync_${PREC} PRIVATE ${PREC_MACRO} CPASYNC)
+    target_compile_options(77_blackwell_mla_2sm_cpasync_${PREC} PRIVATE -Xptxas -v)
+
+    cutlass_example_add_executable(
+        77_blackwell_mla_b2b_2sm_${PREC}
+        77_blackwell_mla.cu
+        TEST_COMMAND_OPTIONS
+        TEST_MLA_BASIC
+        )
+    target_include_directories(77_blackwell_mla_b2b_2sm_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_mla_b2b_2sm_${PREC} PRIVATE ${PREC_MACRO} B2B)
+    target_compile_options(77_blackwell_mla_b2b_2sm_${PREC} PRIVATE -Xptxas -v)
+
+    cutlass_example_add_executable(
+        77_blackwell_fmha_bwd_${PREC}
+        77_blackwell_fmha_bwd.cu
+        TEST_COMMAND_OPTIONS
+        TEST_BASIC
+        # TEST_GEN_VARLEN
+        # TEST_GEN_HDIM64
+        # TEST_GEN_GQA
+        # TEST_GEN_REMAP
+        # TEST_GEN_CACHEONLY)
+        )
+    target_include_directories(77_blackwell_fmha_bwd_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_fmha_bwd_${PREC} PRIVATE ${PREC_MACRO})
+    target_compile_options(77_blackwell_fmha_bwd_${PREC} PRIVATE -Xptxas -v)
+
+    cutlass_example_add_executable(
+        77_blackwell_fmha_bwd_sat_${PREC}
+        77_blackwell_fmha_bwd.cu
+        TEST_COMMAND_OPTIONS
+        TEST_BASIC
+        # TEST_GEN_VARLEN
+        TEST_GEN_HDIM64
+        # TEST_GEN_GQA
+        # TEST_GEN_REMAP
+        # TEST_GEN_CACHEONLY)
+        )
+    target_include_directories(77_blackwell_fmha_bwd_sat_${PREC} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+    target_compile_definitions(77_blackwell_fmha_bwd_sat_${PREC} PRIVATE ${PREC_MACRO} SKIP_ATOMIC)
+    target_compile_options(77_blackwell_fmha_bwd_sat_${PREC} PRIVATE -Xptxas -v)
+  endforeach()
 endif()
diff --git a/examples/77_blackwell_fmha/README.md b/examples/77_blackwell_fmha/README.md
index 2f4c9c760b..a1536dc8b8 100644
--- a/examples/77_blackwell_fmha/README.md
+++ b/examples/77_blackwell_fmha/README.md
@@ -22,6 +22,39 @@ The `apply_mask` function is called with the accumulator of the first GEMM and t
 It is well-suited for applying masks or activations.
 More complex fusions that require memory loads would require modifying the mainloop collective to orchestrate the load via TMA.
 
+# FMHA for Blackwell: Backward
+
+This sample provides code for fused multi-head attention backward pass.
+It supports HeadDims of 64 and 128, and fp8, fp16, and bf16 input data types.
+The blocking in sequence length Q and K is 128, loads are done via TMA.
+We support causal masking.
+The structure of this code is very similar to the forward pass, and the techniques are analogous.
+
+There are three kernels to compute backwards:
+1. `FmhaKernelBwdSumOdO` to compute the sum of the outer product of O and dO.
+3. `Sm100FmhaBwdKernelTmaWarpSpecialized` to compute the backward pass.
+2. `FmhaKernelBwdConvert` to convert the dQ from fp32 to the final output precision.
+
+`Sm100FmhaBwdKernelTmaWarpSpecialized` is the main point of this sample, as it demonstrates how to use tensor cores to achieve a high performance fused kernel.
+
+# MLA Inference for Blackwell
+
+This sample provides code for fused multi-head latent attention inference in
+the weight-absorbed regime, i.e. for latent head dim 512, and rope head dim 64.
+It supports fp16, bf16, and fp8 input and output types.
+
+To accomodate the large output accumulator due to the large latent head dimension,
+the sample demonstrates how to leverage 2Sm Blackwell tensor cores.
+
+Loading can be done via TMA (either without paging or with page size 128), or using `cp.async`
+for support of any power-of-two page size less than or equal to 128.
+With paging, the code also supports variable sequence length.
+
+The approach of this implementation is to reuse the selection logic of the collective gemm builder and recombine the result into an MLA kernel.
+
+The example builds six binaries, showcasing TMA and `cp.async` usage, as well as a back-to-back gemm (essentially turning the softmax into a no-op) for fp8 and fp16.
+For detailed information on how to invoke them, check out either the tests in `CMakeLists.txt` or the `--help` for them.
+
 # Copyright
 
 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
diff --git a/examples/77_blackwell_fmha/common/pow_2.hpp b/examples/77_blackwell_fmha/common/pow_2.hpp
new file mode 100644
index 0000000000..eca93250f4
--- /dev/null
+++ b/examples/77_blackwell_fmha/common/pow_2.hpp
@@ -0,0 +1,92 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <cute/config.hpp>
+#include <cute/numeric/integral_constant.hpp>
+
+#include <cuda_runtime.h>
+
+namespace cutlass::fmha {
+
+struct Pow2 {                                                                   
+  int n;                                                                        
+  int log2_n;                                                                   
+                                                                                
+  explicit CUTE_DEVICE Pow2(int n) : n(n) {
+#ifdef __CUDA_ARCH__
+    log2_n = __ffs(n) - 1;
+#endif
+  }                    
+                                                                                
+  template<class T>  
+  CUTE_HOST_DEVICE T operator *(T const& b) const {
+    return n * b;
+  }
+
+  template<int N>
+  CUTE_HOST_DEVICE auto operator *(Int<N> const&) const {
+    if constexpr (N & (N - 1) == 0) {
+      return Pow2{n * N};
+    }
+    return n * N;
+  }
+
+};                                                                              
+
+template<class T>
+CUTE_HOST_DEVICE auto operator/(T const& a, Pow2 const& b) {
+  return a >> b.log2_n;
+}
+
+template<class T>
+CUTE_HOST_DEVICE auto operator%(T const& a, Pow2 const& b) {
+  return a & (b.n - 1);
+}
+
+template<class T>
+CUTE_HOST_DEVICE bool operator<(T const& a, Pow2 const& b) {
+  return a < b.n;
+}
+
+CUTE_HOST_DEVICE void print(Pow2 const& a) {
+  printf("2^%d", a.log2_n);
+}
+
+} // end namespace cutlass::fmha
+
+namespace cute {
+
+template <>
+struct is_integral<cutlass::fmha::Pow2> : true_type {};
+
+} // end namespace cute
diff --git a/examples/77_blackwell_fmha/device/fmha_device_bwd.hpp b/examples/77_blackwell_fmha/device/fmha_device_bwd.hpp
new file mode 100644
index 0000000000..80fcdf9fdf
--- /dev/null
+++ b/examples/77_blackwell_fmha/device/fmha_device_bwd.hpp
@@ -0,0 +1,320 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+#pragma once
+
+// common
+#include "cutlass/cutlass.h"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cute/tensor.hpp"
+
+#include "../device/fmha.hpp"
+#include "../kernel/sm100_fmha_bwd_kernel_tma_warpspecialized.hpp"
+#include "../kernel/fmha_kernel_bwd_sum_OdO.hpp"
+#include "../kernel/fmha_kernel_bwd_convert.hpp"
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::fmha::device {
+
+////////////////////////////////////////////////////////////////////////////////
+////////////////////////////// CUTLASS 3.x API /////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+
+template<class Element, class ElementAccumulator, class TileShape, class Mask>
+class Sm100FmhaBwd {
+public:
+  /// Argument structure: User API
+  struct Arguments {
+    // Q K D HB
+    cute::tuple<int, int, int, cute::tuple<int, int>> problem_size;
+
+    const Element* ptr_Q;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_Q;
+    const Element* ptr_K;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_K;
+    const Element* ptr_V;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_V;
+
+    const Element* ptr_O;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_O;
+    const ElementAccumulator* ptr_LSE;
+    cute::tuple<cute::_1, cute::tuple<int, int>> stride_LSE;
+
+    const Element* ptr_dO;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_dO;
+
+    Element* ptr_dQ;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_dQ;
+    Element* ptr_dK;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_dK;
+    Element* ptr_dV;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_dV;
+
+    ElementAccumulator softmax_scale;
+
+    cutlass::KernelHardwareInfo hw_info;
+  };
+
+  using OperationSumOdO = cutlass::fmha::device::FMHA<
+    cutlass::fmha::kernel::FmhaKernelBwdSumOdO<Element, ElementAccumulator>
+  >;
+  using OperationConvert = cutlass::fmha::device::FMHA<
+    cutlass::fmha::kernel::FmhaKernelBwdConvert<Element, ElementAccumulator>
+  >;
+
+  using Operation = cutlass::fmha::device::FMHA<
+    cutlass::fmha::kernel::Sm100FmhaBwdKernelTmaWarpSpecialized<Element, ElementAccumulator, TileShape, Mask>
+  >;
+  using Kernel = typename Operation::Kernel;
+
+  struct Params {
+    OperationSumOdO op_sum_OdO;
+    Operation op;
+    OperationConvert op_convert;
+    ElementAccumulator* dQ_acc;
+    size_t dQ_acc_size;
+  };
+
+private:
+  Params params_;
+
+  static typename OperationSumOdO::Arguments to_sum_OdO_arguments(
+        Arguments const& args,
+        ElementAccumulator* sum_odo = nullptr,
+        ElementAccumulator* scaled_lse = nullptr) {
+    using namespace cute;
+    auto [Q, K, D, HB] = args.problem_size;
+    auto [H, B] = HB;
+    D = cutlass::round_up(D, 8);  // Alignment
+    Q = cutlass::round_up(Q, 8);  // Alignment
+    auto stride_sum_OdO = make_stride(_1{}, make_stride(Q, Q*H));
+    auto stride_scaled_lse = make_stride(_1{}, make_stride(Q, Q*H));
+    auto log2_e = log2f(expf(1.0f));
+    return typename OperationSumOdO::Arguments {
+      args.problem_size,
+      args.ptr_O, args.stride_O,
+      args.ptr_dO, args.stride_dO,
+      sum_odo, stride_sum_OdO,
+      args.ptr_LSE, args.stride_LSE,
+      scaled_lse, stride_scaled_lse,
+      -1.0f, -log2_e
+    };
+  }
+
+  static typename OperationConvert::Arguments to_convert_arguments(Arguments const& args, ElementAccumulator* src = nullptr) {
+    using namespace cute;
+    auto [Q, K, D, HB] = args.problem_size;
+    auto [H, B] = HB;
+    D = cutlass::round_up(D, 8);  // Alignment
+    Q = cutlass::round_up(Q, 8);  // Alignment
+    auto stride_src_dQ = make_stride(D, _1{}, make_stride(D*Q, D*Q*H));
+    return typename OperationConvert::Arguments {
+      args.problem_size,
+      src, stride_src_dQ,
+      nullptr, stride_src_dQ,
+      nullptr, stride_src_dQ,
+      args.ptr_dQ, args.stride_dQ,
+      nullptr, args.stride_dK,
+      nullptr, args.stride_dV,
+      args.softmax_scale
+    };
+  }
+
+  static typename Operation::Arguments to_bwd_arguments(
+      Arguments const& args,
+      ElementAccumulator* sum_OdO = nullptr, cute::tuple<cute::_1, cute::tuple<int, int>> const& stride_sum_OdO = {},
+      ElementAccumulator* scaled_lse = nullptr, cute::tuple<cute::_1, cute::tuple<int, int>> const& stride_scaled_lse = {},
+      ElementAccumulator* dQ_acc = nullptr, cute::tuple<int, cute::_1, cute::tuple<int, int>> const& stride_dQ = {}) {
+    return typename Operation::Arguments{
+      args.problem_size,
+      { args.ptr_Q,  args.stride_Q,
+        args.ptr_K,  args.stride_K,
+        args.ptr_V,  args.stride_V,
+        args.ptr_dO, args.stride_dO,
+        scaled_lse, stride_scaled_lse,
+        sum_OdO, stride_sum_OdO,
+        dQ_acc, stride_dQ,
+        args.softmax_scale },
+      { args.ptr_dK, args.stride_dK,
+        args.ptr_dV, args.stride_dV },
+      args.hw_info
+    };
+  }
+
+public:
+
+  /// Determines whether the GEMM can execute the given problem.
+  static Status
+  can_implement(Arguments const& args) {
+    Status status = Status::kSuccess;
+
+    status = OperationSumOdO::can_implement(to_sum_OdO_arguments(args));
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    status = OperationConvert::can_implement(to_convert_arguments(args));
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    status = Operation::can_implement(to_bwd_arguments(args));
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    return status;
+  }
+
+  /// Gets the workspace size
+  static size_t
+  get_workspace_size(Arguments const& args) {
+    auto [Q, K, D, HB] = args.problem_size;
+    auto [H, B] = HB;
+    D = cutlass::round_up(D, 8);  // Alignment
+    Q = cutlass::round_up(Q, 8);  // Alignment
+    size_t workspace_bytes = 0;
+    // OdO vector
+    workspace_bytes += B*H*Q * sizeof(ElementAccumulator);
+    // scaled LSE vector
+    workspace_bytes += B*H*Q * sizeof(ElementAccumulator);
+    // FP32 versions of outputs that are churned (start off with Q only)
+    workspace_bytes += B*H*Q*D * sizeof(ElementAccumulator);
+    return workspace_bytes;
+  }
+
+  /// Initializes state from arguments.
+  Status
+  initialize_split(Arguments const& args, void* workspace_dQ, void* workspace_sum_OdO, void* workspace_scaled_lse, cudaStream_t stream = nullptr) {
+    CUTLASS_TRACE_HOST("Universal::initialize_split() - workspace_dQ="
+      << workspace_dQ << ", workspace_sum_OdO=" << workspace_sum_OdO << "stream: " << (stream ? "non-null" : "null"));
+
+    auto [Q, K, D, HB] = args.problem_size;
+    auto [H, B] = HB;
+    D = cutlass::round_up(D, 8);  // Alignment
+    Q = cutlass::round_up(Q, 8);  // Alignment
+    ElementAccumulator* sum_OdO = reinterpret_cast<ElementAccumulator*>(workspace_sum_OdO);
+    ElementAccumulator* scaled_lse = reinterpret_cast<ElementAccumulator*>(workspace_scaled_lse);
+    ElementAccumulator* dQ_acc = reinterpret_cast<ElementAccumulator*>(workspace_dQ);
+    params_.dQ_acc = dQ_acc;
+    params_.dQ_acc_size = B*H*Q*D * sizeof(ElementAccumulator);
+    auto args_sum_OdO = to_sum_OdO_arguments(args, sum_OdO, scaled_lse);
+    auto args_convert = to_convert_arguments(args, dQ_acc);
+    params_.op_sum_OdO.initialize(args_sum_OdO, nullptr, stream);
+    params_.op_convert.initialize(args_convert, nullptr, stream);
+    auto args_bwd = to_bwd_arguments(
+        args, sum_OdO, args_sum_OdO.stride_sum_OdO,
+        scaled_lse, args_sum_OdO.stride_scaled_lse,
+        dQ_acc, args_convert.stride_src_dQ
+    );
+    params_.op.initialize(args_bwd, nullptr, stream);
+
+    return Status::kSuccess;
+  }
+
+  /// Initializes state from arguments.
+  Status
+  initialize(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) {
+    CUTLASS_TRACE_HOST("Universal::initialize() - workspace "
+      << workspace << ", stream: " << (stream ? "non-null" : "null"));
+
+    auto [Q, K, D, HB] = args.problem_size;
+    auto [H, B] = HB;
+    D = cutlass::round_up(D, 8);  // Alignment
+    Q = cutlass::round_up(Q, 8);  // Alignment
+    char* workspace_chr = reinterpret_cast<char*>(workspace);
+    ElementAccumulator* sum_OdO = reinterpret_cast<ElementAccumulator*>(workspace_chr);
+    workspace_chr += B*H*Q * sizeof(ElementAccumulator);
+    ElementAccumulator* scaled_lse = reinterpret_cast<ElementAccumulator*>(workspace_chr);
+    workspace_chr += B*H*Q * sizeof(ElementAccumulator);
+    ElementAccumulator* dQ_acc = reinterpret_cast<ElementAccumulator*>(workspace_chr);
+    return initialize_split(args, dQ_acc, sum_OdO, scaled_lse, stream);
+  }
+
+  /// Primary run() entry point API that is static allowing users to create and manage their own params.
+  /// Supplied params struct must be construct by calling Kernel::to_underling_arguments()
+  static Status
+  run(Params& params, cudaStream_t stream = nullptr) {
+    CUTLASS_TRACE_HOST("FmhaDeviceBwd::run()");
+
+    Status result = Status::kSuccess;
+    result = params.op_sum_OdO.run(stream);
+    if (result != Status::kSuccess) {
+      return result;
+    }
+
+    auto cuda_result = cudaMemsetAsync(params.dQ_acc, 0, params.dQ_acc_size, stream);
+    if (cuda_result != cudaSuccess) {
+       return Status::kErrorInternal;
+    }
+
+    result = params.op.run(stream);
+    if (result != Status::kSuccess) {
+      return result;
+    }
+
+    result = params.op_convert.run(stream);
+    if (result != Status::kSuccess) {
+      return result;
+    }
+
+    return Status::kSuccess;
+  }
+
+  //
+  // Non-static launch overloads that first create and set the internal params struct of this kernel handle.
+  //
+
+  /// Launches the kernel after first constructing Params internal state from supplied arguments.
+  Status
+  run(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) {
+    Status status = initialize(args, workspace, stream);
+    if (Status::kSuccess == status) {
+      status = run(params_, stream);
+    }
+    return status;
+  }
+
+  /// Overload that allows a user to re-launch the same kernel without updating internal params struct.
+  Status
+  run(cudaStream_t stream = nullptr) {
+    return run(params_, stream);
+  }
+
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::fmha::device
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/77_blackwell_fmha/device/sm100_mla.hpp b/examples/77_blackwell_fmha/device/sm100_mla.hpp
new file mode 100644
index 0000000000..4e09809007
--- /dev/null
+++ b/examples/77_blackwell_fmha/device/sm100_mla.hpp
@@ -0,0 +1,357 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*!
+  \file
+  \brief An universal device layer for cutlass 3.x-style kernels.
+*/
+
+#pragma once
+
+// common
+#include "cutlass/cutlass.h"
+#include "cutlass/device_kernel.h"
+
+#if !defined(__CUDACC_RTC__)
+#include "cutlass/cluster_launch.hpp"
+#include "cutlass/trace.h"
+#endif // !defined(__CUDACC_RTC__)
+
+#include "kernel/sm100_fmha_mla_tma_warpspecialized.hpp"
+#include "kernel/sm100_fmha_mla_reduction.hpp"
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::fmha::device {
+
+using namespace cute;
+using namespace cutlass::fmha::kernel;
+
+
+////////////////////////////////////////////////////////////////////////////////
+////////////////////////////// CUTLASS 3.x API /////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class Kernel_
+>
+class MLA {
+public:
+
+  using Kernel = Kernel_;
+
+  using ReductionKernel = cutlass::fmha::kernel::Sm100FmhaMlaReductionKernel<
+      typename Kernel::ElementOut,
+      typename Kernel::ElementAcc,
+      typename Kernel::ElementAcc,
+      Kernel::TileShapeH::value,
+      Kernel::TileShapeL::value,
+      256 /*Max split*/
+  >;
+
+  /// Argument structure: User API
+  using KernelArguments = typename Kernel::Arguments;
+  using ReductionArguments = typename ReductionKernel::Arguments;
+
+  using Arguments = KernelArguments;
+
+  /// Argument structure: Kernel API
+  using KernelParams = typename Kernel::Params;
+  using ReductionParams = typename ReductionKernel::Params;
+  struct Params {
+    KernelParams fmha_params;
+    ReductionParams reduction_params;
+  };
+
+private:
+
+  /// Kernel API parameters object
+  Params params_;
+
+  bool is_initialized(bool set = false) {
+    static bool initialized = false;
+    if (set) initialized = true;
+    return initialized;
+  }
+
+  static ReductionArguments to_reduction_args(Arguments const& args) {
+    auto [H, K, D, B] = args.problem_shape;
+    return ReductionArguments{
+      nullptr, args.epilogue.ptr_o, nullptr, args.epilogue.ptr_lse,
+      args.mainloop.softmax_scale, B, args.split_kv, K, args.mainloop.ptr_seq, 
+      args.ptr_split_kv, Kernel::TileShapeS::value
+    };
+  }
+
+public:
+
+  /// Access the Params structure
+  Params const& params() const {
+    return params_;
+  }
+
+  static void set_split_kv (KernelArguments& args) {
+    if (args.split_kv >= 1) return;
+    auto [H, K, D, B] = args.problem_shape; 
+    int sm_count = args.hw_info.sm_count;
+    int max_splits = ceil_div(K, 128);
+    int sms_per_batch = max(1, sm_count / B);
+    int split_heur = min(max_splits, sms_per_batch);
+    int waves = ceil_div(B * split_heur, sm_count);
+    int k_waves = ceil_div(max_splits, split_heur);
+    int split_wave_aware = ceil_div(max_splits, k_waves);
+    args.split_kv = split_wave_aware;
+  }
+
+  /// Determines whether the GEMM can execute the given problem.
+  static Status
+  can_implement(Arguments const& args) {
+    if (! Kernel::can_implement(args)) {
+      return Status::kInvalid;
+    }
+    if (! ReductionKernel::can_implement(to_reduction_args(args))) {
+      return Status::kInvalid;
+    }
+    return Status::kSuccess;
+  }
+
+  /// Gets the workspace size
+  static size_t
+  get_workspace_size(Arguments const& args) {
+    size_t workspace_bytes = 0;
+    workspace_bytes += Kernel::get_workspace_size(args);
+    workspace_bytes += ReductionKernel::get_workspace_size(to_reduction_args(args));
+    return workspace_bytes;
+  }
+
+  /// Computes the maximum number of active blocks per multiprocessor
+  static int maximum_active_blocks(int /* smem_capacity */ = -1) {
+    CUTLASS_TRACE_HOST("MLA::maximum_active_blocks()");
+    int max_active_blocks = -1;
+    int smem_size = Kernel::SharedStorageSize;
+
+    // first, account for dynamic smem capacity if needed
+    cudaError_t result;
+    if (smem_size >= (48 << 10)) {
+      CUTLASS_TRACE_HOST("  Setting smem size to " << smem_size);
+      result = cudaFuncSetAttribute(
+          device_kernel<Kernel>,
+          cudaFuncAttributeMaxDynamicSharedMemorySize,
+          smem_size);
+      if (cudaSuccess != result) {
+        result = cudaGetLastError(); // to clear the error bit
+        CUTLASS_TRACE_HOST(
+          "  cudaFuncSetAttribute() returned error: "
+          << cudaGetErrorString(result));
+        return -1;
+      }
+    }
+
+    // query occupancy after setting smem size
+    result = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &max_active_blocks,
+        device_kernel<Kernel>,
+        Kernel::MaxThreadsPerBlock,
+        smem_size);
+
+    if (cudaSuccess != result) {
+      result = cudaGetLastError(); // to clear the error bit
+      CUTLASS_TRACE_HOST(
+        "  cudaOccupancyMaxActiveBlocksPerMultiprocessor() returned error: "
+        << cudaGetErrorString(result));
+      return -1;
+    }
+
+    CUTLASS_TRACE_HOST("  max_active_blocks: " << max_active_blocks);
+    return max_active_blocks;
+  }
+
+  /// Initializes GEMM state from arguments.
+  Status
+  initialize(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) {
+    CUTLASS_TRACE_HOST("MLA::initialize() - workspace "
+      << workspace << ", stream: " << (stream ? "non-null" : "null"));
+
+    // Initialize the workspace
+    Status status = Kernel::initialize_workspace(args, workspace, stream);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+    status = ReductionKernel::initialize_workspace(to_reduction_args(args), workspace, stream);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+    KernelParams kernel_params = Kernel::to_underlying_arguments(args, workspace);
+
+    ReductionArguments reduction_args = to_reduction_args(args);
+    if (reduction_args.split_kv > 1) {
+      reduction_args.ptr_oaccum   = kernel_params.epilogue.ptr_o_acc;
+      reduction_args.ptr_lseaccum = kernel_params.epilogue.ptr_lse_acc;
+    }
+    ReductionParams reduction_params = ReductionKernel::to_underlying_arguments(reduction_args, workspace);
+    // Initialize the Params structure
+    params_ = Params {kernel_params, reduction_params};
+
+    if (is_initialized()) return Status::kSuccess;
+
+    // account for dynamic smem capacity if needed
+    // no dynamic smem is needed for reduction kernel
+    int smem_size = Kernel::SharedStorageSize;
+    if (smem_size >= (48 << 10)) {
+      CUTLASS_TRACE_HOST("  Setting smem size to " << smem_size);
+      cudaError_t result = cudaFuncSetAttribute(
+          device_kernel<Kernel>,
+          cudaFuncAttributeMaxDynamicSharedMemorySize,
+          smem_size);
+      if (cudaSuccess != result) {
+        result = cudaGetLastError(); // to clear the error bit
+        CUTLASS_TRACE_HOST("  cudaFuncSetAttribute() returned error: " << cudaGetErrorString(result));
+        return Status::kErrorInternal;
+      }
+    }
+
+    is_initialized(true);
+
+    return Status::kSuccess;
+  }
+
+  /// Update API is preserved in 3.0, but does not guarantee a lightweight update of params.
+  Status
+  update(Arguments const& args, void* workspace = nullptr) {
+    CUTLASS_TRACE_HOST("MLA()::update() - workspace: " << workspace);
+
+    size_t workspace_bytes = get_workspace_size(args);
+    if (workspace_bytes > 0 && nullptr == workspace) {
+      return Status::kErrorWorkspaceNull;
+    }
+
+    auto fmha_params = Kernel::to_underlying_arguments(args, workspace);
+
+    ReductionArguments reduction_args = to_reduction_args(args);
+    if (reduction_args.split_kv > 1) {
+      reduction_args.ptr_oaccum   = fmha_params.epilogue.ptr_o_acc;
+      reduction_args.ptr_lseaccum = fmha_params.epilogue.ptr_lse_acc;
+    }
+    ReductionParams reduction_params = ReductionKernel::to_underlying_arguments(reduction_args, workspace); 
+    // Initialize the Params structure
+    params_ = Params {fmha_params, reduction_params};
+
+    return Status::kSuccess;
+  }
+
+  /// Primary run() entry point API that is static allowing users to create and manage their own params.
+  /// Supplied params struct must be construct by calling Kernel::to_underling_arguments()
+  static Status
+  run(Params& params, cudaStream_t stream = nullptr) {
+    CUTLASS_TRACE_HOST("MLA::run()");
+    dim3 const block = Kernel::get_block_shape();
+    dim3 const grid = Kernel::get_grid_shape(params.fmha_params);
+
+    // configure smem size and carveout
+    int smem_size = Kernel::SharedStorageSize;
+
+    Status launch_result;
+    // Use extended launch API only for mainloops that use it
+    if constexpr(Kernel::ArchTag::kMinComputeCapability >= 90) {
+      dim3 cluster(cute::size<0>(typename Kernel::ClusterShape{}),
+                   cute::size<1>(typename Kernel::ClusterShape{}),
+                   cute::size<2>(typename Kernel::ClusterShape{}));
+      void const* kernel = (void const*) device_kernel<Kernel>;
+      void* kernel_params[] = {&params.fmha_params};
+      launch_result = ClusterLauncher::launch(grid, cluster, block, smem_size, stream, kernel, kernel_params);
+    }
+    else {
+      launch_result = Status::kSuccess;
+      device_kernel<Kernel><<<grid, block, smem_size, stream>>>(params.fmha_params);
+    }
+
+    cudaError_t result = cudaGetLastError();
+    if (cudaSuccess != result or Status::kSuccess != launch_result) {
+      //return Status::kSuccess;
+      CUTLASS_TRACE_HOST("  Kernel launch failed. Reason: " << result);
+      return Status::kErrorInternal;
+    }
+    if (params.reduction_params.split_kv > 1) {
+      // launch reduction kernel
+      dim3 const block = ReductionKernel::get_block_shape();
+      dim3 const grid  = ReductionKernel::get_grid_shape(params.reduction_params);
+      device_kernel<ReductionKernel><<<grid, block, 0, stream>>>(params.reduction_params);
+      cudaError_t result = cudaGetLastError();
+      if (cudaSuccess == result) {
+        return Status::kSuccess;
+      }
+      else {
+        CUTLASS_TRACE_HOST("  Kernel launch failed. Reason: " << result);
+        return Status::kErrorInternal;
+      }
+    }
+    else {
+      return Status::kSuccess;
+    }
+  }
+
+  //
+  // Non-static launch overloads that first create and set the internal params struct of this kernel handle.
+  //
+
+  /// Launches the kernel after first constructing Params internal state from supplied arguments.
+  Status
+  run(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) {
+    Status status = initialize(args, workspace, stream);
+    if (Status::kSuccess == status) {
+      status = run(params_, stream);
+    }
+    return status;
+  }
+
+  /// Launches the kernel after first constructing Params internal state from supplied arguments.
+  Status
+  operator()(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr) {
+    return run(args, workspace, stream);
+  }
+
+  /// Overload that allows a user to re-launch the same kernel without updating internal params struct.
+  Status
+  run(cudaStream_t stream = nullptr) {
+    return run(params_, stream);
+  }
+
+  /// Overload that allows a user to re-launch the same kernel without updating internal params struct.
+  Status
+  operator()(cudaStream_t stream = nullptr) {
+    return run(params_, stream);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::fmha::device
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/77_blackwell_fmha/kernel/fmha_kernel_bwd_convert.hpp b/examples/77_blackwell_fmha/kernel/fmha_kernel_bwd_convert.hpp
new file mode 100644
index 0000000000..c2618bcb70
--- /dev/null
+++ b/examples/77_blackwell_fmha/kernel/fmha_kernel_bwd_convert.hpp
@@ -0,0 +1,146 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cute/layout.hpp"
+
+namespace cutlass::fmha::kernel {
+
+using namespace cute;
+
+template<class Element, class ElementAcc>
+struct FmhaKernelBwdConvert {
+
+  struct Arguments {
+    tuple<int, int, int, tuple<int, int>> problem_size;
+
+    const ElementAcc* ptr_src_dQ;
+    tuple<int, _1, tuple<int, int>> stride_src_dQ;
+    const ElementAcc* ptr_src_dK;
+    tuple<int, _1, tuple<int, int>> stride_src_dK;
+    const ElementAcc* ptr_src_dV;
+    tuple<int, _1, tuple<int, int>> stride_src_dV;
+    
+    Element* ptr_dest_dQ;
+    tuple<int, _1, tuple<int, int>> stride_dest_dQ;
+    Element* ptr_dest_dK;
+    tuple<int, _1, tuple<int, int>> stride_dest_dK;
+    Element* ptr_dest_dV;
+    tuple<int, _1, tuple<int, int>> stride_dest_dV;
+
+    ElementAcc scale = 1.0;
+  };
+
+  using Params = Arguments;
+
+  using ClusterShape = Shape<_1, _1, _1>;
+  static constexpr int SharedStorageSize = 0;
+
+  static const int MinBlocksPerMultiprocessor = 1;
+  static const int MaxThreadsPerBlock = 128;
+  using ArchTag = cutlass::arch::Sm90;
+
+  static const int kBlockSeq = 8;
+
+  static size_t get_workspace_size(Arguments const& args) { return 0; }
+  static cutlass::Status initialize_workspace(Arguments const&, void*, cudaStream_t) {
+    return cutlass::Status::kSuccess;
+  }
+
+  static const int kNumThreadsD = 16;
+  static const int kNumThreadsSeq = MaxThreadsPerBlock / kNumThreadsD;
+  static const int kElementsPerLoad = 4;
+
+  static const int kIterationsSeq = kBlockSeq / kNumThreadsSeq;
+
+  static bool can_implement(Arguments const& args) {
+    return get<2>(args.problem_size) % kElementsPerLoad == 0;
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    dim3 grid(size<3,0>(params.problem_size), size<3,1>(params.problem_size), ceil_div(std::max(size<0>(params.problem_size), size<1>(params.problem_size)), kBlockSeq));
+    return grid;
+  }
+
+  static dim3 get_block_shape() {
+    dim3 block(kNumThreadsD, kNumThreadsSeq, 1);
+    return block;
+  }
+
+  static Params to_underlying_arguments(Arguments const& args, void* workspace) {
+    return args;
+  }
+
+  template<class StrideSrc, class StrideDest>
+  CUTLASS_DEVICE void copy(Params const& params, const ElementAcc* ptr_src, StrideSrc const& stride_src, Element* ptr_dest, StrideDest const& stride_dest, int count) {
+    auto ptr_src_bh = ptr_src + get<2,0>(stride_src) * blockIdx.x + get<2,1>(stride_src) * blockIdx.y;
+    auto ptr_dest_bh = ptr_dest + get<2,0>(stride_dest) * blockIdx.x + get<2,1>(stride_dest) * blockIdx.y;
+
+    for (int idx_s_t = threadIdx.y; idx_s_t < kBlockSeq; idx_s_t += kNumThreadsSeq) {
+      int idx_s = idx_s_t + kBlockSeq * blockIdx.z;
+      if (idx_s >= count) continue;
+      auto ptr_src_bhs = ptr_src_bh + idx_s * get<0>(stride_src);
+      auto ptr_dest_bhs = ptr_dest_bh + idx_s * get<0>(stride_dest);
+
+      for (int idx_d = threadIdx.x * kElementsPerLoad; idx_d < get<2>(params.problem_size); idx_d += kElementsPerLoad * kNumThreadsD) {
+        ElementAcc value_src[kElementsPerLoad];
+        Element value_dest[kElementsPerLoad];
+        
+        using VecSrc = uint_bit_t<sizeof_bits_v<ElementAcc> * kElementsPerLoad>;
+        using VecDest = uint_bit_t<sizeof_bits_v<Element> * kElementsPerLoad>;
+        *reinterpret_cast<VecSrc*>(value_src) = *reinterpret_cast<const VecSrc*>(&ptr_src_bhs[idx_d]);
+
+        for (int v = 0; v < kElementsPerLoad; v++) {
+          value_dest[v] = static_cast<Element>(params.scale * value_src[v]);
+        }
+
+        *reinterpret_cast<VecDest*>(&ptr_dest_bhs[idx_d]) = *reinterpret_cast<const VecDest*>(value_dest);
+      }
+    }
+  }
+
+  CUTLASS_DEVICE void operator()(const Params &params, char* smem) {
+    if (params.ptr_src_dQ != nullptr) {
+      copy(params, params.ptr_src_dQ, params.stride_src_dQ, params.ptr_dest_dQ, params.stride_dest_dQ, get<0>(params.problem_size));
+    }
+    if (params.ptr_src_dK != nullptr) {
+      copy(params, params.ptr_src_dK, params.stride_src_dK, params.ptr_dest_dK, params.stride_dest_dK, get<1>(params.problem_size));
+    }
+    if (params.ptr_src_dV != nullptr) {
+      copy(params, params.ptr_src_dV, params.stride_src_dV, params.ptr_dest_dV, params.stride_dest_dV, get<1>(params.problem_size));
+    }
+  }
+};
+
+}  // namespace cutlass::fmha::kernel
diff --git a/examples/77_blackwell_fmha/kernel/fmha_kernel_bwd_sum_OdO.hpp b/examples/77_blackwell_fmha/kernel/fmha_kernel_bwd_sum_OdO.hpp
new file mode 100644
index 0000000000..44080e2d10
--- /dev/null
+++ b/examples/77_blackwell_fmha/kernel/fmha_kernel_bwd_sum_OdO.hpp
@@ -0,0 +1,151 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cute/layout.hpp"
+
+namespace cutlass::fmha::kernel {
+
+using namespace cute;
+
+template<class Element, class ElementAcc>
+struct FmhaKernelBwdSumOdO {
+
+  struct Arguments {
+    cute::tuple<int, int, int, cute::tuple<int, int>> problem_size;
+
+    const Element* ptr_O;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_O;
+    const Element* ptr_dO;
+    cute::tuple<int, cute::_1, cute::tuple<int, int>> stride_dO;
+
+    ElementAcc* ptr_sum_OdO;
+    cute::tuple<cute::_1, cute::tuple<int, int>> stride_sum_OdO;
+
+    const ElementAcc* ptr_lse = nullptr;
+    cute::tuple<cute::_1, cute::tuple<int, int>> stride_lse;
+
+    ElementAcc* ptr_scaled_lse = nullptr;
+    cute::tuple<cute::_1, cute::tuple<int, int>> stride_scaled_lse;
+
+    ElementAcc sum_odo_scale = 1.0;
+    ElementAcc lse_scale = 1.0;
+  };
+
+  using Params = Arguments;
+
+  using ClusterShape = Shape<_1, _1, _1>;
+  static constexpr int SharedStorageSize = 0;
+
+  static const int MinBlocksPerMultiprocessor = 1;
+  static const int MaxThreadsPerBlock = 128;
+  using ArchTag = cutlass::arch::Sm100;
+
+  static size_t get_workspace_size(Arguments const& args) { return 0; }
+  static cutlass::Status initialize_workspace(Arguments const&, void*, cudaStream_t) {
+    return cutlass::Status::kSuccess;
+  }
+
+  static const int kBlockQ = 16;
+
+  static const int kNumThreadsD = 8;
+  static const int kNumThreadsQ = MaxThreadsPerBlock / kNumThreadsD;
+  static const int kElementsPerLoad = 2;
+
+  static const int kIterationsQ = kBlockQ / kNumThreadsQ;
+
+  static bool can_implement(Arguments const& args) {
+    return get<2>(args.problem_size) % kElementsPerLoad == 0;
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    dim3 grid(ceil_div(size<0>(params.problem_size), kBlockQ), size<3,0>(params.problem_size), size<3,1>(params.problem_size));
+    return grid;
+  }
+
+  static dim3 get_block_shape() {
+    dim3 block(kNumThreadsD, kNumThreadsQ, 1);
+    return block;
+  }
+
+  static Params to_underlying_arguments(Arguments const& args, void* workspace) {
+    return args;
+  }
+
+  CUTLASS_DEVICE void operator()(const Params &params, char* smem) {
+    auto ptr_O_bh = params.ptr_O + blockIdx.y * get<2,0>(params.stride_O) + blockIdx.z * get<2,1>(params.stride_O);
+    auto ptr_dO_bh = params.ptr_dO + blockIdx.y * get<2,0>(params.stride_dO) + blockIdx.z * get<2,1>(params.stride_dO);
+    auto ptr_sum_OdO_bh = params.ptr_sum_OdO + blockIdx.y * get<1,0>(params.stride_sum_OdO) + blockIdx.z * get<1,1>(params.stride_sum_OdO);
+    auto ptr_lse_bh = params.ptr_lse + blockIdx.y * get<1,0>(params.stride_lse) + blockIdx.z * get<1,1>(params.stride_lse);
+    auto ptr_scaled_lse_bh = params.ptr_scaled_lse + blockIdx.y * get<1,0>(params.stride_scaled_lse) + blockIdx.z * get<1,1>(params.stride_scaled_lse);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int idx_q_t = threadIdx.y; idx_q_t < kBlockQ; idx_q_t += kNumThreadsQ) {
+      int idx_q = idx_q_t + kBlockQ * blockIdx.x;
+      if (idx_q >= get<0>(params.problem_size)) continue;
+      ElementAcc acc = 0;
+      auto ptr_O_bhq = ptr_O_bh + idx_q * get<0>(params.stride_O);
+      auto ptr_dO_bhq = ptr_dO_bh + idx_q * get<0>(params.stride_dO);
+      auto ptr_sum_OdO_bhq = ptr_sum_OdO_bh + idx_q * get<0>(params.stride_sum_OdO);
+      auto ptr_lse_bhq = ptr_lse_bh + idx_q * get<0>(params.stride_lse);
+      auto ptr_scaled_lse_bhq = ptr_scaled_lse_bh + idx_q * get<0>(params.stride_scaled_lse);
+
+      for (int idx_d = threadIdx.x * kElementsPerLoad; idx_d < get<2>(params.problem_size); idx_d += kElementsPerLoad * kNumThreadsD) {
+        Element value_O[kElementsPerLoad];
+        Element value_dO[kElementsPerLoad];
+        
+        using Vec = uint_bit_t<sizeof_bits_v<Element> * kElementsPerLoad>;
+        *reinterpret_cast<Vec*>(value_O) = *reinterpret_cast<const Vec*>(&ptr_O_bhq[idx_d]);
+        *reinterpret_cast<Vec*>(value_dO) = *reinterpret_cast<const Vec*>(&ptr_dO_bhq[idx_d]);
+
+        for (int v = 0; v < kElementsPerLoad; v++) {
+          acc += value_O[v] * value_dO[v];
+        }
+      }
+
+      for (int i = 1; i < kNumThreadsD; i *= 2) {
+        acc += __shfl_xor_sync((uint32_t)-1, acc, i, kNumThreadsD);
+      }
+
+      if (threadIdx.x == 0) {
+        *ptr_sum_OdO_bhq = params.sum_odo_scale * acc;
+        if (params.ptr_scaled_lse) {
+          *ptr_scaled_lse_bhq = params.lse_scale * *ptr_lse_bhq;
+        }
+      }
+    }
+  }
+};
+
+}  // namespace cutlass::fmha::kernel
diff --git a/examples/77_blackwell_fmha/kernel/sm100_fmha_bwd_kernel_tma_warpspecialized.hpp b/examples/77_blackwell_fmha/kernel/sm100_fmha_bwd_kernel_tma_warpspecialized.hpp
new file mode 100644
index 0000000000..e1bd43d5e5
--- /dev/null
+++ b/examples/77_blackwell_fmha/kernel/sm100_fmha_bwd_kernel_tma_warpspecialized.hpp
@@ -0,0 +1,1699 @@
+/***************************************************************************************************
+ * Copyright (c) 2025  - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cute/arch/simd_sm100.hpp"
+
+#include "cutlass/arch/arch.h"
+#include "cutlass/arch/memory_sm80.h"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+
+#include "collective/fmha_common.hpp"
+
+namespace cutlass::fmha::kernel {
+
+using namespace cutlass::fmha::collective;
+
+using namespace cute;
+
+template<
+    class Element,
+    class ElementAcc,
+    class TileShape,
+    class Mask
+>
+struct Sm100FmhaBwdKernelTmaWarpSpecialized {
+
+  using TileShapeQ = decltype(get<0>(TileShape{}));
+  static_assert(std::is_same_v<TileShapeQ, _128>, "tile shape K must be 128");
+  using TileShapeK = decltype(get<1>(TileShape{}));
+  static_assert(std::is_same_v<TileShapeK, _128>, "tile shape K must be 128");
+  using TileShapeDQK = decltype(get<2>(TileShape{}));
+  using TileShapeDVO = decltype(get<2>(TileShape{}));
+
+  using TmemAllocator = cute::TMEM::Allocator1Sm;
+  struct TmemAllocation {
+    static constexpr uint32_t kDK = 0;                     // TileShapeK x TileShapeDQK x acc
+    static constexpr uint32_t kDV = kDK + TileShapeDQK{};  // TileShapeK x TileShapeDVO x acc
+    static constexpr uint32_t kDQ = kDV + TileShapeDVO{};  // TileShapeQ x TileShapeDQK x acc
+    static constexpr uint32_t kDP = kDQ;                   // TileShapeK x TileShapeQ   x inp
+    static constexpr uint32_t kS = kDQ + max(TileShapeQ{}, TileShapeDQK{});
+    static constexpr uint32_t kP = kS;
+    static constexpr uint32_t kTotal = kS + TileShapeQ{};
+  };
+
+  static_assert(
+      static_cast<int>(TmemAllocation::kTotal) <= TmemAllocator::Sm100TmemCapacityColumns,
+      "using too much tmem"
+  );
+
+  enum class WarpRole {
+    Empty = 0x0, Load = 0x1, Mma = 0x2, Compute = 0x3, Reduce = 0x4
+  };
+
+  static constexpr unsigned long long kWarpAssignment = 0x12'3333'3333'4444ull;
+  static constexpr int kNumComputeWarps = 8;
+  static constexpr int kNumReduceWarps = 4;
+  CUTLASS_DEVICE WarpRole warp_idx_to_role(int warp_idx) {
+    return static_cast<WarpRole>((kWarpAssignment >> (4 * warp_idx)) & 0xF);
+  }
+
+  struct RegisterAllocation {
+    static constexpr int kWarpgroup0 = 160-8;
+    static constexpr int kWarpgroup1 = 128;
+    static constexpr int kWarpgroup2 = 96;
+    static constexpr int kReduce = kWarpgroup0;
+    static constexpr int kCompute = kWarpgroup1;
+    static constexpr int kMma = kWarpgroup2;
+    static constexpr int kEmpty = kWarpgroup2;
+    static constexpr int kLoad = kWarpgroup2;
+
+    static_assert(kWarpgroup0 + 2 * kWarpgroup1 + kWarpgroup2 <= 512);
+  };
+
+  using ArchTag = cutlass::arch::Sm100;
+
+  using ClusterShape = Shape<_1, _1, _1>;
+  using Schedule = cutlass::gemm::KernelTmaWarpSpecialized1SmSm100;
+
+  static constexpr int MinBlocksPerMultiprocessor = 1;
+  static constexpr int kNumWarps = kNumComputeWarps + kNumReduceWarps + 4;
+  static constexpr int MaxThreadsPerBlock = NumThreadsPerWarp * kNumWarps;
+
+  static constexpr int Alignment = 128 / sizeof_bits_v<Element>;
+  static constexpr int kStages = 2;
+
+  using TensorStrideContiguousK = Stride<int, _1, Stride<int, int>>;
+  using TensorStrideContiguousMN = Stride<_1, int, Stride<int, int>>;
+  
+  // compute S
+  using CollectiveMmaKQ = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      Element, TensorStrideContiguousK, Alignment,
+      Element, TensorStrideContiguousK, Alignment,
+      ElementAcc,
+      Shape<TileShapeK, TileShapeQ, TileShapeDQK>,
+      ClusterShape, cutlass::gemm::collective::StageCount<kStages>,
+      Schedule>::CollectiveOp;
+  using TileShapeKQ = typename CollectiveMmaKQ::TileShape;
+  using TiledMmaKQ = typename CollectiveMmaKQ::TiledMma;
+
+  // compute dP
+  using CollectiveMmaVDO = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      Element, TensorStrideContiguousK, Alignment,
+      Element, TensorStrideContiguousK, Alignment,
+      ElementAcc,
+      Shape<TileShapeK, TileShapeQ, TileShapeDVO>,
+      ClusterShape, cutlass::gemm::collective::StageCount<kStages>,
+      Schedule>::CollectiveOp;
+  using TileShapeVDO = typename CollectiveMmaVDO::TileShape;
+  using TiledMmaVDO = typename CollectiveMmaVDO::TiledMma;
+
+  // compute dV
+  using CollectiveMmaPDO = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      // needs to match ordering of S calculation
+      Element, TensorStrideContiguousK, Alignment,
+      Element, TensorStrideContiguousMN, Alignment,
+      ElementAcc,
+      Shape<TileShapeK, TileShapeDVO, TileShapeQ>,
+      ClusterShape, cutlass::gemm::collective::StageCount<kStages>,
+      Schedule>::CollectiveOp;
+  using TileShapePDO = typename CollectiveMmaPDO::TileShape;
+  using TiledMmaPDO = decltype(to_tiled_mma_sm100_ts(typename CollectiveMmaPDO::TiledMma{}));
+
+  // compute dK
+  using CollectiveMmaDSQ = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      // somewhat arbitrary since we dump to smem, need to agree with the next one
+      Element, TensorStrideContiguousK , Alignment,
+      Element, TensorStrideContiguousMN, Alignment,
+      ElementAcc,
+      Shape<TileShapeK, TileShapeDQK, TileShapeQ>,
+      ClusterShape, cutlass::gemm::collective::StageCount<kStages>,
+      Schedule>::CollectiveOp;
+  using TileShapeDSQ = typename CollectiveMmaDSQ::TileShape;
+  using TiledMmaDSQ = typename CollectiveMmaDSQ::TiledMma;
+
+  // compute dQ
+  using CollectiveMmaDSK = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      // somewhat arbitrary since we dump to smem, need to agree with the previous one
+      Element, TensorStrideContiguousMN, Alignment,
+      Element, TensorStrideContiguousMN, Alignment,
+      ElementAcc,
+      Shape<TileShapeQ, TileShapeDQK, TileShapeK>,
+      ClusterShape, cutlass::gemm::collective::StageCount<kStages>,
+      Schedule>::CollectiveOp;
+  using TileShapeDSK = typename CollectiveMmaDSK::TileShape;
+  using TiledMmaDSK = typename CollectiveMmaDSK::TiledMma;
+
+  // pipelines are named Pipeline<Producer><Consumer><Resource>
+  static constexpr int kStagesComputeSmem = 1;
+  using PipelineLoadMmaQ = PipelineTmaUmmaAsync<2, ClusterShape>;
+  using PipelineLoadMmaDO = PipelineTmaUmmaAsync<1, ClusterShape>;
+  using PipelineLoadComputeLSE = PipelineAsync<1>;
+  using PipelineLoadComputeSumOdO = PipelineAsync<1>;
+  using PipelineMmaComputeS = PipelineUmmaAsync<1>;
+  using PipelineMmaComputeDP = PipelineUmmaAsync<1>;
+  using PipelineMmaReduceDQ = PipelineUmmaAsync<1>;
+  using PipelineComputeMmaP = PipelineUmmaConsumerAsync<1>;
+  using PipelineComputeMmaDS = PipelineUmmaConsumerAsync<kStagesComputeSmem>;
+  using PipelineMmaComputeDKDV = PipelineUmmaAsync<2>;
+  static constexpr int kStagesReduceTmaStore = 2;
+  using PipelineReduceTmaStore = PipelineTmaStore<kStagesReduceTmaStore>;
+
+  struct PipelineStorage {
+    alignas(16) typename PipelineLoadMmaQ::SharedStorage load_mma_q;
+    alignas(16) typename PipelineLoadMmaDO::SharedStorage load_mma_do;
+    alignas(16) typename PipelineLoadComputeLSE::SharedStorage load_compute_lse;
+    alignas(16) typename PipelineLoadComputeSumOdO::SharedStorage load_compute_sum_odo;
+    alignas(16) typename PipelineMmaComputeS::SharedStorage mma_compute_s;
+    alignas(16) typename PipelineMmaComputeDP::SharedStorage mma_compute_dp;
+    alignas(16) typename PipelineMmaReduceDQ::SharedStorage mma_reduce_dq;
+    alignas(16) typename PipelineComputeMmaP::SharedStorage compute_mma_p;
+    alignas(16) typename PipelineComputeMmaDS::SharedStorage compute_mma_ds;
+    alignas(16) typename PipelineMmaComputeDKDV::SharedStorage mma_compute_dkdv;
+  };
+
+  template<class Layout, class Stages = _1>
+  static CUTE_DEVICE constexpr auto restage(Layout const& layout, Stages stages = {}) {
+    return composition(layout, make_tuple(_, _, _, make_layout(stages)));
+  }
+
+  using SmemLayoutK = decltype(restage(typename CollectiveMmaKQ::SmemLayoutA{}));
+  using SmemLayoutV = decltype(restage(typename CollectiveMmaVDO::SmemLayoutA{}));
+  using SmemLayoutQ = decltype(restage(typename CollectiveMmaKQ::SmemLayoutB{}, _2{}));
+  using SmemLayoutDO = decltype(restage(typename CollectiveMmaVDO::SmemLayoutB{}, _1{}));
+  using SmemLayoutDS = decltype(restage(typename CollectiveMmaDSK::SmemLayoutA{}, Int<kStagesComputeSmem>{}));
+  using SmemLayoutLSE = Layout<Shape<TileShapeQ, _1>>;
+  using SmemLayoutSumOdO = Layout<Shape<TileShapeQ, _1>>;
+
+  using SmemLayoutQT = decltype(restage(typename CollectiveMmaDSQ::SmemLayoutB{}, _2{}));
+  using SmemLayoutKT = decltype(restage(typename CollectiveMmaDSK::SmemLayoutB{}));
+  using SmemLayoutDST = decltype(restage(typename CollectiveMmaDSQ::SmemLayoutA{}, Int<kStagesComputeSmem>{}));
+  using SmemLayoutDOT = decltype(restage(typename CollectiveMmaPDO::SmemLayoutB{}, _1{}));
+
+  using TileShapeDQ = _32;
+  using SmemAtomDQ = decltype(cutlass::gemm::collective::detail::sm100_smem_selector<
+      cute::UMMA::Major::K, ElementAcc, TileShapeQ, TileShapeDQ
+  >());
+  using SmemShapeDQ = Shape<TileShapeQ, TileShapeDQ, Int<kStagesReduceTmaStore>>;
+  using SmemLayoutDQ = decltype(tile_to_shape(SmemAtomDQ{}, SmemShapeDQ{}, Step<_2, _1, _3>{}));
+
+  struct TensorStorage {
+    union {
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutK>> smem_k;
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutKT>> smem_k_t;
+    };
+    alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutV>> smem_v;
+    union {
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutQ>> smem_q;
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutQT>> smem_q_t;
+    };
+    union {
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutDO>> smem_do;
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutDOT>> smem_do_t;
+    };
+    union {
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutDS>> smem_ds;
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutDST>> smem_ds_t;
+    };
+    alignas(1024) cute::array<ElementAcc, cute::cosize_v<SmemLayoutDQ>> smem_dq;
+    alignas(16) cute::array<ElementAcc, cute::cosize_v<SmemLayoutLSE>> smem_lse;
+    alignas(16) cute::array<ElementAcc, cute::cosize_v<SmemLayoutSumOdO>> smem_sum_odo;
+  };
+
+  static constexpr int kTransactionsBytesLoadQ = cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutQ{})) * cute::sizeof_bits_v<Element>);
+  static constexpr int kTransactionsBytesLoadDO = cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutDO{})) * cute::sizeof_bits_v<Element>);
+
+  static constexpr int kTransactionsBytesLoadK = cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutK{})) * cute::sizeof_bits_v<Element>);
+  static constexpr int kTransactionsBytesLoadV = cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutV{})) * cute::sizeof_bits_v<Element>);
+
+  struct SharedStorage {
+    TensorStorage tensors;
+    PipelineStorage pipelines;
+    uint32_t tmem_base_ptr;
+  };
+
+  // this is tight enough that it won't work with sizeof due to padding for alignment
+  static constexpr int SharedStorageSize = offsetof(SharedStorage, tmem_base_ptr) + sizeof(uint32_t);
+  static_assert(SharedStorageSize <= cutlass::arch::sm100_smem_capacity_bytes, "using too much smem");
+
+  using ProblemShape = Shape<int, int, int, Shape<int, int>>;  // Q K D (H B), eventuall D = (D_QK, D_VO)
+  using TensorStride = TensorStrideContiguousK;  // S D (H B)
+  using RowTensorStride = Stride<_1, Stride<int, int>>;    // S (H B)
+
+  struct MainloopArguments {
+    const Element* ptr_q;
+    TensorStride stride_q;
+    const Element* ptr_k;
+    TensorStride stride_k;
+    const Element* ptr_v;
+    TensorStride stride_v;
+    const Element* ptr_do;
+    TensorStride stride_do;
+
+    const ElementAcc* ptr_lse;
+    RowTensorStride stride_lse;
+
+    const ElementAcc* ptr_sum_odo;
+    RowTensorStride stride_sum_odo;
+
+    ElementAcc* ptr_dq_acc;
+    TensorStride stride_dq_acc;
+
+    ElementAcc softmax_scale = 1.0f / sqrtf(TileShapeDQK{});
+  };
+
+  using TMA_K = typename CollectiveMmaKQ::Params::TMA_A;
+  using TMA_V = typename CollectiveMmaVDO::Params::TMA_A;
+  using TMA_Q = typename CollectiveMmaKQ::Params::TMA_B;
+  using TMA_DO = typename CollectiveMmaVDO::Params::TMA_B;
+
+  using TMA_DQ = decltype(make_tma_copy(SM90_TMA_REDUCE_ADD{},
+      make_tensor((const ElementAcc*)nullptr, make_shape(1, 1, make_shape(1, 1)), TensorStride{}),
+      SmemLayoutDQ{}(_, _, _0{})
+  ));
+
+  struct MainloopParams {
+    TMA_K tma_load_k;
+    TMA_V tma_load_v;
+    TMA_Q tma_load_q;
+    TMA_DO tma_load_do;
+    TMA_DQ tma_red_dq;
+  };
+
+  struct EpilogueArguments {
+    Element* ptr_dk;
+    TensorStride stride_dk;
+    Element* ptr_dv;
+    TensorStride stride_dv;
+  };
+
+  struct Arguments {
+    ProblemShape problem_shape;
+    MainloopArguments mainloop;
+    EpilogueArguments epilogue;
+    KernelHardwareInfo hw_info;
+  };
+
+  struct Params {
+    ProblemShape problem_shape;
+    MainloopArguments mainloop;
+    MainloopParams mainloop_params;
+    EpilogueArguments epilogue;
+    KernelHardwareInfo hw_info;
+  };
+
+
+  static bool can_implement(Arguments const& args) {
+    auto [Q, K, D, HB] = args.problem_shape;
+    auto [H, B] = HB;
+    if (Q <= 0 || K <= 0 || D <= 0 || H <= 0 || B <= 0) {
+      return false;
+    }
+    if (D % Alignment != 0) {
+      return false;
+    }
+    return true;
+  }
+
+
+  static Status initialize_workspace(Arguments const&, void*, cudaStream_t) {
+    return Status::kSuccess;
+  }
+
+
+  static Params to_underlying_arguments(Arguments const& args, void*) {
+    auto [Q, K, D, HB] = args.problem_shape;
+
+    auto params_kq = CollectiveMmaKQ::to_underlying_arguments(
+      make_shape(K, Q, D, HB),
+      typename CollectiveMmaKQ::Arguments {
+        args.mainloop.ptr_k, args.mainloop.stride_k,
+        args.mainloop.ptr_q, args.mainloop.stride_q,
+      }, /*workspace=*/nullptr);
+
+    auto params_vdo = CollectiveMmaVDO::to_underlying_arguments(
+      make_shape(K, Q, D, HB),
+      typename CollectiveMmaVDO::Arguments {
+        args.mainloop.ptr_v, args.mainloop.stride_v,
+        args.mainloop.ptr_do, args.mainloop.stride_do,
+      }, /*workspace=*/nullptr);
+
+    TMA_DQ tma_red_dq = make_tma_copy(
+        SM90_TMA_REDUCE_ADD{},
+        make_tensor(args.mainloop.ptr_dq_acc, make_shape(Q, D, HB), args.mainloop.stride_dq_acc),
+        SmemLayoutDQ{}(_, _, _0{})
+    );
+      
+    return Params{
+      args.problem_shape,
+      args.mainloop,
+      MainloopParams{
+        params_kq.tma_load_a,
+        params_vdo.tma_load_a,
+        params_kq.tma_load_b,
+        params_vdo.tma_load_b,
+        tma_red_dq
+      },
+      args.epilogue,
+      args.hw_info
+    };
+  }
+
+
+  template<class T>
+  static CUTLASS_DEVICE auto quantize(T const& input) {
+    constexpr int AlignmentS = 4;
+    auto output = make_tensor<Element>(shape(input));
+    auto input_vec = recast<Array<ElementAcc, AlignmentS>>(input);
+    auto output_vec = recast<Array<Element, AlignmentS>>(output);
+
+    cutlass::NumericArrayConverter<Element, ElementAcc, AlignmentS> epilogue_op;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(input_vec); i++) {
+      output_vec(i) = epilogue_op(input_vec(i));
+    }
+
+    return output;
+  }
+
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void load(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      int iter_index,
+      int iter_count,
+      MainloopArguments const& mainloop_args,
+      MainloopParams const& mainloop_params,
+      TensorStorage& shared_tensors,
+      PipelineLoadMmaQ& pipeline_load_mma_q,
+      typename PipelineLoadMmaQ::PipelineState& pipeline_load_mma_q_producer_state,
+      PipelineLoadMmaDO& pipeline_load_mma_do,
+      typename PipelineLoadMmaDO::PipelineState& pipeline_load_mma_do_producer_state,
+      PipelineLoadComputeLSE& pipeline_load_compute_lse,
+      typename PipelineLoadComputeLSE::PipelineState& pipeline_load_compute_lse_producer_state,
+      PipelineLoadComputeSumOdO& pipeline_load_compute_sum_odo,
+      typename PipelineLoadComputeSumOdO::PipelineState& pipeline_load_compute_sum_odo_producer_state) {
+
+    auto [Q, K, D, HB] = problem_shape;
+
+    using X = Underscore;
+
+    uint16_t mcast_mask = 0;
+
+    auto mK = mainloop_params.tma_load_k.get_tma_tensor(make_shape(K, D, HB));
+    auto mQ = mainloop_params.tma_load_q.get_tma_tensor(make_shape(Q, D, HB));
+    auto mV = mainloop_params.tma_load_v.get_tma_tensor(make_shape(K, D, HB));
+    auto mDO = mainloop_params.tma_load_do.get_tma_tensor(make_shape(Q, D, HB));
+
+    auto gK = local_tile(mK, TileShapeKQ{}, make_coord(_,_,_), Step<_1, X, _1>{});
+    auto gQ = local_tile(mQ, TileShapeKQ{}, make_coord(_,_,_), Step<X, _1, _1>{});
+    auto gV = local_tile(mV, TileShapeVDO{}, make_coord(_,_,_), Step<_1, X, _1>{});
+    auto gDO = local_tile(mDO, TileShapeVDO{}, make_coord(_,_,_), Step<X, _1, _1>{});
+
+    ThrMMA cta_mma_kq = TiledMmaKQ{}.get_slice(_0{});
+    ThrMMA cta_mma_vdo = TiledMmaVDO{}.get_slice(_0{});
+    
+    auto tSTgK = cta_mma_kq.partition_A(gK);
+    auto tSTgQ = cta_mma_kq.partition_B(gQ);
+    auto tDPTgV = cta_mma_vdo.partition_A(gV);
+    auto tDPTgDO = cta_mma_vdo.partition_B(gDO);
+
+    auto sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{});
+    auto sK = make_tensor(make_smem_ptr(shared_tensors.smem_k.begin()), SmemLayoutK{});
+    auto sV = make_tensor(make_smem_ptr(shared_tensors.smem_v.begin()), SmemLayoutV{});
+    auto sDO = make_tensor(make_smem_ptr(shared_tensors.smem_do.begin()), SmemLayoutDO{});
+
+    auto [tKgK_mkl, tKsK] = tma_partition(
+        mainloop_params.tma_load_k, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sK), group_modes<0,3>(tSTgK));
+    auto [tQgQ_mkl, tQsQ] = tma_partition(
+        mainloop_params.tma_load_q, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sQ), group_modes<0,3>(tSTgQ));
+    auto [tVgV_mkl, tVsV] = tma_partition(
+        mainloop_params.tma_load_v, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sV), group_modes<0,3>(tDPTgV));
+    auto [tDOgDO_mkl, tDOsDO] = tma_partition(
+        mainloop_params.tma_load_do, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sDO), group_modes<0,3>(tDPTgDO));
+
+    // set up lse and sum_odo
+    
+    auto [blk_coord_q, blk_coord_k, blk_coord_batch] = blk_coord;
+
+    pipeline_load_mma_q.producer_acquire(pipeline_load_mma_q_producer_state);
+    auto tma_barrier = pipeline_load_mma_q.producer_get_barrier(pipeline_load_mma_q_producer_state);
+
+    pipeline_load_mma_q.producer_expect_transaction(pipeline_load_mma_q_producer_state, kTransactionsBytesLoadK);
+
+    // load K
+    if (cute::elect_one_sync()) {
+      cute::copy(
+          mainloop_params.tma_load_k.with(*tma_barrier, mcast_mask),
+          tKgK_mkl(_, blk_coord_k, _0{}, blk_coord_batch),
+          tKsK(_, _0{})
+      );
+    }
+
+    // load Q
+    if (cute::elect_one_sync()) { 
+      cute::copy(
+          mainloop_params.tma_load_q.with(*tma_barrier, mcast_mask),
+          tQgQ_mkl(_, iter_index, _0{}, blk_coord_batch),
+          tQsQ(_, pipeline_load_mma_q_producer_state.index())
+      );
+    }
+
+    ++pipeline_load_mma_q_producer_state;
+
+    pipeline_load_compute_lse.producer_acquire(pipeline_load_compute_lse_producer_state);
+
+    // load LSE
+    // 32 threads loading 128 values of 32b each
+    // so 4*32b=128b
+
+    int thread_idx = threadIdx.x % NumThreadsPerWarp;
+    int smem_idx = TileShapeQ{} * pipeline_load_compute_lse_producer_state.index() + thread_idx * 4;
+    int gmem_idx = TileShapeQ{} * iter_index + thread_idx * 4;
+    auto mLSE = make_tensor(mainloop_args.ptr_lse, make_shape(Q, HB), mainloop_args.stride_lse);
+    cutlass::arch::cp_async_zfill<16>(
+        shared_tensors.smem_lse.begin() + smem_idx,
+        &mLSE(gmem_idx, blk_coord_batch),
+         gmem_idx < Q
+    );
+    
+    pipeline_load_compute_lse.producer_commit(pipeline_load_compute_lse_producer_state, cutlass::arch::cpasync_barrier_arrive);
+    ++pipeline_load_compute_lse_producer_state;
+
+
+    pipeline_load_mma_do.producer_acquire(pipeline_load_mma_do_producer_state);
+    tma_barrier = pipeline_load_mma_do.producer_get_barrier(pipeline_load_mma_do_producer_state);
+
+    pipeline_load_mma_do.producer_expect_transaction(pipeline_load_mma_do_producer_state, kTransactionsBytesLoadV);
+    
+    // load V
+    if (cute::elect_one_sync()) {
+      cute::copy(
+          mainloop_params.tma_load_v.with(*tma_barrier, mcast_mask),
+          tVgV_mkl(_, blk_coord_k, _0{}, blk_coord_batch),
+          tVsV(_, _0{})
+      );
+    }
+
+    // load dO
+    if (cute::elect_one_sync()) { 
+      cute::copy(
+          mainloop_params.tma_load_do.with(*tma_barrier, mcast_mask),
+          tDOgDO_mkl(_, iter_index, _0{}, blk_coord_batch),
+          tDOsDO(_, pipeline_load_mma_do_producer_state.index())
+      );
+    }
+
+    ++pipeline_load_mma_do_producer_state;
+
+    pipeline_load_compute_sum_odo.producer_acquire(pipeline_load_compute_sum_odo_producer_state);
+
+    // load sum_OdO
+    smem_idx = TileShapeQ{} * pipeline_load_compute_sum_odo_producer_state.index() + thread_idx * 4;
+    gmem_idx = TileShapeQ{} * iter_index + thread_idx * 4;
+    auto mSumOdO = make_tensor(mainloop_args.ptr_sum_odo, make_shape(Q, HB), mainloop_args.stride_sum_odo);
+    cutlass::arch::cp_async<16>(
+        shared_tensors.smem_sum_odo.begin() + smem_idx,
+        &mSumOdO(gmem_idx, blk_coord_batch),
+        gmem_idx < Q
+    );
+
+    pipeline_load_compute_sum_odo.producer_commit(pipeline_load_compute_sum_odo_producer_state, cutlass::arch::cpasync_barrier_arrive);
+    ++pipeline_load_compute_sum_odo_producer_state;
+
+    iter_count -= 1;
+    iter_index += 1;
+
+    while (iter_count > 0) {
+      pipeline_load_mma_q.producer_acquire(pipeline_load_mma_q_producer_state);
+      tma_barrier = pipeline_load_mma_q.producer_get_barrier(pipeline_load_mma_q_producer_state);
+
+      // load Q
+      if (cute::elect_one_sync()) { 
+        cute::copy(
+            mainloop_params.tma_load_q.with(*tma_barrier, mcast_mask),
+            tQgQ_mkl(_, iter_index, _0{}, blk_coord_batch),
+            tQsQ(_, pipeline_load_mma_q_producer_state.index())
+        );
+      }
+
+      ++pipeline_load_mma_q_producer_state;
+
+      pipeline_load_compute_lse.producer_acquire(pipeline_load_compute_lse_producer_state);
+      
+      // load LSE
+      smem_idx = TileShapeQ{} * pipeline_load_compute_lse_producer_state.index() + thread_idx * 4;
+      gmem_idx = TileShapeQ{} * iter_index + thread_idx * 4;
+      cutlass::arch::cp_async<16>(
+          shared_tensors.smem_lse.begin() + smem_idx,
+          &mLSE(gmem_idx, blk_coord_batch),
+          gmem_idx < Q
+      );
+      
+      pipeline_load_compute_lse.producer_commit(pipeline_load_compute_lse_producer_state, cutlass::arch::cpasync_barrier_arrive);
+      ++pipeline_load_compute_lse_producer_state;
+
+      pipeline_load_mma_do.producer_acquire(pipeline_load_mma_do_producer_state);
+      tma_barrier = pipeline_load_mma_do.producer_get_barrier(pipeline_load_mma_do_producer_state);
+
+      // load dO  
+      if (cute::elect_one_sync()) { 
+        cute::copy(
+            mainloop_params.tma_load_do.with(*tma_barrier, mcast_mask),
+            tDOgDO_mkl(_, iter_index, _0{}, blk_coord_batch),
+            tDOsDO(_, pipeline_load_mma_do_producer_state.index())
+        );
+      }
+
+      ++pipeline_load_mma_do_producer_state;
+
+      pipeline_load_compute_sum_odo.producer_acquire(pipeline_load_compute_sum_odo_producer_state);
+      
+      // load sum_OdO
+      smem_idx = TileShapeQ{} * pipeline_load_compute_sum_odo_producer_state.index() + thread_idx * 4;
+      gmem_idx = TileShapeQ{} * iter_index + thread_idx * 4;
+      cutlass::arch::cp_async_zfill<16>(
+          shared_tensors.smem_sum_odo.begin() + smem_idx,
+          &mSumOdO(gmem_idx, blk_coord_batch),
+          gmem_idx < Q
+      );
+      
+      pipeline_load_compute_sum_odo.producer_commit(pipeline_load_compute_sum_odo_producer_state, cutlass::arch::cpasync_barrier_arrive);
+      ++pipeline_load_compute_sum_odo_producer_state;
+
+      iter_count -= 1;
+      iter_index += 1;
+    }
+  }
+
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void mma(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      int iter_index,
+      int iter_count,
+      MainloopArguments const& mainloop_args,
+      TensorStorage& shared_tensors,
+      PipelineLoadMmaQ& pipeline_load_mma_q, 
+      typename PipelineLoadMmaQ::PipelineState& pipeline_load_mma_q_consumer_state,        
+      PipelineLoadMmaDO& pipeline_load_mma_do, 
+      typename PipelineLoadMmaDO::PipelineState& pipeline_load_mma_do_consumer_state,
+      PipelineMmaComputeS& pipeline_mma_compute_s, 
+      typename PipelineMmaComputeS::PipelineState& pipeline_mma_compute_s_producer_state,
+      PipelineMmaComputeDP& pipeline_mma_compute_dp, 
+      typename PipelineMmaComputeDP::PipelineState& pipeline_mma_compute_dp_producer_state,
+      PipelineMmaReduceDQ& pipeline_mma_reduce_dq, 
+      typename PipelineMmaReduceDQ::PipelineState& pipeline_mma_reduce_dq_producer_state,
+      PipelineComputeMmaP& pipeline_compute_mma_p, 
+      typename PipelineComputeMmaP::PipelineState& pipeline_compute_mma_p_consumer_state,
+      PipelineComputeMmaDS& pipeline_compute_mma_ds, 
+      typename PipelineComputeMmaDS::PipelineState& pipeline_compute_mma_ds_consumer_state,
+      PipelineMmaComputeDKDV& pipeline_mma_compute_dkdv,
+      typename PipelineMmaComputeDKDV::PipelineState& pipeline_mma_compute_dkdv_producer_state) {
+    
+    auto [Q, K, D, HB] = problem_shape;
+
+    auto sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{});
+    auto sK = make_tensor(make_smem_ptr(shared_tensors.smem_k.begin()), SmemLayoutK{});
+    auto sV = make_tensor(make_smem_ptr(shared_tensors.smem_v.begin()), SmemLayoutV{});
+    auto sDO = make_tensor(make_smem_ptr(shared_tensors.smem_do.begin()), SmemLayoutDO{});
+
+    auto sQT = make_tensor(make_smem_ptr(shared_tensors.smem_q_t.begin()), SmemLayoutQT{});
+    auto sKT = make_tensor(make_smem_ptr(shared_tensors.smem_k_t.begin()), SmemLayoutKT{});
+    auto sDS = make_tensor(make_smem_ptr(shared_tensors.smem_ds.begin()), SmemLayoutDS{});
+    auto sDST = make_tensor(make_smem_ptr(shared_tensors.smem_ds_t.begin()), SmemLayoutDST{});
+    auto sP = make_tensor(make_smem_ptr((Element*) nullptr), typename CollectiveMmaPDO::SmemLayoutA{});
+    auto sDOT = make_tensor(make_smem_ptr(shared_tensors.smem_do_t.begin()), SmemLayoutDOT{});
+
+    Tensor tSTrK = TiledMmaKQ::make_fragment_A(sK);
+    Tensor tSTrQ = TiledMmaKQ::make_fragment_B(sQ);
+
+    Tensor tDPTrV = TiledMmaVDO::make_fragment_A(sV);
+    Tensor tDPTrDO = TiledMmaVDO::make_fragment_B(sDO);
+
+    Tensor tDQrDS = TiledMmaDSK::make_fragment_A(sDS);
+    Tensor tDQrKT = TiledMmaDSK::make_fragment_B(sKT);
+
+    Tensor tDKrDST = TiledMmaDSQ::make_fragment_A(sDST);
+    Tensor tDKrQT = TiledMmaDSQ::make_fragment_B(sQT);
+
+    Tensor tDVrP = TiledMmaPDO::make_fragment_A(sP)(_, _, _, _0{});
+    tDVrP.data() = TmemAllocation::kP;
+    Tensor tDVrDOT = TiledMmaPDO::make_fragment_B(sDOT);
+    
+    TiledMmaKQ tiled_mma_kq;
+    TiledMmaVDO tiled_mma_vdo;
+    TiledMmaDSK tiled_mma_dsk;
+    TiledMmaDSQ tiled_mma_dsq;
+    TiledMmaPDO tiled_mma_pdo;
+
+    tiled_mma_dsq.accumulate_ = UMMA::ScaleOut::Zero;
+    tiled_mma_pdo.accumulate_ = UMMA::ScaleOut::Zero;
+
+    Tensor tSTtST =  partition_fragment_C(tiled_mma_kq, select<0,1>(TileShapeKQ{}));
+    tSTtST.data() = TmemAllocation::kS;
+
+    Tensor tDPTtDPT = partition_fragment_C(tiled_mma_vdo, select<0,1>(TileShapeVDO{}));
+    tDPTtDPT.data() = TmemAllocation::kDP;
+
+    Tensor tDQtDQ = partition_fragment_C(tiled_mma_dsk, select<0,1>(TileShapeDSK{}));
+    tDQtDQ.data() = TmemAllocation::kDQ;
+
+    Tensor tDKtDK = partition_fragment_C(tiled_mma_dsq, select<0,1>(TileShapeDSQ{}));
+    tDKtDK.data() = TmemAllocation::kDK;
+
+    Tensor tDVtDV = partition_fragment_C(tiled_mma_pdo, select<0,1>(TileShapePDO{}));
+    tDVtDV.data() = TmemAllocation::kDV;
+
+    auto pipeline_load_mma_q_release_state = pipeline_load_mma_q_consumer_state;
+
+    pipeline_load_mma_q.consumer_wait(pipeline_load_mma_q_consumer_state);
+    pipeline_mma_compute_s.producer_acquire(pipeline_mma_compute_s_producer_state);
+
+    // S = Q*K
+    tiled_mma_kq.accumulate_ = UMMA::ScaleOut::Zero;
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_block = 0; k_block < size<2>(tSTrQ); ++k_block) {
+      cute::gemm(tiled_mma_kq,
+                 tSTrK(_,_,k_block,_0{}),
+                 tSTrQ(_,_,k_block,pipeline_load_mma_q_consumer_state.index()),
+                 tSTtST);
+      tiled_mma_kq.accumulate_ = UMMA::ScaleOut::One;
+    }
+
+    ++pipeline_load_mma_q_consumer_state;
+
+    pipeline_mma_compute_s.producer_commit(pipeline_mma_compute_s_producer_state);
+    ++pipeline_mma_compute_s_producer_state;
+
+    pipeline_load_mma_do.consumer_wait(pipeline_load_mma_do_consumer_state);
+
+    pipeline_mma_compute_dp.producer_acquire(pipeline_mma_compute_dp_producer_state);
+    pipeline_mma_reduce_dq.producer_acquire(pipeline_mma_reduce_dq_producer_state);
+
+    // dP = dO*V
+    tiled_mma_vdo.accumulate_ = UMMA::ScaleOut::Zero;
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_block = 0; k_block < size<2>(tDPTrV); ++k_block) {
+      cute::gemm(tiled_mma_vdo,
+                 tDPTrV(_,_,k_block,_0{}),
+                 tDPTrDO(_,_,k_block,pipeline_load_mma_do_consumer_state.index()),
+                 tDPTtDPT);
+      tiled_mma_vdo.accumulate_ = UMMA::ScaleOut::One;
+    }
+
+    pipeline_mma_compute_dp.producer_commit(pipeline_mma_compute_dp_producer_state);
+    ++pipeline_mma_compute_dp_producer_state;
+
+    pipeline_compute_mma_p.consumer_wait(pipeline_compute_mma_p_consumer_state);
+
+    // dV = P*dO
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_block = 0; k_block < size<2>(tDVrP); ++k_block) {
+      cute::gemm(tiled_mma_pdo,
+                 tDVrP(_,_,k_block),
+                 tDVrDOT(_,_,k_block,pipeline_load_mma_do_consumer_state.index()),
+                 tDVtDV);
+      tiled_mma_pdo.accumulate_ = UMMA::ScaleOut::One;
+    }
+
+    pipeline_compute_mma_p.consumer_release(pipeline_compute_mma_p_consumer_state);
+    ++pipeline_compute_mma_p_consumer_state;
+
+    pipeline_load_mma_do.consumer_release(pipeline_load_mma_do_consumer_state);
+    ++pipeline_load_mma_do_consumer_state;
+
+    iter_count -= 1;
+
+    // in tmem, S & P overlap
+    // and dP and dQ overlap
+    // so we need to acquire dQ and dP at the same time
+    while (iter_count > 0) {
+      pipeline_load_mma_q.consumer_wait(pipeline_load_mma_q_consumer_state);
+      pipeline_mma_compute_s.producer_acquire(pipeline_mma_compute_s_producer_state);
+
+      // S = Q*K
+      tiled_mma_kq.accumulate_ = UMMA::ScaleOut::Zero;
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tSTrQ); ++k_block) {
+        cute::gemm(tiled_mma_kq,
+                   tSTrK(_,_,k_block,_0{}),
+                   tSTrQ(_,_,k_block,pipeline_load_mma_q_consumer_state.index()),
+                   tSTtST);
+        tiled_mma_kq.accumulate_ = UMMA::ScaleOut::One;
+      }
+
+      ++pipeline_load_mma_q_consumer_state;
+
+      pipeline_mma_compute_s.producer_commit(pipeline_mma_compute_s_producer_state);
+      ++pipeline_mma_compute_s_producer_state;
+
+      pipeline_compute_mma_ds.consumer_wait(pipeline_compute_mma_ds_consumer_state);
+
+      // we need to acquire dP here, because tmem dQ == tmem dP
+      pipeline_mma_compute_dp.producer_acquire(pipeline_mma_compute_dp_producer_state);
+
+      // dQ = dS*K
+      tiled_mma_dsk.accumulate_ = UMMA::ScaleOut::Zero;
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tDQrDS); ++k_block) {
+        cute::gemm(tiled_mma_dsk,
+                   tDQrDS(_,_,k_block,pipeline_compute_mma_ds_consumer_state.index()),
+                   tDQrKT(_,_,k_block,_0{}),
+                   tDQtDQ);
+        tiled_mma_dsk.accumulate_ = UMMA::ScaleOut::One;
+      }
+
+      pipeline_mma_reduce_dq.producer_commit(pipeline_mma_reduce_dq_producer_state);
+      ++pipeline_mma_reduce_dq_producer_state;
+
+      // dK = dS*Q
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tDKrDST); ++k_block) {
+        cute::gemm(tiled_mma_dsq,
+                   tDKrDST(_,_,k_block,pipeline_compute_mma_ds_consumer_state.index()),
+                   tDKrQT(_,_,k_block,pipeline_load_mma_q_release_state.index()),
+                   tDKtDK);
+        tiled_mma_dsq.accumulate_ = UMMA::ScaleOut::One;
+      }
+
+      pipeline_load_mma_q.consumer_release(pipeline_load_mma_q_release_state);
+      ++pipeline_load_mma_q_release_state;
+
+      pipeline_compute_mma_ds.consumer_release(pipeline_compute_mma_ds_consumer_state);
+      ++pipeline_compute_mma_ds_consumer_state;
+
+      // we grab dq here, because in tmem dq == dp
+      pipeline_mma_reduce_dq.producer_acquire(pipeline_mma_reduce_dq_producer_state);
+
+      pipeline_load_mma_do.consumer_wait(pipeline_load_mma_do_consumer_state);
+
+      // dP = dO*V
+      tiled_mma_vdo.accumulate_ = UMMA::ScaleOut::Zero;
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tDPTrV); ++k_block) {
+        cute::gemm(tiled_mma_vdo,
+                   tDPTrV(_,_,k_block,_0{}),
+                   tDPTrDO(_,_,k_block,pipeline_load_mma_do_consumer_state.index()),
+                   tDPTtDPT);
+        tiled_mma_vdo.accumulate_ = UMMA::ScaleOut::One;
+      }
+
+      pipeline_mma_compute_dp.producer_commit(pipeline_mma_compute_dp_producer_state);
+      ++pipeline_mma_compute_dp_producer_state;
+
+      pipeline_compute_mma_p.consumer_wait(pipeline_compute_mma_p_consumer_state);
+
+      // dV = P*dO
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tDVrP); ++k_block) {
+        cute::gemm(tiled_mma_pdo,
+                   tDVrP(_,_,k_block),
+                   tDVrDOT(_,_,k_block,pipeline_load_mma_do_consumer_state.index()),
+                   tDVtDV);
+        tiled_mma_pdo.accumulate_ = UMMA::ScaleOut::One;
+      }
+
+      pipeline_compute_mma_p.consumer_release(pipeline_compute_mma_p_consumer_state);
+      ++pipeline_compute_mma_p_consumer_state;
+
+      pipeline_load_mma_do.consumer_release(pipeline_load_mma_do_consumer_state);
+      ++pipeline_load_mma_do_consumer_state;
+
+      iter_count -= 1;
+    }
+
+    // signal to the epilogue that dV is ready
+    pipeline_mma_compute_dkdv.producer_acquire(pipeline_mma_compute_dkdv_producer_state);
+    pipeline_mma_compute_dkdv.producer_commit(pipeline_mma_compute_dkdv_producer_state);
+    ++pipeline_mma_compute_dkdv_producer_state;
+
+    pipeline_mma_compute_dkdv.producer_acquire(pipeline_mma_compute_dkdv_producer_state);
+
+    pipeline_compute_mma_ds.consumer_wait(pipeline_compute_mma_ds_consumer_state);
+
+    // dK = dS*Q
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_block = 0; k_block < size<2>(tDKrDST); ++k_block) {
+      cute::gemm(tiled_mma_dsq,
+                 tDKrDST(_,_,k_block,pipeline_compute_mma_ds_consumer_state.index()),
+                 tDKrQT(_,_,k_block,pipeline_load_mma_q_release_state.index()),
+                 tDKtDK);
+      tiled_mma_dsq.accumulate_ = UMMA::ScaleOut::One;
+    }
+
+    // signal to epilgue that dK is ready
+    pipeline_mma_compute_dkdv.producer_commit(pipeline_mma_compute_dkdv_producer_state);
+    ++pipeline_mma_compute_dkdv_producer_state;
+
+    // we've already acquired mma_reduce_dq in the loop
+
+    // dQ = dS*K
+    tiled_mma_dsk.accumulate_ = UMMA::ScaleOut::Zero;
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_block = 0; k_block < size<2>(tDQrDS); ++k_block) {
+      cute::gemm(tiled_mma_dsk,
+                 tDQrDS(_,_,k_block,pipeline_compute_mma_ds_consumer_state.index()),
+                 tDQrKT(_,_,k_block,_0{}),
+                 tDQtDQ);
+      tiled_mma_dsk.accumulate_ = UMMA::ScaleOut::One;
+    }
+
+    pipeline_mma_reduce_dq.producer_commit(pipeline_mma_reduce_dq_producer_state);
+    ++pipeline_mma_reduce_dq_producer_state;
+
+    pipeline_load_mma_q.consumer_release(pipeline_load_mma_q_release_state);
+    ++pipeline_load_mma_q_release_state;
+
+    pipeline_compute_mma_ds.consumer_release(pipeline_compute_mma_ds_consumer_state);
+    ++pipeline_compute_mma_ds_consumer_state;
+  }
+
+
+
+  template<class TensorG, class TensorR, class TensorC, class TensorShape>
+  CUTLASS_DEVICE void store(
+      TensorG gmem,
+      TensorR const& regs,
+      TensorC const& coord,
+      TensorShape const& tensor_shape) {
+
+    auto copy_op = make_cotiled_copy(
+        Copy_Atom<UniversalCopy<uint128_t>, Element>{},
+        make_layout(make_shape(_1{}, Int<sizeof(uint128_t) / sizeof(Element)>{})),
+        regs.layout()
+    );
+    auto thr_copy = copy_op.get_slice(_0{});
+
+    auto tCg = thr_copy.partition_D(gmem);
+    auto tCr = thr_copy.partition_S(quantize(regs));
+    auto tCc = thr_copy.partition_D(coord);
+
+    constexpr int R = decltype(tCr.layout())::rank;
+    auto tCg_v = group_modes<1, R>(tCg);
+    auto tCr_v = group_modes<1, R>(tCr);
+    auto tCc_v = group_modes<1, R>(tCc);
+    auto tCp_v = make_tensor<bool>(shape<1>(tCc_v));
+
+    for (int i = 0; i < size(tCp_v); ++i) {
+      tCp_v(i) = elem_less(tCc_v(_0{},i), tensor_shape);
+    }
+
+    copy_if(copy_op, tCp_v, tCr_v, tCg_v);
+  }
+
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void epilogue(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      EpilogueArguments const& epilogue_args,
+      PipelineMmaComputeDKDV& pipeline_mma_compute_dkdv,
+      typename PipelineMmaComputeDKDV::PipelineState& pipeline_mma_compute_dkdv_consumer_state) {
+
+    auto [Q, K, D, HB] = problem_shape;
+    auto [blk_coord_q, blk_coord_k, blk_coord_batch] = blk_coord;
+
+    auto load_op = SM100_TMEM_LOAD_32dp32b16x{};
+
+    auto tDKtDK = partition_fragment_C(TiledMmaDSQ{}, select<0,1>(TileShapeDSQ{}))(make_coord(_,_),_0{},_0{});
+    tDKtDK.data() = TmemAllocation::kDK;
+
+    auto mDK = make_tensor(make_gmem_ptr(epilogue_args.ptr_dk), make_shape(K, TileShapeDQK{}, HB), epilogue_args.stride_dk);
+    auto gDK = local_tile(mDK, TileShapeDSQ{}, make_coord(_,_,_), Step<_1, _1, X>{})
+        (_, _, blk_coord_k, _0{}, blk_coord_batch);
+
+    Tensor cDK = domain_offset(
+        make_coord(get<1>(blk_coord) * TileShapeK{}, _0{}),
+        make_identity_tensor(take<0,2>(TileShapeDSQ{}))
+    );
+
+    constexpr int kNumWarpgroups = kNumComputeWarps / 4;
+    int dp_idx = threadIdx.x % 128;
+    int wg_idx = (threadIdx.x % (kNumComputeWarps * NumThreadsPerWarp)) / 128;
+
+    auto split_wg = [&](auto const& t) {
+      if constexpr (decltype(rank(t))::value == 3) {
+        auto p = t.compose(make_layout(make_shape(size<0>(t), size<1>(t), make_shape(Int<kNumWarpgroups>{}, size<2>(t) / Int<kNumWarpgroups>{}))));
+        return p(_, _, make_coord(wg_idx, _));
+      }
+      else {
+        auto p = t.compose(make_layout(make_shape(size<0>(t), size<1>(t), size<2>(t), make_shape(Int<kNumWarpgroups>{}, size<3>(t) / Int<kNumWarpgroups>{}))));
+        return p(_, _, _, make_coord(wg_idx, _));
+      }
+    };
+
+    auto tiled_t2r_dk = make_tmem_copy(load_op, tDKtDK);
+    auto thread_t2r_dk = tiled_t2r_dk.get_slice(dp_idx);
+
+    Tensor tTR_cDK   = split_wg(thread_t2r_dk.partition_D(cDK));
+    Tensor tTR_gDK   = split_wg(thread_t2r_dk.partition_D(gDK));
+    Tensor tTR_rDK = make_tensor<ElementAcc>(shape(tTR_cDK));
+    Tensor tTR_tDK = split_wg(thread_t2r_dk.partition_S(tDKtDK));
+
+    auto tDVtDV = partition_fragment_C(TiledMmaDSQ{}, select<0,1>(TileShapeDSQ{}))(make_coord(_,_),_0{},_0{});
+    tDVtDV.data() = TmemAllocation::kDV;
+
+    auto mDV = make_tensor(make_gmem_ptr(epilogue_args.ptr_dv), make_shape(K, TileShapeDVO{}, HB), epilogue_args.stride_dv);
+    auto gDV = local_tile(mDV, TileShapePDO{}, make_coord(_,_,_), Step<_1, _1, X>{})
+        (_, _, blk_coord_k, _0{}, blk_coord_batch);
+
+    Tensor cDV = domain_offset(
+        make_coord(get<1>(blk_coord) * TileShapeK{}, _0{}),
+        make_identity_tensor(take<0,2>(TileShapePDO{}))
+    );
+
+    auto tiled_t2r_dv = make_tmem_copy(load_op, tDVtDV);
+    auto thread_t2r_dv = tiled_t2r_dv.get_slice(dp_idx);
+
+    Tensor tTR_cDV   = split_wg(thread_t2r_dv.partition_D(cDV));
+    Tensor tTR_gDV   = split_wg(thread_t2r_dv.partition_D(gDV));
+    Tensor tTR_rDV = make_tensor<ElementAcc>(shape(tTR_cDV));
+    Tensor tTR_tDV = split_wg(thread_t2r_dv.partition_S(tDVtDV));
+
+    pipeline_mma_compute_dkdv.consumer_wait(pipeline_mma_compute_dkdv_consumer_state);
+
+    // load tDVtDV
+    cute::copy(tiled_t2r_dv, tTR_tDV, tTR_rDV);
+
+    // store tDVgDV
+    store(tTR_gDV, tTR_rDV, tTR_cDV, select<1,2>(problem_shape));
+
+    cutlass::arch::fence_view_async_tmem_load();
+    pipeline_mma_compute_dkdv.consumer_release(pipeline_mma_compute_dkdv_consumer_state);
+    ++pipeline_mma_compute_dkdv_consumer_state;
+
+    pipeline_mma_compute_dkdv.consumer_wait(pipeline_mma_compute_dkdv_consumer_state);
+
+    // load tDKtDK
+    cute::copy(tiled_t2r_dk, tTR_tDK, tTR_rDK);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(tTR_rDK); i++) {
+      tTR_rDK(i) = mainloop_args.softmax_scale * tTR_rDK(i);
+    }
+
+    // store tDKgDK
+    store(tTR_gDK, tTR_rDK, tTR_cDK, select<1,2>(problem_shape));
+
+    cutlass::arch::fence_view_async_tmem_load();
+    pipeline_mma_compute_dkdv.consumer_release(pipeline_mma_compute_dkdv_consumer_state);
+    ++pipeline_mma_compute_dkdv_consumer_state;
+
+  }
+
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void compute(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      int iter_index,
+      int iter_count,
+      MainloopArguments const& mainloop_args,
+      EpilogueArguments const& epilogue_args,
+      TensorStorage& shared_tensors,
+      PipelineLoadComputeLSE& pipeline_load_compute_lse,
+      typename PipelineLoadComputeLSE::PipelineState& pipeline_load_compute_lse_consumer_state,
+      PipelineLoadComputeSumOdO& pipeline_load_compute_sum_odo,
+      typename PipelineLoadComputeSumOdO::PipelineState& pipeline_load_compute_sum_odo_consumer_state,
+      PipelineMmaComputeS& pipeline_mma_compute_s,
+      typename PipelineMmaComputeS::PipelineState& pipeline_mma_compute_s_consumer_state,
+      PipelineMmaComputeDP& pipeline_mma_compute_dp,
+      typename PipelineMmaComputeDP::PipelineState& pipeline_mma_compute_dp_consumer_state,
+      PipelineComputeMmaP& pipeline_compute_mma_p,
+      typename PipelineComputeMmaP::PipelineState& pipeline_compute_mma_p_producer_state,
+      PipelineComputeMmaDS& pipeline_compute_mma_ds,
+      typename PipelineComputeMmaDS::PipelineState& pipeline_compute_mma_ds_producer_state,
+      PipelineMmaComputeDKDV& pipeline_mma_compute_dkdv,
+      typename PipelineMmaComputeDKDV::PipelineState& pipeline_mma_compute_dkdv_consumer_state) {
+
+    
+    auto [Q, K, D, HB] = problem_shape;
+
+    // in tmem, S & P overlap
+    // and dP and dQ overlap
+
+    // there are two compute wg's that cooperatively compute softmax
+    // they are striped by this tmem atom, i.e. wg0 has 16 elems, then wg1 etc
+
+    auto load_op = SM100_TMEM_LOAD_32dp32b16x{};
+    auto store_op = SM100_TMEM_STORE_32dp32b8x{};
+
+    Tensor tSTtST =  partition_fragment_C(TiledMmaKQ{}, select<0,1>(TileShapeKQ{}))(make_coord(_,_),_0{},_0{});
+    tSTtST.data() = TmemAllocation::kS;
+
+    Tensor tDPTtDPT =  partition_fragment_C(TiledMmaVDO{}, select<0,1>(TileShapeVDO{}))(make_coord(_,_),_0{},_0{});
+    tDPTtDPT.data() = TmemAllocation::kDP;
+
+    Tensor cST = make_identity_tensor(take<0,2>(TileShapeKQ{}));
+    Tensor cDPT = make_identity_tensor(take<0,2>(TileShapeVDO{}));
+
+    constexpr int kNumWarpgroups = kNumComputeWarps / 4;
+    int dp_idx = threadIdx.x % 128;
+    int wg_idx = (threadIdx.x % (kNumComputeWarps * NumThreadsPerWarp)) / 128;
+    auto tiled_t2r = make_tmem_copy(load_op, tSTtST);
+    auto thread_t2r = tiled_t2r.get_slice(dp_idx);
+
+    auto split_wg = [&](auto const& t) {
+      if constexpr (decltype(rank(t))::value == 3) {
+        auto p = t.compose(make_layout(make_shape(size<0>(t), size<1>(t), make_shape(Int<kNumWarpgroups>{}, size<2>(t) / Int<kNumWarpgroups>{}))));
+        return p(_, _, make_coord(wg_idx, _));
+      }
+      else {
+        auto p = t.compose(make_layout(make_shape(size<0>(t), size<1>(t), size<2>(t), make_shape(Int<kNumWarpgroups>{}, size<3>(t) / Int<kNumWarpgroups>{}))));
+        return p(_, _, _, make_coord(wg_idx, _));
+      }
+    };
+
+    Tensor tTR_cST   = split_wg(thread_t2r.partition_D(cST));
+    Tensor tTR_rST = make_tensor<ElementAcc>(shape(tTR_cST));
+    Tensor tTR_tST = split_wg(thread_t2r.partition_S(tSTtST));
+    
+    Tensor tTR_cDPT_p = thread_t2r.partition_D(cDPT);
+    Tensor tTR_cDPT = split_wg(tTR_cDPT_p);
+    Tensor tTR_rDPT = make_tensor<ElementAcc>(shape(tTR_cDPT));
+    Tensor tTR_tDPT = split_wg(thread_t2r.partition_S(tDPTtDPT));
+
+    Tensor sLSE = make_tensor(make_smem_ptr(shared_tensors.smem_lse.begin()), SmemLayoutLSE{});
+    Tensor sSumOdO = make_tensor(make_smem_ptr(shared_tensors.smem_sum_odo.begin()), SmemLayoutSumOdO{});
+
+    auto sP = make_tensor(make_smem_ptr((Element*) nullptr), typename CollectiveMmaPDO::SmemLayoutA{});
+
+
+    auto tDVrP = TiledMmaPDO::make_fragment_A(sP)(_, _, _, _0{});
+    auto tDVcST = TiledMmaPDO{}.get_slice(_0{}).partition_A(cST);
+    tDVrP.data() = TmemAllocation::kP;
+
+    auto tiled_r2t = make_tmem_copy(store_op, tDVrP);
+    auto thread_r2t = tiled_r2t.get_slice(dp_idx);
+
+    auto tRT_tP = split_wg(thread_r2t.partition_D(tDVrP));
+    auto tRT_cST = split_wg(thread_r2t.partition_S(tDVcST));
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (iter_count > 0) {
+      // wait for S and P
+      pipeline_mma_compute_s.consumer_wait(pipeline_mma_compute_s_consumer_state);
+      pipeline_compute_mma_p.producer_acquire(pipeline_compute_mma_p_producer_state);
+      // wait for LSE
+      pipeline_load_compute_lse.consumer_wait(pipeline_load_compute_lse_consumer_state);
+
+      auto dispatch_bool = [](bool b, auto fn) {
+        if (b) {
+          fn(cute::true_type{});
+        }
+        else {
+          fn(cute::false_type{});
+        }
+      };
+      
+      dispatch_bool(std::is_base_of_v<cutlass::fmha::collective::CausalMask, Mask> &&
+          warp_uniform(iter_index == get<1>(blk_coord)), [&](auto is_causal_masked_tile) {
+
+        // compute P = softmax(S, LSE)
+        cute::copy(tiled_t2r, tTR_tST, tTR_rST);
+  
+        if constexpr (std::is_base_of_v<cutlass::fmha::collective::CausalMask, Mask> && decltype(is_causal_masked_tile)::value) {
+          Mask{}.apply_mask(tTR_rST, [&](int i) {
+            auto c_transpose = tTR_cST(i);
+            return make_coord(get<1>(c_transpose) + iter_index * TileShapeQ{}, get<0>(c_transpose) + get<1>(blk_coord) * TileShapeK{});
+          }, problem_shape);
+        }
+  
+        ElementAcc log2_e = static_cast<ElementAcc>(M_LOG2E);
+        float2 softmax_scale_log2_e;
+        softmax_scale_log2_e.x = mainloop_args.softmax_scale * log2_e;
+        softmax_scale_log2_e.y = mainloop_args.softmax_scale * log2_e;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(tTR_rST); i += 2) {
+          float2 acc;
+          float2 lse;
+          float2 out;
+          acc.x = tTR_rST(i);
+          acc.y = tTR_rST(i + 1);
+          lse.x = sLSE(get<1>(tTR_cST(i)), pipeline_load_compute_lse_consumer_state.index());
+          lse.y = sLSE(get<1>(tTR_cST(i+1)), pipeline_load_compute_lse_consumer_state.index());
+          cute::fma(out, softmax_scale_log2_e, acc, lse);
+          tTR_rST(i) = ::exp2f(out.x);
+          tTR_rST(i+1) = ::exp2f(out.y);
+        }
+  
+        auto tRT_rST = quantize(tTR_rST);
+        auto tRT_rST_reshaped = make_tensor(tRT_rST.data(), shape(tRT_cST));
+        
+        cutlass::arch::fence_view_async_tmem_load();
+        cutlass::arch::NamedBarrier(
+          kNumComputeWarps * NumThreadsPerWarp,
+          cutlass::arch::ReservedNamedBarriers::TransformBarrier
+        ).arrive_and_wait();
+        
+        cute::copy(tiled_r2t, tRT_rST_reshaped, tRT_tP);
+      });
+
+      // notify for P
+      cutlass::arch::fence_view_async_tmem_store();
+      pipeline_compute_mma_p.producer_commit(pipeline_compute_mma_p_producer_state);
+      ++pipeline_compute_mma_p_producer_state;
+      // release S
+      pipeline_mma_compute_s.consumer_release(pipeline_mma_compute_s_consumer_state);
+      ++pipeline_mma_compute_s_consumer_state;
+      // release LSE
+      pipeline_load_compute_lse.consumer_release(pipeline_load_compute_lse_consumer_state);
+      ++pipeline_load_compute_lse_consumer_state;
+
+      // wait for OdO
+      pipeline_load_compute_sum_odo.consumer_wait(pipeline_load_compute_sum_odo_consumer_state);
+      // wait for dP
+      pipeline_mma_compute_dp.consumer_wait(pipeline_mma_compute_dp_consumer_state);
+
+      // wait for dS
+      // in principle, we could defer waiting for dS, and move in the freeing of dP
+      // however, that would force us to keep dS in registers longer
+      pipeline_compute_mma_ds.producer_acquire(pipeline_compute_mma_ds_producer_state);
+
+      // compute dS = dsoftmax(P, dP, sum_OdO)
+      cute::copy(tiled_t2r, tTR_tDPT, tTR_rDPT);
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tTR_rDPT); i += 2) {
+        float2 st;
+        st.x = tTR_rST(i);
+        st.y = tTR_rST(i+1);
+        float2 dpt;
+        dpt.x = tTR_rDPT(i);
+        dpt.y = tTR_rDPT(i+1);
+        float2 odo;
+        odo.x = sSumOdO(get<1>(tTR_cDPT(i)), pipeline_load_compute_sum_odo_consumer_state.index());
+        odo.y = sSumOdO(get<1>(tTR_cDPT(i+1)), pipeline_load_compute_sum_odo_consumer_state.index());
+        float2 dif;
+        // sum odo is negated during preprocess
+        cute::add(dif, dpt, odo);
+        float2 out;
+        cute::mul(out, dif, st);
+        tTR_rDPT(i) = out.x;
+        tTR_rDPT(i+1) = out.y;
+      }
+
+      auto tTR_rDST = quantize(tTR_rDPT);
+
+      // release dP
+      cutlass::arch::fence_view_async_tmem_load();
+      pipeline_mma_compute_dp.consumer_release(pipeline_mma_compute_dp_consumer_state);
+      ++pipeline_mma_compute_dp_consumer_state;
+
+      Tensor sDS = make_tensor(make_smem_ptr((Element*) shared_tensors.smem_ds.begin()), SmemLayoutDS{})
+          (_, _, _, pipeline_compute_mma_ds_producer_state.index());
+
+      auto thread_layout = make_ordered_layout(
+          make_shape(_128{}, _128{}),
+          make_stride(_1{}, _0{})
+      );
+
+      auto sDS_pi = as_position_independent_swizzle_tensor(sDS);
+      auto sDS_pi_slice_p = sDS_pi.compose(thread_layout)(dp_idx, _).compose(make_layout(shape(tTR_cDPT_p)));
+      auto sDS_pi_slice = split_wg(sDS_pi_slice_p);
+
+      copy_aligned(tTR_rDST, sDS_pi_slice);
+
+      // notify for dS
+      cutlass::arch::fence_view_async_shared();
+      pipeline_compute_mma_ds.producer_commit(pipeline_compute_mma_ds_producer_state);
+      ++pipeline_compute_mma_ds_producer_state;
+      // release OdO
+      pipeline_load_compute_sum_odo.consumer_release(pipeline_load_compute_sum_odo_consumer_state);
+      ++pipeline_load_compute_sum_odo_consumer_state;
+
+      iter_count -= 1;
+      iter_index += 1;
+    }
+
+    epilogue(
+        blk_coord, problem_shape, mainloop_args, epilogue_args,
+        pipeline_mma_compute_dkdv, pipeline_mma_compute_dkdv_consumer_state
+    );
+  }
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void reduce(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      int iter_index,
+      int iter_count,
+      MainloopArguments const& mainloop_args,
+      MainloopParams const& mainloop_params,
+      TensorStorage& shared_tensors,
+      PipelineMmaReduceDQ& pipeline_mma_reduce_dq,
+      typename PipelineMmaReduceDQ::PipelineState& pipeline_mma_reduce_dq_consumer_state,
+      PipelineReduceTmaStore& pipeline_reduce_tma_store,
+      typename PipelineReduceTmaStore::PipelineState& pipeline_reduce_tma_store_producer_state) {
+    
+    using X = Underscore;
+    
+    auto [Q, K, D, HB] = problem_shape;
+
+    auto [blk_coord_q, blk_coord_k, blk_coord_batch] = blk_coord;
+
+    // must match TileShapeDQ
+    auto load_op = SM100_TMEM_LOAD_32dp32b32x{};
+
+    auto tDQtDQ = partition_fragment_C(TiledMmaDSK{}, select<0,1>(TileShapeDSK{}))(make_coord(_,_),_0{},_0{});
+    tDQtDQ.data() = TmemAllocation::kDQ;
+
+    Tensor mDQ = mainloop_params.tma_red_dq.get_tma_tensor(make_shape(Q, D, HB));
+    auto gDQ = local_tile(mDQ, TileShapeKQ{}, make_coord(_,_,_), Step<_1, _1, X>{})
+        (_, _, _, _0{}, blk_coord_batch);
+
+    Tensor cDQ = make_identity_tensor(take<0,2>(TileShapeDSK{}));
+
+    Tensor sDQ = make_tensor(make_smem_ptr(shared_tensors.smem_dq.begin()), SmemLayoutDQ{});
+
+    int thread_idx = threadIdx.x % (kNumComputeWarps * NumThreadsPerWarp);
+    auto tiled_t2r = make_tmem_copy(load_op, tDQtDQ);
+    auto thread_t2r = tiled_t2r.get_slice(thread_idx);
+
+    Tensor tTR_cDQ   = thread_t2r.partition_D(cDQ);
+    Tensor tTR_gDQ   = thread_t2r.partition_D(gDQ);
+    Tensor tTR_sDQ   = thread_t2r.partition_D(sDQ);
+    Tensor tTR_tDQ = thread_t2r.partition_S(tDQtDQ);
+
+    auto block_tma = mainloop_params.tma_red_dq.get_slice(_0{});
+
+    Tensor tDQsDQ = block_tma.partition_S(sDQ);
+    Tensor tDQcDQ = block_tma.partition_S(cDQ);
+    Tensor tDQgDQ = block_tma.partition_D(gDQ);
+
+    int lane_predicate = (threadIdx.x % (kNumReduceWarps * NumThreadsPerWarp)) == 0;
+
+    while (iter_count > 0) {
+      pipeline_mma_reduce_dq.consumer_wait(pipeline_mma_reduce_dq_consumer_state);
+
+      Tensor tTR_rDQ = make_tensor<ElementAcc>(shape(tTR_cDQ));
+
+      // load dQ from tmem to rmem
+      cute::copy(tiled_t2r, tTR_tDQ, tTR_rDQ);
+
+      cutlass::arch::fence_view_async_tmem_load();
+      pipeline_mma_reduce_dq.consumer_release(pipeline_mma_reduce_dq_consumer_state);
+      ++pipeline_mma_reduce_dq_consumer_state;
+
+      // we don't have enough smem to dump it all to smem, so we do it in stages
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size<2>(tTR_cDQ); i++) {
+        if (lane_predicate) {
+          pipeline_reduce_tma_store.producer_acquire(pipeline_reduce_tma_store_producer_state);
+        }
+        // wait in all threads for the acquire to complete
+        cutlass::arch::NamedBarrier(
+            kNumReduceWarps * NumThreadsPerWarp,
+            cutlass::arch::ReservedNamedBarriers::TransposeBarrier
+        ).arrive_and_wait();
+
+        cute::copy(tTR_rDQ(_, _, i), tTR_sDQ(_, _, _0{}, pipeline_reduce_tma_store_producer_state.index()));
+
+        // wait for the stores to all be visible to the TMA
+        cutlass::arch::fence_view_async_shared();
+        cutlass::arch::NamedBarrier(
+            kNumReduceWarps * NumThreadsPerWarp,
+            cutlass::arch::ReservedNamedBarriers::TransposeBarrier
+        ).arrive_and_wait();
+        if (lane_predicate) {
+          // launch tma store
+          copy(mainloop_params.tma_red_dq, tDQsDQ(_,_,_0{}, pipeline_reduce_tma_store_producer_state.index()), tDQgDQ(_,_,i,iter_index));
+          pipeline_reduce_tma_store.producer_commit(pipeline_reduce_tma_store_producer_state);
+        }
+
+        ++pipeline_reduce_tma_store_producer_state;
+      }
+
+      iter_count -= 1;
+      iter_index += 1;
+    }
+  }
+  
+
+  CUTLASS_DEVICE void operator()(Params const& params, char* smem) {
+    int warp_idx = cutlass::canonical_warp_idx_sync();
+    auto role = warp_idx_to_role(warp_idx);
+    uint32_t lane_predicate = cute::elect_one_sync();
+
+    if (role == WarpRole::Load && lane_predicate) {
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_q.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_k.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_v.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_do.get_tma_descriptor());
+    }
+
+    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem);
+
+    int initializing_warp = 0;
+    typename PipelineLoadMmaQ::Params pipeline_load_mma_q_params;
+    if (role == WarpRole::Load) {
+      pipeline_load_mma_q_params.role = PipelineLoadMmaQ::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Mma) {
+      pipeline_load_mma_q_params.role = PipelineLoadMmaQ::ThreadCategory::Consumer;
+    }
+    pipeline_load_mma_q_params.is_leader = lane_predicate && (role == WarpRole::Load);
+    // Also loads K in the first iteration
+    pipeline_load_mma_q_params.transaction_bytes = kTransactionsBytesLoadQ;
+    pipeline_load_mma_q_params.initializing_warp = initializing_warp++;
+    PipelineLoadMmaQ pipeline_load_mma_q(shared_storage.pipelines.load_mma_q, pipeline_load_mma_q_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineLoadMmaDO::Params pipeline_load_mma_do_params;
+    if (role == WarpRole::Load) {
+      pipeline_load_mma_do_params.role = PipelineLoadMmaDO::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Mma) {
+      pipeline_load_mma_do_params.role = PipelineLoadMmaDO::ThreadCategory::Consumer;
+    }
+    pipeline_load_mma_do_params.is_leader = lane_predicate && (role == WarpRole::Load);
+    // Also loads V in the first iteration
+    pipeline_load_mma_do_params.transaction_bytes = kTransactionsBytesLoadDO;
+    pipeline_load_mma_do_params.initializing_warp = initializing_warp++;
+    PipelineLoadMmaDO pipeline_load_mma_do(shared_storage.pipelines.load_mma_do, pipeline_load_mma_do_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineLoadComputeLSE::Params pipeline_load_compute_lse_params;
+    if (role == WarpRole::Load) {
+      pipeline_load_compute_lse_params.role = PipelineLoadComputeLSE::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_load_compute_lse_params.role = PipelineLoadComputeLSE::ThreadCategory::Consumer;
+    }
+    pipeline_load_compute_lse_params.producer_arv_count = NumThreadsPerWarp;
+    pipeline_load_compute_lse_params.consumer_arv_count = kNumComputeWarps * NumThreadsPerWarp;
+    pipeline_load_compute_lse_params.initializing_warp = initializing_warp++;
+    PipelineLoadComputeLSE pipeline_load_compute_lse(
+      shared_storage.pipelines.load_compute_lse,
+      pipeline_load_compute_lse_params,
+      /*barrier init*/ cute::true_type{});
+
+    typename PipelineLoadComputeSumOdO::Params pipeline_load_compute_sum_odo_params;
+    if (role == WarpRole::Load) {
+      pipeline_load_compute_sum_odo_params.role = PipelineLoadComputeSumOdO::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_load_compute_sum_odo_params.role = PipelineLoadComputeSumOdO::ThreadCategory::Consumer;
+    }
+    pipeline_load_compute_sum_odo_params.producer_arv_count = NumThreadsPerWarp;
+    pipeline_load_compute_sum_odo_params.consumer_arv_count = kNumComputeWarps * NumThreadsPerWarp;
+    pipeline_load_compute_sum_odo_params.initializing_warp = initializing_warp++;
+    PipelineLoadComputeSumOdO pipeline_load_compute_sum_odo(
+      shared_storage.pipelines.load_compute_sum_odo,
+      pipeline_load_compute_sum_odo_params,
+      /*barrier init*/ cute::true_type{});
+
+    typename PipelineMmaComputeS::Params pipeline_mma_compute_s_params;
+    if (role == WarpRole::Mma) {
+      pipeline_mma_compute_s_params.role = PipelineMmaComputeS::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_mma_compute_s_params.role = PipelineMmaComputeS::ThreadCategory::Consumer;
+    }
+    pipeline_mma_compute_s_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp;
+    pipeline_mma_compute_s_params.initializing_warp = initializing_warp++;
+    PipelineMmaComputeS pipeline_mma_compute_s(
+      shared_storage.pipelines.mma_compute_s,
+      pipeline_mma_compute_s_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineMmaComputeDP::Params pipeline_mma_compute_dp_params;
+    if (role == WarpRole::Mma) {
+      pipeline_mma_compute_dp_params.role = PipelineMmaComputeDP::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_mma_compute_dp_params.role = PipelineMmaComputeDP::ThreadCategory::Consumer;
+    }
+    pipeline_mma_compute_dp_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp;
+    pipeline_mma_compute_dp_params.initializing_warp = initializing_warp++;
+    PipelineMmaComputeDP pipeline_mma_compute_dp(
+      shared_storage.pipelines.mma_compute_dp,
+      pipeline_mma_compute_dp_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineMmaReduceDQ::Params pipeline_mma_reduce_dq_params;
+    if (role == WarpRole::Mma) {
+      pipeline_mma_reduce_dq_params.role = PipelineMmaReduceDQ::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Reduce) {
+      pipeline_mma_reduce_dq_params.role = PipelineMmaReduceDQ::ThreadCategory::Consumer;
+    }
+    pipeline_mma_reduce_dq_params.consumer_arv_count = kNumReduceWarps * cutlass::NumThreadsPerWarp;
+    pipeline_mma_reduce_dq_params.initializing_warp = initializing_warp++;
+    PipelineMmaReduceDQ pipeline_mma_reduce_dq(
+      shared_storage.pipelines.mma_reduce_dq,
+      pipeline_mma_reduce_dq_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineComputeMmaP::Params pipeline_compute_mma_p_params;
+    if (role == WarpRole::Mma) {
+      pipeline_compute_mma_p_params.role = PipelineComputeMmaP::ThreadCategory::Consumer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_compute_mma_p_params.role = PipelineComputeMmaP::ThreadCategory::Producer;
+    }
+    pipeline_compute_mma_p_params.producer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp;
+    pipeline_compute_mma_p_params.consumer_arv_count = 1;
+    pipeline_compute_mma_p_params.initializing_warp = initializing_warp++;
+    PipelineComputeMmaP pipeline_compute_mma_p(
+      shared_storage.pipelines.compute_mma_p,
+      pipeline_compute_mma_p_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineComputeMmaDS::Params pipeline_compute_mma_ds_params;
+    if (role == WarpRole::Mma) {
+      pipeline_compute_mma_ds_params.role = PipelineComputeMmaDS::ThreadCategory::Consumer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_compute_mma_ds_params.role = PipelineComputeMmaDS::ThreadCategory::Producer;
+    }
+    pipeline_compute_mma_ds_params.producer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp;
+    pipeline_compute_mma_ds_params.consumer_arv_count = 1;
+    pipeline_compute_mma_ds_params.initializing_warp = initializing_warp++;
+    PipelineComputeMmaDS pipeline_compute_mma_ds(
+      shared_storage.pipelines.compute_mma_ds,
+      pipeline_compute_mma_ds_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineMmaComputeDKDV::Params pipeline_mma_compute_dkdv_params;
+    if (role == WarpRole::Mma) {
+      pipeline_mma_compute_dkdv_params.role = PipelineMmaComputeDKDV::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::Compute) {
+      pipeline_mma_compute_dkdv_params.role = PipelineMmaComputeDKDV::ThreadCategory::Consumer;
+    }
+    pipeline_mma_compute_dkdv_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp;
+    pipeline_mma_compute_dkdv_params.initializing_warp = initializing_warp++;
+    PipelineMmaComputeDKDV pipeline_mma_compute_dkdv(
+      shared_storage.pipelines.mma_compute_dkdv,
+      pipeline_mma_compute_dkdv_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+    PipelineReduceTmaStore pipeline_reduce_tma_store;
+
+    TmemAllocator tmem_allocator;
+
+    pipeline_init_arrive_relaxed(size(ClusterShape{}));
+
+    pipeline_load_mma_q.init_masks(ClusterShape{});
+    pipeline_load_mma_do.init_masks(ClusterShape{});
+    pipeline_mma_compute_s.init_masks(ClusterShape{});
+    pipeline_mma_compute_dp.init_masks(ClusterShape{});
+    pipeline_mma_reduce_dq.init_masks(ClusterShape{});
+    pipeline_compute_mma_p.init_masks(ClusterShape{});
+    pipeline_compute_mma_ds.init_masks(ClusterShape{});
+    pipeline_mma_compute_dkdv.init_masks(ClusterShape{});
+
+    typename decltype(pipeline_load_mma_q)::PipelineState pipeline_load_mma_q_consumer_state;
+    typename decltype(pipeline_load_mma_do)::PipelineState pipeline_load_mma_do_consumer_state;
+    typename decltype(pipeline_load_compute_lse)::PipelineState pipeline_load_compute_lse_consumer_state;
+    typename decltype(pipeline_load_compute_sum_odo)::PipelineState pipeline_load_compute_sum_odo_consumer_state;
+    typename decltype(pipeline_mma_compute_s)::PipelineState pipeline_mma_compute_s_consumer_state;
+    typename decltype(pipeline_mma_compute_dp)::PipelineState pipeline_mma_compute_dp_consumer_state;
+    typename decltype(pipeline_mma_reduce_dq)::PipelineState pipeline_mma_reduce_dq_consumer_state;
+    typename decltype(pipeline_compute_mma_p)::PipelineState pipeline_compute_mma_p_consumer_state;
+    typename decltype(pipeline_compute_mma_ds)::PipelineState pipeline_compute_mma_ds_consumer_state;
+    typename decltype(pipeline_mma_compute_dkdv)::PipelineState pipeline_mma_compute_dkdv_consumer_state;
+    
+    auto pipeline_load_mma_q_producer_state = make_producer_start_state<decltype(pipeline_load_mma_q)>();
+    auto pipeline_load_mma_do_producer_state = make_producer_start_state<decltype(pipeline_load_mma_do)>();
+    auto pipeline_load_compute_lse_producer_state = make_producer_start_state<decltype(pipeline_load_compute_lse)>();
+    auto pipeline_load_compute_sum_odo_producer_state = make_producer_start_state<decltype(pipeline_load_compute_sum_odo)>();
+    auto pipeline_mma_compute_s_producer_state = make_producer_start_state<decltype(pipeline_mma_compute_s)>();
+    auto pipeline_mma_compute_dp_producer_state = make_producer_start_state<decltype(pipeline_mma_compute_dp)>();
+    auto pipeline_mma_reduce_dq_producer_state = make_producer_start_state<decltype(pipeline_mma_reduce_dq)>();
+    auto pipeline_compute_mma_p_producer_state = make_producer_start_state<decltype(pipeline_compute_mma_p)>();
+    auto pipeline_compute_mma_ds_producer_state = make_producer_start_state<decltype(pipeline_compute_mma_ds)>();
+    auto pipeline_mma_compute_dkdv_producer_state = make_producer_start_state<decltype(pipeline_mma_compute_dkdv)>();
+    auto pipeline_reduce_tma_store_producer_state = make_producer_start_state<decltype(pipeline_reduce_tma_store)>();
+
+    pipeline_init_wait(size(ClusterShape{}));
+
+    auto blk_coord = make_coord(_0{}, blockIdx.x, make_coord(blockIdx.y, blockIdx.z));
+    auto problem_shape = params.problem_shape;
+    int iter_count = ceil_div(get<0>(problem_shape), TileShapeQ{});
+    int iter_start = 0;
+    if constexpr (std::is_base_of_v<cutlass::fmha::collective::CausalMask, Mask>) {
+      iter_start = (get<1>(blk_coord) * TileShapeK{}) / TileShapeQ{};
+    }
+    iter_count -= iter_start;
+
+    if (role == WarpRole::Load) {
+      warpgroup_reg_set<RegisterAllocation::kLoad>();
+    
+      load(
+          blk_coord,
+          problem_shape,
+          iter_start,
+          iter_count,
+          params.mainloop,
+          params.mainloop_params,
+          shared_storage.tensors,
+          pipeline_load_mma_q, pipeline_load_mma_q_producer_state,        
+          pipeline_load_mma_do, pipeline_load_mma_do_producer_state,
+          pipeline_load_compute_lse, pipeline_load_compute_lse_producer_state,
+          pipeline_load_compute_sum_odo, pipeline_load_compute_sum_odo_producer_state
+      );
+
+    }
+    else if (role == WarpRole::Mma) {
+      warpgroup_reg_set<RegisterAllocation::kMma>();
+
+      tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+      __syncwarp();
+    
+      mma(
+          blk_coord,
+          problem_shape,
+          iter_start,
+          iter_count,
+          params.mainloop,
+          shared_storage.tensors,
+          pipeline_load_mma_q, pipeline_load_mma_q_consumer_state,        
+          pipeline_load_mma_do, pipeline_load_mma_do_consumer_state,
+          pipeline_mma_compute_s, pipeline_mma_compute_s_producer_state,
+          pipeline_mma_compute_dp, pipeline_mma_compute_dp_producer_state,
+          pipeline_mma_reduce_dq, pipeline_mma_reduce_dq_producer_state,
+          pipeline_compute_mma_p, pipeline_compute_mma_p_consumer_state,
+          pipeline_compute_mma_ds, pipeline_compute_mma_ds_consumer_state,
+          pipeline_mma_compute_dkdv, pipeline_mma_compute_dkdv_producer_state
+      );
+
+    }
+    else if (role == WarpRole::Compute) {
+      warpgroup_reg_set<RegisterAllocation::kCompute>();
+    
+      compute(
+          blk_coord,
+          problem_shape,
+          iter_start,
+          iter_count,
+          params.mainloop,
+          params.epilogue,
+          shared_storage.tensors,
+          pipeline_load_compute_lse, pipeline_load_compute_lse_consumer_state,
+          pipeline_load_compute_sum_odo, pipeline_load_compute_sum_odo_consumer_state,
+          pipeline_mma_compute_s, pipeline_mma_compute_s_consumer_state,
+          pipeline_mma_compute_dp, pipeline_mma_compute_dp_consumer_state,
+          pipeline_compute_mma_p, pipeline_compute_mma_p_producer_state,
+          pipeline_compute_mma_ds, pipeline_compute_mma_ds_producer_state,
+          pipeline_mma_compute_dkdv, pipeline_mma_compute_dkdv_consumer_state
+      );
+
+      cutlass::arch::NamedBarrier(
+          kNumComputeWarps * NumThreadsPerWarp,
+          cutlass::arch::ReservedNamedBarriers::EpilogueBarrier
+      ).arrive_and_wait();
+
+      if (warp_idx % kNumComputeWarps == 0) {
+        uint32_t free_stage_ptr = shared_storage.tmem_base_ptr;
+        tmem_allocator.free(free_stage_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+      }
+
+    }
+    else if (role == WarpRole::Reduce) {
+      warpgroup_reg_set<RegisterAllocation::kReduce>();
+    
+      reduce(
+          blk_coord,
+          problem_shape,
+          iter_start,
+          iter_count,
+          params.mainloop,
+          params.mainloop_params,
+          shared_storage.tensors,
+          pipeline_mma_reduce_dq, pipeline_mma_reduce_dq_consumer_state,
+          pipeline_reduce_tma_store, pipeline_reduce_tma_store_producer_state
+      );
+
+      pipeline_reduce_tma_store.producer_tail(pipeline_reduce_tma_store_producer_state);
+    }
+    else {
+      warpgroup_reg_set<RegisterAllocation::kEmpty>();
+    
+      /* no-op */
+      
+    }
+  }
+
+  static dim3 get_block_shape() {
+    dim3 block(MaxThreadsPerBlock, 1, 1);
+    return block;
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    auto [Q, K, D, HB] = params.problem_shape;
+    auto [H, B] = HB;
+    dim3 grid(ceil_div(K, TileShapeK{}), H, B);
+    return grid;
+  }
+};
+
+}  // namespace cutlass::fmha::kernel
diff --git a/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_reduction.hpp b/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_reduction.hpp
new file mode 100644
index 0000000000..c6a0575013
--- /dev/null
+++ b/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_reduction.hpp
@@ -0,0 +1,197 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/arch/arch.h"
+#include "cute/tensor.hpp"
+
+namespace cutlass::fmha::kernel {
+
+using namespace cute;
+template<
+    class ElementOut,
+    class ElementAcc,
+    class ElementScale,
+    size_t kNumHeads,
+    size_t kHeadDimLatent,
+    int kMaxSplits
+>
+struct Sm100FmhaMlaReductionKernel {
+
+  static const int SharedStorageSize = 0;
+  static const int MaxThreadsPerBlock = 128;
+  static const int MinBlocksPerMultiprocessor = 1;
+
+  using ArchTag = cutlass::arch::Sm100;
+
+  static_assert(kHeadDimLatent % MaxThreadsPerBlock == 0);
+  struct Arguments {
+    ElementAcc* ptr_oaccum = nullptr;
+    ElementOut* ptr_o = nullptr;
+    ElementAcc* ptr_lseaccum = nullptr;
+    ElementAcc* ptr_lse = nullptr;
+    ElementScale scale = 1.f;
+    int num_batches = 0;
+    int split_kv = -1;
+    int dim_k = -1;
+    int* ptr_seq = nullptr;
+    int* ptr_split_kv = nullptr;
+    int tile_shape_s = 128;
+  };
+  using Params = Arguments;
+
+  static Params to_underlying_arguments(Arguments const& args, void* workspace) {
+    return {args.ptr_oaccum, args.ptr_o, args.ptr_lseaccum, args.ptr_lse, 
+	    args.scale, args.num_batches, args.split_kv, args.dim_k, args.ptr_seq, 
+	    args.ptr_split_kv, args.tile_shape_s};    
+  }
+
+  static size_t get_workspace_size(Arguments const& /*args*/) {
+    return 0;
+  }
+
+  static Status initialize_workspace(
+      Arguments const& /*args*/, void* /*ws*/, cudaStream_t /*stream*/) {
+    return Status::kSuccess;
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    return dim3(kNumHeads, 1, params.num_batches);
+  }
+
+  static dim3 get_block_shape() {
+    return dim3(MaxThreadsPerBlock, 1, 1);
+  }
+
+  static bool can_implement(Arguments const& args) {
+    if (args.num_batches <= 0) return false;
+    if (args.split_kv <= 0) return false;
+    return true;
+  }
+
+  CUTLASS_DEVICE void operator() (Params const& params, char* smem_raw) {
+    if (params.split_kv <= 1) return;
+    auto blk_coord = make_coord(blockIdx.x, _0{}, blockIdx.z);
+
+    __shared__ ElementAcc sLseScale[kMaxSplits];
+    const size_t offset_lseaccum = get<0>(blk_coord) + kNumHeads * params.split_kv * get<2>(blk_coord);
+    const size_t offset_lse = get<0>(blk_coord) + kNumHeads * get<2>(blk_coord);
+
+    Tensor gLSEaccum = make_tensor(make_gmem_ptr(params.ptr_lseaccum + offset_lseaccum),
+                                   make_shape(params.split_kv), Stride<Int<kNumHeads>>{});
+
+    Tensor gLSE = make_tensor(make_gmem_ptr(params.ptr_lse + offset_lse),
+                              Shape<_1>{}, Stride<_1>{});
+
+    auto dim_k = params.ptr_seq == nullptr ?  params.dim_k : params.ptr_seq[get<2>(blk_coord)];
+    auto local_split_kv = params.ptr_split_kv == nullptr ? params.split_kv : params.ptr_split_kv[get<2>(blk_coord)];
+    auto k_tile_total = ceil_div(dim_k, params.tile_shape_s);
+    auto k_tile_per_cta = ceil_div(k_tile_total, local_split_kv);
+    local_split_kv = ceil_div(k_tile_total, k_tile_per_cta);
+
+    int warp_idx = cutlass::canonical_warp_idx_sync();
+    if (warp_idx == 0) {
+      constexpr int kNLsePerThread = cute::ceil_div(kMaxSplits, 32);
+
+      ElementAcc local_lse[kNLsePerThread];
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kNLsePerThread; ++i) {
+        const int split = i * 32 + threadIdx.x;
+        local_lse[i] = split < local_split_kv ? gLSEaccum(split) : -std::numeric_limits<ElementAcc>::infinity();
+      }
+    
+      ElementAcc lse_max = -std::numeric_limits<ElementAcc>::infinity();
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kNLsePerThread; ++i) {
+        lse_max = max(lse_max, local_lse[i]);
+      }
+      CUTLASS_PRAGMA_UNROLL
+      for (int offset = 16; offset >= 1; offset /= 2) {
+        lse_max = max(lse_max, __shfl_xor_sync(0xffffffff, lse_max, offset));
+      }
+      lse_max = lse_max == -std::numeric_limits<ElementAcc>::infinity() ? 0.0f : lse_max;  // In case all local LSEs are -inf
+      lse_max = __shfl_sync(0xffffffff, lse_max, 0);
+
+      ElementAcc sum_lse = 0;
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kNLsePerThread; ++i) {
+        sum_lse = sum_lse + expf(local_lse[i] - params.scale * lse_max);
+      }
+    
+      CUTLASS_PRAGMA_UNROLL
+      for (int offset = 16; offset >= 1; offset /= 2) {
+        sum_lse = sum_lse + __shfl_xor_sync(0xffffffff, sum_lse, offset);
+      }
+
+      sum_lse = __shfl_sync(0xffffffff, sum_lse, 0);
+
+      ElementAcc global_lse = (sum_lse == 0.f || sum_lse != sum_lse) ? std::numeric_limits<ElementAcc>::infinity() : logf(sum_lse) + params.scale * lse_max;
+      if (threadIdx.x == 0 and params.ptr_lse != nullptr) {
+        gLSE(0) = global_lse;
+      }
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kNLsePerThread; ++i) {
+        const int split = i * 32 + threadIdx.x;
+        if (split < local_split_kv) {
+          sLseScale[split] = expf(local_lse[i] - global_lse);
+        }
+      }
+    }
+    __syncthreads();
+
+    constexpr int Elements = kHeadDimLatent / MaxThreadsPerBlock;
+    const size_t offset_oaccum = kHeadDimLatent * params.split_kv * (get<0>(blk_coord) + kNumHeads * get<2>(blk_coord));
+    Tensor gOaccum = make_tensor(make_gmem_ptr(params.ptr_oaccum + offset_oaccum),
+                               Shape<Int<kHeadDimLatent>>{}, Stride<_1>{});
+    ElementAcc local_val[Elements] = {0};
+    for (int split = 0; split < local_split_kv; ++split) {
+      ElementAcc lse_scale = sLseScale[split];
+      CUTLASS_PRAGMA_UNROLL
+      for(int i = 0; i < Elements; ++i) {
+        local_val[i] += lse_scale * gOaccum(threadIdx.x + MaxThreadsPerBlock * i);
+      }
+      gOaccum.data() = gOaccum.data() + kHeadDimLatent;
+    }
+    auto ptr_o_local = params.ptr_o + (get<0>(blk_coord) + get<2>(blk_coord) * kNumHeads) * kHeadDimLatent;
+    Tensor gO = make_tensor(make_gmem_ptr(ptr_o_local), Shape<Int<kHeadDimLatent>>{}, Stride<_1>{});
+
+    CUTLASS_PRAGMA_UNROLL
+    for(int i = 0; i < Elements; ++i) {
+      gO(threadIdx.x + MaxThreadsPerBlock * i) = static_cast<ElementOut>(local_val[i]);
+    }
+  }
+};
+
+}  // namespace cutlass::fmha::kernel
diff --git a/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_tma_warpspecialized.hpp b/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_tma_warpspecialized.hpp
new file mode 100644
index 0000000000..acb89a9def
--- /dev/null
+++ b/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_tma_warpspecialized.hpp
@@ -0,0 +1,2018 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cute/arch/simd_sm100.hpp"
+
+#include "cutlass/arch/arch.h"
+#include "cutlass/arch/memory_sm80.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+
+#include "gather_tensor.hpp"  // from examples/common
+#include "common/pow_2.hpp"
+
+namespace cutlass::fmha::kernel {
+
+using namespace cute;
+
+template<
+    class TileShape,
+    class Element_,
+    class ElementAcc_,
+    class ElementOut_,
+    class ElementLSE_,
+    class TileScheduler,
+#ifdef CPASYNC
+    bool kIsCpAsync = true
+#else
+    bool kIsCpAsync = false
+#endif
+>
+struct Sm100FmhaMlaKernelTmaWarpspecialized {
+
+  using Element = Element_;
+  using ElementAcc = ElementAcc_;
+  using ElementOut = ElementOut_;
+  using ElementLSE = ElementLSE_;
+  
+  // only 2Sm mode is supported
+  static const bool kIs2Sm = true;
+  static const int MaxThreadsPerBlock = 256;
+  static const int MinBlocksPerMultiprocessor = 1;
+  static const int TotalSNum = 2;
+  static const int TotalPNum = 2;
+  using ArchTag = cutlass::arch::Sm100;
+
+  using ClusterShape = cute::conditional_t<kIs2Sm, Shape<_2, _1, _1>, Shape<_1, _1, _1>>;
+
+  using TileShapeH = tuple_element_t<0, TileShape>;
+  using TileShapeS = tuple_element_t<1, TileShape>;
+  using TileShapeD = tuple_element_t<2, TileShape>;
+
+  using TileShapeL = tuple_element_t<0, TileShapeD>;
+  using TileShapeR = tuple_element_t<1, TileShapeD>;
+  static_assert(TileShapeL{} % TileShapeR{} == 0, "Rope head dim must divide latent head dim");
+
+  using ProblemShape = Shape<TileShapeH, int, TileShapeD, int>;
+  using TensorStride   = Stride<int64_t, _1, int64_t>;
+  using TmemAllocator = cute::conditional_t<kIs2Sm, cute::TMEM::Allocator2Sm, cute::TMEM::Allocator1Sm>;
+
+  static_assert(TileShapeH{} == 128);
+  static const int kWarpsInN = kIs2Sm ? 2 : 1;
+
+  static const int kNumComputeWarps = 4;
+  static const int kNumLoadWarps = kIsCpAsync ? 2 : 1;
+
+  enum class WarpRole {
+    kMma = 0x1, kLoad = 0x2, kCompute = 0x3, kLoadPageTable = 0x4, kEmpty=0x0
+  };
+
+  static const long long unsigned int kWarpAssignment = kIsCpAsync ? 0x4221'3333ull : 0x0021'3333ull;
+
+  static CUTLASS_DEVICE WarpRole warp_idx_to_role(int warp_idx) {
+      return static_cast<WarpRole>((kWarpAssignment >> (4 * warp_idx)) & 0xF);
+  }
+
+  static const int Alignment = 128 / sizeof_bits_v<Element>;
+  static const int AlignmentOut = 128 / sizeof_bits_v<ElementOut>;
+
+  using TileShapeQK = Shape<TileShapeH, TileShapeS, decltype(TileShapeR{} / _1{})>;
+  static const int StagesQK = 24 / sizeof(Element);  // free parameter
+  static const int IterationsQKLatent = decltype(TileShapeL{} / get<2>(TileShapeQK{}))::value;
+  static const int IterationsQKRope = decltype(TileShapeR{} / get<2>(TileShapeQK{}))::value;
+  static const int IterationsQK = IterationsQKLatent + IterationsQKRope;
+
+  using Schedule = cute::conditional_t<kIs2Sm, cutlass::gemm::KernelTmaWarpSpecialized2SmSm100, cutlass::gemm::KernelTmaWarpSpecialized1SmSm100>;
+  using CollectiveMmaQK = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      Element, TensorStride, Alignment,
+      Element, TensorStride, Alignment,
+      ElementAcc,
+      TileShapeQK, ClusterShape, cutlass::gemm::collective::StageCount<StagesQK>,
+      Schedule>::CollectiveOp;
+  using TiledMmaQK = typename CollectiveMmaQK::TiledMma;
+  using CtaShapeQK = typename CollectiveMmaQK::CtaShape_MNK;
+
+  // chosen for unified smem staging between K and V
+  using TileShapePV = Shape<TileShapeH, _256, _32>;
+  using TransposeTensorStride = decltype(select<1,0,2>(TensorStride{}));
+  static const int StagesPV = StagesQK;  // not sure why, but must be at least two. check pipes
+  static const int IterationsPV_K = decltype(TileShapeS{} / get<2>(TileShapePV{}))::value;
+  static const int IterationsPV_N = decltype(TileShapeL{} / get<1>(TileShapePV{}))::value;
+
+  using CollectiveMmaPV = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      Element, TensorStride, Alignment,
+      Element, TransposeTensorStride, Alignment,
+      ElementAcc,
+      TileShapePV, ClusterShape, cutlass::gemm::collective::StageCount<StagesPV>,
+      Schedule>::CollectiveOp;
+  using CtaShapePV = typename CollectiveMmaPV::CtaShape_MNK;
+  static_assert(std::is_same_v<TransposeTensorStride, typename CollectiveMmaPV::StrideB>);
+
+  using TiledMmaPV = typename CollectiveMmaPV::TiledMma;
+
+  using AtomThrShapeMNK = typename CollectiveMmaQK::AtomThrShapeMNK;
+  static_assert(typename CollectiveMmaQK::AtomThrShapeMNK{} == typename CollectiveMmaPV::AtomThrShapeMNK{}, "schedule must match");
+
+  static const int StagesPageTable = kIsCpAsync ? StagesPV : 1;
+
+  // pipelines from load to mma, PipelineTmaUmmaAsync, stages tbd
+  // use expect_tx for Q load
+  using PipelineLoadQK = cute::conditional_t<kIsCpAsync, PipelineUmmaConsumerAsync<StagesQK, AtomThrShapeMNK>, PipelineTmaUmmaAsync<StagesQK, ClusterShape, AtomThrShapeMNK>>;
+  using PipelineLoadPV = PipelineLoadQK;
+  // pipeline from mma (Q@K) to softmax, PipelineUmmaAsync, 2 stages
+  using PipelineS = PipelineUmmaAsync<TotalSNum, AtomThrShapeMNK>;
+  // pipeline from softmax (P) to mma (bmm2), PipelineUmmaAsync, 2 stages
+  using PipelineP = PipelineUmmaConsumerAsync<TotalPNum, AtomThrShapeMNK>;
+  // pipeline from mma to softmax (for rescale), PipelineUmmaAsync, 1 stage
+  using PipelineO = PipelineUmmaAsync<1, AtomThrShapeMNK>;
+
+  using PipelinePT = PipelineAsync<StagesPageTable>;
+
+  struct PipelineStorage {
+    alignas(16) typename PipelineLoadQK::SharedStorage load_qk;
+    alignas(16) typename PipelineS::SharedStorage mma_s;
+    alignas(16) typename PipelineP::SharedStorage p_mma;
+    alignas(16) typename PipelineO::SharedStorage mma_o;
+    alignas(16) typename PipelinePT::SharedStorage load_page_table;
+  };
+
+  template<class Layout, class Stages = _1>
+  static CUTE_DEVICE constexpr auto unstageSmemLayout(Layout const& layout, Stages stages = {}) {
+      return composition(layout, make_tuple(_, _, _, make_layout(stages)));
+  }
+
+  using SmemLayoutQ = decltype(unstageSmemLayout(typename CollectiveMmaQK::SmemLayoutA{}, Int<IterationsQK>{}));
+  using SmemLayoutKC = typename CollectiveMmaQK::SmemLayoutB;
+  using SmemLayoutVC = typename CollectiveMmaPV::SmemLayoutB;
+  using SmemLayoutP = decltype(unstageSmemLayout(typename CollectiveMmaPV::SmemLayoutA{}, make_shape(Int<IterationsPV_K>{}, _2{})));
+
+  static const int kBytesLoadQ  = size(AtomThrShapeMNK{}) * cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutQ{})) * cute::sizeof_bits_v<Element>);
+  static const int kBytesLoadKC = size(AtomThrShapeMNK{}) * cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutKC{})) * cute::sizeof_bits_v<Element>);
+  static const int kBytesLoadVC = size(AtomThrShapeMNK{}) * cutlass::bits_to_bytes(cosize(take<0,3>(SmemLayoutVC{})) * cute::sizeof_bits_v<Element>);
+  // pre-condition for overlapped smem staging
+  static_assert(kBytesLoadKC == kBytesLoadVC);
+  static_assert(StagesQK == StagesPV);
+
+  static const int kTransactionsBytesLoadQK = kBytesLoadKC;
+  static const int kTransactionsBytesLoadExtraQ = kBytesLoadQ;
+  static const int kTransactionsBytesLoadPV = kBytesLoadVC;
+
+  static const int kNamedBarrierExchange = (int) cutlass::arch::ReservedNamedBarriers::TransformBarrier;
+  // This Named Barrier is introduced to solve Q tile loading overwritten issue when enable persistent 
+  // tile scheduler for FP8 MLA.
+  static const int kNamedBarrierEpilogue = (int) cutlass::arch::ReservedNamedBarriers::EpilogueBarrier;
+  // 
+  static const int kNamedBarrierTmemDealloc = (int) cutlass::arch::ReservedNamedBarriers::TmemAllocBarrier;
+
+  enum class TmemAllocation : uint32_t {
+    kSizeS = TileShapeS::value / kWarpsInN,
+    // Overall
+    kSizeO = TileShapeL::value / kWarpsInN,
+    // Between accumulators we loop over
+    kSizeAccO = decltype(get<1>(TileShapePV{}))::value / kWarpsInN,
+    kNumS = TotalSNum,
+    kNumP = TotalPNum,
+    kNumO = 1,
+    kS0 = 0,
+    kS1 = kS0 + kSizeS,
+    kO0 = kS1 + kSizeS,
+    kTotal = kO0 + kSizeO
+  };
+
+  static_assert(static_cast<int>(TmemAllocation::kTotal) <= TmemAllocator::Sm100TmemCapacityColumns, "using too much tmem");
+
+  struct TensorStorage {
+    // to communicate max and row_sum
+    cute::array<ElementAcc, kNumComputeWarps * cutlass::NumThreadsPerWarp> smem_exchange;
+    cute::array<int, StagesPageTable * TileShapeS::value> smem_page_table;
+    alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutQ>> smem_q;
+    union {
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutKC>> smem_kc;
+      alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutVC>> smem_vc;
+    };
+    alignas(2048) cute::array<Element, cute::cosize_v<SmemLayoutP>> smem_p;
+  };
+
+  struct SharedStorage {
+    PipelineStorage pipelines;
+    TensorStorage tensors;
+    uint32_t tmem_base_ptr;
+  };
+
+  static const int SharedStorageSize = sizeof(SharedStorage);
+  static_assert(SharedStorageSize <= cutlass::arch::sm100_smem_capacity_bytes, "using too much smem");
+
+  struct MainloopArguments {
+    ElementAcc softmax_scale;
+    
+    // all tensors strides are (num_heads or seqlen, head_dim, batch)
+    // head_dim stride is always 1
+    Element* ptr_q_latent;
+    TensorStride stride_q_latent;
+    Element* ptr_q_rope;
+    TensorStride stride_q_rope;
+
+    Element* ptr_c_latent;
+    TensorStride stride_c_latent;
+    Element* ptr_k_rope;
+    TensorStride stride_k_rope;
+
+    // for paged attention, we interpret what was previously [batch, seqlen]
+    // as [page_count, page_size], and index according to page_table
+    int* ptr_seq = nullptr;
+    int* ptr_page_table = nullptr;
+    // page table is [batch, seqlen or similar]
+    Stride<_1, int> stride_page_table = {};
+    int page_count = 0;
+    int page_size = TileShapeS{};  // powers of two if kIsCpAsync, otherwise TileShapeS
+  };
+  
+  struct EpilogueArguments {
+    ElementOut* ptr_o = nullptr;
+    TensorStride stride_o;
+    ElementLSE* ptr_lse = nullptr;
+    Stride<_1, int> stride_lse;
+    ElementAcc output_scale = 1.0f;
+  };
+
+  struct Arguments {
+    // (num_heads=128, seqlen, (d_latent=512, d_rope=64), batch_count)
+    // for paged attention, seqlen is max seqlen
+    ProblemShape problem_shape;
+    MainloopArguments mainloop;
+    EpilogueArguments epilogue;
+    KernelHardwareInfo hw_info;
+    int split_kv = -1;
+    int* ptr_split_kv = nullptr;
+  };
+  
+  using TmaLoadQLatent = typename CollectiveMmaQK::Params::TMA_A;
+  using TmaLoadQRope = typename CollectiveMmaQK::Params::TMA_A;
+  using TmaLoadCLatent = typename CollectiveMmaQK::Params::TMA_B;
+  using TmaLoadKRope = typename CollectiveMmaQK::Params::TMA_B;
+  using TmaLoadCLatentTranspose = typename CollectiveMmaPV::Params::TMA_B;
+
+  struct MainloopParams {
+    TmaLoadQLatent tma_load_q_latent;
+    TmaLoadQRope tma_load_q_rope;
+    TmaLoadCLatent tma_load_c_latent;
+    TmaLoadKRope tma_load_k_rope;
+    TmaLoadCLatentTranspose tma_load_c_latent_transpose;
+  };
+
+  struct EpilogueParams {
+    ElementOut* ptr_o = nullptr;
+    ElementAcc* ptr_o_acc = nullptr;
+    TensorStride stride_o;
+    TensorStride stride_o_acc;
+    ElementLSE* ptr_lse = nullptr;
+    ElementLSE* ptr_lse_acc = nullptr;
+    Stride<_1, int> stride_lse;
+    Stride<_1, int> stride_lse_acc;
+    ElementAcc output_scale = 1.0f;
+  };
+
+  struct Params {
+    ProblemShape problem_shape;
+    MainloopArguments mainloop;
+    EpilogueParams epilogue;
+    MainloopParams mainloop_params;
+    typename TileScheduler::Params tile_scheduler;
+    int split_kv = -1;
+    int* ptr_split_kv = nullptr;
+  };
+
+  static Params to_underlying_arguments(Arguments const& args, void* workspace) {
+    //workspace = nullptr;  // let's get an error if one of these needs workspace
+
+    auto [H, K, D, B] = args.problem_shape;
+    auto [L, R] = D;
+
+    int paged_B = B;
+    int paged_K = K;
+    if (args.mainloop.ptr_page_table != nullptr) {
+      paged_B = args.mainloop.page_count;
+      paged_K = args.mainloop.page_size;
+    }
+
+    auto params_qk_latent = CollectiveMmaQK::to_underlying_arguments(
+        make_shape(H, K, L, B),
+        typename CollectiveMmaQK::Arguments {
+          args.mainloop.ptr_q_latent, args.mainloop.stride_q_latent,
+          args.mainloop.ptr_c_latent, args.mainloop.stride_c_latent,
+        }, nullptr);
+
+    auto params_qk_latent_paged = CollectiveMmaQK::to_underlying_arguments(
+        make_shape(H, paged_K, L, paged_B),
+        typename CollectiveMmaQK::Arguments {
+          args.mainloop.ptr_q_latent, args.mainloop.stride_q_latent,
+          args.mainloop.ptr_c_latent, args.mainloop.stride_c_latent,
+        }, nullptr);
+
+    auto params_qk_rope = CollectiveMmaQK::to_underlying_arguments(
+        make_shape(H, K, R, B),
+        typename CollectiveMmaQK::Arguments {
+          args.mainloop.ptr_q_rope, args.mainloop.stride_q_rope,
+          args.mainloop.ptr_k_rope, args.mainloop.stride_k_rope,
+        }, nullptr);
+
+    auto params_qk_rope_paged = CollectiveMmaQK::to_underlying_arguments(
+        make_shape(H, paged_K, R, paged_B),
+        typename CollectiveMmaQK::Arguments {
+          args.mainloop.ptr_q_rope, args.mainloop.stride_q_rope,
+          args.mainloop.ptr_k_rope, args.mainloop.stride_k_rope,
+        }, nullptr);
+
+
+    auto stride_c_latent_transpose = select<1,0,2>(args.mainloop.stride_c_latent);
+    auto params_pv_latent = CollectiveMmaPV::to_underlying_arguments(
+        make_shape(H, L, paged_K, paged_B),
+        typename CollectiveMmaPV::Arguments {
+          args.mainloop.ptr_q_latent, args.mainloop.stride_q_latent,  // dummy, never used
+          args.mainloop.ptr_c_latent, stride_c_latent_transpose,
+        }, nullptr);
+
+    MainloopParams mainloop_params {
+      params_qk_latent.tma_load_a,
+      params_qk_rope.tma_load_a,
+      params_qk_latent_paged.tma_load_b,
+      params_qk_rope_paged.tma_load_b,
+      params_pv_latent.tma_load_b
+    };
+
+    EpilogueParams epilogue_params;
+
+    epilogue_params.ptr_o = args.epilogue.ptr_o;
+    epilogue_params.stride_o = args.epilogue.stride_o;
+    epilogue_params.ptr_lse = args.epilogue.ptr_lse; 
+    epilogue_params.stride_lse = args.epilogue.stride_lse;
+    epilogue_params.output_scale = args.epilogue.output_scale;
+
+    if (args.split_kv > 1) {
+      ElementAcc* ptr_o_acc   = reinterpret_cast<ElementAcc*>(workspace);
+      ElementLSE* ptr_lse_acc = reinterpret_cast<ElementLSE*>(ptr_o_acc + H * L * args.split_kv * B);
+      epilogue_params.ptr_o_acc   = ptr_o_acc;
+      epilogue_params.ptr_lse_acc = ptr_lse_acc;
+
+      epilogue_params.stride_o_acc = make_tuple(static_cast<int64_t>(0 + L) * args.split_kv, _1{}, static_cast<int64_t>(0 + H * L) * args.split_kv);
+      epilogue_params.stride_lse_acc = make_tuple(_1{}, (0 + H) * args.split_kv);
+    }
+
+    return {args.problem_shape, args.mainloop, epilogue_params, mainloop_params,
+            TileScheduler::to_underlying_arguments(args.problem_shape, args.hw_info, ClusterShape{}, args.split_kv), args.split_kv, args.ptr_split_kv};
+  }
+
+  static size_t get_workspace_size(Arguments const& args) { 
+    ProblemShape problem_shape = args.problem_shape;
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+    auto split_kv = args.split_kv;
+    return (sizeof(ElementAcc) * D_latent + sizeof(ElementLSE)) * H * split_kv * B;
+  }
+  static Status initialize_workspace(
+      Arguments const& /*args*/, void* /*ws*/, cudaStream_t /*stream*/) {
+    return Status::kSuccess;
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    return TileScheduler::get_grid_shape(params.tile_scheduler);
+  }
+
+  static dim3 get_block_shape() {
+    dim3 block(MaxThreadsPerBlock, 1, 1);
+    return block;
+  }
+
+  static bool can_implement(Arguments const& args) {
+    if (kIsCpAsync) {
+      if ((args.mainloop.page_size & (args.mainloop.page_size - 1)) != 0) {
+        return false;
+      }
+      if (args.mainloop.page_size > TileShapeS{}) {
+        return false;
+      }
+    }
+    else {
+      if (args.mainloop.ptr_page_table != nullptr && args.mainloop.page_size != TileShapeS{}) {
+        return false;
+      }
+    }
+    if (get<0>(args.problem_shape) != 128) {
+      return false;
+    }
+    if (get<1>(args.problem_shape) <= 0) {
+      return false;
+    }
+    if (args.split_kv <= 0) {
+      return false;
+    }
+    return true;
+  }
+
+
+  CUTLASS_DEVICE void operator()(Params const& params, char* smem_raw) {
+
+    TileScheduler tile_scheduler(params.tile_scheduler);
+
+    int warp_idx = cutlass::canonical_warp_idx_sync();
+    auto role = warp_idx_to_role(warp_idx);
+    uint32_t lane_predicate = cute::elect_one_sync();
+
+    uint32_t cta_rank_in_cluster = cute::block_rank_in_cluster();
+    int cta_coord_v = cta_rank_in_cluster % size<0>(AtomThrShapeMNK{});
+    bool is_mma_leader_cta = cta_coord_v == 0;
+
+    if (role == WarpRole::kLoad && lane_predicate && ! kIsCpAsync) {
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_q_latent.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_c_latent.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_q_rope.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_k_rope.get_tma_descriptor());
+      prefetch_tma_descriptor(params.mainloop_params.tma_load_c_latent_transpose.get_tma_descriptor());
+    }
+    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_raw);
+
+    typename PipelineLoadQK::Params pipeline_load_qk_params;
+    if (role == WarpRole::kLoad) {
+      pipeline_load_qk_params.role = PipelineLoadQK::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::kMma) {
+      pipeline_load_qk_params.role = PipelineLoadQK::ThreadCategory::Consumer;
+    }
+    if constexpr (kIsCpAsync) {
+      // we can make our life easier by unconditionally loading blocks
+      // since we know it'll always be legal
+      pipeline_load_qk_params.producer_arv_count = kNumLoadWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{});
+    }
+    else {
+      pipeline_load_qk_params.is_leader = lane_predicate && (role == WarpRole::kLoad) && is_mma_leader_cta;
+      pipeline_load_qk_params.transaction_bytes = kTransactionsBytesLoadQK;
+    }
+    pipeline_load_qk_params.initializing_warp = 0;
+    PipelineLoadQK pipeline_load_qk(shared_storage.pipelines.load_qk, pipeline_load_qk_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineS::Params pipeline_mma_s_params;
+    if (role == WarpRole::kMma) {
+      pipeline_mma_s_params.role = PipelineS::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::kCompute) {
+      pipeline_mma_s_params.role = PipelineS::ThreadCategory::Consumer;
+    }
+    pipeline_mma_s_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{});
+    pipeline_mma_s_params.initializing_warp = 1;
+    PipelineS pipeline_mma_s(
+      shared_storage.pipelines.mma_s,
+      pipeline_mma_s_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineP::Params pipeline_p_mma_params;
+    if (role == WarpRole::kMma) {
+      pipeline_p_mma_params.role = PipelineP::ThreadCategory::Consumer;
+    }
+    if (role == WarpRole::kCompute) {
+      pipeline_p_mma_params.role = PipelineP::ThreadCategory::Producer;
+    }
+    pipeline_p_mma_params.producer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{});
+    pipeline_p_mma_params.consumer_arv_count = 1;
+    pipeline_p_mma_params.initializing_warp = 2;
+    PipelineP pipeline_p_mma(
+      shared_storage.pipelines.p_mma,
+      pipeline_p_mma_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelineO::Params pipeline_mma_o_params;
+    if (role == WarpRole::kMma) {
+      pipeline_mma_o_params.role = PipelineO::ThreadCategory::Producer;
+    }
+    if (role == WarpRole::kCompute) {
+      pipeline_mma_o_params.role = PipelineO::ThreadCategory::Consumer;
+    }
+    pipeline_mma_o_params.consumer_arv_count = kNumComputeWarps * cutlass::NumThreadsPerWarp * size(AtomThrShapeMNK{});
+    pipeline_mma_o_params.initializing_warp = 3;
+    PipelineO pipeline_mma_o(
+      shared_storage.pipelines.mma_o,
+      pipeline_mma_o_params,
+      ClusterShape{}, /*barrier init*/ cute::true_type{}, /*mask calc*/cute::false_type{});
+
+    typename PipelinePT::Params pipeline_pt_params;
+    if (role == WarpRole::kLoad) {
+      pipeline_pt_params.role = PipelinePT::ThreadCategory::Consumer;
+    }
+    if (role == WarpRole::kLoadPageTable) {
+      pipeline_pt_params.role = PipelinePT::ThreadCategory::Producer;
+    }
+    pipeline_pt_params.consumer_arv_count = kNumLoadWarps * cutlass::NumThreadsPerWarp;
+    pipeline_pt_params.producer_arv_count = cutlass::NumThreadsPerWarp;
+    pipeline_pt_params.initializing_warp = 4;
+    PipelinePT pipeline_page_table(
+      shared_storage.pipelines.load_page_table,
+      pipeline_pt_params);
+
+    TmemAllocator tmem_allocator;
+
+    pipeline_init_arrive_relaxed(size(ClusterShape{}));
+
+    pipeline_load_qk.init_masks(ClusterShape{});  // do we need an update here for 2Sm?
+    pipeline_mma_s.init_masks(ClusterShape{});
+    pipeline_p_mma.init_masks(ClusterShape{});
+    pipeline_mma_o.init_masks(ClusterShape{});
+
+    typename PipelineLoadQK::PipelineState pipeline_load_qk_consumer_state;
+    typename PipelineLoadQK::PipelineState pipeline_load_qk_producer_state = cutlass::make_producer_start_state<PipelineLoadQK>();
+
+    typename PipelineS::PipelineState pipeline_mma_s_consumer_state;
+    typename PipelineS::PipelineState pipeline_mma_s_producer_state = cutlass::make_producer_start_state<PipelineS>();
+
+    typename PipelineP::PipelineState pipeline_p_mma_consumer_state;
+    typename PipelineP::PipelineState pipeline_p_mma_producer_state = cutlass::make_producer_start_state<PipelineP>();
+
+    typename PipelineO::PipelineState pipeline_mma_o_consumer_state;
+    typename PipelineO::PipelineState pipeline_mma_o_producer_state = cutlass::make_producer_start_state<PipelineO>();
+
+    typename PipelinePT::PipelineState pipeline_pt_consumer_state;
+    typename PipelinePT::PipelineState pipeline_pt_producer_state = cutlass::make_producer_start_state<PipelinePT>();
+
+    pipeline_init_wait(size(ClusterShape{}));
+
+    if (role == WarpRole::kLoadPageTable) {
+      CUTLASS_PRAGMA_NO_UNROLL
+      for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+        auto blk_coord = tile_scheduler.get_block_coord();
+        auto problem_shape = params.problem_shape;
+	auto local_split_kv = params.split_kv;
+        if (params.mainloop.ptr_seq != nullptr) {
+          get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)];
+	  if (params.ptr_split_kv != nullptr) {
+            local_split_kv = params.ptr_split_kv[get<2>(blk_coord)];
+          }
+        }
+	if (local_split_kv <= get<3>(blk_coord))
+	  continue;
+        load_page_table(
+          blk_coord,
+          problem_shape,
+          params.mainloop,
+          shared_storage.tensors,
+          pipeline_page_table, pipeline_pt_producer_state,
+	  local_split_kv
+        );
+      }
+    }
+    else if (role == WarpRole::kLoad) {
+      if constexpr (kIsCpAsync) {
+        CUTLASS_PRAGMA_NO_UNROLL
+        for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+          auto blk_coord = tile_scheduler.get_block_coord();
+	  auto problem_shape = params.problem_shape;
+	  auto local_split_kv = params.split_kv;
+          if (params.mainloop.ptr_seq != nullptr) {
+            get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)];
+	    if (params.ptr_split_kv != nullptr) {
+              local_split_kv = params.ptr_split_kv[get<2>(blk_coord)];
+            }
+          }
+	  if (local_split_kv <= get<3>(blk_coord))
+            continue;
+          load_cpasync(
+            blk_coord,
+            problem_shape,
+            params.mainloop,
+            params.mainloop_params,
+            shared_storage.tensors,
+            pipeline_load_qk, pipeline_load_qk_producer_state,
+	    local_split_kv,
+            /* must be shared pipe */
+            pipeline_page_table, pipeline_pt_consumer_state
+          );
+          cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait();
+        }
+      }
+      else {
+        if (params.mainloop.ptr_page_table != nullptr) {
+          CUTLASS_PRAGMA_NO_UNROLL
+          for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+            auto blk_coord = tile_scheduler.get_block_coord();
+	    auto problem_shape = params.problem_shape;
+	    auto local_split_kv = params.split_kv;
+            if (params.mainloop.ptr_seq != nullptr) {
+              get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)];
+	      if (params.ptr_split_kv != nullptr) {
+	        local_split_kv = params.ptr_split_kv[get<2>(blk_coord)];
+	      }
+            }
+	    if (local_split_kv <= get<3>(blk_coord))
+              continue;
+            load_tma</* paged= */ true>(
+              blk_coord,
+              problem_shape,
+              params.mainloop,
+              params.mainloop_params,
+              shared_storage.tensors,
+              pipeline_load_qk, pipeline_load_qk_producer_state,
+              pipeline_load_qk, pipeline_load_qk_producer_state,
+	      local_split_kv
+            );
+            cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait();
+          }
+        }
+        else {
+          CUTLASS_PRAGMA_NO_UNROLL
+          for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+            auto blk_coord = tile_scheduler.get_block_coord();
+	    auto problem_shape = params.problem_shape;
+	    auto local_split_kv = params.split_kv;
+            if (params.mainloop.ptr_seq != nullptr) {
+              get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)];
+	      if (params.ptr_split_kv != nullptr) {
+                local_split_kv = params.ptr_split_kv[get<2>(blk_coord)];
+	      }
+            }
+	    if (local_split_kv <= get<3>(blk_coord))
+              continue;
+            load_tma<false>(
+              blk_coord,
+              problem_shape,
+              params.mainloop,
+              params.mainloop_params,
+              shared_storage.tensors,
+              pipeline_load_qk, pipeline_load_qk_producer_state,
+              pipeline_load_qk, pipeline_load_qk_producer_state,
+	      local_split_kv
+            );
+            cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait();
+          }
+        }
+      }
+    }
+    else if (role == WarpRole::kMma) {
+      tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+      __syncwarp();
+    
+      if (is_mma_leader_cta) {
+        CUTLASS_PRAGMA_NO_UNROLL
+        for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+          auto blk_coord = tile_scheduler.get_block_coord();
+          auto problem_shape = params.problem_shape;
+	  auto local_split_kv = params.split_kv;
+          if (params.mainloop.ptr_seq != nullptr) {
+            get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)];
+            if (params.ptr_split_kv != nullptr) {
+                local_split_kv = params.ptr_split_kv[get<2>(blk_coord)];
+            }
+          }
+	  if (local_split_kv <= get<3>(blk_coord))
+            continue;
+          mma(blk_coord,
+            problem_shape,
+            shared_storage.tensors,
+            pipeline_load_qk, pipeline_load_qk_consumer_state,
+            pipeline_load_qk, pipeline_load_qk_consumer_state,
+            pipeline_mma_s, pipeline_mma_s_producer_state,
+            pipeline_p_mma, pipeline_p_mma_consumer_state,
+            pipeline_mma_o, pipeline_mma_o_producer_state,
+	    local_split_kv
+          );
+        }
+      }
+
+      //cutlass::arch::NamedBarrier((kNumComputeWarps + 1) * NumThreadsPerWarp, kNamedBarrierTmemDealloc).arrive_and_wait();
+
+      //uint32_t free_stage_ptr = shared_storage.tmem_base_ptr;
+      //tmem_allocator.free(free_stage_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+    }
+    else if (role == WarpRole::kCompute) {
+      CUTLASS_PRAGMA_NO_UNROLL
+      for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+        auto blk_coord = tile_scheduler.get_block_coord();
+        auto problem_shape = params.problem_shape;
+	auto split_kv = params.split_kv;
+	auto local_split_kv = split_kv;
+        if (params.mainloop.ptr_seq != nullptr) {
+          get<1>(problem_shape) = params.mainloop.ptr_seq[get<2>(blk_coord)];
+	  if (params.ptr_split_kv != nullptr) {
+            local_split_kv = params.ptr_split_kv[get<2>(blk_coord)];
+          }
+        }
+	if (local_split_kv <= get<3>(blk_coord))
+          continue;
+        compute(
+          blk_coord,
+          problem_shape,
+          params.mainloop,         // for softmax_scale
+          params.epilogue,
+          shared_storage.tensors,  // for smem_comm
+          pipeline_mma_s, pipeline_mma_s_consumer_state,
+          pipeline_p_mma, pipeline_p_mma_producer_state,
+          pipeline_mma_o, pipeline_mma_o_consumer_state,
+	  local_split_kv
+        );
+      }
+
+      //cutlass::arch::NamedBarrier((kNumComputeWarps + 1) * NumThreadsPerWarp, kNamedBarrierTmemDealloc).arrive();
+    }
+
+    cute::cluster_sync();
+    cutlass::arch::NamedBarrier((kNumComputeWarps + 1) * NumThreadsPerWarp, kNamedBarrierTmemDealloc).arrive();
+    if (role == WarpRole::kMma) {
+      uint32_t free_stage_ptr = shared_storage.tmem_base_ptr;
+      tmem_allocator.free(free_stage_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+    }
+  }
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void load_page_table(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      TensorStorage& shared_tensors,
+      PipelinePT& pipeline_page_table,
+      typename PipelinePT::PipelineState& pipeline_pt_producer_state, int const& split_kv) {
+
+    auto [H, K, D, B] = problem_shape;
+    int batch_coord = get<2>(blk_coord);
+
+    auto mPT_l = make_tensor(make_gmem_ptr(mainloop_args.ptr_page_table), 
+                            make_shape(mainloop_args.page_count, B), 
+                            mainloop_args.stride_page_table);
+    auto mPT = mPT_l(_, batch_coord);
+    
+    int k_tile_total = ceil_div(K, TileShapeS{});
+    int k_tile_per_cta = ceil_div(k_tile_total, split_kv);
+    int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit
+    int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index);
+    if (k_tile_count == 0) {
+      return;
+    }
+
+    auto page_size = Pow2{mainloop_args.page_size};
+    auto pages_per_tile = Pow2{TileShapeS{} / page_size};
+    int thread_idx = threadIdx.x % cutlass::NumThreadsPerWarp;
+
+#if 1
+    for (; k_tile_count > 0; ++k_index, --k_tile_count) {
+      pipeline_page_table.producer_acquire(pipeline_pt_producer_state);
+    
+      // assume a single warp
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < TileShapeS{}; i += cutlass::NumThreadsPerWarp) {
+        int idx = i + thread_idx;
+        bool guard = idx < pages_per_tile;
+        int smem_idx = pipeline_pt_producer_state.index() * TileShapeS::value + idx;
+        int pt_idx = pages_per_tile * k_index + idx;
+
+        cutlass::arch::cp_async_zfill<sizeof(int), cutlass::arch::CacheOperation::Always>(
+          &shared_tensors.smem_page_table[smem_idx], &mPT(pt_idx), guard
+        );
+      }
+      
+      pipeline_page_table.producer_commit(pipeline_pt_producer_state, cutlass::arch::cpasync_barrier_arrive);
+      ++pipeline_pt_producer_state;
+    }
+#endif
+  }
+
+
+  struct Gather {
+    int& page_table_stage;
+    Pow2 pages_per_tile;
+    const int * __restrict__ smem_page_table;
+
+    CUTLASS_DEVICE int operator()(int idx) const {
+      return smem_page_table[page_table_stage * TileShapeS::value + idx % pages_per_tile];
+    }
+
+    CUTLASS_DEVICE friend void print(Gather const&) {
+      printf("<gather>");
+    }
+
+  };
+
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void load_cpasync(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      MainloopParams const& mainloop_params,
+      TensorStorage& shared_tensors,
+      PipelineLoadQK& pipeline_load,
+      typename PipelineLoadQK::PipelineState& pipeline_load_producer_state,
+      int const& split_kv,
+      PipelinePT& pipeline_page_table,
+      typename PipelinePT::PipelineState& pipeline_pt_consumer_state) {
+
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+
+    using X = Underscore;
+
+    int k_tile_total = ceil_div(K, TileShapeS{});
+    int k_tile_per_cta = ceil_div(k_tile_total, split_kv);
+    int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit
+    int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index);
+    if (k_tile_count == 0) {
+      return;
+    }
+
+    // partition all tensors
+    auto mQL = make_tensor(make_gmem_ptr(mainloop_args.ptr_q_latent), make_shape(H, D_latent, B), mainloop_args.stride_q_latent);
+    auto mQR = make_tensor(make_gmem_ptr(mainloop_args.ptr_q_rope), make_shape(H, D_rope, B), mainloop_args.stride_q_rope);
+
+    int  paged_B = mainloop_args.page_count;
+    auto paged_K = Pow2{mainloop_args.page_size};
+    auto mPT_l = make_tensor(make_gmem_ptr(mainloop_args.ptr_page_table), make_shape(paged_B, B), mainloop_args.stride_page_table);
+
+    int batch_coord = get<2>(blk_coord);
+    auto mPT = mPT_l(_, batch_coord);
+
+    auto gQL = local_tile(mQL, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{});
+    auto gQR = local_tile(mQR, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{});
+
+    ThrMMA cta_mma_qk = TiledMmaQK{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{}));
+    ThrMMA cta_mma_pv = TiledMmaPV{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{}));
+
+    auto tSgQL = cta_mma_qk.partition_A(gQL);
+    auto tSgQR = cta_mma_qk.partition_A(gQR);
+
+    Tensor sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{});
+    Tensor sKC = make_tensor(make_smem_ptr(shared_tensors.smem_kc.begin()), SmemLayoutKC{});
+    Tensor sVC = make_tensor(make_smem_ptr(shared_tensors.smem_vc.begin()), SmemLayoutVC{});
+
+    auto make_copy_for = [](auto sT) {
+      auto rT_a = sT.layout()(_, _, _, _0{});
+      auto rT = make_ordered_layout(shape(rT_a), stride(rT_a));
+      auto threads = Int<kNumLoadWarps * cutlass::NumThreadsPerWarp>{};
+      auto values = Int<sizeof(uint128_t) / sizeof(Element)>{};
+      return make_cotiled_copy(
+          Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, Element>{},
+          make_ordered_layout(
+              make_shape(threads, values),
+              make_stride(_1{}, _0{})),
+          rT);
+    };
+
+    // like cute::copy, but makes sure we do all page table lookups first
+    auto copy_split = [](auto atom, auto src, auto dst) {
+      auto src_v = group_modes<1, rank_v<decltype(src)>>(src);
+      auto dst_v = group_modes<1, rank_v<decltype(dst)>>(dst);
+
+      auto src_v_ptrs = make_tensor<Element*>(size<1>(src_v));
+      for (int i = 0; i < size<1>(src_v); i++) {
+        src_v_ptrs(i) = &src_v(_0{}, i);
+      }
+
+
+      for (int i = 0; i < size<1>(src_v); i++) {
+        auto src_v_i = make_tensor(
+            make_gmem_ptr(src_v_ptrs(i)),
+            make_shape(shape<0>(src_v)),
+            make_stride(make_stride(_1{}, _0{}))
+        );
+        atom.call(src_v_i, dst_v(_, i));
+      }
+    };
+
+    auto tiled_copy_q = make_copy_for(sQ);
+    auto tiled_copy_kc = make_copy_for(sKC);
+    auto tiled_copy_vc = make_copy_for(sVC);
+
+    auto thr_copy_q = tiled_copy_q.get_thread_slice(threadIdx.x % (kNumLoadWarps * cutlass::NumThreadsPerWarp));
+    auto thr_copy_kc = tiled_copy_kc.get_thread_slice(threadIdx.x % (kNumLoadWarps * cutlass::NumThreadsPerWarp));
+    auto thr_copy_vc = tiled_copy_vc.get_thread_slice(threadIdx.x % (kNumLoadWarps * cutlass::NumThreadsPerWarp));
+
+    auto tQsQ = thr_copy_q.partition_D(sQ);
+    auto tQgQL = thr_copy_q.partition_S(tSgQL);
+    auto tQgQR = thr_copy_q.partition_S(tSgQR);
+
+    auto tKCsKC = thr_copy_kc.partition_D(sKC);
+    auto tVCsVC = thr_copy_vc.partition_D(sVC);
+
+    auto pipeline_pt_release_state = pipeline_pt_consumer_state;
+
+    int page_table_stage = -1;
+    Pow2 pages_per_tile{TileShapeS{} / paged_K};
+    const int * __restrict__ smem_page_table = shared_tensors.smem_page_table.begin();
+    Gather gather{page_table_stage, pages_per_tile, smem_page_table};
+
+    auto mCL = make_tensor(
+        make_gmem_ptr(mainloop_args.ptr_c_latent),
+        ComposedLayout{
+            make_layout(
+                make_shape(make_shape(paged_K, paged_B), _1{}),
+                make_stride(make_stride(get<0>(mainloop_args.stride_c_latent), example::CustomStride(gather, get<2>(mainloop_args.stride_c_latent))), get<1>(mainloop_args.stride_c_latent))),
+            make_coord(_0{}, _0{}),
+            make_identity_layout(make_shape(paged_K * paged_B, D_latent))});
+
+    auto mKR = make_tensor(
+        make_gmem_ptr(mainloop_args.ptr_k_rope),
+        ComposedLayout{
+            make_layout(
+                make_shape(make_shape(paged_K, paged_B), _1{}),
+                make_stride(make_stride(get<0>(mainloop_args.stride_k_rope), example::CustomStride(gather, get<2>(mainloop_args.stride_k_rope))), get<1>(mainloop_args.stride_k_rope))),
+            make_coord(_0{}, _0{}),
+            make_identity_layout(make_shape(paged_K * paged_B, D_latent))});
+
+    auto mCLT = make_tensor(
+        make_gmem_ptr(mainloop_args.ptr_c_latent),
+        ComposedLayout{
+            make_layout(
+                make_shape(_1{}, make_shape(paged_K, paged_B)),
+                make_stride(get<1>(mainloop_args.stride_c_latent), make_stride(get<0>(mainloop_args.stride_c_latent), example::CustomStride(gather, get<2>(mainloop_args.stride_c_latent))))),
+            make_coord(_0{}, _0{}),
+            make_identity_layout(make_shape(D_latent, paged_K * paged_B))});
+
+    auto gCL = local_tile(mCL, TileShapeQK{}, make_coord(_,_,_), Step<X, _1, _1>{});
+    auto gKR = local_tile(mKR, TileShapeQK{}, make_coord(_,_,_), Step<X, _1, _1>{});
+    auto gCLT = local_tile(mCLT, TileShapePV{}, make_coord(_,_,_), Step<X, _1, _1>{});
+
+    auto tSgCL = cta_mma_qk.partition_B(gCL);
+    auto tSgKR = cta_mma_qk.partition_B(gKR);
+    auto tOgCLT = cta_mma_pv.partition_B(gCLT);
+
+    auto tKCgCL = thr_copy_kc.partition_S(tSgCL);
+    auto tKCgKR = thr_copy_kc.partition_S(tSgKR);
+    auto tVCgCLT = thr_copy_vc.partition_S(tOgCLT);
+
+    // latent is first in memory, so let's load it first always
+    // startup: alternate Q and K, set tx count appropriately, for k_idx = 0
+    auto& pipeline_acquire_state = pipeline_load_producer_state;
+    auto pipeline_commit_state = pipeline_acquire_state;
+    int pipeline_offset = 0;
+
+    for (int i = 0; i < StagesPV; i++) {
+      cutlass::arch::cp_async_fence();
+    }
+
+    auto load_stage = [&](auto fn) {
+      pipeline_load.producer_acquire(pipeline_acquire_state);
+      fn(pipeline_acquire_state.index());
+      cutlass::arch::cp_async_fence();
+
+      ++pipeline_acquire_state;
+      ++pipeline_offset;
+
+      if (pipeline_offset == StagesPV - 1) {
+        cutlass::arch::cp_async_wait<StagesPV - 1>();
+        pipeline_load.producer_commit(pipeline_commit_state);
+        ++pipeline_commit_state;
+        --pipeline_offset;
+      }
+    };
+
+    pipeline_page_table.consumer_wait(pipeline_pt_consumer_state);
+    page_table_stage = pipeline_pt_consumer_state.index();
+    ++pipeline_pt_consumer_state;
+
+    // each Q/K tile consists of rope and latent
+    for (int i = 0; i < IterationsQKLatent; i++) {
+      load_stage([&](int index) {
+        cute::copy(tiled_copy_q, tQgQL(_, _, _, _, _0{}, i, batch_coord),  tQsQ(_, _, _, _, i));
+        copy_split(tiled_copy_kc, tKCgCL(_, _, _, _, k_index, i),  tKCsKC(_, _, _, _, index));
+      });
+    }
+
+    for (int i = 0; i < IterationsQKRope; i++) {
+      load_stage([&](int index) {
+        cute::copy(tiled_copy_q, tQgQR(_, _, _, _, _0{}, i, batch_coord),  tQsQ(_, _, _, _, IterationsQKLatent + i));
+        copy_split(tiled_copy_kc, tKCgKR(_, _, _, _, k_index, i),  tKCsKC(_, _, _, _, index));
+      });
+    }
+
+    k_index += 1;
+    k_tile_count -= 1;
+
+    // assume k_tile_count >= 1
+    // perform K+Q load here
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+
+      pipeline_page_table.consumer_wait(pipeline_pt_consumer_state);
+      page_table_stage = pipeline_pt_consumer_state.index();
+      ++pipeline_pt_consumer_state;
+
+      for (int i = 0; i < IterationsQKLatent; i++) {
+        load_stage([&](int index) {
+          copy_split(tiled_copy_kc, tKCgCL(_, _, _, _, k_index, i),  tKCsKC(_, _, _, _, index));
+        });
+      }
+
+      for (int i = 0; i < IterationsQKRope; i++) {
+        load_stage([&](int index) {
+          copy_split(tiled_copy_kc, tKCgKR(_, _, _, _, k_index, i),  tKCsKC(_, _, _, _, index));
+        });
+      }
+
+      page_table_stage = pipeline_pt_release_state.index();
+
+      for (int i = 0; i < IterationsPV_K; i++) {
+        for (int j = 0; j < IterationsPV_N; j++) {
+          load_stage([&](int index) {
+            copy_split(tiled_copy_vc, tVCgCLT(_, _, _, _, j, IterationsPV_K * (k_index - 1) + i),  tVCsVC(_, _, _, _, index));
+          });
+        }
+      }
+
+      pipeline_page_table.consumer_release(pipeline_pt_release_state);
+      ++pipeline_pt_release_state;
+
+      k_index += 1;
+      k_tile_count -= 1;
+    }
+
+    page_table_stage = pipeline_pt_release_state.index();
+
+    for (int i = 0; i < IterationsPV_K; i++) {
+      for (int j = 0; j < IterationsPV_N; j++) {
+        load_stage([&](int index) {
+          copy_split(tiled_copy_vc, tVCgCLT(_, _, _, _, j, IterationsPV_K * (k_index - 1) + i),  tVCsVC(_, _, _, _, index));
+        });
+      }
+    }
+
+    pipeline_page_table.consumer_release(pipeline_pt_release_state);
+    ++pipeline_pt_release_state;
+
+    while (pipeline_offset > 0) {
+      cutlass::arch::cp_async_fence();
+    
+      cutlass::arch::cp_async_wait<StagesPV - 1>();
+      pipeline_load.producer_commit(pipeline_commit_state);
+      ++pipeline_commit_state;
+      --pipeline_offset;
+    }
+
+    cutlass::arch::cp_async_wait<0>();
+
+  }
+
+
+  template<bool kIsPaged = false, class BlkCoord>
+  CUTLASS_DEVICE void load_tma(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      MainloopParams const& mainloop_params,
+      TensorStorage& shared_tensors,
+      PipelineLoadQK& pipeline_load_qk,
+      typename PipelineLoadQK::PipelineState& pipeline_load_qk_producer_state,
+      PipelineLoadPV& pipeline_load_pv,
+      typename PipelineLoadPV::PipelineState& pipeline_load_pv_producer_state,
+      int const& split_kv) {
+
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+
+    int k_tile_total = ceil_div(K, TileShapeS{});
+    int k_tile_per_cta = ceil_div(k_tile_total, split_kv);
+    int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit
+    int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index);
+    if (k_tile_count == 0) {
+      return;
+    }
+
+    using X = Underscore;
+  
+    // partition all tensors
+    auto mQL = mainloop_params.tma_load_q_latent.get_tma_tensor(make_shape(H, D_latent, B));
+    auto mQR = mainloop_params.tma_load_q_rope.get_tma_tensor(make_shape(H, D_rope, B));
+
+    int paged_B = B;
+    int paged_K = K;
+    if constexpr (kIsPaged) {
+      paged_B = mainloop_args.page_count;
+      paged_K = mainloop_args.page_size;
+    }
+    auto mPT_l = make_tensor(make_gmem_ptr(mainloop_args.ptr_page_table), make_shape(paged_B, B), mainloop_args.stride_page_table);
+
+    auto mCL = mainloop_params.tma_load_c_latent.get_tma_tensor(make_shape(paged_K, D_latent, paged_B));
+    auto mKR = mainloop_params.tma_load_k_rope.get_tma_tensor(make_shape(paged_K, D_rope, paged_B));
+
+    auto mCLT = mainloop_params.tma_load_c_latent_transpose.get_tma_tensor(make_shape(D_latent, paged_K, paged_B));
+
+    auto gQL = local_tile(mQL, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{});
+    auto gQR = local_tile(mQR, TileShapeQK{}, make_coord(_,_,_), Step<_1, X, _1>{});
+
+    auto gCL = local_tile(mCL, TileShapeQK{}, make_coord(_,_,_), Step<X, _1, _1>{});
+    auto gKR = local_tile(mKR, TileShapeQK{}, make_coord(_,_,_), Step<X, _1, _1>{});
+    auto gCLT = local_tile(mCLT, TileShapePV{}, make_coord(_,_,_), Step<X, _1, _1>{});
+
+    ThrMMA cta_mma_qk = TiledMmaQK{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{}));
+    ThrMMA cta_mma_pv = TiledMmaPV{}.get_slice(get<0>(blk_coord) % size(AtomThrShapeMNK{}));
+
+    auto tSgQL = cta_mma_qk.partition_A(gQL);
+    auto tSgQR = cta_mma_qk.partition_A(gQR);
+
+    auto tSgCL = cta_mma_qk.partition_B(gCL);
+    auto tSgKR = cta_mma_qk.partition_B(gKR);
+
+    auto tOgCLT = cta_mma_pv.partition_B(gCLT);
+
+    Tensor sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{});
+    Tensor sKC = make_tensor(make_smem_ptr(shared_tensors.smem_kc.begin()), SmemLayoutKC{});
+    Tensor sVC = make_tensor(make_smem_ptr(shared_tensors.smem_vc.begin()), SmemLayoutVC{});
+
+    auto [tQLgQL_mkl, tQsQ] = tma_partition(
+        mainloop_params.tma_load_q_latent, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sQ), group_modes<0,3>(tSgQL));
+
+    auto [tQRgQR_mkl, tQsQ_ignore] = tma_partition(
+        mainloop_params.tma_load_q_rope, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sQ), group_modes<0,3>(tSgQR));
+
+    auto [tCLgCL_nkl, tKCsKC] = tma_partition(
+        mainloop_params.tma_load_c_latent, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sKC), group_modes<0,3>(tSgCL));
+
+    auto [tKRgKR_nkl, tKCsKC_ignore] = tma_partition(
+        mainloop_params.tma_load_k_rope, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sKC), group_modes<0,3>(tSgKR));
+
+    auto [tCLTgCLT_nkl, tVCsVC] = tma_partition(
+        mainloop_params.tma_load_c_latent_transpose, _0{}, make_layout(_1{}),
+        group_modes<0,3>(sVC), group_modes<0,3>(tOgCLT));
+
+    uint16_t mcast_mask = 0;
+
+    int batch_coord = get<2>(blk_coord);
+    Tensor tQLgQL = tQLgQL_mkl(_, _, _, batch_coord);
+    Tensor tQRgQR = tQRgQR_mkl(_, _, _, batch_coord);
+
+    auto mPT = mPT_l(_, batch_coord);
+
+    Tensor tCLgCL = tCLgCL_nkl(_, _, _, _);
+    Tensor tKRgKR = tKRgKR_nkl(_, _, _, _);
+
+    // careful: stage and k are swapped here!
+    Tensor tCLTgCLT = tCLTgCLT_nkl(_, _, _, _);
+
+    // latent is first in memory, so let's load it first always
+    // startup: alternate Q and K, set tx count appropriately, for k_idx = 0
+
+    // each Q/K tile consists of rope and latent
+    for (int i = 0; i < IterationsQKLatent; i++) {
+      pipeline_load_qk.producer_expect_transaction(pipeline_load_qk_producer_state, kTransactionsBytesLoadExtraQ);
+      pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state);
+      auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state);
+
+      if (cute::elect_one_sync()) {
+        // expect the extra bytes
+        // load_qk ql
+        cute::copy(mainloop_params.tma_load_q_latent.with(*tma_barrier, mcast_mask), tQLgQL(_, _0{}, i), tQsQ(_, i));
+        // load_qk cl
+        if constexpr (kIsPaged) {
+          cute::copy(
+              mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask),
+              tCLgCL(_, _0{}, i, mPT(k_index)),
+              tKCsKC(_, pipeline_load_qk_producer_state.index())
+          );
+        }
+        else {
+          cute::copy(
+              mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask),
+              tCLgCL(_, k_index, i, batch_coord),
+              tKCsKC(_, pipeline_load_qk_producer_state.index()));
+        }
+      }
+      ++pipeline_load_qk_producer_state;
+    }
+
+    for (int i = 0; i < IterationsQKRope; i++) {
+      pipeline_load_qk.producer_expect_transaction(pipeline_load_qk_producer_state, kTransactionsBytesLoadExtraQ);
+      pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state);
+      auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state);
+
+      if (cute::elect_one_sync()) {
+        // expect the extra bytes
+        // load_qk ql
+        cute::copy(mainloop_params.tma_load_q_rope.with(*tma_barrier, mcast_mask), tQRgQR(_, _0{}, i), tQsQ(_, i + IterationsQKLatent));
+        // load_qk cl
+        if constexpr (kIsPaged) {
+          cute::copy(
+              mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask),
+              tKRgKR(_, _0{}, i, mPT(k_index)),
+              tKCsKC(_, pipeline_load_qk_producer_state.index())
+          );
+        }
+        else {
+          cute::copy(
+              mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask),
+              tKRgKR(_, k_index, i, batch_coord),
+              tKCsKC(_, pipeline_load_qk_producer_state.index()));
+        }
+      }
+      ++pipeline_load_qk_producer_state;
+    }
+
+    k_index += 1;
+    k_tile_count -= 1;
+
+    // assume k_tile_count >= 1
+    // perform K+Q load here
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+
+      // perform K load
+      for (int i = 0; i < IterationsQKLatent; i++) {
+        pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state);
+        auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state);
+
+        if (cute::elect_one_sync()) {
+          // load_qk cl
+          if constexpr (kIsPaged) {
+            cute::copy(
+                mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask),
+                tCLgCL(_, _0{}, i, mPT(k_index)),
+                tKCsKC(_, pipeline_load_qk_producer_state.index())
+            );
+          }
+          else {
+            cute::copy(
+                mainloop_params.tma_load_c_latent.with(*tma_barrier, mcast_mask),
+                tCLgCL(_, k_index, i, batch_coord),
+                tKCsKC(_, pipeline_load_qk_producer_state.index()));
+          }
+        }
+        ++pipeline_load_qk_producer_state;
+      }
+
+      for (int i = 0; i < IterationsQKRope; i++) {
+        pipeline_load_qk.producer_acquire(pipeline_load_qk_producer_state);
+        auto tma_barrier = pipeline_load_qk.producer_get_barrier(pipeline_load_qk_producer_state);
+
+        if (cute::elect_one_sync()) {
+          // load_qk cl
+          if constexpr (kIsPaged) {
+            cute::copy(
+                mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask),
+                tKRgKR(_, _0{}, i, mPT(k_index)),
+                tKCsKC(_, pipeline_load_qk_producer_state.index())
+            );
+          }
+          else {
+            cute::copy(
+                mainloop_params.tma_load_k_rope.with(*tma_barrier, mcast_mask),
+                tKRgKR(_, k_index, i, batch_coord),
+                tKCsKC(_, pipeline_load_qk_producer_state.index()));
+          }
+        }
+        ++pipeline_load_qk_producer_state;
+      }
+
+      // prefetch next K load to keep busy while we transpose-load from cache
+      const int kPrefetchDistance = 1;
+      for (int i = 0; i < IterationsQKLatent; i++) {
+        if (cute::elect_one_sync()) {
+          if constexpr (kIsPaged) {
+            if (k_tile_count > kPrefetchDistance) {
+              cute::prefetch(
+                  mainloop_params.tma_load_c_latent,
+                  tCLgCL(_, _0{}, i, mPT(k_index + kPrefetchDistance))
+              );
+            }
+          }
+          else {
+            cute::prefetch(
+                mainloop_params.tma_load_c_latent,
+                tCLgCL(_, k_index + kPrefetchDistance, i, batch_coord)
+            );
+          }
+        }
+      }
+
+      for (int i = 0; i < IterationsQKRope; i++) {
+        if (cute::elect_one_sync()) {
+          if constexpr (kIsPaged) {
+            if (k_tile_count > kPrefetchDistance) {
+              cute::prefetch(
+                  mainloop_params.tma_load_k_rope,
+                  tKRgKR(_, _0{}, i, mPT(k_index + kPrefetchDistance))
+              );
+            }
+          }
+          else {
+            cute::prefetch(
+                mainloop_params.tma_load_k_rope,
+                tKRgKR(_, k_index + kPrefetchDistance, i, batch_coord)
+            );
+          }
+        }
+      }
+
+      // perform V load (k_idx - 1)
+
+      for (int i = 0; i < IterationsPV_K; i++) {
+        for (int j = 0; j < IterationsPV_N; j++) {
+          pipeline_load_pv.producer_acquire(pipeline_load_pv_producer_state);
+          auto tma_barrier = pipeline_load_pv.producer_get_barrier(pipeline_load_pv_producer_state);
+
+          if (cute::elect_one_sync()) {
+            // load_pv cl
+            // note the transpose in indices!
+            // note we are off-by-one on k_index
+            if constexpr (kIsPaged) {
+              cute::copy(
+                  mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST),
+                  tCLTgCLT(_, j, i, mPT(k_index - 1)),
+                  tVCsVC(_, pipeline_load_pv_producer_state.index())
+              );
+            }
+            else {
+              cute::copy(
+                  mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST),
+                  tCLTgCLT(_, j, IterationsPV_K * (k_index - 1) + i, batch_coord),
+                  tVCsVC(_, pipeline_load_pv_producer_state.index())
+              );
+            }
+          }
+          ++pipeline_load_pv_producer_state;
+        }
+      }
+
+      k_index += 1;
+      k_tile_count -= 1;
+    }
+
+    for (int i = 0; i < IterationsPV_K; i++) {
+      for (int j = 0; j < IterationsPV_N; j++) {
+        pipeline_load_pv.producer_acquire(pipeline_load_pv_producer_state);
+        auto tma_barrier = pipeline_load_pv.producer_get_barrier(pipeline_load_pv_producer_state);
+
+        if (cute::elect_one_sync()) {
+          // load_pv cl
+          // note the transpose in indices
+          // note we are off-by-one on k_index
+
+          if constexpr (kIsPaged) {
+            cute::copy(
+                mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST),
+                tCLTgCLT(_, j, i, mPT(k_index - 1)),
+                tVCsVC(_, pipeline_load_pv_producer_state.index())
+            );
+          }
+          else {
+            cute::copy(
+                mainloop_params.tma_load_c_latent_transpose.with(*tma_barrier, mcast_mask, cute::TMA::CacheHintSm100::EVICT_FIRST),
+                tCLTgCLT(_, j, IterationsPV_K * (k_index - 1) + i, batch_coord),
+                tVCsVC(_, pipeline_load_pv_producer_state.index())
+            );
+          }
+        }
+        ++pipeline_load_pv_producer_state;
+      }
+    }
+  }
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void mma(
+      BlkCoord const& blk_coord,
+      ProblemShape const& problem_shape,
+      TensorStorage& shared_tensors,
+      PipelineLoadQK& pipeline_load_qk,
+      typename PipelineLoadQK::PipelineState& pipeline_load_qk_consumer_state,
+      PipelineLoadPV& pipeline_load_pv,
+      typename PipelineLoadPV::PipelineState& pipeline_load_pv_consumer_state,
+      PipelineS& pipeline_mma_s,
+      typename PipelineS::PipelineState& pipeline_mma_s_producer_state,
+      PipelineP& pipeline_p_mma,
+      typename PipelineP::PipelineState& pipeline_p_mma_consumer_state,
+      PipelineO& pipeline_mma_o,
+      typename PipelineO::PipelineState& pipeline_mma_o_producer_state,
+      int const& split_kv) {
+
+    auto [H, K, D, B] = problem_shape;
+
+    int k_tile_total = ceil_div(K, TileShapeS{});
+    int k_tile_per_cta = ceil_div(k_tile_total, split_kv);
+    int k_index = get<3>(blk_coord) * k_tile_per_cta; // lower limit
+    int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index);
+    if (k_tile_count == 0) {
+      return;
+    }
+
+    // mma init
+    Tensor sQ = make_tensor(make_smem_ptr(shared_tensors.smem_q.begin()), SmemLayoutQ{});
+    Tensor sKC = make_tensor(make_smem_ptr(shared_tensors.smem_kc.begin()), SmemLayoutKC{});
+    Tensor sVC = make_tensor(make_smem_ptr(shared_tensors.smem_vc.begin()), SmemLayoutVC{});
+    Tensor sP = make_tensor(make_smem_ptr((Element*) shared_tensors.smem_p.begin()), SmemLayoutP{});
+
+    Tensor tSrQ = TiledMmaQK::make_fragment_A(sQ);
+    Tensor tSrKC = TiledMmaQK::make_fragment_B(sKC);
+    Tensor tOrP = TiledMmaPV::make_fragment_A(sP);
+    Tensor tOrVC = TiledMmaPV::make_fragment_B(sVC);
+
+    TiledMmaQK tiled_mma_qk;
+    TiledMmaPV tiled_mma_pv;
+
+    Tensor tStS =  partition_fragment_C(tiled_mma_qk, select<0,1>(TileShapeQK{}));
+    Tensor tOtO =  partition_fragment_C(tiled_mma_pv, select<0,1>(TileShapePV{}));
+
+    tiled_mma_pv.accumulate_ = UMMA::ScaleOut::Zero;
+
+    pipeline_mma_s.producer_acquire(pipeline_mma_s_producer_state);
+
+    // Mma           S0 S1 O0 S2 O1 ... Sn On-1 On
+    // S0 ownership  --    -----        --      --
+    // S1 ownership     --       -----     ----
+    // O ownership         --    --        ---- --
+
+    tiled_mma_qk.accumulate_ = UMMA::ScaleOut::Zero;
+    for (int i = 0; i < IterationsQK; i++) {
+      pipeline_load_qk.consumer_wait(pipeline_load_qk_consumer_state);
+      int read_stage = pipeline_load_qk_consumer_state.index();
+
+      tStS.data() = uint32_t(pipeline_mma_s_producer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1);
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tSrQ); ++k_block) {
+        cute::gemm(tiled_mma_qk,
+                   tSrQ(_,_,k_block,i),
+                   tSrKC(_,_,k_block,read_stage),
+                   tStS);
+        tiled_mma_qk.accumulate_ = UMMA::ScaleOut::One;
+      }
+
+      pipeline_load_qk.consumer_release(pipeline_load_qk_consumer_state);
+      ++pipeline_load_qk_consumer_state;
+    }
+
+    pipeline_mma_s.producer_commit(pipeline_mma_s_producer_state);
+    ++pipeline_mma_s_producer_state;
+
+    k_tile_count -= 1;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+
+      pipeline_mma_s.producer_acquire(pipeline_mma_s_producer_state);
+      tiled_mma_qk.accumulate_ = UMMA::ScaleOut::Zero;
+      for (int i = 0; i < IterationsQK; i++) {
+        pipeline_load_qk.consumer_wait(pipeline_load_qk_consumer_state);
+        int read_stage = pipeline_load_qk_consumer_state.index();
+
+        tStS.data() = uint32_t(pipeline_mma_s_producer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1);
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tSrQ); ++k_block) {
+          cute::gemm(tiled_mma_qk,
+                     tSrQ(_,_,k_block,i),
+                     tSrKC(_,_,k_block,read_stage),
+                     tStS);
+          tiled_mma_qk.accumulate_ = UMMA::ScaleOut::One;
+        }
+
+        pipeline_load_qk.consumer_release(pipeline_load_qk_consumer_state);
+        ++pipeline_load_qk_consumer_state;
+      }
+
+      pipeline_mma_s.producer_commit(pipeline_mma_s_producer_state);
+      ++pipeline_mma_s_producer_state;
+
+      pipeline_mma_o.producer_acquire(pipeline_mma_o_producer_state);
+      pipeline_p_mma.consumer_wait(pipeline_p_mma_consumer_state);
+
+      for (int i = 0; i < IterationsPV_K; i++) {
+        auto acc_flag = tiled_mma_pv.accumulate_;
+        for (int j = 0; j < IterationsPV_N; j++) {
+          pipeline_load_pv.consumer_wait(pipeline_load_pv_consumer_state);
+
+          int read_stage = pipeline_load_pv_consumer_state.index();
+
+          tOtO.data() = uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO);
+          tiled_mma_pv.accumulate_ = acc_flag;
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int k_block = 0; k_block < size<2>(tOrP); ++k_block) {
+            cute::gemm(tiled_mma_pv,
+                       tOrP(_,_,k_block, make_coord(i, pipeline_p_mma_consumer_state.index())),
+                       tOrVC(_,_,k_block,read_stage),
+                       tOtO);
+            tiled_mma_pv.accumulate_ = UMMA::ScaleOut::One;
+          }
+
+          pipeline_load_pv.consumer_release(pipeline_load_pv_consumer_state);
+          ++pipeline_load_pv_consumer_state;
+        }
+      }
+
+      pipeline_p_mma.consumer_release(pipeline_p_mma_consumer_state);
+      ++pipeline_p_mma_consumer_state;
+      pipeline_mma_o.producer_commit(pipeline_mma_o_producer_state);
+      ++pipeline_mma_o_producer_state;
+
+      --k_tile_count;
+    }
+
+    pipeline_mma_o.producer_acquire(pipeline_mma_o_producer_state);
+    pipeline_p_mma.consumer_wait(pipeline_p_mma_consumer_state);
+
+    for (int i = 0; i < IterationsPV_K; i++) {
+      auto acc_flag = tiled_mma_pv.accumulate_;
+      for (int j = 0; j < IterationsPV_N; j++) {
+        pipeline_load_pv.consumer_wait(pipeline_load_pv_consumer_state);
+
+        int read_stage = pipeline_load_pv_consumer_state.index();
+
+        tOtO.data() = uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO);
+        tiled_mma_pv.accumulate_ = acc_flag;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tOrP); ++k_block) {
+          cute::gemm(tiled_mma_pv,
+                     tOrP(_,_,k_block, make_coord(i, pipeline_p_mma_consumer_state.index())),
+                     tOrVC(_,_,k_block,read_stage),
+                     tOtO);
+          tiled_mma_pv.accumulate_ = UMMA::ScaleOut::One;
+        }
+
+        pipeline_load_pv.consumer_release(pipeline_load_pv_consumer_state);
+        ++pipeline_load_pv_consumer_state;
+      }
+    }
+
+    pipeline_p_mma.consumer_release(pipeline_p_mma_consumer_state);
+    ++pipeline_p_mma_consumer_state;
+    pipeline_mma_o.producer_commit(pipeline_mma_o_producer_state);
+    ++pipeline_mma_o_producer_state;
+  }
+
+
+  template<class IsLastTile>
+  CUTLASS_DEVICE void softmax(
+      IsLastTile const& is_last_tile,
+      ElementAcc& row_max,
+      ElementAcc& row_sum,
+      ElementAcc& correction_factor,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      TensorStorage& shared_tensors,
+      int k_index,
+      uint32_t tmem_s,
+      int smem_p_index) {
+
+    auto load_op = cute::SM100_TMEM_LOAD_32dp32b32x{};
+
+    TiledMmaQK tiled_mma_qk;
+
+    Tensor tStS = partition_fragment_C(tiled_mma_qk, select<0,1>(TileShapeQK{}));
+    tStS.data() = tmem_s;
+
+    CUTE_STATIC_ASSERT_V(shape<1>(tStS) == _1{});
+    CUTE_STATIC_ASSERT_V(shape<2>(tStS) == _1{});
+    Tensor tAcc = tStS(make_coord(_,_),_0{},_0{});
+
+    Tensor cS = make_identity_tensor(take<0,2>(CtaShapeQK{}));
+
+    auto tiled_t2r = make_tmem_copy(load_op, tAcc);
+    auto thread_idx = threadIdx.x % size(tiled_t2r);
+
+    auto thread_t2r = tiled_t2r.get_slice(thread_idx);
+    Tensor tTR_cS   = thread_t2r.partition_D(cS);
+    Tensor tTR_rAcc = make_tensor<ElementAcc>(shape(tTR_cS));
+
+    Tensor tTR_rS_frag = make_tensor<Element>(shape(tTR_rAcc));
+    const int AlignmentS = 4;
+    Tensor tTR_tAcc = thread_t2r.partition_S(tAcc);
+    Tensor tTR_rAcc_vec = recast<Array<ElementAcc, AlignmentS>>(tTR_rAcc);
+    Tensor tTR_rS_vec = recast<Array<Element, AlignmentS>>(tTR_rS_frag);
+
+    // load s
+    copy(tiled_t2r, tTR_tAcc, tTR_rAcc);
+
+    if (is_last_tile) {
+      for (int i = 0; i < size(tTR_rAcc); i++) {
+        if (get<1>(tTR_cS(i)) + TileShapeS{} * k_index >= get<1>(problem_shape)) {
+          tTR_rAcc(i) = -std::numeric_limits<ElementAcc>::infinity();
+        }
+      }
+    }
+
+    // max
+    ElementAcc row_max_new = row_max;
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(tTR_rAcc); i += 1) {
+      row_max_new = ::fmax(row_max_new, tTR_rAcc(i));
+    }
+
+    // for 2x2 dp, reduce here
+    if constexpr (kWarpsInN > 1) {
+      shared_tensors.smem_exchange[threadIdx.x] = row_max_new;
+      cutlass::arch::NamedBarrier(kNumComputeWarps*NumThreadsPerWarp, kNamedBarrierExchange).sync();
+      // (64, 2) shape
+      int peer_index = (threadIdx.x + 64) % 128;
+      row_max_new = cutlass::max(row_max_new, shared_tensors.smem_exchange[peer_index]);
+    }
+
+#ifndef B2B
+    // find correction factor
+    ElementAcc softmax_scale_log2 = mainloop_args.softmax_scale * static_cast<ElementAcc>(M_LOG2E);
+    correction_factor = ::exp2f(softmax_scale_log2 * (row_max - row_max_new));
+    row_max = row_max_new;
+
+    // softmax
+    ElementAcc row_max_scale_log2 = row_max * softmax_scale_log2;
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(tTR_rAcc); i++) {
+      tTR_rAcc(i) = ::exp2f(softmax_scale_log2 * tTR_rAcc(i) - row_max_scale_log2);
+    }
+#endif
+
+    // quantize
+    cutlass::NumericArrayConverter<Element, ElementAcc, AlignmentS> epilogue_op;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(tTR_rAcc_vec); i++) {
+      tTR_rS_vec(i) = epilogue_op(tTR_rAcc_vec(i));
+    }
+
+    Tensor sP = make_tensor(make_smem_ptr((Element*) shared_tensors.smem_p.begin()), SmemLayoutP{})(_, _, _, make_coord(_, smem_p_index));
+
+    Tensor tOcP = TiledMmaPV{}.get_slice(_0{}).partition_A(cS);
+
+    // have a mapping for each thread to coord
+    // find identical mapping to coords for the MMA
+    auto l = make_ordered_layout(make_shape(make_shape(_64{}, _2{}), make_shape(_16{}, TileShapeS{} / _32{})), make_stride(make_stride(_0{}, _3{}), make_stride(_1{}, _2{})));
+    auto sP_ = as_position_independent_swizzle_tensor(sP);
+    copy_aligned(tTR_rS_frag, sP_.compose(l)(threadIdx.x, _));
+
+    // sum
+    row_sum *= correction_factor;
+
+    static_assert(cute::is_same_v<ElementAcc, float>);
+    auto tTR_rAcc_float2 = recast<float2>(tTR_rAcc);
+    auto sums = make_tensor<float2>(_4{});
+    static_assert(size(tTR_rAcc_float2) % size(sums) == 0);
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(sums); i++) {
+      sums(i) = tTR_rAcc_float2(i);
+    }
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = size(sums); i < size(tTR_rAcc_float2); i += size(sums)) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < size(sums); j++) {
+        cute::add(sums(j), sums(j), tTR_rAcc_float2(i + j));
+      }
+    }
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 1; i < size(sums); i *= 2) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < size(sums); j += 2*i) {
+        cute::add(sums(j), sums(j), sums(j+i));
+      }
+    }
+    row_sum += sums(0).x + sums(0).y;
+  }
+
+
+  CUTLASS_DEVICE void rescale(
+      ElementAcc correction_factor,
+      uint32_t tmem_o) {
+
+    // for b2b gemm, do nothing
+#ifndef B2B
+    auto load_op = cute::SM100_TMEM_LOAD_32dp32b32x{};
+    auto store_op = TMEM::tmem_load_to_store(load_op);
+
+    TiledMmaPV tiled_mma_pv;
+
+    Tensor tOtO = partition_fragment_C(tiled_mma_pv, select<0,1>(TileShapePV{}));
+    tOtO.data() = tmem_o;
+
+    CUTE_STATIC_ASSERT_V(shape<1>(tOtO) == _1{});
+    CUTE_STATIC_ASSERT_V(shape<2>(tOtO) == _1{});
+    Tensor tAcc = tOtO(make_coord(_,_),_0{},_0{});
+
+    auto cta_tiler_pv = take<0,2>(typename CollectiveMmaPV::CtaShape_MNK{});
+    Tensor gO = make_tensor(make_gmem_ptr((ElementAcc*) nullptr), cta_tiler_pv, make_stride(0, 0));
+
+    auto tiled_t2r = make_tmem_copy(load_op, tAcc);
+    auto tiled_r2t = make_tmem_copy(store_op, tAcc);
+    auto thread_idx = threadIdx.x % size(tiled_t2r);
+
+    auto thread_t2r = tiled_t2r.get_slice(thread_idx);
+    auto thread_r2t = tiled_r2t.get_slice(thread_idx);
+    Tensor tTR_gO   = thread_t2r.partition_D(gO);
+    Tensor tTR_rAcc = make_tensor<ElementAcc>(shape(tTR_gO));
+
+    Tensor tTR_tAcc = thread_t2r.partition_S(tAcc);
+
+    // load o
+    copy(tiled_t2r, tTR_tAcc, tTR_rAcc);
+
+    // multiply by correction factor
+    float2 correction_factor_vec = make_float2(correction_factor, correction_factor);
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < size(tTR_rAcc); i += 2) {
+      float2 in = make_float2(tTR_rAcc(i + 0), tTR_rAcc(i + 1));
+      float2 out;
+      cute::mul(out, in, correction_factor_vec);
+      tTR_rAcc(i + 0) = out.x;
+      tTR_rAcc(i + 1) = out.y;
+    }
+
+    // store o
+    copy(tiled_r2t, tTR_rAcc, tTR_tAcc);
+#endif
+ }
+
+
+  template<class BlkCoord>
+  CUTLASS_DEVICE void epilogue(
+      ElementAcc& row_max,
+      ElementAcc& row_sum,
+      BlkCoord const& cta_coord,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      EpilogueParams const& epilogue_args,
+      TensorStorage& shared_tensors,
+      uint32_t tmem_o,
+      int const& split_kv) {
+
+    auto load_op = cute::SM100_TMEM_LOAD_32dp32b32x{};
+
+    TiledMmaPV tiled_mma_pv;
+    
+    Tensor tOtO = TiledMmaPV::make_fragment_C(partition_shape_C(TiledMmaPV{}, take<0, 2>(TileShapePV{})));
+    tOtO.data() = tmem_o;
+
+    CUTE_STATIC_ASSERT_V(shape<1>(tOtO) == _1{});
+    CUTE_STATIC_ASSERT_V(shape<2>(tOtO) == _1{});
+    Tensor tAcc = tOtO(make_coord(_,_),_0{},_0{});
+
+    auto [H, K, D, B] = problem_shape;
+    auto [D_latent, D_rope] = D;
+    if (epilogue_args.ptr_o_acc != nullptr) { 
+      using ElementOutAcc = ElementAcc;
+      constexpr auto AlignmentOutAcc = 128 / cute::sizeof_bits_v<ElementOutAcc>;
+      Tensor mO = make_tensor(make_gmem_ptr(epilogue_args.ptr_o_acc + get<3>(cta_coord) * D_latent), make_shape(H, D_latent, B), epilogue_args.stride_o_acc);
+      auto cta_tiler_pv = take<0,2>(typename CollectiveMmaPV::CtaShape_MNK{});
+      Tensor gO = local_tile(mO, cta_tiler_pv, take<0,3>(cta_coord));
+
+      auto tiled_t2r = make_tmem_copy(load_op, tAcc);
+      auto thread_idx = threadIdx.x % size(tiled_t2r);
+
+      auto thread_t2r = tiled_t2r.get_slice(thread_idx);
+      Tensor tTR_gO   = thread_t2r.partition_D(gO);
+      Tensor tTR_rAcc = make_tensor<ElementAcc>(shape(tTR_gO));
+
+      Tensor tTR_rO_frag = make_tensor<ElementOutAcc>(shape(tTR_rAcc));
+      Tensor tTR_rO_src = recast<Array<ElementOutAcc, AlignmentOutAcc>>(coalesce(tTR_rO_frag));
+      Tensor tR2G_rO_dst = recast<Array<ElementOutAcc, AlignmentOutAcc>>(coalesce(tTR_gO));
+      Tensor tTR_tAcc = thread_t2r.partition_S(tAcc);
+
+      copy(tiled_t2r, tTR_tAcc, tTR_rAcc);
+
+      cutlass::epilogue::thread::LinearCombination<ElementOutAcc, 1, ElementAcc, ElementAcc, cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling> epilogue_op({epilogue_args.output_scale / row_sum});
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tTR_rAcc); i++) {
+        tTR_rO_frag(i) = epilogue_op(tTR_rAcc(i));
+      }
+
+      copy(tTR_rO_src, tR2G_rO_dst);
+
+#ifndef B2B
+
+      // compute LSE
+      ElementAcc lse = cutlass::fast_log(row_sum) + mainloop_args.softmax_scale * row_max;
+
+      // store LSE
+      Tensor mLSE = make_tensor(make_gmem_ptr(epilogue_args.ptr_lse_acc + H * get<3>(cta_coord)), make_shape(H, B), epilogue_args.stride_lse_acc);
+      Tensor gLSE = local_tile(mLSE, append<3>(cta_tiler_pv, _1{}), take<0,3>(cta_coord), Step<_1, Underscore, _1>{});
+      // for 2x2 dp, this must be conditional and the index is wrong
+      if (! kIs2Sm || (threadIdx.x < 64))
+      {
+          gLSE(threadIdx.x) = lse;
+      }
+ #endif
+    }
+    else {
+      Tensor mO = make_tensor(make_gmem_ptr(epilogue_args.ptr_o), make_shape(H, D_latent, B), epilogue_args.stride_o);
+      auto cta_tiler_pv = take<0,2>(typename CollectiveMmaPV::CtaShape_MNK{});
+      Tensor gO = local_tile(mO, cta_tiler_pv, take<0,3>(cta_coord));
+
+      auto tiled_t2r = make_tmem_copy(load_op, tAcc);
+      auto thread_idx = threadIdx.x % size(tiled_t2r);
+
+      auto thread_t2r = tiled_t2r.get_slice(thread_idx);
+      Tensor tTR_gO   = thread_t2r.partition_D(gO);
+      Tensor tTR_rAcc = make_tensor<ElementAcc>(shape(tTR_gO));
+
+      Tensor tTR_rO_frag = make_tensor<ElementOut>(shape(tTR_rAcc));
+      Tensor tTR_rO_src = recast<Array<ElementOut, AlignmentOut>>(coalesce(tTR_rO_frag));
+      Tensor tR2G_rO_dst = recast<Array<ElementOut, AlignmentOut>>(coalesce(tTR_gO));
+      Tensor tTR_tAcc = thread_t2r.partition_S(tAcc);
+
+      copy(tiled_t2r, tTR_tAcc, tTR_rAcc);
+
+      cutlass::epilogue::thread::LinearCombination<ElementOut, 1, ElementAcc, ElementAcc, cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling> epilogue_op({epilogue_args.output_scale / row_sum});
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tTR_rAcc); i++) {
+        tTR_rO_frag(i) = epilogue_op(tTR_rAcc(i));
+      }
+
+      copy(tTR_rO_src, tR2G_rO_dst);
+
+#ifndef B2B
+      if (epilogue_args.ptr_lse != nullptr) {
+        // compute LSE
+        ElementAcc lse = cutlass::fast_log(row_sum) + mainloop_args.softmax_scale * row_max;
+
+        // store LSE
+        Tensor mLSE = make_tensor(make_gmem_ptr(epilogue_args.ptr_lse), make_shape(H, B), epilogue_args.stride_lse);
+        Tensor gLSE = local_tile(mLSE, append<3>(cta_tiler_pv, _1{}), take<0,3>(cta_coord), Step<_1, Underscore, _1>{});
+
+        // for 2x2 dp, this must be conditional and the index is wrong
+        if (! kIs2Sm || (threadIdx.x < 64))
+        {
+          gLSE(threadIdx.x) = lse;
+        }
+      }
+#endif
+    }
+  }
+
+
+  template<class CtaCoord>
+  CUTLASS_DEVICE void compute(
+      CtaCoord const& cta_coord,
+      ProblemShape const& problem_shape,
+      MainloopArguments const& mainloop_args,
+      EpilogueParams const& epilogue_args,
+      TensorStorage& shared_tensors,
+      PipelineS& pipeline_mma_s,
+      typename PipelineS::PipelineState& pipeline_mma_s_consumer_state,
+      PipelineP& pipeline_p_mma,
+      typename PipelineP::PipelineState& pipeline_p_mma_producer_state,
+      PipelineO& pipeline_mma_o,
+      typename PipelineO::PipelineState& pipeline_mma_o_consumer_state,
+      int const& split_kv) {
+
+    auto [H, K, D, B] = problem_shape;
+
+    int k_tile_total = ceil_div(K, TileShapeS{});
+    int k_tile_per_cta = ceil_div(k_tile_total, split_kv);
+    int k_index = get<3>(cta_coord) * k_tile_per_cta; // lower limit
+    int k_tile_count = max(0, min(k_tile_total, k_index + k_tile_per_cta) - k_index);
+    if (k_tile_count == 0) {
+
+      // if we return early, we have to make sure we release the load warp
+      cutlass::arch::NamedBarrier(
+          (kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp,
+          kNamedBarrierEpilogue
+      ).arrive();
+
+      return;
+    }
+    int k_index_final = k_tile_total - 1;
+
+    ElementAcc row_max = -std::numeric_limits<ElementAcc>::infinity();
+    ElementAcc row_sum = 0;
+    ElementAcc correction_factor = 1;
+
+    pipeline_p_mma.producer_acquire(pipeline_p_mma_producer_state);
+    pipeline_mma_s.consumer_wait(pipeline_mma_s_consumer_state);
+
+    auto dispatch_bool = [](bool b, auto fn) {
+      if (b) {
+        fn(cute::true_type{});
+      }
+      else {
+        fn(cute::false_type{});
+      }
+    };
+
+    // softmax s0 -> p0
+    dispatch_bool(k_index == k_index_final, [&](auto is_last_tile) {
+      softmax(
+          is_last_tile,
+          row_max, row_sum, correction_factor,
+          problem_shape, mainloop_args, shared_tensors, k_index,
+          uint32_t(pipeline_mma_s_consumer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1),
+          pipeline_p_mma_producer_state.index()
+      );
+    });
+    
+    k_index += 1;
+
+    cutlass::arch::fence_view_async_tmem_load();
+    cutlass::arch::fence_view_async_shared();
+    pipeline_mma_s.consumer_release(pipeline_mma_s_consumer_state);
+    ++pipeline_mma_s_consumer_state;
+    pipeline_p_mma.producer_commit(pipeline_p_mma_producer_state);
+    ++pipeline_p_mma_producer_state;
+
+    k_tile_count -= 1;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+      pipeline_p_mma.producer_acquire(pipeline_p_mma_producer_state);
+      pipeline_mma_s.consumer_wait(pipeline_mma_s_consumer_state);
+
+      // softmax s1 -> p1
+      dispatch_bool(k_index == k_index_final, [&](auto is_last_tile) {
+        softmax(
+            is_last_tile,
+            row_max, row_sum, correction_factor,
+            problem_shape, mainloop_args, shared_tensors, k_index,
+            uint32_t(pipeline_mma_s_consumer_state.index() == 0 ? TmemAllocation::kS0 : TmemAllocation::kS1),
+            pipeline_p_mma_producer_state.index()
+        );
+      });
+
+      cutlass::arch::fence_view_async_tmem_load();
+      cutlass::arch::fence_view_async_shared();
+      pipeline_mma_s.consumer_release(pipeline_mma_s_consumer_state);
+      ++pipeline_mma_s_consumer_state;
+      pipeline_p_mma.producer_commit(pipeline_p_mma_producer_state);
+      ++pipeline_p_mma_producer_state;
+
+      pipeline_mma_o.consumer_wait(pipeline_mma_o_consumer_state);
+
+      // rescale
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < IterationsPV_N; j++) {
+        rescale(correction_factor, uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO));
+      }
+
+      cutlass::arch::fence_view_async_tmem_store();
+      pipeline_mma_o.consumer_release(pipeline_mma_o_consumer_state);
+      ++pipeline_mma_o_consumer_state;
+
+      --k_tile_count;
+      k_index += 1;
+    }
+
+    pipeline_mma_o.consumer_wait(pipeline_mma_o_consumer_state);
+
+#ifdef B2B
+    row_sum = 1;
+#else
+    if constexpr (kWarpsInN > 1) {
+      // reduce row_sum if needed (for 2x2 dp)
+      shared_tensors.smem_exchange[threadIdx.x] = row_sum;
+      cutlass::arch::NamedBarrier(kNumComputeWarps*NumThreadsPerWarp, kNamedBarrierExchange).sync();
+      // (64, 2) shape
+      int peer_index = (threadIdx.x + 64) % 128;
+      row_sum += shared_tensors.smem_exchange[peer_index];
+    }
+#endif
+
+    cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive();
+
+    // epilogue
+    CUTLASS_PRAGMA_UNROLL
+    for (int j = 0; j < IterationsPV_N; j++) {
+      epilogue(
+          row_max, row_sum,
+          replace<1>(cta_coord, j), problem_shape,
+          mainloop_args, epilogue_args, shared_tensors,
+          uint32_t(TmemAllocation::kO0) + j * uint32_t(TmemAllocation::kSizeAccO), split_kv
+      );
+    }
+
+    cutlass::arch::fence_view_async_tmem_load();
+    pipeline_mma_o.consumer_release(pipeline_mma_o_consumer_state);
+    ++pipeline_mma_o_consumer_state;
+  }
+
+};
+
+///////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::fmha::kernel
diff --git a/examples/77_blackwell_fmha/kernel/sm100_mla_tile_scheduler.hpp b/examples/77_blackwell_fmha/kernel/sm100_mla_tile_scheduler.hpp
new file mode 100644
index 0000000000..dbcc2ce8b8
--- /dev/null
+++ b/examples/77_blackwell_fmha/kernel/sm100_mla_tile_scheduler.hpp
@@ -0,0 +1,160 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/fast_math.h"
+#include "cutlass/kernel_hardware_info.h"
+
+namespace cutlass::fmha::kernel {
+
+////////////////////////////////////////////////////////////////////////////////
+
+struct Sm100MlaIndividualTileScheduler {
+
+  struct Params {
+    dim3 grid;
+  };
+
+  bool valid_ = true;
+
+  CUTLASS_DEVICE
+  Sm100MlaIndividualTileScheduler(Params const&) {}
+
+  template<class ProblemShape, class ClusterShape>
+  static Params to_underlying_arguments(
+      ProblemShape const& problem_shape, KernelHardwareInfo hw_info,
+      ClusterShape const& cluster_shape, int const& split_kv) {
+    using namespace cute;
+    dim3 grid(get<0>(cluster_shape), get<3>(problem_shape) /* Batch */, split_kv /*Maximum Split KV*/);
+    return Params{ grid };
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    return params.grid;
+  }
+
+  CUTLASS_DEVICE
+  bool is_valid() {
+    return valid_;
+  }
+
+  CUTLASS_DEVICE
+  auto get_block_coord() {
+    using namespace cute;
+    return make_coord(blockIdx.x, _0{}, blockIdx.y, blockIdx.z);
+  }
+
+  CUTLASS_DEVICE
+  Sm100MlaIndividualTileScheduler& operator++() {
+    valid_ = false;
+    return *this;
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+struct Sm100MlaPersistentTileScheduler {
+
+  struct Params {
+    int num_blocks;
+    FastDivmod divmod_m_block;
+    FastDivmod divmod_b;
+    FastDivmod divmod_split_kv;
+    KernelHardwareInfo hw_info;
+  };
+
+  int block_idx = 0;
+  Params params;
+
+  CUTLASS_DEVICE
+  Sm100MlaPersistentTileScheduler(Params const& params) : block_idx(blockIdx.x), params(params) {}
+
+  template<class ProblemShape, class ClusterShape>
+  static Params to_underlying_arguments(
+      ProblemShape const& problem_shape, KernelHardwareInfo hw_info,
+      ClusterShape const& cluster_shape, int const& split_kv) {
+    using namespace cute;
+    // Get SM count if needed, otherwise use user supplied SM count
+    int sm_count = hw_info.sm_count;
+    if (sm_count <= 1 || sm_count % size<0>(cluster_shape) != 0) {
+      CUTLASS_TRACE_HOST("  WARNING: Arguments do not include a valid SM count.\n"
+          "  For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count.");
+      sm_count = KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+    }
+
+    CUTLASS_TRACE_HOST("to_underlying_arguments(): Setting persistent grid SM count to " << sm_count);
+    hw_info.sm_count = sm_count;
+
+    int num_m_blocks = size<0>(cluster_shape);
+    int num_blocks = num_m_blocks * get<3>(problem_shape)  /* Batch */;
+    num_blocks *= split_kv; /* Maximum Split KV*/
+
+    return Params {
+      num_blocks,
+      { num_m_blocks}, { get<3>(problem_shape) }, {split_kv},
+      hw_info
+    };
+  }
+
+  static dim3 get_grid_shape(Params const& params) {
+    dim3 grid(std::min(params.num_blocks, params.hw_info.sm_count), 1, 1);
+    return grid;
+  }
+
+  CUTLASS_DEVICE
+  bool is_valid() {
+    return block_idx < params.num_blocks;
+  }
+
+  CUTLASS_DEVICE
+  auto get_block_coord() {
+    using namespace cute;
+    int block_decode = block_idx;
+    int m_block, bidb, n_split_kv;
+    params.divmod_m_block(block_decode, m_block, block_decode);
+    params.divmod_b(block_decode, bidb, block_decode);
+    params.divmod_split_kv(block_decode, n_split_kv, block_decode);
+    return make_coord(m_block, _0{}, bidb, n_split_kv);
+  }
+
+  CUTLASS_DEVICE
+  Sm100MlaPersistentTileScheduler& operator++() {
+    block_idx += gridDim.x;
+    return *this;
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::fmha::kernel
+
diff --git a/examples/77_blackwell_fmha/reference/fmha_bwd_reference.hpp b/examples/77_blackwell_fmha/reference/fmha_bwd_reference.hpp
new file mode 100644
index 0000000000..bb8cfb348b
--- /dev/null
+++ b/examples/77_blackwell_fmha/reference/fmha_bwd_reference.hpp
@@ -0,0 +1,311 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+#pragma once
+
+#include "cute/tensor.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    class TensorDQ, /* class TensorDK, class TensorDV, */
+    class Fusion
+>
+void __global__ fmha_bwd_reference_dQ_kernel(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    TensorDQ mDQ, /* TensorDK mDK, TensorDV mDV, */
+    Fusion fusion) {
+
+  using namespace cute;
+
+  using Element = typename TensorO::value_type;
+  using ElementAccumulator = typename TensorLSE::value_type;
+  
+  extern __shared__ char mS_mem[];
+  Element* mS = reinterpret_cast<Element*>(mS_mem);
+
+  Element softmax_scale = static_cast<Element>(1.0 / sqrt(1.0 * size<1>(mO)));
+
+  for (int idx_L = blockIdx.y; idx_L < size<2>(mDQ); idx_L += gridDim.y) {
+    for (int idx_Q = blockIdx.x; idx_Q < size<0>(mDQ); idx_Q += gridDim.x) {
+      for (int idx_K = threadIdx.x; idx_K < size<0>(mK); idx_K += blockDim.x) {
+        ElementAccumulator acc_qk = 0;
+        ElementAccumulator acc_dov = 0;
+        ElementAccumulator acc_doo = 0;
+        for (int idx_D0 = 0; idx_D0 < size<1>(mK); idx_D0++) {
+          acc_qk += mQ(idx_Q, idx_D0, idx_L) * mK(idx_K, idx_D0, idx_L);
+          acc_dov += mDO(idx_Q, idx_D0, idx_L) * mV(idx_K, idx_D0, idx_L);
+          acc_doo += mDO(idx_Q, idx_D0, idx_L) * mO(idx_Q, idx_D0, idx_L);
+        }  // for idx_D0
+
+        auto id = make_identity_tensor(make_shape(1, 1));
+        auto frag = make_tensor<ElementAccumulator>(Shape<_1, _1>{});
+        frag(0) = acc_qk;
+        fusion.apply_mask(frag, make_tensor(id.data() + make_arithmetic_tuple(idx_Q, idx_K), id.layout()), problem_shape);
+        acc_qk = frag(0);
+
+        mS[idx_K] = static_cast<Element>(exp(softmax_scale * acc_qk - mLSE(idx_Q, idx_L)) * softmax_scale * (acc_dov - acc_doo));
+      }  // for idx_K
+
+      __syncthreads();
+
+      for (int idx_D = threadIdx.x; idx_D < size<1>(mDQ); idx_D += blockDim.x) {
+        ElementAccumulator acc = 0;
+        for (int idx_K = 0; idx_K < size<0>(mK); idx_K++) {
+          acc += mS[idx_K] * mK(idx_K, idx_D, idx_L);
+        }
+        mDQ(idx_Q, idx_D, idx_L) = static_cast<typename TensorDQ::value_type>(acc);
+      }  // for idx_D
+    }
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    /* class TensorDQ, */ class TensorDK, /* class TensorDV, */
+    class Fusion
+>
+void __global__ fmha_bwd_reference_dK_kernel(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    /* TensorDQ mDQ, */ TensorDK mDK, /* TensorDV mDV, */
+    Fusion fusion) {
+
+  using namespace cute;
+
+  using Element = typename TensorO::value_type;
+  using ElementAccumulator = typename TensorLSE::value_type;
+  
+  extern __shared__ char mS_mem[];
+  Element* mS = reinterpret_cast<Element*>(mS_mem);
+
+  Element softmax_scale = static_cast<Element>(1.0 / sqrt(1.0 * size<1>(mO)));
+
+  for (int idx_L = blockIdx.y; idx_L < size<2>(mDK); idx_L += gridDim.y) {
+    for (int idx_K = blockIdx.x; idx_K < size<0>(mDK); idx_K += gridDim.x) {
+      for (int idx_Q = threadIdx.x; idx_Q < size<0>(mDO); idx_Q += blockDim.x) {
+        ElementAccumulator acc_qk = 0;
+        ElementAccumulator acc_dov = 0;
+        ElementAccumulator acc_doo = 0;
+        for (int idx_D0 = 0; idx_D0 < size<1>(mK); idx_D0++) {
+          acc_qk += mQ(idx_Q, idx_D0, idx_L) * mK(idx_K, idx_D0, idx_L);
+          acc_dov += mDO(idx_Q, idx_D0, idx_L) * mV(idx_K, idx_D0, idx_L);
+          acc_doo += mDO(idx_Q, idx_D0, idx_L) * mO(idx_Q, idx_D0, idx_L);
+        }  // for idx_D0
+        
+        auto id = make_identity_tensor(make_shape(1, 1));
+        auto frag = make_tensor<ElementAccumulator>(Shape<_1, _1>{});
+        frag(0) = acc_qk;
+        fusion.apply_mask(frag, make_tensor(id.data() + make_arithmetic_tuple(idx_Q, idx_K), id.layout()), problem_shape);
+        acc_qk = frag(0);
+
+        mS[idx_Q] = static_cast<Element>(exp(softmax_scale * acc_qk - mLSE(idx_Q, idx_L)) * softmax_scale * (acc_dov - acc_doo));
+      }  // for idx_Q
+
+      __syncthreads();
+
+      for (int idx_D = threadIdx.x; idx_D < size<1>(mDK); idx_D += blockDim.x) {
+        ElementAccumulator acc = 0;
+        for (int idx_Q = 0; idx_Q < size<0>(mDO); idx_Q++) {
+          acc += mS[idx_Q] * mQ(idx_Q, idx_D, idx_L);
+        }
+        mDK(idx_K, idx_D, idx_L) = static_cast<typename TensorDK::value_type>(acc);
+      }  // for idx_D
+    }  // for idx_K
+  }  // for idx_L
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    /* class TensorDQ, class TensorDK, */ class TensorDV,
+    class Fusion
+>
+void __global__ fmha_bwd_reference_dV_kernel(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    /* TensorDQ mDQ, TensorDK mDK, */ TensorDV mDV,
+    Fusion fusion) {
+
+  using namespace cute;
+
+  using Element = typename TensorO::value_type;
+  using ElementAcc = typename TensorLSE::value_type;
+  
+  extern __shared__ char mS_mem[];
+  Element* mS = reinterpret_cast<Element*>(mS_mem);
+
+  ElementAcc softmax_scale = static_cast<ElementAcc>(1.0 / sqrt(1.0 * size<1>(mO)));
+
+  for (int idx_L = blockIdx.y; idx_L < size<2>(mDV); idx_L += gridDim.y) {
+    for (int idx_K = blockIdx.x; idx_K < size<0>(mDV); idx_K += gridDim.x) {
+      for (int idx_Q = threadIdx.x; idx_Q < size<0>(mDO); idx_Q += blockDim.x) {
+        ElementAcc acc_qk = 0;
+
+        for (int idx_D0 = 0; idx_D0 < size<1>(mK); idx_D0++) {
+          ElementAcc rQ = mQ(idx_Q, idx_D0, idx_L);
+          ElementAcc rK = mK(idx_K, idx_D0, idx_L);
+          acc_qk += rQ * rK;
+        }  // for idx_D0
+
+        auto id = make_identity_tensor(make_shape(1, 1));
+        auto frag = make_tensor<ElementAcc>(Shape<_1, _1>{});
+        frag(0) = acc_qk;
+        fusion.apply_mask(frag, make_tensor(id.data() + make_arithmetic_tuple(idx_Q, idx_K), id.layout()), problem_shape);
+        acc_qk = frag(0);
+
+        mS[idx_Q] = static_cast<Element>(exp(softmax_scale * acc_qk - mLSE(idx_Q, idx_L)));
+      }  // for idx_Q
+
+      __syncthreads();
+
+      for (int idx_D = threadIdx.x; idx_D < size<1>(mDV); idx_D += blockDim.x) {
+        ElementAcc acc = 0;
+        for (int idx_Q = 0; idx_Q < size<0>(mDO); idx_Q++) {
+          ElementAcc rS = mS[idx_Q];
+          ElementAcc rDO = mDO(idx_Q, idx_D, idx_L);
+          acc += rS * rDO;
+        }
+        mDV(idx_K, idx_D, idx_L) = static_cast<typename TensorDV::value_type>(acc);
+      }  // for idx_D
+    }  // for idx_K
+  }  // for idx_L
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    /**/ class TensorDQ, /** / class TensorDK, / ** / class TensorDV, / **/
+    class Fusion
+>
+void fmha_bwd_reference_dQ(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    /**/ TensorDQ mDQ, /** / TensorDK mDK, / ** / TensorDV mDV, / **/
+    Fusion fusion) {
+
+  using namespace cute;
+
+  dim3 grid(size<0>(mDQ), size<2>(mDQ), 1);
+  dim3 block(256);
+  int shared_mem = size<0>(mK) * sizeof(typename TensorO::value_type);
+  fmha_bwd_reference_dQ_kernel<<<grid, block, shared_mem>>>(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDQ, fusion);
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    /** / class TensorDQ, / **/ class TensorDK, /** / class TensorDV, / **/
+    class Fusion
+>
+void fmha_bwd_reference_dK(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    /** / TensorDQ mDQ, / **/ TensorDK mDK, /** / TensorDV mDV, / **/
+    Fusion fusion) {
+
+  using namespace cute;
+
+  dim3 grid(size<0>(mDK), size<2>(mDK), 1);
+  dim3 block(256);
+  int shared_mem = size<0>(mDO) * sizeof(typename TensorO::value_type);
+  fmha_bwd_reference_dK_kernel<<<grid, block, shared_mem>>>(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDK, fusion);
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    /** / class TensorDQ, / ** / class TensorDK, / **/ class TensorDV, /**/
+    class Fusion
+>
+void fmha_bwd_reference_dV(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    /** / TensorDQ mDQ, / ** / TensorDK mDK, / **/ TensorDV mDV, /**/
+    Fusion fusion) {
+
+  using namespace cute;
+
+  dim3 grid(size<0>(mDV), size<2>(mDV), 1);
+  dim3 block(256);
+  int shared_mem = size<0>(mDO) * sizeof(typename TensorO::value_type);
+  fmha_bwd_reference_dV_kernel<<<grid, block, shared_mem>>>(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDV, fusion);
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    class ProblemShape,
+    class TensorQ, class TensorK, class TensorV,
+    class TensorO, class TensorLSE, class TensorDO,
+    class TensorDQ, class TensorDK, class TensorDV,
+    class Fusion
+>
+void fmha_bwd_reference(
+    ProblemShape problem_shape,
+    TensorQ mQ, TensorK mK, TensorV mV,
+    TensorO mO, TensorLSE mLSE, TensorDO mDO,
+    TensorDQ mDQ, TensorDK mDK, TensorDV mDV,
+    Fusion fusion) {
+
+  fmha_bwd_reference_dQ(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDQ, fusion);
+  fmha_bwd_reference_dK(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDK, fusion);
+  fmha_bwd_reference_dV(problem_shape, mQ, mK, mV, mO, mLSE, mDO, mDV, fusion);
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/77_blackwell_fmha/reference/fmha_fwd_reference.hpp b/examples/77_blackwell_fmha/reference/fmha_fwd_reference.hpp
index 48d8110187..b7c6b412cb 100644
--- a/examples/77_blackwell_fmha/reference/fmha_fwd_reference.hpp
+++ b/examples/77_blackwell_fmha/reference/fmha_fwd_reference.hpp
@@ -128,7 +128,7 @@ void __global__ fmha_reference_kernel(
       }
 
       if (threadIdx.x == 0) {
-        mLSE(idx_Q + offset_Q, idx_L) = log(sum) + maxS;
+        mLSE(idx_Q + offset_Q, idx_L) = log(sum) + softmax_scale * maxS;
       }
 
     }
diff --git a/examples/77_blackwell_fmha/reference/fmha_mla_reference.hpp b/examples/77_blackwell_fmha/reference/fmha_mla_reference.hpp
new file mode 100644
index 0000000000..29db90746e
--- /dev/null
+++ b/examples/77_blackwell_fmha/reference/fmha_mla_reference.hpp
@@ -0,0 +1,206 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cute/tensor.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+  class ProblemShape,
+  class TensorSeq,
+  class TensorPageTable,
+  class TensorQL,
+  class TensorQR,
+  class TensorCL,
+  class TensorKR,
+  class TensorO,
+  class TensorLSE,
+  class Scale
+>
+void __global__ fmha_mla_reference_kernel(
+    ProblemShape problem_shape,
+    TensorSeq mSeq, TensorPageTable mPT,
+    TensorQL mQL, TensorQR mQR,
+    TensorCL mCL, TensorKR mKR,
+    TensorO mO, TensorLSE mLSE,
+    Scale softmax_scale) {
+
+  using namespace cute;
+
+  auto [H, K, D, B] = problem_shape;
+  auto [D_latent, D_rope] = D;
+
+  using Element = typename TensorO::value_type;
+  using ElementAcc = typename TensorLSE::value_type;
+
+  extern __shared__ ElementAcc mS[];
+  // ElementAcc* mS = reinterpret_cast<ElementAcc*>(mS_mem);
+
+  for (int idx_B = blockIdx.y; idx_B < B; idx_B += gridDim.y) {
+    if (mSeq.data() != nullptr) {
+      K = mSeq(idx_B);
+    }
+
+    for (int idx_H = blockIdx.x; idx_H < H; idx_H += gridDim.x) {
+
+      for (int idx_K = threadIdx.x; idx_K < K; idx_K += blockDim.x) {
+        ElementAcc acc = 0;
+
+        for (int idx_D = 0; idx_D < D_latent; idx_D++) {
+          int page_idx_K = idx_K;
+          int page_idx_B = idx_B;
+          if (mPT.data() != nullptr) {
+            page_idx_B = mPT(idx_K / size<0>(mCL), idx_B); 
+            page_idx_K = idx_K % size<0>(mCL);
+          }
+          ElementAcc eQ = mQL(idx_H, idx_D, idx_B);
+          ElementAcc eK = mCL(page_idx_K, idx_D, page_idx_B);
+          acc += eQ * eK;
+        }
+
+        for (int idx_D = 0; idx_D < D_rope; idx_D++) {
+          int page_idx_K = idx_K;
+          int page_idx_B = idx_B;
+          if (mPT.data() != nullptr) {
+            page_idx_B = mPT(idx_K / size<0>(mCL), idx_B); 
+            page_idx_K = idx_K % size<0>(mCL);
+          }
+          ElementAcc eQ = mQR(idx_H, idx_D, idx_B);
+          ElementAcc eK = mKR(page_idx_K, idx_D, page_idx_B);
+          acc += eQ * eK;
+        }
+        mS[idx_K] = acc;
+      }
+
+      __syncthreads();
+
+      ElementAcc maxS = -std::numeric_limits<ElementAcc>::infinity();
+      for (int idx_K = 0; idx_K < K; idx_K++) {
+        maxS = std::max<ElementAcc>(maxS, mS[idx_K]);
+      }
+      if (maxS == -std::numeric_limits<ElementAcc>::infinity()) maxS = 0;
+
+      __syncthreads();
+
+#ifndef B2B
+      for (int idx_K = threadIdx.x; idx_K < K; idx_K += blockDim.x) {
+        mS[idx_K] = expf(softmax_scale * (mS[idx_K] - maxS));
+      }
+#endif
+
+      __syncthreads();
+
+      ElementAcc sum = 0;
+      for (int idx_K = 0; idx_K < K; idx_K++) {
+        sum += mS[idx_K];
+      }
+
+      ElementAcc o_scale = 1.0f / sum;
+#ifdef B2B
+      o_scale = 1.0;
+#endif
+
+      for (int idx_D = threadIdx.x; idx_D < D_latent; idx_D += blockDim.x) {
+        ElementAcc acc = 0;
+        for (int idx_K = 0; idx_K < K; idx_K++) {
+          int page_idx_K = idx_K;
+          int page_idx_B = idx_B;
+          if (mPT.data() != nullptr) {
+            page_idx_B = mPT(idx_K / size<0>(mCL), idx_B); 
+            page_idx_K = idx_K % size<0>(mCL);
+          }
+          ElementAcc eV = mCL(page_idx_K, idx_D, page_idx_B);
+          ElementAcc eK = static_cast<Element>(mS[idx_K]);
+          acc += eK * eV;
+        }
+        mO(idx_H, idx_D, idx_B) = static_cast<typename TensorO::value_type>(acc * o_scale);
+      }
+
+      if (threadIdx.x == 0) {
+        mLSE(idx_H, idx_B) = log(sum) + softmax_scale * maxS;
+      }
+
+    }
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+  class ProblemShape,
+  class TensorSeq,
+  class TensorPageTable,
+  class TensorQL,
+  class TensorQR,
+  class TensorCL,
+  class TensorKR,
+  class TensorO,
+  class TensorLSE,
+  class Scale
+>
+void fmha_mla_reference(
+    ProblemShape problem_shape,
+    TensorSeq mSeq, TensorPageTable mPT,
+    TensorQL mQL, TensorQR mQR,
+    TensorCL mCL, TensorKR mKR,
+    TensorO mO, TensorLSE mLSE,
+    Scale scale) {
+
+  using namespace cute;
+
+  auto [H, K, D, B] = problem_shape;
+  auto [D_latent, D_rope] = D;
+
+  dim3 grid(H, B, 1);
+  dim3 block(256);
+  int shared_mem = K * int(sizeof(typename TensorLSE::value_type)) + 16;
+  cudaError_t result;
+  if (shared_mem >= (48 << 10)) {
+    result = cudaFuncSetAttribute(
+        &fmha_mla_reference_kernel<ProblemShape, TensorSeq, TensorPageTable, TensorQL, TensorQR, TensorCL, TensorKR, TensorO, TensorLSE, Scale>,
+        cudaFuncAttributeMaxDynamicSharedMemorySize,
+        shared_mem);
+    if (cudaSuccess != result) {
+      result = cudaGetLastError(); // to clear the error bit
+      throw std::runtime_error("couldn't perform smem optin");
+    }
+  }
+  fmha_mla_reference_kernel<<<grid, block, shared_mem>>>(
+      problem_shape, mSeq, mPT, mQL, mQR, mCL, mKR, mO, mLSE, scale);
+  cudaDeviceSynchronize();
+  result = cudaGetLastError();
+  if (cudaSuccess != result) {
+    throw std::runtime_error("couldn't execute reference");
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/77_blackwell_fmha/reference/reference_abs_error.hpp b/examples/77_blackwell_fmha/reference/reference_abs_error.hpp
index e4a01c8216..6d833ad12a 100644
--- a/examples/77_blackwell_fmha/reference/reference_abs_error.hpp
+++ b/examples/77_blackwell_fmha/reference/reference_abs_error.hpp
@@ -178,3 +178,96 @@ void reference_abs_diff(
   max_diff = result_host[0];
   mean_diff = result_host[1] / static_cast<double>(data.size());
 }
+
+template<typename Element>
+__global__ void reference_rel_diff_kernel(
+    Element* data, Element* data_ref, size_t count,
+    double* max_diff, double* sum_diff,
+    bool print_diff ) {
+
+    double thread_max_diff = 0;
+    double thread_sum_diff = 0;
+
+    __shared__ double block_max_diff;
+    __shared__ double block_sum_diff;
+
+    for (size_t i = threadIdx.x + blockIdx.x * blockDim.x; i < count; i += blockDim.x * gridDim.x) {
+      double diff = fabs(data[i] - data_ref[i]) / fabs(data_ref[i]);
+      if (print_diff) if (diff != diff || diff > 0.01f) printf("difference at %lld: %f ... %f vs %f\n", static_cast<long long int>(i), diff, (double)data[i], (double)data_ref[i]);
+      thread_max_diff = fmax(diff, thread_max_diff);
+      thread_sum_diff += diff;
+    }
+
+    for (int i = 0; i < blockDim.x; i++) {
+      if (i == threadIdx.x) {
+        if (i == 0) {
+          block_max_diff = thread_max_diff;
+          block_sum_diff = thread_sum_diff;
+        }
+        else {
+          block_max_diff = fmax(block_max_diff, thread_max_diff);
+          block_sum_diff += thread_sum_diff;
+        }
+      }
+      __syncthreads();
+   }
+
+   if (threadIdx.x == 0) {
+     atomicAdd(sum_diff, block_sum_diff);
+
+     for (;;) {
+       unsigned long long prev = *reinterpret_cast<unsigned long long*>(max_diff);
+       double prev_diff = reinterpret_cast<double const&>(prev);
+       double new_max_diff = fmax(block_max_diff, prev_diff);
+       unsigned long long found = atomicCAS(reinterpret_cast<unsigned long long*>(max_diff), prev, reinterpret_cast<unsigned long long const&>(new_max_diff));
+       if (found == prev) break;
+    }
+   }
+}
+
+template<typename Element>
+void reference_rel_diff(
+    DeviceAllocation<Element> const& data,
+    DeviceAllocation<Element> const& data_ref,
+    double& max_diff, double& mean_diff) {
+
+  static bool kPrintDiff = getenv("REF_PRINT_DIFF") && atoi(getenv("REF_PRINT_DIFF")) == 1;
+
+  DeviceAllocation<double> result;
+  result.reset(2);
+  assert(data.size() == data_ref.size());
+
+  cudaError_t err = cudaMemset(result.get(), 0, result.size() * sizeof(double));
+  if (err != cudaSuccess) {
+    std::cerr << "Memset failed. Last CUDA error: "
+              << cudaGetErrorString(err) << std::endl;
+    max_diff = mean_diff = 1e20;
+    return;
+  }
+
+  dim3 block(256, 1, 1);
+  dim3 grid(1024, 1, 1);
+  reference_rel_diff_kernel<<<block, grid>>>(
+      data.get(), data_ref.get(), data.size(),
+      result.get(), result.get() + 1, kPrintDiff);
+
+  err = cudaDeviceSynchronize();
+  if (err != cudaSuccess) {
+    std::cerr << "Difference kernel failed. Last CUDA error: "
+              << cudaGetErrorString(err) << std::endl;
+    max_diff = mean_diff = 1e20;
+    return;
+  }
+
+  double result_host[2];
+  err = cudaMemcpy(result_host, result.get(), result.size() * sizeof(double), cudaMemcpyDefault);
+  if (err != cudaSuccess) {
+    std::cerr << "Copy failed. Last CUDA error: "
+              << cudaGetErrorString(err) << std::endl;
+    max_diff = mean_diff = 1e20;
+    return;
+  }
+
+  max_diff = result_host[0];
+  mean_diff = result_host[1] / static_cast<double>(data.size());
+}
diff --git a/examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu b/examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu
new file mode 100644
index 0000000000..d36bf4dd74
--- /dev/null
+++ b/examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu
@@ -0,0 +1,927 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+/*! \file
+    \brief Grouped GEMM example using CUTLASS 3x APIs for the NVIDIA Blackwell SM120 architecture.
+
+    This example demonstrates an implementation of Grouped GEMM using a TMA + Blackwell SM120 TensorOp-based warp-specialized kernel
+    for narrow precisions (FP4) with input Scale Factors.
+    For this example all scheduling work is performed on the device, utilizing the device-side modification of TMA descriptors
+    to move between groups/problem_count (represented by groups).
+    https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device
+
+    To run this example:
+
+      $ ./examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm --m=2048 --n=2048 --k=2048 --groups=10
+
+      The above example command makes all 10 groups to be sized at the given m, n, k sizes.
+      Skipping any of the problem dimensions randomizes it across the different groups.
+      Same applies for alpha and beta values that are randomized across the different groups.
+
+    To run this example for a set of problems using the benchmark option:
+
+      $ ./examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm --benchmark=./test_benchmark.txt
+
+      Where the test_benchmark.txt may look as such:
+        0 256x512x128
+        1 256x512x512
+        2 512x256x128
+        3 256x256x128
+        4 256x512x1024
+        5 1024x512x128 and so on
+*/
+
+#include <iostream>
+#include <fstream>
+#include <iostream>
+#include <sstream>
+#include <vector>
+#include <float.h>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "helper.h"
+using namespace cute;
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+using ElementInput = cutlass::float_e2m1_t;                                // Element type for Input matrix operands
+
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using         ElementA    = cutlass::nv_float4_t<ElementInput>;             // Element type for A matrix operand
+using         LayoutATag  = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 32;                                             // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::nv_float4_t<ElementInput>;             // Element type for B matrix operand
+using         LayoutBTag  = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 32;                                             // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementD    = float_e2m1_t;                                   // Element type for D matrix operands
+using         ElementSFD  = cutlass::float_ue4m3_t;                         // Element type for SF Output operands
+using         ElementC    = cutlass::half_t;                                // Element type for C matrix operands
+using         LayoutCTag  = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+using         LayoutDTag  = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+using         LayoutSFDTag = LayoutDTag;                                    // Layout type for SFD should be same as D matrix operand
+
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Alignment of D matrix in units of elements (up to 16 bytes)
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for internal computation
+using ArchTag             = cutlass::arch::Sm120;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledTensorOp;      // Epilogue Operator class tag
+
+// Kernel Perf config
+// Cluster Shape fixed to 1x1x1
+using ThreadBlockShape    = Shape<_128,_128,_128>;
+using ClusterShape        = Shape<_1,_1,_1>;
+constexpr int OutputSFVectorSize = 16;
+
+// D = alpha * acc + beta * C
+// With BlockScaleFactor generation.
+using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+    OutputSFVectorSize,
+    ElementD, 
+    ElementCompute, 
+    ElementSFD, LayoutCTag,
+    ElementC>;
+
+// Cooperative kernel schedule
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ThreadBlockShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutCTag *, AlignmentC,
+    ElementD, LayoutCTag *, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    FusionOperation
+>::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+  ArchTag, OperatorClass,
+  ElementA, LayoutATag *, AlignmentA,
+  ElementB, LayoutBTag *, AlignmentB,
+  ElementAccumulator,
+  ThreadBlockShape, ClusterShape,
+  cutlass::gemm::collective::StageCountAutoCarveout<
+  static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+  cutlass::gemm::collective::KernelScheduleAuto                             // Auto schedule defaults to cooperative schedule
+>::CollectiveOp;
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+
+// Pingpong kernel schedule
+using CollectiveMainloopPingpong = typename cutlass::gemm::collective::CollectiveBuilder<
+  ArchTag, OperatorClass,
+  ElementA, LayoutATag *, AlignmentA,
+  ElementB, LayoutBTag *, AlignmentB,
+  ElementAccumulator,
+  ThreadBlockShape, ClusterShape,
+  cutlass::gemm::collective::StageCountAutoCarveout<
+  static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+  cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong
+>::CollectiveOp;
+
+using GemmKernelPingpong = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopPingpong,
+    CollectiveEpilogue
+>;
+
+using GemmPingpong = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelPingpong>;
+
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
+
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFA;
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFB;
+using Sm1xxBlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig;
+using Sm1xxBlockScaledOutputConfig= cutlass::detail::Sm1xxBlockScaledOutputConfig<
+                                        OutputSFVectorSize, 
+                                        cute::is_same_v<typename FusionOperation::GmemLayoutTagScalefactor,
+                                            cutlass::layout::RowMajor> ? cute::UMMA::Major::K : cute::UMMA::Major::MN
+                                     >;
+using OutputSFAtom = typename Sm1xxBlockScaledOutputConfig::SfAtom;
+using LayoutSFD = typename Sm1xxBlockScaledOutputConfig::LayoutSF;
+
+// Host-side allocations
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<LayoutSFA> layout_SFA_host;
+std::vector<LayoutSFA> layout_SFB_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+using HostTensorA = cutlass::HostTensor<typename Gemm::ElementA, cutlass::layout::PackedVectorLayout>;
+using HostTensorB = cutlass::HostTensor<typename Gemm::ElementB, cutlass::layout::PackedVectorLayout>;
+using HostTensorSF = cutlass::HostTensor<typename Gemm::GemmKernel::CollectiveMainloop::ElementSF, cutlass::layout::PackedVectorLayout>;
+using HostTensorC = cutlass::HostTensor<typename Gemm::ElementC, cutlass::layout::PackedVectorLayout>;
+using HostTensorD = cutlass::HostTensor<typename Gemm::EpilogueOutputOp::ElementOutput, cutlass::layout::PackedVectorLayout>;
+std::vector<HostTensorA> block_A;
+std::vector<HostTensorB> block_B;
+std::vector<HostTensorSF> block_SFA;
+std::vector<HostTensorSF> block_SFB;
+std::vector<HostTensorC> block_C;
+std::vector<HostTensorD> block_D;
+std::vector<HostTensorSF> block_SFD;
+std::vector<HostTensorD> block_ref_D;
+std::vector<HostTensorSF> block_ref_SFD;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<const typename Gemm::ElementA *> ptr_A;
+cutlass::DeviceAllocation<const typename Gemm::ElementB *> ptr_B;
+cutlass::DeviceAllocation<const typename Gemm::GemmKernel::CollectiveMainloop::ElementSF *> ptr_SFA;
+cutlass::DeviceAllocation<const typename Gemm::GemmKernel::CollectiveMainloop::ElementSF *> ptr_SFB;
+cutlass::DeviceAllocation<const typename Gemm::ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D;
+cutlass::DeviceAllocation<typename Gemm::GemmKernel::CollectiveMainloop::ElementSF *> ptr_SFD;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_ref_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<LayoutSFA> layout_SFA;
+cutlass::DeviceAllocation<LayoutSFB> layout_SFB;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+// A matrix wide constant value to scale the output matrix
+// Avoids generating small FP4 values.
+// NormConst is a single device-side constant value, its not per-batch or per-group
+cutlass::DeviceAllocation<ElementAccumulator> norm_constant_device;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm100GroupParams<typename ProblemShape::UnderlyingProblemShape>::RasterOrderOptions;
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool verification = true;
+  bool use_pdl = false;
+
+  float alpha = std::numeric_limits<float>::max();
+  float beta  = std::numeric_limits<float>::max();
+  float norm_constant = 1.0;
+  int iterations = 10;
+  int m = 1024, n = 2048, k = 512, groups = 10;
+  RasterOrderOptions raster_order = RasterOrderOptions::AlongN;
+  int max_sm_count = INT_MAX;
+  std::string benchmark_path;
+  std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host;
+  int const tma_alignment_bits = 128;
+  int const alignment = tma_alignment_bits / cutlass::sizeof_bits<ElementInput>::value;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+    if (cmd.check_cmd_line_flag("no_verif")) {
+      verification = false;
+    }
+    if (cmd.check_cmd_line_flag("use_pdl")) {
+      use_pdl = true;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("groups", groups);
+    cmd.get_cmd_line_argument("alpha", alpha, std::numeric_limits<float>::max());
+    cmd.get_cmd_line_argument("beta",  beta,  std::numeric_limits<float>::max());
+    cmd.get_cmd_line_argument("norm_constant",  norm_constant,  float(1.0));
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("benchmark", benchmark_path);
+    cmd.get_cmd_line_argument("max_sm_count", max_sm_count, INT_MAX);
+
+    // Decide how to initialize the problems
+    if (!benchmark_path.empty()) {
+      if (!benchmark_problems()) {
+        problem_sizes_host.clear();
+        return;
+      }
+    }
+    else {
+      randomize_problems(cmd);
+    }
+
+    char raster_char;
+    cmd.get_cmd_line_argument("raster", raster_char);
+
+    if (raster_char == 'N' || raster_char == 'n') {
+      raster_order = RasterOrderOptions::AlongN;
+    }
+    else if (raster_char == 'M' || raster_char == 'm') {
+      raster_order = RasterOrderOptions::AlongM;
+    }
+  }
+
+  void randomize_problems(cutlass::CommandLine &cmd) {
+    int cmd_line_m = -1, cmd_line_n = -1, cmd_line_k = -1;
+    cmd.get_cmd_line_argument("m", cmd_line_m);
+    cmd.get_cmd_line_argument("n", cmd_line_n);
+    cmd.get_cmd_line_argument("k", cmd_line_k);
+
+    problem_sizes_host.reserve(groups);
+
+    for (int i = groups; i > 0; i--) {
+      int m = cmd_line_m;
+      int n = cmd_line_n;
+      int k = cmd_line_k;
+      if (m < 1) {
+        m = alignment * ((rand() % 64) + 1);
+      }
+      if (n < 1) {
+        n = alignment * ((rand() % 64) + 1);
+      }
+      if (k < 1) {
+        k = alignment * ((rand() % 64) + 1);
+      }
+      problem_sizes_host.push_back({m, n, k});
+    }
+  }
+
+  /// Load a benchmark
+  bool benchmark_problems() {
+    std::ifstream file(benchmark_path);
+    if (!file.good()) {
+      return false;
+    }
+
+    while (file.good()) {
+
+      int idx = -1;
+      std::string extent_str;
+
+      file >> idx >> extent_str;
+
+      if (idx < 0 || extent_str.empty()) {
+        break;
+      }
+
+      cutlass::gemm::GemmCoord extent;
+      std::vector<std::string> tokens;
+
+      cutlass::CommandLine::tokenize(tokens, extent_str, 'x');
+
+      for (int i = 0; i < int(tokens.size()); ++i) {
+        int x = std::atoi(tokens.at(i).c_str());
+
+        // round up
+        if (x % alignment) {
+          x += (alignment - (x % alignment));
+        }
+
+        extent.at(i) = x;
+      }
+
+      if (extent.product()) {
+        problem_sizes_host.push_back({extent.m(), extent.n(), extent.k()});
+      }
+    }
+    groups = static_cast<int>(problem_sizes_host.size());
+
+    return true;
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "79d_blackwell_geforce_nvfp4_grouped_gemm\n\n"
+      << "  Blackwell Block Scaled Narrow Precision Grouped GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                                                       If specified, displays this usage statement\n\n"
+      << "  --m=<int>                                                    Sets the M extent of the GEMM for all groups\n"
+      << "  --n=<int>                                                    Sets the N extent of the GEMM for all groups\n"
+      << "  --k=<int>                                                    Sets the K extent of the GEMM for all groups\n"
+      << "  --groups=<int>                                               Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --alpha=<f32>                                                Epilogue scalar alpha\n"
+      << "  --beta=<f32>                                                 Epilogue scalar beta\n"
+      << "  --norm_constant=<f32>                                        Epilogue scalar normalization constant for the output matrix\n\n"
+      << "  --raster=<char>                                              CTA Rasterization direction (N for along N, M for along M)\n\n"
+      << "  --iterations=<int>                                           Number of profiling iterations to perform\n\n"
+      << "  --benchmark=<str>                                            Executes a benchmark problem size\n"
+      << "  --max_sm_count=<int>                                         Run kernels using only these number of SMs\n"
+      << "  --no_verif                                                   Do not run (host-side) verification kernels\n"
+      << "  --use_pdl                                                    Launch kernel with PDL (Programmatic Dependent Launch) enabled\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "79d_blackwell_geforce_nvfp4_grouped_gemm" << " --m=1024 --n=512 --k=1024 --groups=10 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s, std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host) const
+  {
+    // Number of real-valued multiply-adds
+    uint64_t fmas = uint64_t();
+
+    for (auto const & problem : problem_sizes_host) {
+      fmas += static_cast<uint64_t>(get<0>(problem)) *
+              static_cast<uint64_t>(get<1>(problem)) *
+              static_cast<uint64_t>(get<2>(problem));
+    }
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * uint64_t(fmas);
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms = 0.0;
+  double gflops = 0.0;
+  cutlass::Status status = cutlass::Status::kSuccess;
+  cudaError_t error = cudaSuccess;
+  bool passed = false;
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+
+/// Allocates device-side data
+void allocate(const Options &options) {
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    auto stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1});
+    auto stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1});
+    auto stride_C = cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1});
+    auto stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1});
+
+    auto layout_A = make_layout(make_shape(M, K, 1), stride_A);
+    auto layout_B = make_layout(make_shape(N, K, 1), stride_B);
+    auto layout_C = make_layout(make_shape(M, N, 1), stride_C);
+    auto layout_D = make_layout(make_shape(M, N, 1), stride_D);
+    auto layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(M, N, K, 1));
+    auto layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(M, N, K, 1));
+    auto layout_SFD = Sm1xxBlockScaledOutputConfig::tile_atom_to_shape_SFD(cute::make_shape(M, N, K, 1));
+
+    stride_A_host.push_back(stride_A);
+    stride_B_host.push_back(stride_B);
+    layout_SFA_host.push_back(layout_SFA);
+    layout_SFB_host.push_back(layout_SFB);
+    stride_C_host.push_back(stride_C);
+    stride_D_host.push_back(stride_D);
+
+    block_A.push_back(HostTensorA(cutlass::make_Coord(size(layout_A))));
+    block_B.push_back(HostTensorB(cutlass::make_Coord(size(layout_B))));
+    block_SFA.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFA)))));
+    block_SFB.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFB)))));
+    block_C.push_back(HostTensorC(cutlass::make_Coord(size(layout_C))));
+    block_D.push_back(HostTensorD(cutlass::make_Coord(size(layout_D))));
+    block_SFD.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFD)))));
+    block_ref_D.push_back(HostTensorD(cutlass::make_Coord(size(layout_D))));
+    block_ref_SFD.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFD)))));
+  }
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  uint64_t seed = 2020;
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<typename Gemm::ElementA *> ptr_A_host(options.groups);
+  std::vector<typename Gemm::ElementB *> ptr_B_host(options.groups);
+  std::vector<typename Gemm::GemmKernel::CollectiveMainloop::ElementSF *> ptr_SFA_host(options.groups);
+  std::vector<typename Gemm::GemmKernel::CollectiveMainloop::ElementSF *> ptr_SFB_host(options.groups);
+  std::vector<typename Gemm::ElementC *> ptr_C_host(options.groups);
+  std::vector<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D_host(options.groups);
+  std::vector<typename Gemm::GemmKernel::CollectiveMainloop::ElementSF *> ptr_SFD_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    initialize_block(block_A.at(i).host_view(), seed + 2021);
+    initialize_block(block_B.at(i).host_view(), seed + 2022);
+    initialize_block(block_C.at(i).host_view(), seed + 2023);
+    initialize_block(block_SFA.at(i).host_view(), seed + 2024);
+    initialize_block(block_SFB.at(i).host_view(), seed + 2025);
+
+    block_A.at(i).sync_device();
+    block_B.at(i).sync_device();
+    block_C.at(i).sync_device();
+    block_SFA.at(i).sync_device();
+    block_SFB.at(i).sync_device();
+
+    ptr_A_host.at(i) = block_A.at(i).device_data();
+    ptr_B_host.at(i) = block_B.at(i).device_data();
+    ptr_SFA_host.at(i) = block_SFA.at(i).device_data();
+    ptr_SFB_host.at(i) = block_SFB.at(i).device_data();
+    ptr_C_host.at(i) = block_C.at(i).device_data();
+    ptr_D_host.at(i) = block_D.at(i).device_data();
+    ptr_SFD_host.at(i) = block_SFD.at(i).device_data();
+
+    alpha_host.push_back((options.alpha == std::numeric_limits<float>::max()) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == std::numeric_limits<float>::max()) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_SFA.reset(options.groups);
+  ptr_SFA.copy_from_host(ptr_SFA_host.data());
+
+  ptr_SFB.reset(options.groups);
+  ptr_SFB.copy_from_host(ptr_SFB_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  ptr_SFD.reset(options.groups);
+  ptr_SFD.copy_from_host(ptr_SFD_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  layout_SFA.reset(options.groups);
+  layout_SFA.copy_from_host(layout_SFA_host.data());
+
+  layout_SFB.reset(options.groups);
+  layout_SFB.copy_from_host(layout_SFB_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+
+  norm_constant_device.reset(1);
+  norm_constant_device.copy_from_host(&options.norm_constant);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Gemm>
+typename Gemm::Arguments args_from_options(Options &options, bool host_problem_shapes_available = true)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = min(cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id), options.max_sm_count);
+
+  typename Gemm::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+  fusion_args.alpha_ptr = nullptr;
+  fusion_args.beta_ptr = nullptr;
+
+  // If alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
+  // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
+  if (options.alpha != std::numeric_limits<float>::max()){
+    // Single alpha for all groups
+    fusion_args.alpha = options.alpha;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.dAlpha = {_0{}, _0{}, 0};
+  }
+  else {
+    fusion_args.alpha = 0;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    // Only one alpha per each group
+    fusion_args.dAlpha = {_0{}, _0{}, 1};
+  }
+  if (options.beta != std::numeric_limits<float>::max()) {
+    // Single beta for all groups
+    fusion_args.beta = options.beta;
+    fusion_args.beta_ptr_array = nullptr;
+    fusion_args.dBeta = {_0{}, _0{}, 0};
+  }
+  else {
+    fusion_args.beta = 0;
+    fusion_args.beta_ptr_array = beta_device.get();
+    // Only one beta per each group
+    fusion_args.dBeta = {_0{}, _0{}, 1};
+  }
+
+  // Output Block SF
+  fusion_args.block_scale_factor_ptr = ptr_SFD.get();          // Enable for SF Output
+  fusion_args.norm_constant_ptr = norm_constant_device.get();  // Enable for SF Output
+
+  typename Gemm::GemmKernel::TileSchedulerArguments scheduler;
+  scheduler.raster_order = options.raster_order;
+
+  if (host_problem_shapes_available) {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), options.problem_sizes_host.data()},
+      {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get(),
+       ptr_SFA.get(), layout_SFA.get(), ptr_SFB.get(), layout_SFB.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info, scheduler
+    };
+  }
+  else {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get(),
+       ptr_SFA.get(), layout_SFA.get(), ptr_SFB.get(), layout_SFB.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info, scheduler
+    };
+  }
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+  bool passed = true;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    auto stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1});
+    auto stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1});
+    auto stride_C = cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1});
+    auto stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1});
+    auto layout_A = make_layout(make_shape(M, K, 1), stride_A);
+    auto layout_B = make_layout(make_shape(N, K, 1), stride_B);
+    auto layout_C = make_layout(make_shape(M, N, 1), stride_C);
+    auto layout_D = make_layout(make_shape(M, N, 1), stride_D);
+    auto layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(M, N, K, 1));
+    auto layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(M, N, K, 1));
+    auto layout_SFD = Sm1xxBlockScaledOutputConfig::tile_atom_to_shape_SFD(cute::make_shape(M, N, K, 1));
+
+    // Create the arguments for host reference implementation
+    Tensor tensor_A = make_tensor(make_iterator(block_A.at(i).host_data()), layout_A);
+    Tensor tensor_SFA = make_tensor(block_SFA.at(i).host_data(), layout_SFA);
+    Tensor tensor_B = make_tensor(make_iterator(block_B.at(i).host_data()), layout_B);
+    Tensor tensor_SFB = make_tensor(block_SFB.at(i).host_data(), layout_SFB);
+    cutlass::reference::host::GettBlockScalingMainloopParams<ElementAccumulator,
+        decltype(tensor_A),
+        decltype(tensor_SFA),
+        decltype(tensor_B),
+        decltype(tensor_SFB)
+      > 
+    mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+  
+    auto tensor_C = cute::make_tensor(make_iterator(block_C.at(i).host_data()), layout_C);
+    auto tensor_ref_D = cute::make_tensor(make_iterator(block_ref_D.at(i).host_data()), layout_D);
+    auto tensor_ref_SFD = cute::make_tensor(make_iterator(block_ref_SFD.at(i).host_data()), layout_SFD);
+
+    cutlass::reference::host::GettBlockScalingEpilogueParams<
+        ElementCompute,                       // ElementScalar
+        ElementAccumulator,                   // ElementAccumulator
+        ElementCompute,                       // ElementCompute
+        decltype(tensor_C),                   // TensorC
+        decltype(tensor_ref_D),               // TensorD
+        decltype(tensor_ref_SFD),             // TensorSfD
+        cute::Int<OutputSFVectorSize>,
+        cutlass::reference::host::SfStrategy::SfDGen
+      > epilogue_params {alpha_host.at(i), beta_host.at(i), tensor_C, tensor_ref_D, tensor_ref_SFD, options.norm_constant};
+    
+    cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+    // Comparison
+    block_D.at(i).sync_host();
+    block_SFD.at(i).sync_host();
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    passed &= cutlass::reference::host::TensorEquals(block_ref_D.at(i).host_view(), block_D.at(i).host_view());
+    passed &= cutlass::reference::host::TensorEquals(block_ref_SFD.at(i).host_view(), block_SFD.at(i).host_view());
+    // Check that the tensors have non-zero norms
+    passed &= (cutlass::reference::host::TensorNorm(block_ref_D.at(i).host_view()) > 0);
+    passed &= (cutlass::reference::host::TensorNorm(block_D.at(i).host_view()) > 0);
+    passed &= (cutlass::reference::host::TensorNorm(block_ref_SFD.at(i).host_view()) > 0);
+    passed &= (cutlass::reference::host::TensorNorm(block_SFD.at(i).host_view()) > 0);
+  }
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options, bool host_problem_shapes_available = true)
+{
+  std::cout << "  Problem Sizes, Alpha, Beta " << std::endl;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    std::cout << "    " << options.problem_sizes_host.at(i);
+    std::cout << ", " << alpha_host.at(i) << ", " << beta_host.at(i) << std::endl;
+  }
+  std::cout << "  Groups      : " << options.groups  << std::endl;
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<Gemm>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run(/* stream = */ nullptr, /* cuda_adapter = */ nullptr, /* launch_with_pdl = */ options.use_pdl));
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  if (options.verification) {
+    std::cout << "  Host-side verification is now running - may be very slow for large cases." << std::endl;
+    result.passed = verify(options);
+    std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+    if (!result.passed) {
+      exit(-1);
+    }
+  }
+  else {
+    std::cout << "  Verfication is turned off for this run." << std::endl;
+  } 
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run(/* stream = */ nullptr, /* cuda_adapter = */ nullptr, /* launch_with_pdl = */ options.use_pdl));
+    }
+    timer.stop();
+
+    // Compute average setup and runtime and GFLOPs.
+    float elapsed_ms       = timer.elapsed_millis();
+    result.avg_runtime_ms  = double(elapsed_ms) / double(options.iterations);
+    result.gflops          = options.gflops(result.avg_runtime_ms / 1000.0, options.problem_sizes_host);
+
+    std::cout << "  Avg runtime : " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS      : " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 ||
+       ((__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)
+       )
+     ) {
+    std::cerr << "This example requires CUDA 12.8 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (!(props.major == 12 && props.minor == 0)) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture (compute capability 120a).\n";
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+  allocate(options);
+  initialize(options);
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+  std::cout << "Running kernel with Cooperative kernel schedule:" << std::endl;
+  run<Gemm>(options, false /*host_problem_shapes_available*/);
+  std::cout << "Running kernel with Pingpong kernel schedule:" << std::endl;
+  run<GemmPingpong>(options, false /*host_problem_shapes_available*/); 
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/79_blackwell_geforce_gemm/CMakeLists.txt b/examples/79_blackwell_geforce_gemm/CMakeLists.txt
index cb7e3e97c0..b689c85e7e 100644
--- a/examples/79_blackwell_geforce_gemm/CMakeLists.txt
+++ b/examples/79_blackwell_geforce_gemm/CMakeLists.txt
@@ -28,6 +28,24 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 
+set(TEST_RANDOM --iterations=0)                                                     # Random problem sizes
+set(TEST_RANDOM_LARGE_GROUP --groups=50 --iterations=0)                             # Random problem sizes
+
+set(TEST_EPILOGUE --alpha=0.5 --beta=0.5 --iterations=0)                            # Random problem sizes
+set(TEST_EPILOGUE_LARGE_GROUP --alpha=1.5 --beta=2.0 --groups=50 --iterations=0)    # Random problem sizes
+
+set(TEST_EPILOGUE_OP --beta=0.5 --iterations=1)                                     # Random problem sizes
+set(TEST_EPILOGUE_OP_LARGE_GROUP --alpha=1.5 --iterations=1)                        # Random problem sizes
+
+set(TEST_FIXED --m=2048 --n=5120 --k=8192 --iterations=0)                           # Fixed problem sizes
+set(TEST_FIXED_LARGE_GROUP --m=2048 --n=512 --k=512 --groups=51 --iterations=0)     # Fixed problem sizes
+
+set(TEST_SMALL --m=256 --n=128 --iterations=0)                                      # Small problem sizes
+set(TEST_SMALL_LARGE_GROUP --m=128 --n=128 --groups=50 --iterations=0)              # Small problem sizes
+
+set(TEST_RANDOM_PERF --iterations=10)                                               # Random problem sizes
+set(TEST_RANDOM_PERF_LARGE_GROUP --groups=50 --iterations=10)                       # Random problem sizes
+
 if (CUTLASS_NVCC_ARCHS MATCHES 120a)
 cutlass_example_add_executable(
   79a_blackwell_geforce_nvfp4_bf16_gemm
@@ -44,4 +62,22 @@ cutlass_example_add_executable(
   79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu
 )  
 
+cutlass_example_add_executable(
+  79d_blackwell_geforce_nvfp4_grouped_gemm
+  79d_blackwell_geforce_nvfp4_grouped_gemm.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  TEST_RANDOM_PERF
+  TEST_RANDOM_PERF_LARGE_GROUP
+)  
+
 endif()
diff --git a/examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu b/examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu
new file mode 100644
index 0000000000..32df1146ae
--- /dev/null
+++ b/examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu
@@ -0,0 +1,554 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM120 architecture.
+
+    This example demonstrates a simple way to instantiate and run a narrow precision blockscaled sparse GEMM on the NVIDIA Blackwell SM120 architecture.
+    This kernel is optimized for the GeForce RTX 50 series GPUs.
+
+    The Blackwell SM120 CUTLASS kernel uses the new Block Scaled Sparse Tensor Core MMA Instructions:
+      * mma.sync.aligned.kind::mxf8f6f4.sp::ordered_metadata.block_scale.
+    Please see more detail in https://docs.nvidia.com/cuda/parallel-thread-execution.
+
+    The kernel leverages:
+    1. Warp-Specialized persistent kernel design that supports cooperative scheduler introduced in Hopper.
+    2. The new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+    3. Block Scaled Sparse Tensor Core MMA Instructions
+
+    Note that GeForce RTX 50 series GPUs do not support:
+    1. Multicast feature of TMA load. Cluster shape has to be 1x1x1.
+    2. Dynamic datatypes.
+
+    Usage:
+      $ ./examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm --m=2048 --n=2048 --k=2048
+*/
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+
+#include "helper.h"
+using namespace cute;
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using         ElementA    = cutlass::mx_float8_t<cutlass::float_e4m3_t>;     // Element type for A matrix operand
+using         LayoutATag  = cutlass::layout::RowMajor;                       // Layout type for A matrix operand
+constexpr int AlignmentA  = 32;                                              // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+// B matrix configuration
+using         ElementB    = cutlass::mx_float8_t<cutlass::float_e4m3_t>;     // Element type for B matrix operand
+using         LayoutBTag  = cutlass::layout::ColumnMajor;                    // Layout type for B matrix operand
+constexpr int AlignmentB  = 16;                                              // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+// C/D matrix configuration
+using         ElementD    = cutlass::bfloat16_t;                             // Element type for D matrix operand
+using         ElementC    = cutlass::bfloat16_t;                             // Element type for C matrix operand
+using         LayoutCTag  = cutlass::layout::RowMajor;                       // Layout type for C matrix operand
+using         LayoutDTag  = cutlass::layout::RowMajor;                       // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;     // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;     // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+// E matrix configuration. Note, E is used to represent metadata tensor.
+using         ElementE    = uint8_t;                                         // Element type for E matrix operand
+// Kernel functional config
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm120;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledSparseTensorOp; // Operator class tag
+using KernelScheduleType =  cutlass::gemm::KernelSparseTmaWarpSpecializedMxf8f6f4Acc2x4Sm120;   // Kernel schedule policy
+// Kernel Perf config
+using ThreadBlockShape    = Shape<_128,_128,_256>;                           // Threadblock's tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                 // Shape of the threadblocks in a cluster
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,                      
+    ThreadBlockShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutCTag, AlignmentC,
+    ElementD, LayoutDTag, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto                      // Epilogue schedule policy
+  >::CollectiveOp;
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutATag, AlignmentA,
+    ElementB, LayoutBTag, AlignmentB,
+    ElementAccumulator,
+    ThreadBlockShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelScheduleType                                                       // Mainloop schedule policy
+  >::CollectiveOp;
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,                                                  // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+// Reference device GEMM implementation type
+using StrideA   = typename Gemm::GemmKernel::StrideA;
+using LayoutA   = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;
+using StrideB   = typename Gemm::GemmKernel::StrideB;
+using LayoutB   = decltype(cute::make_layout(make_shape(0,0,0), StrideB{}));
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;
+using StrideC   = typename Gemm::GemmKernel::StrideC;
+using LayoutC   = decltype(cute::make_layout(make_shape(0,0,0), StrideC{}));
+using StrideD   = typename Gemm::GemmKernel::StrideD;
+using LayoutD   = decltype(cute::make_layout(make_shape(0,0,0), StrideD{}));
+using LayoutE   = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+//
+// Data members
+//
+/// Initialization
+StrideA stride_A;
+LayoutA layout_A;
+LayoutSFA layout_SFA;
+StrideB stride_B;
+LayoutB layout_B;
+LayoutSFB layout_SFB;
+StrideC stride_C;
+LayoutC layout_C;
+StrideD stride_D;
+LayoutD layout_D;
+LayoutE layout_E;
+uint64_t seed;
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A_Decompressed;
+cutlass::HostTensor<ElementE, cutlass::layout::PackedVectorLayout> block_E;
+cutlass::HostTensor<ElementA::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementB::DataType, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementB::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFB;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+// Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+// Reference Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_reference_D;
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// Command line options parsing
+struct Options {
+  bool help;
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+    out << "80a_blackwell_geforce_mxfp8_bf16_sparse_gemm\n\n"
+      << "  Blackwell MXFP8 Sparse GEMM is a warp specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+    out << "\n\nExamples:\n\n"
+      << "$ " << "./examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+    return out;
+  }
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+};
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+/// Initialize blocks that released to sparse Matrix A and its metadata E
+bool initialize_sparse_blocks(const Options &options) {
+  auto workload = make_shape(options.m,
+                             options.n,
+                             options.k,
+                             1);
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  /// Alias SparseConfig and Compressor
+  using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+  using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                              cute::Shape<int, int, int, int>,
+                              ElementA::DataType,
+                              LayoutATag,
+                              SparseConfig>;
+  using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                              cute::Shape<int, int, int, int>,
+                              ElementA::DataType,
+                              LayoutATag,
+                              SparseConfig,
+                              cutlass::arch::Sm120>;
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+  /// Declare compressor_utility to randomly fill zero in Matrix A to match sparsity needs
+  CompressorUtility compressor_utility(workload, stride_A);
+  // Aligned M K dimension size for A and E
+  int aligned_m_e = compressor_utility.get_metadata_m_physical();
+  int aligned_k_e = compressor_utility.get_metadata_k_physical();
+  int aligned_m_a = compressor_utility.get_tensorA_m_physical();
+  int aligned_k_a = compressor_utility.get_tensorA_k_physical();
+  /// Layout A and E
+  layout_A = SparseConfig::fill_layoutA(workload);
+  layout_E = SparseConfig::fill_layoutE(workload);
+
+  block_A.reset(cutlass::make_Coord(aligned_m_a * aligned_k_a));
+  block_E.reset(cutlass::make_Coord(aligned_m_e * aligned_k_e));
+  block_A_Decompressed.reset(cutlass::make_Coord(options.m * options.k));
+  initialize_block(block_A_Decompressed.host_view(), seed + 2020);
+  compressor_utility.structure_sparse_zero_mask_fill(
+          block_A_Decompressed.host_data(), static_cast<int>(seed + 2021));
+  block_A_Decompressed.sync_device();
+
+  /// Use compressor kernel to generate compressed Matrix A and E
+  cutlass::Status status { cutlass::Status::kSuccess };
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  typename Compressor::Arguments arguments{
+    {options.m, options.n, options.k, 1},
+    {block_A_Decompressed.device_data(),
+      stride_A,
+      block_A.device_data(),
+      block_E.device_data()},
+    {hw_info}
+  };
+
+  // Compress A and E
+  Compressor compressor_op;
+  size_t workspace_size = Compressor::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+  status = compressor_op.can_implement(arguments);
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.initialize(arguments, workspace.get());
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.run();
+  auto result = cudaDeviceSynchronize();
+  if (result != cudaSuccess) {
+    return false;
+  }
+
+  block_A.sync_host();
+  block_E.sync_host();
+  return true;
+}
+/// Initialize operands to be used in the GEMM and reference GEMM
+bool initialize(const Options &options) {
+  using namespace cute;
+
+  // Initial A, E(metadata) and A_compressed blocks
+  if(!initialize_sparse_blocks(options)) return false;
+
+  // Define B, C and D blocks
+  using Sm1xxBlkScaledConfig = typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig;
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+  layout_B = make_layout(make_shape(options.n, options.k, 1), stride_B);
+  layout_C = make_layout(make_shape(options.m, options.n, 1), stride_C);
+  layout_D = make_layout(make_shape(options.m, options.n, 1), stride_D);
+  // Define SFA and SFB tensors layouts
+  layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(options.m, options.n, options.k, 1));
+  layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(options.m, options.n, options.k, 1));
+  block_B.reset(cutlass::make_Coord(size(layout_B)));
+  block_C.reset(cutlass::make_Coord(size(layout_C)));
+  block_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_reference_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_SFA.reset(cutlass::make_Coord(size(filter_zeros(layout_SFA))));
+  block_SFB.reset(cutlass::make_Coord(size(filter_zeros(layout_SFB))));
+  initialize_block(block_B.host_view(), seed + 2022);
+  initialize_block(block_C.host_view(), seed + 2023);
+  initialize_block(block_SFA.host_view(), seed + 2024);
+  initialize_block(block_SFB.host_view(), seed + 2025);
+  block_B.sync_device();
+  block_C.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+  return true;
+}
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    { // Mainloop arguments
+      block_A.device_data(), layout_A,
+      block_B.device_data(), stride_B,
+      block_E.device_data(), layout_E,
+      block_SFA.device_data(), layout_SFA,
+      block_SFB.device_data(), layout_SFB
+    },
+    { // Epilogue arguments
+      {options.alpha, options.beta},
+      block_C.device_data(), stride_C,
+      block_D.device_data(), stride_D
+    }
+  };
+  return arguments;
+}
+bool verify(const Options &options) {
+  using namespace cute;
+  // Create the arguments for host reference implementation
+  Tensor tensor_A = make_tensor(make_iterator(block_A_Decompressed.host_data()), layout_A);
+  Tensor tensor_SFA = make_tensor(block_SFA.host_data(), layout_SFA);
+  Tensor tensor_B = make_tensor(make_iterator(block_B.host_data()), layout_B);
+  Tensor tensor_SFB = make_tensor(block_SFB.host_data(), layout_SFB);
+  Tensor tensor_E = make_tensor(make_iterator(block_E.host_data()), layout_E);
+
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,                 // ElementAccumulator
+      decltype(tensor_A),                 // TensorA
+      decltype(tensor_SFA),               // TensorSfA
+      decltype(tensor_B),                 // TensorB
+      decltype(tensor_SFB)                // TensorSfB
+    > mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+  auto tensor_C = cute::make_tensor(make_iterator(block_C.host_data()), layout_C);
+  auto tensor_D = cute::make_tensor(make_iterator(block_reference_D.host_data()), layout_D);
+
+  cutlass::reference::host::GettBlockScalingEpilogueParams<
+      ElementAccumulator,                   // ElementScalar
+      ElementAccumulator,                   // ElementAccumulator
+      ElementAccumulator,                   // ElementCompute
+      decltype(tensor_C),                   // TensorC
+      decltype(tensor_D)                    // TensorD
+    > epilogue_params{options.alpha, options.beta, tensor_C, tensor_D};
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+  // Comparison
+  block_D.sync_host();
+
+  bool passed = cutlass::reference::host::TensorEquals(block_reference_D.host_view(), block_reference_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(block_reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(block_D.host_view()) > 0);
+  return passed;
+}
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  // Initialization
+  if(!initialize(options)) 
+  {
+    std::cerr << " Initialization failed! " << std::endl;
+    exit(-1);
+  }
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+  cudaDeviceSynchronize();
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+  if (!result.passed) {
+    exit(-1);
+  }
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+  return 0;
+}
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+///////////////////////////////////////////////////////////////////////////////////////////////////
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 120.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  
+  if (!(props.major == 12 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 120)." << std::endl;
+    return 0;
+  }
+  //
+  // Parse options
+  //
+  Options options;
+  options.parse(argc, args);
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+  return 0;
+}
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu b/examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu
new file mode 100644
index 0000000000..f3441b5630
--- /dev/null
+++ b/examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu
@@ -0,0 +1,578 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM120 architecture.
+
+    This example demonstrates a simple way to instantiate and run a narrow precision blockscaled sparse GEMM on the NVIDIA Blackwell SM120 architecture.
+    This kernel is optimized for the GeForce RTX 50 series GPUs.
+
+    The Blackwell SM120 CUTLASS kernel uses the new Block Scaled Sparse Tensor Core MMA Instructions:
+      * mma.sync.aligned.kind::mxf4nvf4.sp::ordered_metadata.block_scale.
+    Please see more detail in https://docs.nvidia.com/cuda/parallel-thread-execution.
+
+    The kernel leverages:
+    1. Warp-Specialized persistent kernel design that supports cooperative scheduler introduced in Hopper.
+    2. The new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+    3. Block Scaled Sparse Tensor Core MMA Instructions
+
+    Note that GeForce RTX 50 series GPUs do not support:
+    1. Multicast feature of TMA load. Cluster shape has to be 1x1x1.
+    2. Dynamic datatypes.
+
+    Usage:
+      $ ./examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm --m=2048 --n=2048 --k=2048
+*/
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+
+#include "helper.h"
+using namespace cute;
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using         ElementA    = cutlass::nv_float4_t<cutlass::float_e2m1_t>;     // Element type for A matrix operand
+using         LayoutATag  = cutlass::layout::RowMajor;                       // Layout type for A matrix operand
+constexpr int AlignmentA  = 64;                                              // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+// B matrix configuration
+using         ElementB    = cutlass::nv_float4_t<cutlass::float_e2m1_t>;     // Element type for B matrix operand
+using         LayoutBTag  = cutlass::layout::ColumnMajor;                    // Layout type for B matrix operand
+constexpr int AlignmentB  = 32;                                              // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+// C/D matrix configuration
+using         ElementD    = cutlass::float_e2m1_t;                           // Element type for D matrix operand
+using         ElementC    = cutlass::bfloat16_t;                             // Element type for C matrix operand
+using         LayoutCTag  = cutlass::layout::ColumnMajor;                    // Layout type for C matrix operand
+using         LayoutDTag  = cutlass::layout::ColumnMajor;                    // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;     // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;     // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int outputVectorSize = 32;                                         // Vector size for D matrix
+using   outputScaleFactor = cutlass::float_ue4m3_t;                          // Scale factor type for D matrix
+// E matrix configuration. Note, E is used to represent metadata tensor.
+using         ElementE    = uint8_t;                                         // Element type for E matrix operand
+// Kernel functional config
+using ElementCompute      = float;                                           // Element type for computation inside mainloop and epilogue
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm120;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledSparseTensorOp; // Operator class tag
+using KernelScheduleType =  cutlass::gemm::KernelSparseTmaWarpSpecializedNvf4Sm120;   // Kernel schedule policy
+// Kernel Perf config
+using ThreadBlockShape    = Shape<_128,_128,_256>;                           // Threadblock's tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                 // Shape of the threadblocks in a cluster
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,                      
+    ThreadBlockShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutCTag, AlignmentC,
+    ElementD, LayoutDTag, AlignmentD,
+    cutlass::epilogue::SparseTmaWarpSpecializedCooperativeSm120,             // Epilogue schedule policy
+    cutlass::epilogue::fusion::LinCombBlockScaleFactor<                      // Epilogue fusion to generate nvfp4 output
+          outputVectorSize, ElementD, ElementAccumulator, outputScaleFactor, LayoutDTag, ElementC>
+  >::CollectiveOp;
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutATag, AlignmentA,
+    ElementB, LayoutBTag, AlignmentB,
+    ElementAccumulator,
+    ThreadBlockShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelScheduleType                                                       // Mainloop schedule policy
+  >::CollectiveOp;
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,                                                  // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+// Reference device GEMM implementation type
+using StrideA   = typename Gemm::GemmKernel::StrideA;
+using LayoutA   = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;
+using StrideB   = typename Gemm::GemmKernel::StrideB;
+using LayoutB   = decltype(cute::make_layout(make_shape(0,0,0), StrideB{}));
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;
+using StrideC   = typename Gemm::GemmKernel::StrideC;
+using LayoutC   = decltype(cute::make_layout(make_shape(0,0,0), StrideC{}));
+using StrideD   = typename Gemm::GemmKernel::StrideD;
+using LayoutD   = decltype(cute::make_layout(make_shape(0,0,0), StrideD{}));
+using LayoutE   = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+using SfdOutputCfg = cutlass::detail::Sm1xxBlockScaledOutputConfig<outputVectorSize>;
+using LayoutSFD = typename SfdOutputCfg::LayoutSF;
+//
+// Data members
+//
+/// Initialization
+StrideA stride_A;
+LayoutA layout_A;
+LayoutSFA layout_SFA;
+StrideB stride_B;
+LayoutB layout_B;
+LayoutSFB layout_SFB;
+StrideC stride_C;
+LayoutC layout_C;
+StrideD stride_D;
+LayoutD layout_D;
+LayoutSFD layout_SFD;
+LayoutE layout_E;
+uint64_t seed;
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A_Decompressed;
+cutlass::HostTensor<ElementE, cutlass::layout::PackedVectorLayout> block_E;
+cutlass::HostTensor<ElementA::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementB::DataType, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementB::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFB;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+// Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+cutlass::HostTensor<outputScaleFactor, cutlass::layout::PackedVectorLayout> block_SFD;
+// Reference Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_reference_D;
+cutlass::HostTensor<outputScaleFactor, cutlass::layout::PackedVectorLayout> block_reference_SFD;
+cutlass::HostTensor<ElementCompute, cutlass::layout::PackedVectorLayout> block_Normconst;
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// Command line options parsing
+struct Options {
+  bool help;
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+    out << "80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm\n\n"
+      << "  Blackwell MXFP8 Sparse GEMM is a warp specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+    out << "\n\nExamples:\n\n"
+      << "$ " << "./examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+    return out;
+  }
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+};
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+/// Initialize blocks that released to sparse Matrix A and its metadata E
+bool initialize_sparse_blocks(const Options &options) {
+  auto workload = make_shape(options.m,
+                             options.n,
+                             options.k,
+                             1);
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  /// Alias SparseConfig and Compressor
+  using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+  using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                              cute::Shape<int, int, int, int>,
+                              ElementA::DataType,
+                              LayoutATag,
+                              SparseConfig>;
+  using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                              cute::Shape<int, int, int, int>,
+                              ElementA::DataType,
+                              LayoutATag,
+                              SparseConfig,
+                              cutlass::arch::Sm120>;
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+  /// Declare compressor_utility to randomly fill zero in Matrix A to match sparsity needs
+  CompressorUtility compressor_utility(workload, stride_A);
+  // Aligned M K dimension size for A and E
+  int aligned_m_e = compressor_utility.get_metadata_m_physical();
+  int aligned_k_e = compressor_utility.get_metadata_k_physical();
+  int aligned_m_a = compressor_utility.get_tensorA_m_physical();
+  int aligned_k_a = compressor_utility.get_tensorA_k_physical();
+  /// Layout A and E
+  layout_A = SparseConfig::fill_layoutA(workload);
+  layout_E = SparseConfig::fill_layoutE(workload);
+
+  block_A.reset(cutlass::make_Coord(aligned_m_a * aligned_k_a));
+  block_E.reset(cutlass::make_Coord(aligned_m_e * aligned_k_e));
+  block_A_Decompressed.reset(cutlass::make_Coord(options.m * options.k));
+  initialize_block(block_A_Decompressed.host_view(), seed + 2020);
+  compressor_utility.structure_sparse_zero_mask_fill(
+          block_A_Decompressed.host_data(), static_cast<int>(seed + 2021));
+  block_A_Decompressed.sync_device();
+
+  /// Use compressor kernel to generate compressed Matrix A and E
+  cutlass::Status status { cutlass::Status::kSuccess };
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  typename Compressor::Arguments arguments{
+    {options.m, options.n, options.k, 1},
+    {block_A_Decompressed.device_data(),
+      stride_A,
+      block_A.device_data(),
+      block_E.device_data()},
+    {hw_info}
+  };
+
+  // Compress A and E
+  Compressor compressor_op;
+  size_t workspace_size = Compressor::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+  status = compressor_op.can_implement(arguments);
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.initialize(arguments, workspace.get());
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.run();
+  auto result = cudaDeviceSynchronize();
+  if (result != cudaSuccess) {
+    return false;
+  }
+
+  block_A.sync_host();
+  block_E.sync_host();
+  return true;
+}
+/// Initialize operands to be used in the GEMM and reference GEMM
+bool initialize(const Options &options) {
+  using namespace cute;
+
+  // Initial A, E(metadata) and A_compressed blocks
+  if(!initialize_sparse_blocks(options)) return false;
+
+  // Define B, C and D blocks
+  using Sm1xxBlkScaledConfig = typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig;
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+  layout_B = make_layout(make_shape(options.n, options.k, 1), stride_B);
+  layout_C = make_layout(make_shape(options.m, options.n, 1), stride_C);
+  layout_D = make_layout(make_shape(options.m, options.n, 1), stride_D);
+  layout_SFD = SfdOutputCfg::tile_atom_to_shape_SFD(cute::make_shape(options.m, options.n, options.k, 1));
+  // Define SFA and SFB tensors layouts
+  layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(options.m, options.n, options.k, 1));
+  layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(options.m, options.n, options.k, 1));
+  block_B.reset(cutlass::make_Coord(size(layout_B)));
+  block_C.reset(cutlass::make_Coord(size(layout_C)));
+  block_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_SFD.reset(cutlass::make_Coord(size(filter_zeros(layout_SFD))));
+  block_reference_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_reference_SFD.reset(cutlass::make_Coord(size(filter_zeros(layout_SFD))));
+  block_Normconst.reset(cutlass::make_Coord(1));
+  block_SFA.reset(cutlass::make_Coord(size(filter_zeros(layout_SFA))));
+  block_SFB.reset(cutlass::make_Coord(size(filter_zeros(layout_SFB))));
+  initialize_block(block_B.host_view(), seed + 2022);
+  initialize_block(block_C.host_view(), seed + 2023);
+  initialize_block(block_SFA.host_view(), seed + 2024);
+  initialize_block(block_SFB.host_view(), seed + 2025);
+  block_Normconst.at(cutlass::make_Coord(0)) = 2;
+  block_B.sync_device();
+  block_C.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+  block_SFD.sync_device();
+  block_Normconst.sync_device();
+  return true;
+}
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    { // Mainloop arguments
+      block_A.device_data(), layout_A,
+      block_B.device_data(), stride_B,
+      block_E.device_data(), layout_E,
+      block_SFA.device_data(), layout_SFA,
+      block_SFB.device_data(), layout_SFB
+    },
+    { // Epilogue arguments
+      {options.alpha, options.beta},
+      block_C.device_data(), stride_C,
+      block_D.device_data(), stride_D
+    }
+  };
+  arguments.epilogue.thread.block_scale_factor_ptr = block_SFD.device_data();
+  arguments.epilogue.thread.norm_constant_ptr      = block_Normconst.device_data();
+  return arguments;
+}
+bool verify(const Options &options) {
+  using namespace cute;
+  // Create the arguments for host reference implementation
+  Tensor tensor_A = make_tensor(make_iterator(block_A_Decompressed.host_data()), layout_A);
+  Tensor tensor_SFA = make_tensor(block_SFA.host_data(), layout_SFA);
+  Tensor tensor_B = make_tensor(make_iterator(block_B.host_data()), layout_B);
+  Tensor tensor_SFB = make_tensor(block_SFB.host_data(), layout_SFB);
+  Tensor tensor_E = make_tensor(make_iterator(block_E.host_data()), layout_E);
+
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,                 // ElementAccumulator
+      decltype(tensor_A),                 // TensorA
+      decltype(tensor_SFA),               // TensorSfA
+      decltype(tensor_B),                 // TensorB
+      decltype(tensor_SFB)                // TensorSfB
+    > mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+  auto tensor_C = cute::make_tensor(make_iterator(block_C.host_data()), layout_C);
+  auto tensor_D = cute::make_tensor(make_iterator(block_reference_D.host_data()), layout_D);
+  auto tensor_SFD = cute::make_tensor(block_reference_SFD.host_data(), layout_SFD);
+
+  cutlass::reference::host::GettBlockScalingEpilogueParams<
+      ElementAccumulator,                   // ElementScalar
+      ElementAccumulator,                   // ElementAccumulator
+      ElementAccumulator,                   // ElementCompute
+      decltype(tensor_C),                   // TensorC
+      decltype(tensor_D),                   // TensorD
+      decltype(tensor_SFD),                 // TensorSfD
+      cute::Int<outputVectorSize>,
+      cutlass::reference::host::SfStrategy::SfDGen
+    > epilogue_params{options.alpha, options.beta, tensor_C, tensor_D, tensor_SFD, block_Normconst.at(cutlass::make_Coord(0))};
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+  // Comparison
+  block_D.sync_host();
+
+  bool passed = cutlass::reference::host::TensorEquals(block_reference_D.host_view(), block_reference_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(block_reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(block_D.host_view()) > 0);
+  return passed;
+}
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  // Initialization
+  if(!initialize(options)) 
+  {
+    std::cerr << " Initialization failed! " << std::endl;
+    exit(-1);
+  }
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+  cudaDeviceSynchronize();
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+  if (!result.passed) {
+    exit(-1);
+  }
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+  return 0;
+}
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+///////////////////////////////////////////////////////////////////////////////////////////////////
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 120.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  
+  if (!(props.major == 12 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 120)." << std::endl;
+    return 0;
+  }
+  //
+  // Parse options
+  //
+  Options options;
+  options.parse(argc, args);
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+  return 0;
+}
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/80_blackwell_geforce_sparse_gemm/CMakeLists.txt b/examples/80_blackwell_geforce_sparse_gemm/CMakeLists.txt
new file mode 100644
index 0000000000..6a94fb0d90
--- /dev/null
+++ b/examples/80_blackwell_geforce_sparse_gemm/CMakeLists.txt
@@ -0,0 +1,41 @@
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 120a)
+cutlass_example_add_executable(
+  80a_blackwell_geforce_mxfp8_bf16_sparse_gemm
+  80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu
+)  
+
+cutlass_example_add_executable(
+  80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm
+  80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu
+)  
+
+endif()
diff --git a/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu b/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu
index 3148d2aac2..10cfe89d3c 100644
--- a/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu
+++ b/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu
@@ -30,11 +30,9 @@
  **************************************************************************************************/
 
 /*! \file
-    \brief A FP8 blockwise scaled GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+    \brief An FP8 blockwise scaled GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
 */
 
-
-
 #include <iostream>
 
 #include "cutlass/cutlass.h"
@@ -115,7 +113,7 @@ using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBui
     ElementAccumulator, ElementCompute,
     ElementC, LayoutC, AlignmentC,
     ElementD, LayoutC, AlignmentD,
-    cutlass::epilogue::TmaWarpSpecialized1Sm
+    cutlass::epilogue::collective::EpilogueScheduleAuto
   >::CollectiveOp;
 
 using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -125,7 +123,7 @@ using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder
     ElementAccumulator,
     MmaTileShape_MNK, ClusterShape_MNK,
     cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-    cutlass::gemm::KernelTmaWarpSpecializedBlockwise1SmSm100 // Note: Groupwise and Blockwise only support 1 SM MMA at this moment
+    cutlass::gemm::KernelScheduleSm100Blockwise
   >::CollectiveOp;
 
 using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
@@ -222,8 +220,7 @@ struct Options {
   }
 
   /// Compute performance in GFLOP/s
-  double gflops(double runtime_s) const
-  {
+  double gflops(double runtime_s) const {
     // Two flops per multiply-add
     uint64_t flop = uint64_t(2) * m * n * k;
     double gflop = double(flop) / double(1.0e9);
@@ -232,8 +229,7 @@ struct Options {
 };
 
 /// Result structure
-struct Result
-{
+struct Result {
   double avg_runtime_ms;
   double gflops;
   cutlass::Status status;
@@ -273,13 +269,16 @@ bool initialize_tensor(
     if (bits_input == 1) {
       scope_max = 2;
       scope_min = 0;
-    } else if (bits_input <= 8) {
+    } 
+    else if (bits_input <= 8) {
       scope_max = 2;
       scope_min = -2;
-    } else if (bits_output == 16) {
+    } 
+    else if (bits_output == 16) {
       scope_max = 5;
       scope_min = -5;
-    } else {
+    } 
+    else {
       scope_max = 8;
       scope_min = -8;
     }
@@ -392,8 +391,7 @@ void initialize(const Options &options) {
 }
 
 /// Populates a Gemm::Arguments structure from the given commandline options
-typename Gemm::Arguments args_from_options(const Options &options)
-{
+typename Gemm::Arguments args_from_options(const Options &options) {
   typename Gemm::Arguments arguments{
     cutlass::gemm::GemmUniversalMode::kGemm,
     {options.m, options.n, options.k, options.l},
@@ -468,8 +466,7 @@ bool verify(const Options &options) {
 
 /// Execute a given example GEMM computation
 template <typename Gemm>
-int run(Options &options)
-{
+int run(Options &options) {
   initialize(options);
 
   
@@ -510,8 +507,7 @@ int run(Options &options)
   }
 
   // Run profiling loop
-  if (options.iterations > 0)
-  {
+  if (options.iterations > 0) {
     GpuTimer timer;
     timer.start();
     for (int iter = 0; iter < options.iterations; ++iter) {
diff --git a/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu b/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu
index 11083e0981..6d8d1de019 100644
--- a/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu
+++ b/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu
@@ -30,7 +30,7 @@
  **************************************************************************************************/
 
 /*! \file
-    \brief A FP8 groupwise scaled GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+    \brief An FP8 groupwise scaled GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
 */
 
 #include <iostream>
@@ -96,9 +96,9 @@ using ElementCompute = float;
 
 // MMA and Cluster Tile Shapes
 // Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
-using MmaTileShape_MNK = Shape<_128,_128,_128>;                          
+using MmaTileShape_MNK = Shape<_256,_128,_128>;                          
 // Shape of the threadblocks in a cluster
-using ClusterShape_MNK = Shape<_1,_1,_1>;
+using ClusterShape_MNK = Shape<_2,_1,_1>;
  
 constexpr int ScaleGranularityM = 1;
 constexpr int ScaleGranularityN = 128;
@@ -120,7 +120,7 @@ using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBui
     ElementAccumulator, ElementCompute,
     ElementC, LayoutC, AlignmentC,
     ElementD, LayoutC, AlignmentD,
-    cutlass::epilogue::TmaWarpSpecialized1Sm
+    cutlass::epilogue::collective::EpilogueScheduleAuto
   >::CollectiveOp;
 
 using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -130,7 +130,7 @@ using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder
     ElementAccumulator,
     MmaTileShape_MNK, ClusterShape_MNK,
     cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-    cutlass::gemm::KernelTmaWarpSpecializedBlockwise1SmSm100 // Note: Groupwise and Blockwise only support 1 SM MMA at this moment
+    cutlass::gemm::KernelScheduleSm100Blockwise
   >::CollectiveOp;
 
 using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
@@ -227,8 +227,7 @@ struct Options {
   }
 
   /// Compute performance in GFLOP/s
-  double gflops(double runtime_s) const
-  {
+  double gflops(double runtime_s) const {
     // Two flops per multiply-add
     uint64_t flop = uint64_t(2) * m * n * k;
     double gflop = double(flop) / double(1.0e9);
@@ -237,8 +236,7 @@ struct Options {
 };
 
 /// Result structure
-struct Result
-{
+struct Result {
   double avg_runtime_ms;
   double gflops;
   cutlass::Status status;
@@ -278,13 +276,16 @@ bool initialize_tensor(
     if (bits_input == 1) {
       scope_max = 2;
       scope_min = 0;
-    } else if (bits_input <= 8) {
+    } 
+    else if (bits_input <= 8) {
       scope_max = 2;
       scope_min = -2;
-    } else if (bits_output == 16) {
+    } 
+    else if (bits_output == 16) {
       scope_max = 5;
       scope_min = -5;
-    } else {
+    } 
+    else {
       scope_max = 8;
       scope_min = -8;
     }
@@ -397,9 +398,8 @@ void initialize(const Options &options) {
 }
 
 /// Populates a Gemm::Arguments structure from the given commandline options
-typename Gemm::Arguments args_from_options(const Options &options)
-{
-  typename Gemm::Arguments arguments{
+typename Gemm::Arguments args_from_options(const Options &options) {
+  typename Gemm::Arguments arguments {
     cutlass::gemm::GemmUniversalMode::kGemm,
     {options.m, options.n, options.k, options.l},
     {tensor_A.device_data(), stride_A, 
@@ -473,8 +473,7 @@ bool verify(const Options &options) {
 
 /// Execute a given example GEMM computation
 template <typename Gemm>
-int run(Options &options)
-{
+int run(Options &options) {
   initialize(options);
 
   
@@ -515,8 +514,7 @@ int run(Options &options)
   }
 
   // Run profiling loop
-  if (options.iterations > 0)
-  {
+  if (options.iterations > 0) {
     GpuTimer timer;
     timer.start();
     for (int iter = 0; iter < options.iterations; ++iter) {
diff --git a/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu b/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu
new file mode 100644
index 0000000000..b43869e7f1
--- /dev/null
+++ b/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu
@@ -0,0 +1,754 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief An FP8 blockwise-scaled grouped GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+    In this example M, N, and K are fixed across groups.
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+
+#include "cutlass/util/reference/host/gett.hpp"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using ElementA            = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using LayoutA             = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using ElementB            = cutlass::float_e4m3_t;                          // Element type for B matrix operand
+using LayoutB             = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using ElementC            = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using LayoutC             = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+using ElementD           = ElementC;
+using LayoutD            = LayoutC;
+constexpr int AlignmentD = AlignmentC;
+
+// MMA type
+using ElementAccumulator = float;
+using ElementCompute = float;
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_128,_128,_128>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_1,_1,_1>;
+// Shape of the tile computed by each SM
+
+using ScaleConfig = decltype(cutlass::detail::sm100_trivial_blockwise_scale_config(MmaTileShape_MNK{}));
+
+using LayoutSFA             = decltype(ScaleConfig::deduce_layoutSFA());                     // Layout type for SFA matrix operand
+using LayoutSFB             = decltype(ScaleConfig::deduce_layoutSFB());                     // Layout type for SFB matrix operand
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC *, AlignmentC,
+    ElementD, LayoutC *, AlignmentD,
+    cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+    ElementA, cute::tuple<LayoutA *, LayoutSFA *>, AlignmentA,
+    ElementB, cute::tuple<LayoutB *, LayoutSFB *>, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise1SmSm100
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                // Default to ClusterLaunchControl (CLC) based tile scheduler
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
+
+static_assert(cute::is_same_v<typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFA, LayoutSFA>);
+static_assert(cute::is_same_v<typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFB, LayoutSFB>);
+
+/// Initialization
+uint64_t seed;
+
+// Host-side allocations
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_SFA;
+std::vector<int64_t> offset_SFB;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+std::vector<LayoutSFA> layout_SFA_host;
+std::vector<LayoutSFB> layout_SFB_host;
+
+std::vector<ElementD *> ptr_ref_D_host;
+
+std::vector<ElementA *> ptr_A_host;
+std::vector<ElementB *> ptr_B_host;
+std::vector<ElementC *> ptr_C_host;
+std::vector<ElementD *> ptr_D_host;
+std::vector<ElementAccumulator *> ptr_SFA_host;
+std::vector<ElementAccumulator *> ptr_SFB_host;
+
+// Shared Allocations
+
+cutlass::HostTensor<ElementA, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementB, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_ref_D;
+cutlass::HostTensor<ElementAccumulator, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementAccumulator, cutlass::layout::PackedVectorLayout> block_SFB;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<const typename Gemm::ElementA *> ptr_A;
+cutlass::DeviceAllocation<const typename Gemm::ElementB *> ptr_B;
+cutlass::DeviceAllocation<const typename Gemm::ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D;
+cutlass::DeviceAllocation<const ElementAccumulator *> ptr_SFA;
+cutlass::DeviceAllocation<const ElementAccumulator *> ptr_SFB;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<LayoutSFA> layout_SFA;
+cutlass::DeviceAllocation<LayoutSFB> layout_SFB;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool skip_verification = false;
+
+  float alpha = 1.f, beta = 0.f;
+  int iterations = 1000;
+  int m = 1024, n = 2048, k = 512, groups = 10;
+  std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    if (cmd.check_cmd_line_flag("skip-verification")) {
+      skip_verification = true;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("groups", groups);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+
+    for (int i = 0; i < groups; ++i) {
+      problem_sizes_host.push_back({m, n, k});
+    }
+
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "81_blackwell_grouped_gemm_blockwise\n\n"
+      << "  Blackwell FP8 GEMM with Blockwise Scaling using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --groups=<int>              Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n"
+      << "  --skip-verification         Skip verification.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "81_blackwell_grouped_gemm_blockwise" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k * groups;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result {
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  cutlass::Distribution::Kind dist_kind,
+  uint64_t seed) {
+
+  if (dist_kind == cutlass::Distribution::Uniform) {
+
+    double scope_max, scope_min;
+    int bits_input = cutlass::sizeof_bits<Element>::value;
+    int bits_output = cutlass::sizeof_bits<Element>::value;
+
+    if (bits_input == 1) {
+      scope_max = 2;
+      scope_min = 0;
+    } 
+    else if (bits_input <= 8) {
+      scope_max = 2;
+      scope_min = -2;
+    } 
+    else if (bits_output == 16) {
+      scope_max = 5;
+      scope_min = -5;
+    } 
+    else {
+      scope_max = 8;
+      scope_min = -8;
+    }
+
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Identity) {
+
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+  else if (dist_kind == cutlass::Distribution::AllOnes) {
+    cutlass::reference::host::TensorFill(view, Element(1));
+  }
+  else {
+    throw std::runtime_error("Not implementated.");
+  }
+
+  return true;
+}
+
+/// Helper to initialize a block of device data (scale_tensors)
+template <typename Element, typename Layout>
+bool initialize_scale_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  cutlass::Distribution::Kind dist_kind,
+  uint64_t seed) {
+
+  if (dist_kind == cutlass::Distribution::Uniform) {
+
+    double scope_max, scope_min;
+
+    scope_min = -1;
+    scope_max = 1;
+
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Identity) {
+
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+  else if (dist_kind == cutlass::Distribution::AllOnes) {
+    cutlass::reference::host::TensorFill(view, Element(1));
+  }
+  else {
+    throw std::runtime_error("Not implementated.");
+  }
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(Options const& options) {
+  int32_t total_elements_A = 0;
+  int32_t total_elements_B = 0;
+  int32_t total_elements_C = 0;
+  int32_t total_elements_D = 0;
+  int32_t total_elements_SFA = 0;
+  int32_t total_elements_SFB = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_SFA.push_back(total_elements_SFA);
+    offset_SFB.push_back(total_elements_SFB);
+
+    int32_t elements_A = M * K;
+    int32_t elements_B = K * N;
+    int32_t elements_C = M * N;
+    int32_t elements_D = M * N;
+
+    auto gemm_layout_SFA = ScaleConfig::tile_atom_to_shape_SFA(make_shape(M, N, K, 1));
+    auto gemm_layout_SFB = ScaleConfig::tile_atom_to_shape_SFB(make_shape(M, N, K, 1));
+
+    int32_t elements_SFA = cosize(gemm_layout_SFA);
+    int32_t elements_SFB = cosize(gemm_layout_SFB);
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_elements_SFA += elements_SFA;
+    total_elements_SFB += elements_SFB;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}));
+    layout_SFA_host.push_back(gemm_layout_SFA);
+    layout_SFB_host.push_back(gemm_layout_SFB);
+  }
+
+  block_A.resize(cutlass::make_Coord(total_elements_A));
+  block_B.resize(cutlass::make_Coord(total_elements_B));
+  block_C.resize(cutlass::make_Coord(total_elements_C));
+  block_D.resize(cutlass::make_Coord(total_elements_D));
+  block_ref_D.resize(cutlass::make_Coord(total_elements_D));
+  block_SFA.resize(cutlass::make_Coord(total_elements_SFA));
+  block_SFB.resize(cutlass::make_Coord(total_elements_SFB));
+
+  initialize_tensor(block_A.host_view(), cutlass::Distribution::Uniform, seed + 2022);
+  initialize_tensor(block_B.host_view(), cutlass::Distribution::Uniform, seed + 2023);
+  initialize_tensor(block_C.host_view(), cutlass::Distribution::Uniform, seed + 2024);
+  initialize_scale_tensor(block_SFA.host_view(), cutlass::Distribution::Uniform, seed + 2026);
+  initialize_scale_tensor(block_SFB.host_view(), cutlass::Distribution::Uniform, seed + 2027);
+
+  block_A.sync_device();
+  block_B.sync_device();
+  block_C.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+
+  // copy problem sizes
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  std::vector<ElementA *> device_ptr_A_host(options.groups);
+  std::vector<ElementB *> device_ptr_B_host(options.groups);
+  std::vector<ElementC *> device_ptr_C_host(options.groups);
+  std::vector<ElementD *> device_ptr_D_host(options.groups);
+  std::vector<ElementAccumulator *> device_ptr_SFA_host(options.groups);
+  std::vector<ElementAccumulator *> device_ptr_SFB_host(options.groups);
+
+  ptr_A_host = std::vector<ElementA *>(options.groups);
+  ptr_B_host = std::vector<ElementB *>(options.groups);
+  ptr_C_host = std::vector<ElementC *>(options.groups);
+  ptr_D_host = std::vector<ElementD *>(options.groups);
+  ptr_SFA_host = std::vector<ElementAccumulator *>(options.groups);
+  ptr_SFB_host = std::vector<ElementAccumulator *>(options.groups);
+  ptr_ref_D_host = std::vector<ElementD *>(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    // Ptrs for A
+    ptr_A_host.at(i) = block_A.host_data() + offset_A.at(i);
+    device_ptr_A_host.at(i) = block_A.device_data() + offset_A.at(i);
+
+    // Ptrs for B
+    ptr_B_host.at(i) = block_B.host_data() + offset_B.at(i);
+    device_ptr_B_host.at(i) = block_B.device_data() + offset_B.at(i);
+
+    // Ptrs for C
+    ptr_C_host.at(i) = block_C.host_data() + offset_C.at(i);
+    device_ptr_C_host.at(i) = block_C.device_data() + offset_C.at(i);
+
+    // Ptrs for D
+    ptr_D_host.at(i) = block_D.host_data() + offset_D.at(i);
+    device_ptr_D_host.at(i) = block_D.device_data() + offset_D.at(i);
+    ptr_ref_D_host.at(i) = block_ref_D.host_data() + offset_D.at(i);
+
+    // Ptrs for SFA
+    ptr_SFA_host.at(i) = block_SFA.host_data() + offset_SFA.at(i);
+    device_ptr_SFA_host.at(i) = block_SFA.device_data() + offset_SFA.at(i);
+
+    // Ptrs for SFB
+    ptr_SFB_host.at(i) = block_SFB.host_data() + offset_SFB.at(i);
+    device_ptr_SFB_host.at(i) = block_SFB.device_data() + offset_SFB.at(i);
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(device_ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(device_ptr_B_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(device_ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(device_ptr_D_host.data());
+
+  ptr_SFA.reset(options.groups);
+  ptr_SFA.copy_from_host(device_ptr_SFA_host.data());
+
+  ptr_SFB.reset(options.groups);
+  ptr_SFB.copy_from_host(device_ptr_SFB_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  layout_SFA.reset(options.groups);
+  layout_SFA.copy_from_host(layout_SFA_host.data());
+
+  layout_SFB.reset(options.groups);
+  layout_SFB.copy_from_host(layout_SFB_host.data());
+
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options) {
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_sizes.get(), options.problem_sizes_host.data()},
+    {ptr_A.get(), stride_A.get(),
+     ptr_B.get(), stride_B.get(),
+     ptr_SFA.get(), layout_SFA.get(),
+     ptr_SFB.get(), layout_SFB.get()
+    },
+    {
+      {}, // epilogue.thread
+      ptr_C.get(), stride_C.get(),
+      ptr_D.get(), stride_D.get()
+    },
+    hw_info
+  };
+
+  auto &fusion_args = arguments.epilogue.thread;
+  fusion_args.alpha = options.alpha;
+  fusion_args.beta = options.beta;
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  //
+  // Compute reference output
+  //
+  
+  block_D.sync_host();
+
+  for (int i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    // Create instantiation for device reference gemm kernel
+    auto A = cute::make_tensor(ptr_A_host.at(i),
+        cute::make_layout(cute::make_shape(M, K, 1), stride_A_host.at(i)));
+    auto B = cute::make_tensor(ptr_B_host.at(i),
+        cute::make_layout(cute::make_shape(N, K, 1), stride_B_host.at(i)));
+    auto C = cute::make_tensor(ptr_C_host.at(i),
+        cute::make_layout(cute::make_shape(M, N, 1), stride_C_host.at(i)));
+    auto D = cute::make_tensor(ptr_ref_D_host.at(i),
+        cute::make_layout(cute::make_shape(M, N, 1), stride_D_host.at(i)));
+
+    auto SFA = cute::make_tensor(ptr_SFA_host.at(i), layout_SFA_host.at(i));
+    auto SFB = cute::make_tensor(ptr_SFB_host.at(i), layout_SFB_host.at(i));
+
+    using unused_t = decltype(D);
+
+    cutlass::reference::host::GettBlockScalingMainloopParams<
+        ElementAccumulator,
+        decltype(A), 
+        decltype(SFA), 
+        decltype(B),
+        decltype(SFB)
+      > mainloop_params{A, SFA, B, SFB};
+
+    cutlass::reference::host::GettEpilogueParams<
+        ElementAccumulator,
+        ElementAccumulator,
+        ElementAccumulator,
+        ElementCompute,
+        decltype(C),
+        decltype(D)
+    > epilogue_params;
+
+    epilogue_params.C = C;
+    epilogue_params.D = D;
+    epilogue_params.alpha = options.alpha;
+    epilogue_params.beta = options.beta;
+
+    // get reference result
+    cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  }
+
+  bool passed = cutlass::reference::host::TensorEquals(block_ref_D.host_view(), block_D.host_view());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options) {
+  initialize(options);
+
+  
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+ 
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  Result result;
+  if (!options.skip_verification) {
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    result.passed = verify(options);
+
+    std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+    if (!result.passed) {
+      exit(-1);
+    }
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0) {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.groups << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least sm100a.
+  
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 || props.minor != 0) {
+    std::cerr << "This example requires a GPU with compute capability 100a)." << std::endl;
+    return 0;
+  } 
+  
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Run
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu b/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu
new file mode 100644
index 0000000000..60667cda29
--- /dev/null
+++ b/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu
@@ -0,0 +1,761 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief An FP8 blockwise-scaled grouped GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+    In this example M, N, and K are fixed across groups.
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+
+#include "cutlass/util/reference/host/gett.hpp"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using ElementA            = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using LayoutA             = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using ElementB            = cutlass::float_e4m3_t;                          // Element type for B matrix operand
+using LayoutB             = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using ElementC            = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using LayoutC             = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+using ElementD           = ElementC;
+using LayoutD            = LayoutC;
+constexpr int AlignmentD = AlignmentC;
+
+// MMA type
+using ElementAccumulator = float;
+using ElementCompute = float;
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_256,_128,_128>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_2,_1,_1>;
+// Shape of the threadblocks participating in a tcgen05 MMA. <1, 1, 1> for cta_group = 1, <2, 1, 1> for cta_group = 2
+
+constexpr int ScaleGranularityM = 1;
+constexpr int ScaleGranularityN = 128;
+constexpr int ScaleGranularityK = 128;
+using ScaleConfig = cutlass::detail::Sm100BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>;
+
+// Note when we have multiple scale factors per tile (in this case 128 scales in M per tile), we will restrict up to a 
+// 16B alignment if possible (i.e., we have at least 16B of scales in M).
+// In this case the smallest M that can be executed is 16. To avoid this for smaller M, you can swap A and B
+// and transpose A, B, C, and scales since B^T A^T = C^T.
+using LayoutSFA             = decltype(ScaleConfig::deduce_layoutSFA());                     // Layout type for SFA matrix operand
+using LayoutSFB             = decltype(ScaleConfig::deduce_layoutSFB());                     // Layout type for SFB matrix operand
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC *, AlignmentC,
+    ElementD, LayoutC *, AlignmentD,
+    cutlass::epilogue::PtrArrayTmaWarpSpecialized2Sm
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+    ElementA, cute::tuple<LayoutA *, LayoutSFA *>, AlignmentA,
+    ElementB, cute::tuple<LayoutB *, LayoutSFB *>, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise2SmSm100
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                // Default to ClusterLaunchControl (CLC) based tile scheduler
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
+
+static_assert(cute::is_same_v<typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFA, LayoutSFA>);
+static_assert(cute::is_same_v<typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFB, LayoutSFB>);
+
+/// Initialization
+uint64_t seed;
+
+// Host-side allocations
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_SFA;
+std::vector<int64_t> offset_SFB;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+std::vector<LayoutSFA> layout_SFA_host;
+std::vector<LayoutSFB> layout_SFB_host;
+
+std::vector<ElementD *> ptr_ref_D_host;
+
+std::vector<ElementA *> ptr_A_host;
+std::vector<ElementB *> ptr_B_host;
+std::vector<ElementC *> ptr_C_host;
+std::vector<ElementD *> ptr_D_host;
+std::vector<ElementAccumulator *> ptr_SFA_host;
+std::vector<ElementAccumulator *> ptr_SFB_host;
+
+// Shared Allocations
+
+cutlass::HostTensor<ElementA, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementB, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_ref_D;
+cutlass::HostTensor<ElementAccumulator, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementAccumulator, cutlass::layout::PackedVectorLayout> block_SFB;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<const typename Gemm::ElementA *> ptr_A;
+cutlass::DeviceAllocation<const typename Gemm::ElementB *> ptr_B;
+cutlass::DeviceAllocation<const typename Gemm::ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D;
+cutlass::DeviceAllocation<const ElementAccumulator *> ptr_SFA;
+cutlass::DeviceAllocation<const ElementAccumulator *> ptr_SFB;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<LayoutSFA> layout_SFA;
+cutlass::DeviceAllocation<LayoutSFB> layout_SFB;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool skip_verification = false;
+
+  float alpha = 1.f, beta = 0.f;
+  int iterations = 1000;
+  int m = 1024, n = 2048, k = 512, groups = 10;
+  std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    if (cmd.check_cmd_line_flag("skip-verification")) {
+      skip_verification = true;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("groups", groups);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+
+    for (int i = 0; i < groups; ++i) {
+      problem_sizes_host.push_back({m, n, k});
+    }
+
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "81_blackwell_grouped_gemm_groupwise\n\n"
+      << "  Blackwell FP8 GEMM with Groupwise Scaling using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --groups=<int>              Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n"
+      << "  --skip-verification         Skip verification.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "81_blackwell_grouped_gemm_groupwise" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k * groups;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result {
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  cutlass::Distribution::Kind dist_kind,
+  uint64_t seed) {
+
+  if (dist_kind == cutlass::Distribution::Uniform) {
+
+    double scope_max, scope_min;
+    int bits_input = cutlass::sizeof_bits<Element>::value;
+    int bits_output = cutlass::sizeof_bits<Element>::value;
+
+    if (bits_input == 1) {
+      scope_max = 2;
+      scope_min = 0;
+    } 
+    else if (bits_input <= 8) {
+      scope_max = 2;
+      scope_min = -2;
+    } 
+    else if (bits_output == 16) {
+      scope_max = 5;
+      scope_min = -5;
+    } 
+    else {
+      scope_max = 8;
+      scope_min = -8;
+    }
+
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Identity) {
+
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+  else if (dist_kind == cutlass::Distribution::AllOnes) {
+    cutlass::reference::host::TensorFill(view, Element(1));
+  }
+  else {
+    throw std::runtime_error("Not implementated.");
+  }
+
+  return true;
+}
+
+/// Helper to initialize a block of device data (scale_tensors)
+template <typename Element, typename Layout>
+bool initialize_scale_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  cutlass::Distribution::Kind dist_kind,
+  uint64_t seed) {
+
+  if (dist_kind == cutlass::Distribution::Uniform) {
+
+    double scope_max, scope_min;
+
+    scope_min = -1;
+    scope_max = 1;
+
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Identity) {
+
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+  else if (dist_kind == cutlass::Distribution::AllOnes) {
+    cutlass::reference::host::TensorFill(view, Element(1));
+  }
+  else {
+    throw std::runtime_error("Not implementated.");
+  }
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  int32_t total_elements_A = 0;
+  int32_t total_elements_B = 0;
+  int32_t total_elements_C = 0;
+  int32_t total_elements_D = 0;
+  int32_t total_elements_SFA = 0;
+  int32_t total_elements_SFB = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_SFA.push_back(total_elements_SFA);
+    offset_SFB.push_back(total_elements_SFB);
+
+    int32_t elements_A = M * K;
+    int32_t elements_B = K * N;
+    int32_t elements_C = M * N;
+    int32_t elements_D = M * N;
+
+    auto gemm_layout_SFA = ScaleConfig::tile_atom_to_shape_SFA(make_shape(M, N, K, 1));
+    auto gemm_layout_SFB = ScaleConfig::tile_atom_to_shape_SFB(make_shape(M, N, K, 1));
+
+    int32_t elements_SFA = cosize(gemm_layout_SFA);
+    int32_t elements_SFB = cosize(gemm_layout_SFB);
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_elements_SFA += elements_SFA;
+    total_elements_SFB += elements_SFB;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}));
+    layout_SFA_host.push_back(gemm_layout_SFA);
+    layout_SFB_host.push_back(gemm_layout_SFB);
+  }
+
+  block_A.resize(cutlass::make_Coord(total_elements_A));
+  block_B.resize(cutlass::make_Coord(total_elements_B));
+  block_C.resize(cutlass::make_Coord(total_elements_C));
+  block_D.resize(cutlass::make_Coord(total_elements_D));
+  block_ref_D.resize(cutlass::make_Coord(total_elements_D));
+  block_SFA.resize(cutlass::make_Coord(total_elements_SFA));
+  block_SFB.resize(cutlass::make_Coord(total_elements_SFB));
+
+  initialize_tensor(block_A.host_view(), cutlass::Distribution::Uniform, seed + 2022);
+  initialize_tensor(block_B.host_view(), cutlass::Distribution::Uniform, seed + 2023);
+  initialize_tensor(block_C.host_view(), cutlass::Distribution::Uniform, seed + 2024);
+  initialize_scale_tensor(block_SFA.host_view(), cutlass::Distribution::Uniform, seed + 2026);
+  initialize_scale_tensor(block_SFB.host_view(), cutlass::Distribution::Uniform, seed + 2027);
+
+  block_A.sync_device();
+  block_B.sync_device();
+  block_C.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+
+  // copy problem sizes
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  std::vector<ElementA *> device_ptr_A_host(options.groups);
+  std::vector<ElementB *> device_ptr_B_host(options.groups);
+  std::vector<ElementC *> device_ptr_C_host(options.groups);
+  std::vector<ElementD *> device_ptr_D_host(options.groups);
+  std::vector<ElementAccumulator *> device_ptr_SFA_host(options.groups);
+  std::vector<ElementAccumulator *> device_ptr_SFB_host(options.groups);
+
+  ptr_A_host = std::vector<ElementA *>(options.groups);
+  ptr_B_host = std::vector<ElementB *>(options.groups);
+  ptr_C_host = std::vector<ElementC *>(options.groups);
+  ptr_D_host = std::vector<ElementD *>(options.groups);
+  ptr_SFA_host = std::vector<ElementAccumulator *>(options.groups);
+  ptr_SFB_host = std::vector<ElementAccumulator *>(options.groups);
+  ptr_ref_D_host = std::vector<ElementD *>(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    // Ptrs for A
+    ptr_A_host.at(i) = block_A.host_data() + offset_A.at(i);
+    device_ptr_A_host.at(i) = block_A.device_data() + offset_A.at(i);
+
+    // Ptrs for B
+    ptr_B_host.at(i) = block_B.host_data() + offset_B.at(i);
+    device_ptr_B_host.at(i) = block_B.device_data() + offset_B.at(i);
+
+    // Ptrs for C
+    ptr_C_host.at(i) = block_C.host_data() + offset_C.at(i);
+    device_ptr_C_host.at(i) = block_C.device_data() + offset_C.at(i);
+
+    // Ptrs for D
+    ptr_D_host.at(i) = block_D.host_data() + offset_D.at(i);
+    device_ptr_D_host.at(i) = block_D.device_data() + offset_D.at(i);
+    ptr_ref_D_host.at(i) = block_ref_D.host_data() + offset_D.at(i);
+
+    // Ptrs for SFA
+    ptr_SFA_host.at(i) = block_SFA.host_data() + offset_SFA.at(i);
+    device_ptr_SFA_host.at(i) = block_SFA.device_data() + offset_SFA.at(i);
+
+    // Ptrs for SFB
+    ptr_SFB_host.at(i) = block_SFB.host_data() + offset_SFB.at(i);
+    device_ptr_SFB_host.at(i) = block_SFB.device_data() + offset_SFB.at(i);
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(device_ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(device_ptr_B_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(device_ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(device_ptr_D_host.data());
+
+  ptr_SFA.reset(options.groups);
+  ptr_SFA.copy_from_host(device_ptr_SFA_host.data());
+
+  ptr_SFB.reset(options.groups);
+  ptr_SFB.copy_from_host(device_ptr_SFB_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  layout_SFA.reset(options.groups);
+  layout_SFA.copy_from_host(layout_SFA_host.data());
+
+  layout_SFB.reset(options.groups);
+  layout_SFB.copy_from_host(layout_SFB_host.data());
+
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options) {
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_sizes.get(), options.problem_sizes_host.data()},
+    {ptr_A.get(), stride_A.get(),
+     ptr_B.get(), stride_B.get(),
+     ptr_SFA.get(), layout_SFA.get(),
+     ptr_SFB.get(), layout_SFB.get()
+    },
+    {
+      {}, // epilogue.thread
+      ptr_C.get(), stride_C.get(),
+      ptr_D.get(), stride_D.get()
+    },
+    hw_info
+  };
+
+  auto &fusion_args = arguments.epilogue.thread;
+  fusion_args.alpha = options.alpha;
+  fusion_args.beta = options.beta;
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  //
+  // Compute reference output
+  //
+  
+  block_D.sync_host();
+
+  for (int i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    // Create instantiation for device reference gemm kernel
+    auto A = cute::make_tensor(ptr_A_host.at(i),
+        cute::make_layout(cute::make_shape(M, K, 1), stride_A_host.at(i)));
+    auto B = cute::make_tensor(ptr_B_host.at(i),
+        cute::make_layout(cute::make_shape(N, K, 1), stride_B_host.at(i)));
+    auto C = cute::make_tensor(ptr_C_host.at(i),
+        cute::make_layout(cute::make_shape(M, N, 1), stride_C_host.at(i)));
+    auto D = cute::make_tensor(ptr_ref_D_host.at(i),
+        cute::make_layout(cute::make_shape(M, N, 1), stride_D_host.at(i)));
+
+    auto SFA = cute::make_tensor(ptr_SFA_host.at(i), layout_SFA_host.at(i));
+    auto SFB = cute::make_tensor(ptr_SFB_host.at(i), layout_SFB_host.at(i));
+
+    using unused_t = decltype(D);
+
+    cutlass::reference::host::GettBlockScalingMainloopParams<
+        ElementAccumulator,
+        decltype(A), 
+        decltype(SFA), 
+        decltype(B),
+        decltype(SFB)
+      > mainloop_params{A, SFA, B, SFB};
+
+    cutlass::reference::host::GettEpilogueParams<
+        ElementAccumulator,
+        ElementAccumulator,
+        ElementAccumulator,
+        ElementCompute,
+        decltype(C),
+        decltype(D)
+    > epilogue_params;
+
+    epilogue_params.C = C;
+    epilogue_params.D = D;
+    epilogue_params.alpha = options.alpha;
+    epilogue_params.beta = options.beta;
+
+    // get reference result
+    cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  }
+
+  bool passed = cutlass::reference::host::TensorEquals(block_ref_D.host_view(), block_D.host_view());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options) {
+  initialize(options);
+
+  
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+ 
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  Result result;
+  if (!options.skip_verification) {
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    result.passed = verify(options);
+
+    std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+    if (!result.passed) {
+      exit(-1);
+    }
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0) {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.groups << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least sm100a.
+  
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 || props.minor != 0) {
+    std::cerr << "This example requires a GPU with compute capability 100a)." << std::endl;
+    return 0;
+  } 
+  
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Run
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/81_blackwell_gemm_blockwise/CMakeLists.txt b/examples/81_blackwell_gemm_blockwise/CMakeLists.txt
index a4dc34d09e..8b98154627 100644
--- a/examples/81_blackwell_gemm_blockwise/CMakeLists.txt
+++ b/examples/81_blackwell_gemm_blockwise/CMakeLists.txt
@@ -54,4 +54,18 @@ cutlass_example_add_executable(
   TEST_SMALL
 )
 
+cutlass_example_add_executable(
+  81_blackwell_grouped_gemm_blockwise 
+  81_blackwell_grouped_gemm_blockwise.cu
+  TEST_COMMAND_OPTIONS
+  TEST_SMALL
+)
+
+cutlass_example_add_executable(
+  81_blackwell_grouped_gemm_groupwise 
+  81_blackwell_grouped_gemm_groupwise.cu
+  TEST_COMMAND_OPTIONS
+  TEST_SMALL
+)
+
 endif()
diff --git a/examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu b/examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu
new file mode 100644
index 0000000000..f955b8e99b
--- /dev/null
+++ b/examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu
@@ -0,0 +1,869 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Distributed GEMM (DistGEMM) for Blackwell.
+
+    This example runs Tensor Parallel GEMMs using the (experimental) Distributed GEMM API in 
+    CUTLASS. For more information, please refer to README.md.
+
+    Note that Distributed GEMM assumes an any-to-any NVLink network topology.
+    To check whether your device is compatible, run:
+
+      $ nvidia-smi topo -m
+
+    and make sure there's an any-to-any NVLink topology. It would look like this:
+
+                GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
+        GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18
+        GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18
+        GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18
+        GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18
+        GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18
+        GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18
+        GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18
+        GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X
+
+    You should also additionally check if the driver enables peer to peer access:
+
+      $ nvidia-smi topo -p2p r
+
+    Output should be something like this:
+
+               GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
+        GPU0   X       OK      OK      OK      OK      OK      OK      OK
+        GPU1   OK      X       OK      OK      OK      OK      OK      OK
+        GPU2   OK      OK      X       OK      OK      OK      OK      OK
+        GPU3   OK      OK      OK      X       OK      OK      OK      OK
+        GPU4   OK      OK      OK      OK      X       OK      OK      OK
+        GPU5   OK      OK      OK      OK      OK      X       OK      OK
+        GPU6   OK      OK      OK      OK      OK      OK      X       OK
+        GPU7   OK      OK      OK      OK      OK      OK      OK      X
+
+    It is recommended to build this target with the following flag to enable 
+    Grid Dependency Control instructions (GDC) in CUTLASS:
+      - CUTLASS_ENABLE_GDC_FOR_SM100
+
+    Example:
+
+      $ mkdir build && cd build
+
+      $ cmake .. -DCUTLASS_NVCC_ARCHS="100a" -DCUTLASS_ENABLE_GDC_FOR_SM100=1
+
+      $ cd examples/82_blackwell_distributed_gemm
+
+      $ make
+
+      $ ./82_blackwell_distributed_gemm
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/numeric_types.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/error_metrics.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+
+// Distributed GEMM headers
+#include "cutlass/experimental/distributed/device/dist_gemm_universal_wrapper.hpp"
+#include "cutlass/experimental/distributed/kernel/dist_gemm_kernel_wrapper.hpp"
+#include "cutlass/experimental/distributed/schedules/dist_gemm_1d_schedules.hpp"
+
+#include "helper.h"
+
+// Distributed GEMM helpers
+#include "dist_gemm_helpers.h"
+
+using namespace cute;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Distributed GEMM configuration
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// TP size (= number of processors/GPUs)
+using TP = _8;
+static constexpr int TP_ = TP{};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && \
+  (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 4))
+
+// Distributed GEMM tiling/sharding schedule
+// Choices:
+//
+// * All Gather + GEMM:
+//   * AllGather1D_TilingCD_RotatingA
+//   * AllGather1D_TilingCD_RotatingB
+//
+// * GEMM + Reduce Scatter:
+//   * ReduceScatter1D_TilingA_RotatingC
+//   * ReduceScatter1D_TilingB_RotatingC
+
+using DistSchedule = cutlass::distributed::schedules::AllGather1D_TilingCD_RotatingA<TP>;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::float_e4m3_t;                          // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::RowMajor;                      // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+using         ElementD    = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using         LayoutD     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of D matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_256,_256,_128>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_2,_1,_1>;
+// Shape of the tile computed by each SM
+using PerSmTileShape_MNK = Shape<_128, _256, _128>;
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass, 
+    PerSmTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::KernelTmaWarpSpecialized2SmSm100
+  >::CollectiveOp;
+
+// Compose into a kernel
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int, int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                   // Default to ClusterLaunchControl (CLC) based tile scheduler 
+
+// We're going to use the single-device GEMM as reference
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Instantiate Distributed GEMM kernel
+using DistGemmKernel = cutlass::distributed::kernel::DistributedGemmKernelWrapper<
+  GemmKernel,
+  DistSchedule
+>;
+using DistGemm = cutlass::distributed::device::DistributedGemmUniversalAdapter<DistGemmKernel>;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+using HostTensorA = typename cutlass::HostTensor<ElementA, LayoutA>;
+using HostTensorB = typename cutlass::HostTensor<ElementB, LayoutB>;
+using HostTensorC = typename cutlass::HostTensor<ElementC, LayoutC>;
+using HostTensorD = typename cutlass::HostTensor<ElementD, LayoutD>;
+
+// Reference GEMM tensors
+HostTensorA tensor_A;
+HostTensorB tensor_B;
+HostTensorC tensor_C;
+HostTensorD tensor_D;
+HostTensorD tensor_ref_D;
+
+// DistGEMM tensors (multi-device)
+HostTensorA tensor_A_arr[TP_];
+HostTensorB tensor_B_arr[TP_];
+HostTensorD tensor_C_arr[TP_];
+HostTensorD tensor_D_arr[TP_];
+
+#endif // (defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && (__CUDACC_VER_MAJOR__ >= 12) && (__CUDACC_VER_MINOR__ >= 4))
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  float alpha = 1.f, beta = 0.f;
+  int iterations = 100;
+  int warmup_iterations = 10;
+  int m = 16384, n = 106496, k = 16384, l = 1;
+  float eps = 0.f;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha);
+    cmd.get_cmd_line_argument("beta", beta);
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("warmup-iterations", warmup_iterations);
+    cmd.get_cmd_line_argument("eps", eps);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "82_blackwell_distributed_gemm\n\n"
+      << "  Blackwell Distributed GEMM (DistGEMM). \n"
+      << "  For more details please refer to the source file.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the L extent (batch) of the GEMM (default: 1)\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha (default: 1.0)\n"
+      << "  --beta=<f32>                Epilogue scalar beta (default: 0.0)\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform (default: 100)\n"
+      << "  --warmup-iterations=<int>   Number of warmup iterations prior to profiling (default: 10)\n"
+      << "  --eps=<f32>                 Threshold for error compared to reference " 
+      << "GEMM (default: 0.0)\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "82_blackwell_distributed_gemm" << " --m=16384 --n=106496 --k=16384 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in TFLOP/s
+  double tflops(double runtime_s) const {
+
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k * l / TP_;
+    double tflop = double(flop) / double(1.0e12);
+    return tflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result {
+  double avg_runtime_ms;
+  double tflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double tflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), tflops(tflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && \
+  (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 4))
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed,
+  bool is_device_tensor = false) {
+
+  double scope_max, scope_min;
+  int bits = cutlass::sizeof_bits<Element>::value;
+
+  if (bits == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if (bits <= 16) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else {
+    scope_max = 8;
+    scope_min = -8;
+  }
+
+  if (is_device_tensor) {
+    using Real = typename cutlass::RealType<Element>::Type;
+    cutlass::reference::device::TensorFillRandomUniform(
+      view, seed, static_cast<Real>(scope_max), static_cast<Real>(scope_min), 0);
+    cudaDeviceSynchronize();
+  } else {
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  auto problem_shape = cute::make_tuple(options.m, options.n, options.k, options.l);
+
+  // Setup (reference) GEMM tensors
+  auto shape_A = cute::select<0,2,3>(problem_shape);
+  auto shape_B = cute::select<1,2,3>(problem_shape);
+  auto shape_C = cute::select<0,1,3>(problem_shape);
+  auto shape_D = cute::select<0,1,3>(problem_shape);
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, shape_A);
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, shape_C);
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, shape_D);
+
+  auto a_coord = cutlass::make_Coord(size(shape_A), 1);
+  auto b_coord = cutlass::make_Coord(size(shape_B), 1);
+  auto c_coord = cutlass::make_Coord(size(shape_C), 1);
+
+  tensor_A.resize(a_coord);
+  tensor_B.resize(b_coord);
+  tensor_C.resize(c_coord);
+  tensor_D.resize(c_coord);
+  tensor_ref_D.resize(c_coord);
+
+  initialize_tensor(tensor_A.device_view(), seed + 2022, /* is_device_tensor = */ true);
+  initialize_tensor(tensor_B.device_view(), seed + 2023, /* is_device_tensor = */ true);
+  initialize_tensor(tensor_C.device_view(), seed + 2024, /* is_device_tensor = */ true);
+
+  tensor_A.sync_host();
+  tensor_B.sync_host();
+  tensor_C.sync_host();
+  tensor_D.sync_host();
+  tensor_ref_D.sync_host();
+
+  // Set up DistGEMM tensors
+  auto local_shape_A = DistSchedule::get_local_a_shape(problem_shape);
+  auto local_shape_B = DistSchedule::get_local_b_shape(problem_shape);
+  auto local_shape_C = DistSchedule::get_local_c_shape(problem_shape);
+  auto local_shape_D = DistSchedule::get_local_d_shape(problem_shape);
+
+  auto a_coord_device = cutlass::make_Coord(size(local_shape_A), 1);
+  auto b_coord_device = cutlass::make_Coord(size(local_shape_B), 1);
+  auto c_coord_device = cutlass::make_Coord(size(local_shape_C), 1);
+
+  int primary_device_idx;
+  CUDA_CHECK(cudaGetDevice(&primary_device_idx));
+
+  // Enable any-to-any access
+  for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+    int can_access;
+    CUDA_CHECK(cudaSetDevice(device_idx));
+    for (int peer_idx = 0; peer_idx < TP_; ++peer_idx) {
+      if (peer_idx != device_idx) {
+        CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access, device_idx, peer_idx));
+        if (not can_access) {
+          std::cerr << "FAILURE: Device " << device_idx << " can't access device " << peer_idx << "." <<
+            std::endl;
+          exit(EXIT_FAILURE);
+        }
+        CUDA_CHECK(cudaDeviceEnablePeerAccess(peer_idx, 0));
+      }
+    }
+
+    tensor_A_arr[device_idx].resize(a_coord_device);
+    tensor_B_arr[device_idx].resize(b_coord_device);
+    tensor_C_arr[device_idx].resize(c_coord_device);
+    tensor_D_arr[device_idx].resize(c_coord_device);
+  }
+  CUDA_CHECK(cudaSetDevice(primary_device_idx));
+}
+
+/// Commandline options -> Gemm/DistGemm Arguments
+using GemmArguments = typename Gemm::Arguments;
+GemmArguments gemm_args_from_options(const Options &options) {
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {tensor_A.device_data(), stride_A, tensor_B.device_data(), stride_B},
+    {
+      {static_cast<ElementCompute>(options.alpha), static_cast<ElementCompute>(options.beta)},
+      tensor_C.device_data(), stride_C,
+      tensor_ref_D.device_data(), stride_D
+    }
+  };
+
+  return arguments;
+}
+
+using DistGemmArguments = typename DistGemm::Arguments;
+DistGemmArguments dist_gemm_args_from_options(
+    const Options &options,
+    int device_idx,
+    cudaStream_t stream) {
+
+  auto problem_shape = cute::make_tuple(options.m, options.n, options.k, options.l);
+
+  auto global_A = cute::make_tensor(tensor_A.device_data(),
+      cute::make_layout(cute::make_shape(options.m, options.k, options.l), stride_A));
+  auto global_B = cute::make_tensor(tensor_B.device_data(),
+      cute::make_layout(cute::make_shape(options.n, options.k, options.l), stride_B));
+  auto global_C = cute::make_tensor(tensor_C.device_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_C));
+
+  auto global_A_device_slice = DistSchedule::get_device_slice_A(global_A, device_idx);
+  auto global_B_device_slice = DistSchedule::get_device_slice_B(global_B, device_idx);
+  auto global_C_device_slice = DistSchedule::get_device_slice_C(global_C, device_idx);
+
+  auto local_shape_A = DistSchedule::get_local_a_shape(problem_shape);
+  auto local_shape_B = DistSchedule::get_local_b_shape(problem_shape);
+  auto local_shape_C = DistSchedule::get_local_c_shape(problem_shape);
+  auto local_shape_D = DistSchedule::get_local_d_shape(problem_shape);
+
+  auto local_stride_A = cutlass::make_cute_packed_stride(StrideA{}, local_shape_A);
+  auto local_stride_B = cutlass::make_cute_packed_stride(StrideB{}, local_shape_B);
+  auto local_stride_C = cutlass::make_cute_packed_stride(StrideC{}, local_shape_C);
+  auto local_stride_D = cutlass::make_cute_packed_stride(StrideD{}, local_shape_D);
+
+  auto local_A = cute::make_tensor(
+      tensor_A_arr[device_idx].device_data(),
+      make_layout(local_shape_A, local_stride_A));
+  auto local_B = cute::make_tensor(
+      tensor_B_arr[device_idx].device_data(),
+      make_layout(local_shape_B, local_stride_B));
+  auto local_C = cute::make_tensor(
+      tensor_C_arr[device_idx].device_data(),
+      make_layout(local_shape_C, local_stride_C));
+  auto local_D = cute::make_tensor(
+      tensor_D_arr[device_idx].device_data(),
+      make_layout(local_shape_D, local_stride_D));
+
+  // Copy over tensor tiles for the first iteration
+  cutlass::device_copy(global_A_device_slice, local_A, stream);
+  cutlass::device_copy(global_B_device_slice, local_B, stream);
+  cutlass::device_copy(global_C_device_slice, local_C, stream);
+
+  DistGemmArguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,                                       // mode
+    problem_shape,                                                                 // problem shape
+    {
+      reinterpret_cast<const ElementA*>(local_A.data()),
+      local_A.stride(),
+      reinterpret_cast<const ElementB*>(local_B.data()),
+      local_B.stride()
+    },                                                                             // mainloop
+    {
+      {                                                                            // epilogue.thread
+        static_cast<ElementCompute>(options.alpha),
+        static_cast<ElementCompute>(options.beta)
+      },
+      reinterpret_cast<const ElementC*>(local_C.data()),
+      local_C.stride(),
+      reinterpret_cast<ElementD*>(local_D.data()),
+      local_D.stride(),
+    },                                                                             // epilogue
+    {},                                                                            // hw_info
+    {}                                                                             // scheduler
+  };
+
+  return arguments;
+}
+
+// Gathers results, moves back to the original full-sized D tensor on the primary device.
+void gather_results(const Options &options, int device_idx, cudaStream_t stream = nullptr) {
+
+  auto problem_shape = cute::make_tuple(options.m, options.n, options.k, options.l);
+
+  // Global dest
+  auto global_D = cute::make_tensor(tensor_D.device_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_D));
+  auto global_D_device_slice = DistSchedule::get_device_slice_D(global_D, device_idx);
+
+  // Device_idx local dest
+  auto local_shape_D = DistSchedule::get_local_d_shape(problem_shape);
+  auto local_stride_D = cutlass::make_cute_packed_stride(StrideD{}, local_shape_D);
+  auto local_D = cute::make_tensor(
+      tensor_D_arr[device_idx].device_data(),
+      make_layout(local_shape_D, local_stride_D)
+  );
+
+  // Copy to global dest
+  cutlass::device_copy(local_D, global_D_device_slice, stream);
+}
+
+bool verify(const Options &options) {
+  tensor_D.sync_host();
+  tensor_ref_D.sync_host();
+
+  bool passed = false;
+  if (options.eps == 0.f) {
+    passed = cutlass::reference::host::TensorEquals(tensor_ref_D.host_view(), tensor_D.host_view());
+  } else {
+    double err = cutlass::reference::host::TensorRelativeErrorMetric(
+      tensor_D.host_view(),
+      tensor_ref_D.host_view());
+    passed = err < 1e-5;
+  }
+
+  if (options.m <= 64 && options.n <= 64) {
+    std::cout << "GEMM output:\n" << tensor_D.host_view() << "\n\n";
+    std::cout << "Reference output:\n" << tensor_ref_D.host_view() << "\n\n";
+  }
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+int run(Options &options) {
+
+  int primary_device_idx;
+  cudaError_t device_get_result = cudaGetDevice(&primary_device_idx);
+  if (device_get_result != cudaSuccess) {
+    throw std::runtime_error("cudaGetDevice() failed");
+  }
+
+  initialize(options);
+
+  // Reference single-GPU GEMM
+  Gemm reference_gemm;
+  cutlass::device_memory::allocation<uint8_t> reference_workspace;
+
+  auto reference_arguments = gemm_args_from_options(options);
+  size_t reference_workspace_size = Gemm::get_workspace_size(reference_arguments);
+  reference_workspace = cutlass::device_memory::allocation<uint8_t>(reference_workspace_size);
+
+  CUTLASS_CHECK(reference_gemm.can_implement(reference_arguments));
+  CUTLASS_CHECK(reference_gemm.initialize(reference_arguments, reference_workspace.get()));
+  CUTLASS_CHECK(reference_gemm.run());
+
+  using ElementBarrier = typename DistGemm::ElementBarrier;
+  using ElementFlag = typename DistGemmKernel::ElementFlag;
+
+  // Set up per-device streams
+  cudaStream_t stream_arr[TP_];
+
+  for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+    CUDA_CHECK(cudaSetDevice(device_idx));
+
+    // Create stream
+    CUDA_CHECK(cudaStreamCreate(&stream_arr[device_idx]));
+  }
+
+  // Instantiate DistGEMM
+  DistGemm dist_gemm_arr[TP_];  // Distributed GEMM array for multiple devices
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace_arr[TP_];
+  cutlass::device_memory::allocation<uint8_t> exclusive_workspace_arr[TP_];
+
+  // Cross-device workspace pointer array for gemm.initialize()
+  void * workspace_ptr_arr[TP_];
+  void * exclusive_workspace_ptr_arr[TP_];
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  DistGemmArguments arguments_[TP_];
+
+  for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+    CUDA_CHECK(cudaSetDevice(device_idx));
+
+    arguments_[device_idx] = dist_gemm_args_from_options(options, device_idx, stream_arr[device_idx]);
+
+    // Using the arguments, query for extra workspace required for matrix multiplication computation
+    size_t workspace_size = DistGemm::get_workspace_size(arguments_[device_idx]);
+    size_t exclusive_workspace_size = DistGemm::get_exclusive_workspace_size();
+
+    workspace_arr[device_idx] = cutlass::device_memory::allocation<uint8_t>(workspace_size);
+    exclusive_workspace_arr[device_idx] = cutlass::device_memory::allocation<uint8_t>(exclusive_workspace_size);
+
+    // Throw workspace pointers into arrays for gemm.initialize()
+    workspace_ptr_arr[device_idx] = workspace_arr[device_idx].get();
+    exclusive_workspace_ptr_arr[device_idx] = exclusive_workspace_arr[device_idx].get();
+
+    // Zero out exclusive workspace
+    cudaMemsetAsync(exclusive_workspace_ptr_arr[device_idx], 0, exclusive_workspace_size, stream_arr[device_idx]);
+
+    cudaDeviceSynchronize();
+  }
+
+  for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+    CUDA_CHECK(cudaSetDevice(device_idx));
+
+    // Check if the problem size is supported or not
+    CUTLASS_CHECK(dist_gemm_arr[device_idx].can_implement(arguments_[device_idx]));
+
+#if defined(CUTLASS_ENABLE_GDC_FOR_SM100)
+    bool launch_with_pdl = true;
+#else
+    bool launch_with_pdl = false;
+#endif
+
+    // Initialize CUTLASS kernel with arguments and workspace pointer
+    CUTLASS_CHECK(dist_gemm_arr[device_idx].initialize(
+          arguments_,
+          workspace_ptr_arr,
+          exclusive_workspace_ptr_arr,
+          device_idx,
+          stream_arr[device_idx],
+          launch_with_pdl
+          ));
+
+    cudaDeviceSynchronize();
+  }
+
+  // Correctness / Warmup iteration
+  std::cout << std::endl << "  running DistGEMM..." << std::endl;
+
+  for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+    CUDA_CHECK(cudaSetDevice(device_idx));
+    CUTLASS_CHECK(dist_gemm_arr[device_idx].run(stream_arr[device_idx]));
+  }
+  for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+    CUDA_CHECK(cudaStreamSynchronize(stream_arr[device_idx]));
+    CUDA_CHECK(cudaGetLastError());
+    gather_results(options, device_idx);
+  }
+
+  std::cout << "  running DistGEMM finished without runtime errors" << std::endl;
+
+  //// Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+
+  result.passed = verify(options);
+
+  std::cout << std::endl << "  Disposition (eps: " << options.eps << "): " << 
+    (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0) {
+    float elapsed_ms = 0.f;
+
+    // Warmup
+    std::cout << "  Warming up for " << options.warmup_iterations << " iterations." << std::endl;
+    for (int warmup_iter = 0; warmup_iter < options.warmup_iterations; ++warmup_iter) {
+      for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+        CUDA_CHECK(cudaSetDevice(device_idx));
+        CUTLASS_CHECK(dist_gemm_arr[device_idx].run(stream_arr[device_idx]));
+      }
+    }
+
+    for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+      CUDA_CHECK(cudaSetDevice(device_idx));
+      CUDA_CHECK(cudaStreamSynchronize(stream_arr[device_idx]));
+    }
+
+    CUDA_CHECK(cudaSetDevice(primary_device_idx));
+
+    // Benchmark
+    std::cout << "  Profiling for " << options.iterations << " iterations." << std::endl;
+    using AtomicBoolean = cuda::atomic<bool>;
+    AtomicBoolean* atomic_flag_ptr;
+    CUDA_CHECK(cudaHostAlloc(&atomic_flag_ptr, sizeof(AtomicBoolean), cudaHostAllocPortable));
+    atomic_flag_ptr->store(false);
+
+    cutlass::DistGpuTimer<TP_> timer;
+
+    for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+      CUDA_CHECK(cudaSetDevice(device_idx));
+      cutlass::delay_kernel<<<1, 1, 0, stream_arr[device_idx]>>>(atomic_flag_ptr);
+      CUDA_CHECK(cudaGetLastError());
+    }
+
+    for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+      timer.start(device_idx, stream_arr[device_idx]);
+    }
+
+    atomic_flag_ptr->store(true);
+
+    for (int profile_iter = 0; profile_iter < options.iterations; ++profile_iter) {
+      for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+        CUDA_CHECK(cudaSetDevice(device_idx));
+        CUTLASS_CHECK(dist_gemm_arr[device_idx].run(stream_arr[device_idx]));
+      }
+    }
+
+    for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+      CUDA_CHECK(cudaSetDevice(device_idx));
+      timer.stop(device_idx, stream_arr[device_idx]);
+    }
+
+    CUDA_CHECK(cudaSetDevice(primary_device_idx));
+
+    for (int device_idx = 0; device_idx < TP_; ++device_idx) {
+      elapsed_ms = max(elapsed_ms, timer.elapsed_millis(device_idx));
+    }
+
+    // Compute average runtime and TFLOPs.
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    double avg_runtime_s = (double)(result.avg_runtime_ms / 1000.0);
+    result.tflops = options.tflops(avg_runtime_s);
+
+    auto [local_M, local_N, local_K, local_L] = DistSchedule::get_local_gemm_shape(
+        cute::make_tuple(options.m, options.n, options.k, options.l));
+
+    std::cout << std::endl;
+    std::cout << "  TP: " << TP::value << std::endl;
+    std::cout << "  Problem Size: " << 
+      options.m << " x " << 
+      options.n << " x " << 
+      options.k << " x " << 
+      options.l << std::endl;
+    std::cout << "  Local GEMM Problem Size: " << 
+      local_M << " x " << 
+      local_N << " x " << 
+      local_K << " x " << 
+      local_L<< std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  TFLOPS: " << result.tflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // (defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && (__CUDACC_VER_MAJOR__ >= 12) && (__CUDACC_VER_MINOR__ >= 4))
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA Toolkit 12.4 or newer to run this example
+  // and must have compute capability at least 90.
+  // Some necessary cuda graph APIs were only introduced in CUDA 12.4.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 4)) {
+    std::cerr << "This example requires CUDA 12.4 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  int num_devices;
+  CUDA_CHECK(cudaGetDeviceCount(&num_devices));
+  if (num_devices < TP_) {
+    std::cerr << "Distributed GEMM is compiled with TP = " << TP::value << ", but " << 
+      "found only " << num_devices << " devices." <<
+      std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 || props.minor != 0) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture "
+      << "(compute capability 100), " 
+      << "got compute capability " << props.major * 10 + props.minor << "." 
+      << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if (defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && (__CUDACC_VER_MAJOR__ >= 12) && (__CUDACC_VER_MINOR__ >= 4))
+  run(options);
+#endif
+
+  return 0;
+}
diff --git a/examples/82_blackwell_distributed_gemm/CMakeLists.txt b/examples/82_blackwell_distributed_gemm/CMakeLists.txt
new file mode 100644
index 0000000000..fa8fe9adee
--- /dev/null
+++ b/examples/82_blackwell_distributed_gemm/CMakeLists.txt
@@ -0,0 +1,32 @@
+# Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+cutlass_example_add_executable(
+  82_blackwell_distributed_gemm
+  82_blackwell_distributed_gemm.cu
+  )
diff --git a/examples/82_blackwell_distributed_gemm/README.md b/examples/82_blackwell_distributed_gemm/README.md
new file mode 100644
index 0000000000..6f6c19b867
--- /dev/null
+++ b/examples/82_blackwell_distributed_gemm/README.md
@@ -0,0 +1,37 @@
+# Blackwell Distributed GEMM
+
+This example implements Tensor Parallel GEMMs for the Hopper architecture with the experimental
+[Distributed GEMM](../../include/cutlass/experimental/distributed) API in CUTLASS.
+
+This example requires Blackwell GPUs with an any-to-any NVLink network.
+Please refer to [REQUIREMENTS.md](REQUIREMENTS.md) for more information.
+
+By default, the example assumes 8 GPUs (TP=8) and runs an All Gather + GEMM operation, which rotates
+operand A. To run with a different number of GPUs or schedule, please refer to
+[82_blackwell_distributed_gemm.cu](82_blackwell_distributed_gemm.cu).
+
+
+## Getting started
+
+Command line arguments are mostly similar to other examples:
+
+```
+--m=<int>                   Sets the M extent of the GEMM
+--n=<int>                   Sets the N extent of the GEMM
+--k=<int>                   Sets the K extent of the GEMM
+--l=<int>                   Sets the L extent (batch) of the GEMM (default: 1)
+--alpha=<f32>               Epilogue scalar alpha (default: 1.0)
+--beta=<f32>                Epilogue scalar beta (default: 0.0)
+--iterations=<int>          Number of profiling iterations to perform (default: 100)
+--warmup-iterations=<int>   Number of warmup iterations prior to profiling (default: 10)
+--eps=<f32>                 Threshold for error compared to reference GEMM (default: 0.0)
+```
+
+Sample run command:
+
+```bash
+./82_blackwell_distributed_gemm --m=16384 --n=106496 --k=16384 --warmup-iterations=10 --iterations=100
+```
+
+This example follows the [Hopper example](../65_distributed_gemm/) very closely, and only differs in the base GEMM kernel. For
+more information you can refer to [that example](../65_distributed_gemm/README.md).
diff --git a/examples/82_blackwell_distributed_gemm/REQUIREMENTS.md b/examples/82_blackwell_distributed_gemm/REQUIREMENTS.md
new file mode 100644
index 0000000000..3943716b2c
--- /dev/null
+++ b/examples/82_blackwell_distributed_gemm/REQUIREMENTS.md
@@ -0,0 +1,86 @@
+# Blackwell Distributed GEMM
+
+## Requirements
+
+### Build
+Make sure to set up CUTLASS with
+support for [Programmatic Dependent Launch (PDL)](../../media/docs/dependent_kernel_launch.md),
+that is with the `CUTLASS_ENABLE_GDC_FOR_SM100` flag.
+
+```bash
+cmake $PATH -DCUTLASS_NVCC_ARCHS="100a" -DCUTLASS_ENABLE_GDC_FOR_SM100=1
+```
+
+### Minimum software
+
+Like all other CUTLASS examples, the NVIDIA driver, runtime, and CUDA Toolkit are required.
+This example specifically requires CUDA Toolkit 12.6 or newer, due to some of the necessary
+CUDA graph APIs.
+
+### Hardware / driver settings
+
+This example requires Blackwell GPUs with NVLink network.
+
+If you're not sure, first run the following command and make sure your GPU
+compute capability is 10.0:
+
+```bash
+nvidia-smi --query-gpu=name,compute_cap --format=csv
+```
+
+Sample output:
+
+```
+name, compute_cap
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+NVIDIA B200, 10.0
+```
+
+
+Then you should make sure there is an NVLink network by checking the GPU network topology,
+and making sure there's `NV*` links between every pair of GPUs:
+
+```bash
+nvidia-smi topo -m
+```
+
+Sample output:
+
+```
+        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
+GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18
+GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18
+GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18
+GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18
+GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18
+GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18
+GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18
+GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X
+```
+
+Finally, check if the driver enables peer to peer access, which should usually be the case,
+but it's good to check anyway:
+
+```bash
+nvidia-smi topo -p2p r
+```
+
+Sample output:
+
+```
+       GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
+GPU0   X       OK      OK      OK      OK      OK      OK      OK
+GPU1   OK      X       OK      OK      OK      OK      OK      OK
+GPU2   OK      OK      X       OK      OK      OK      OK      OK
+GPU3   OK      OK      OK      X       OK      OK      OK      OK
+GPU4   OK      OK      OK      OK      X       OK      OK      OK
+GPU5   OK      OK      OK      OK      OK      X       OK      OK
+GPU6   OK      OK      OK      OK      OK      OK      X       OK
+GPU7   OK      OK      OK      OK      OK      OK      OK      X
+```
diff --git a/examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu b/examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu
new file mode 100644
index 0000000000..d428047219
--- /dev/null
+++ b/examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu
@@ -0,0 +1,607 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A FP16 sparse GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+
+    The Blackwell SM100 CUTLASS kernel uses of the following Blackwell SM100 features:
+
+    1. New series of Tensor Core MMA Instructions (tcgen05) introduced on the Blackwell architecture (sm100a) 
+    which have 2x throughput compared to Hopper Tensor Core MMA instructions (WGMMA). 
+
+    Note that Hopper WGMMA Tensor Core MMA instructions are not compatible on Blackwell (See https://docs.nvidia.com/cuda/parallel-thread-execution). 
+
+    2. A new per-SM memory called Tensor Memory (TMEM) introduced on the Blackwell architecture (sm100a). 
+    Blackwell SM100 Tensor Core MMA instructions store their accumulation results in TMEM instead of the 
+    Register File. (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+
+    3. An extended flavor of the warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+
+    4. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution). 
+
+    Usage:
+      $ ./examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm --m=8192 --n=8192 --k=8192
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = half_t;                                          // Element type for A matrix operand
+using         LayoutTagA  = cutlass::layout::RowMajor;                       // Layout type for A matrix operand
+constexpr int AlignmentA  = 2 * 128 / cutlass::sizeof_bits<ElementA>::value; // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes), 2x for compress along k
+
+// E matrix config
+using         ElementE    = cute::uint8_t;
+
+// B matrix configuration
+using         ElementB    = half_t;                                         // Element type for B matrix operand
+using         LayoutTagB  = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementD    = float;                                          // Element type for D matrix operand
+using         ElementC    = float;                                          // Element type for C matrix operand
+using         LayoutTagC  = cutlass::layout::ColumnMajor;                   // Layout type for C matrix operand
+using         LayoutTagD  = cutlass::layout::ColumnMajor;                   // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassSparseTensorOp;           // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_256,_128,_64>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_2,_1,_1>;
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutTagC, AlignmentC,
+    ElementD, LayoutTagD, AlignmentD,
+    cutlass::epilogue::TmaWarpSpecialized2Sm
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutTagA, AlignmentA,
+    ElementB, LayoutTagB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>,
+    cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100
+  >::CollectiveOp;
+
+using ProblemShape = Shape<int,int,int,int>;
+
+// Compose into a kernel
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                   // Default to ClusterLaunchControl (CLC) based tile scheduler 
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using DeviceGemmReference = cutlass::reference::device::Gemm<
+  ElementA,
+  LayoutTagA,
+  ElementB,
+  LayoutTagB,
+  ElementC,
+  LayoutTagC,
+  ElementAccumulator,
+  ElementAccumulator>;
+
+// Layouts
+using LayoutA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+using LayoutE = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+using StrideA = cutlass::gemm::TagToStrideA_t<LayoutTagA>;
+using StrideE = StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+//
+// Compressor
+//
+using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+
+using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig>;
+
+using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig,
+                            ArchTag>;
+
+using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+//
+// Data members
+//
+
+/// Initialization
+LayoutA layout_A;
+LayoutE layout_E;
+StrideA stride_A;
+StrideA stride_A_compressed;
+StrideE stride_E;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+
+uint64_t seed;
+
+ProblemShape problem_shape;
+
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A_compressed;
+cutlass::DeviceAllocation<typename Gemm::CollectiveMainloop::ElementE> block_E;
+cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k, l;
+
+  Options():
+    help(false),
+    m(8192), n(8192), k(8192), l(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "83_blackwell_sparse_gemm\n\n"
+      << "  Blackwell FP16 Sparse GEMM example.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the L extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "83_blackwell_sparse_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if constexpr (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  }
+  else if constexpr (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  }
+  else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+  return true;
+}
+
+/// Make A structured sparse by replacing elements with 0 and compress it
+bool sparsify_and_compress()
+{
+  auto [M, N, K, L] = problem_shape;
+  CompressorUtility compressor_utility(problem_shape, stride_A);
+
+  // TensorE
+  // In unit of ElementE (uint8_t), after alignment requirement
+  // M-dim: TensorEAtom_M alignment
+  // K-dim: TensorEAtom_K alignment
+  int KAlignedE = compressor_utility.get_metadata_k_physical();
+  int MAlignedE = compressor_utility.get_metadata_m_physical();
+
+  // TensorA Compressed
+  // In unit of ElementARaw, after alignment requirement
+  // M-dim: TMA alignment
+  // K-dim: TMA alignment
+  int KAlignedAC = compressor_utility.get_tensorA_k_physical();
+  int MAlignedAC = compressor_utility.get_tensorA_m_physical();
+
+  block_A_compressed.reset(M * KAlignedAC * L);
+  block_E.reset(MAlignedE * KAlignedE * L);
+
+  stride_A_compressed = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, KAlignedAC, L));
+  stride_E            = cutlass::make_cute_packed_stride(StrideE{}, cute::make_shape(MAlignedE, KAlignedE, L));
+
+  // Random 50% fill zero is performed on host
+  std::vector<ElementA> block_A_host(block_A.size());
+  cutlass::device_memory::copy_to_host(block_A_host.data(), block_A.get(), block_A.size());
+  compressor_utility.structure_sparse_zero_mask_fill(block_A_host.data(), static_cast<int>(seed + 2024));
+  cutlass::device_memory::copy_to_device(block_A.get(), block_A_host.data(), block_A.size());
+
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  typename Compressor::Arguments arguments {
+    problem_shape,
+    { block_A.get(),
+      stride_A,
+      block_A_compressed.get(),
+      block_E.get() },
+    {hw_info} };
+
+  Compressor compressor_op;
+  size_t workspace_size = Compressor::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  cutlass::Status status {cutlass::Status::kSuccess };
+  status = compressor_op.can_implement(arguments);
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.initialize(arguments, workspace.get());
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.run();
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  auto result = cudaDeviceSynchronize();
+  if (result != cudaSuccess) {
+    return false;
+  }
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+bool initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  block_A.reset(options.m * options.k);
+  block_B.reset(options.k * options.n);
+  block_C.reset(options.m * options.n);
+  block_D.reset(options.m * options.n);
+  block_ref_D.reset(options.m * options.n);
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+
+  // Compress row A and get A_compress and E
+  problem_shape = make_tuple(options.m, options.n, options.k, options.l);
+  if (not sparsify_and_compress()) {
+    return false;
+  };
+
+  // Build the compressed/metadata layouts
+  layout_A = SparseConfig::fill_layoutA(problem_shape);
+  layout_E = SparseConfig::fill_layoutE(problem_shape);
+
+  return true;
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    problem_shape,
+    { block_A_compressed.get(), layout_A, block_B.get(), stride_B, block_E.get(), layout_E },
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), Gemm::LayoutA::packed({options.m, options.k}));
+  cutlass::TensorRef ref_B(block_B.get(), Gemm::LayoutB::packed({options.k, options.n}));
+  cutlass::TensorRef ref_C(block_C.get(), Gemm::LayoutC::packed({options.m, options.n}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), Gemm::LayoutD::packed({options.m, options.n}));
+
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  DeviceGemmReference gemm_reference;
+
+  // Launch device reference gemm kernel
+  gemm_reference(
+    {options.m, options.n, options.k},
+    ElementAccumulator(options.alpha),
+    ref_A,
+    ref_B,
+    ElementAccumulator(options.beta),
+    ref_C,
+    ref_D);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  auto init_pass = initialize(options);
+  if (not init_pass) {
+    std::cout << "Initialization failure" << std::endl;
+    exit(EXIT_FAILURE);
+  }
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (not result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (not (props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/83_blackwell_sparse_gemm/CMakeLists.txt b/examples/83_blackwell_sparse_gemm/CMakeLists.txt
new file mode 100644
index 0000000000..765ef4c4ad
--- /dev/null
+++ b/examples/83_blackwell_sparse_gemm/CMakeLists.txt
@@ -0,0 +1,38 @@
+
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+
+cutlass_example_add_executable(
+  83_blackwell_sparse_gemm
+  83_blackwell_sparse_gemm.cu
+)
+
+endif()
diff --git a/examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu b/examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu
new file mode 100644
index 0000000000..d2d87c4697
--- /dev/null
+++ b/examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu
@@ -0,0 +1,693 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A Narrow Precision Sparse GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrates a simple way to instantiate and run a blockscaled NVFP4 Sparse GEMM on the NVIDIA Blackwell SM100 architecture.
+
+    The Blackwell SM100 CUTLASS kernel uses the new Block Scaled Tensor Core MMA Instructions (tcgen05.mma.blockscaled) introduced
+    on the Blackwell architecture (sm100a) which have 2x throughput compared to fp8 Tensor Core MMA instructions (tcgen05.mma)
+    and 4x throughput compared to fp8 Hopper Tensor Core MMA Instructions (WGMMA) (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Similar to 83_blackwell_sparse_gemm, this kernel leverages:
+    1. Per-SM memory called Tensor Memory (TMEM)  (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+
+    2. The extended warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+
+    3. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Usage:
+      $ ./examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm --m=2048 --n=2048 --k=2048
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA     = cutlass::float_e2m1_t;
+using         ElementAPair = cutlass::nv_float4_t<ElementA>;                 // Element type for A matrix operand
+using         LayoutTagA   = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA   = 64;                                             // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes), 2x for compress along k
+
+// E matrix config
+using         ElementE    = cute::uint8_t;
+using         LayoutTagE  = LayoutTagA;
+
+// B matrix configuration
+using         ElementB     = cutlass::float_e2m1_t;
+using         ElementBPair = cutlass::nv_float4_t<ElementB>;                 // Element type for B matrix operand
+using         LayoutTagB   = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB   = 32;                                             // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// SF
+using         ElementSF   = typename ElementAPair::ScaleFactorType;
+
+// C/D matrix configuration
+using         ElementD    = cutlass::bfloat16_t;                            // Element type for D matrix operand
+using         ElementC    = cutlass::bfloat16_t;                            // Element type for C matrix operand
+using         LayoutTagC  = cutlass::layout::RowMajor;                      // Layout type for C matrix operand
+using         LayoutTagD  = cutlass::layout::RowMajor;                      // Layout type for D matrix operand
+constexpr int AlignmentD  = (16 * 8) / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = (16 * 8) / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledSparseTensorOp; // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape = Shape<_256,_128,_256>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape = Shape<_2,_1,_1>;
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaTileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutTagC, AlignmentC,
+    ElementD, LayoutTagD, AlignmentD,
+    cutlass::epilogue::TmaWarpSpecialized2SmNvf4
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementAPair, LayoutTagA, AlignmentA,
+    ElementBPair, LayoutTagB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>,
+    cutlass::gemm::KernelSparseTmaWarpSpecialized2SmNvf4Sm100
+  >::CollectiveOp;
+
+using ProblemShape = Shape<int,int,int,int>;
+
+// Compose into a kernel
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                   // Default to ClusterLaunchControl (CLC) based tile scheduler 
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+//
+// Blockscale
+//
+using Sm1xxBlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig;
+using Blk_MN   = typename Sm1xxBlkScaledConfig::Blk_MN;
+using Blk_SF   = typename Sm1xxBlkScaledConfig::Blk_SF;
+using SfAtom   = typename Sm1xxBlkScaledConfig::SfAtom;
+
+using LayoutA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+using LayoutE = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+using StrideA = cutlass::gemm::TagToStrideA_t<LayoutTagA>;
+using StrideE = StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+//
+// Compressor
+//
+using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+
+using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig>;
+
+using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig,
+                            ArchTag>;
+
+using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideA stride_A_compressed;
+StrideE stride_E;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+
+LayoutA layout_A;
+LayoutE layout_E;
+LayoutSFA layout_SFA;
+LayoutSFB layout_SFB;
+
+typename LayoutTagA::Stride stride_factor_A;
+typename LayoutTagB::Stride stride_factor_B;
+typename LayoutTagE::Stride stride_factor_E;
+typename LayoutTagC::Stride stride_factor_C;
+typename LayoutTagD::Stride stride_factor_D;
+
+uint64_t seed;
+
+ProblemShape problem_shape;
+
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA, LayoutTagA> tensor_A;
+cutlass::HostTensor<ElementA, LayoutTagA> tensor_A_compressed;
+cutlass::HostTensor<ElementE, LayoutTagE> tensor_E;
+cutlass::HostTensor<ElementB, LayoutTagB> tensor_B;
+cutlass::HostTensor<ElementC, LayoutTagC> tensor_C;
+cutlass::HostTensor<ElementSF, LayoutTagA> tensor_SFA;
+cutlass::HostTensor<ElementSF, LayoutTagB> tensor_SFB;
+cutlass::HostTensor<ElementD, LayoutTagD> tensor_D;
+cutlass::HostTensor<ElementD, LayoutTagD> reference_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k, l;
+
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024), l(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "84a_blackwell_nvfp4_bf16_sparse_gemm\n\n"
+      << "  Blackwell NVFP4 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the L extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out << "\n\nExamples:\n\n"
+      << "$ " << "./examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+void initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+    else if (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>){
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+bool initialize(const Options &options) {
+
+  problem_shape = make_tuple(options.m, options.n, options.k, options.l);
+
+  // * Get A B C D size
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+  layout_A = SparseConfig::fill_layoutA(problem_shape);
+  layout_E = SparseConfig::fill_layoutE(problem_shape);
+  layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(problem_shape);
+  layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB(problem_shape);
+
+  // * Get ACompress & E size
+  CompressorUtility compressor_utility(problem_shape, stride_A);
+
+  // TensorE
+  // In unit of ElementE (uint8_t), after alignment requirement
+  // M-dim: TensorEAtom_M alignment
+  // K-dim: TensorEAtom_K alignment
+  int KAlignedE = compressor_utility.get_metadata_k_physical();
+  int MAlignedE = compressor_utility.get_metadata_m_physical();
+
+  // TensorA Compressed
+  // In unit of ElementARaw, after alignment requirement
+  // M-dim: TMA alignment
+  // K-dim: TMA alignment
+  int KAlignedAC = compressor_utility.get_tensorA_k_physical();
+  int MAlignedAC = compressor_utility.get_tensorA_m_physical();
+
+  stride_A_compressed = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, KAlignedAC, options.l));
+  stride_E            = cutlass::make_cute_packed_stride(StrideE{}, cute::make_shape(MAlignedE, KAlignedE, options.l));
+
+  // * Get SFA & SFB size
+  auto k_blks = cutlass::ceil_div(options.k, cute::size<1>(shape(SfAtom{})));
+  auto m_blks = cutlass::ceil_div(options.m, Blk_MN{});
+  auto n_blks = cutlass::ceil_div(options.n, Blk_MN{});
+
+  // * Allocate Tensor
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+  auto e_coord = cutlass::make_Coord(MAlignedE * options.l, KAlignedE);
+  auto a_comp_coord = cutlass::make_Coord(MAlignedAC * options.l, KAlignedAC);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto d_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto sfa_coord   = cutlass::make_Coord(m_blks * Blk_MN{} * options.l, k_blks * Blk_SF{});
+  auto sfb_coord   = cutlass::make_Coord(n_blks * Blk_MN{} * options.l, k_blks * Blk_SF{});
+
+  tensor_A.resize(a_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_coord, stride_factor_A));
+  tensor_A_compressed.resize(a_comp_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_comp_coord, stride_factor_A));
+  tensor_B.resize(b_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(b_coord, stride_factor_B));
+  tensor_E.resize(e_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagE>::layout_factory(e_coord, stride_factor_E));
+  tensor_C.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagC>::layout_factory(c_coord, stride_factor_C));
+  tensor_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(d_coord, stride_factor_D));
+  reference_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(d_coord, stride_factor_D), false);
+  tensor_SFA.resize(sfa_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(sfa_coord, stride_factor_A));
+  tensor_SFB.resize(sfb_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(sfb_coord, stride_factor_B));
+
+  // * Random init
+  initialize_tensor(tensor_A.host_view(), seed + 2021);
+  initialize_tensor(tensor_B.host_view(), seed + 2022);
+  initialize_tensor(tensor_C.host_view(), seed + 2023);
+  initialize_tensor(tensor_SFA.host_view(), seed + 2024);
+  initialize_tensor(tensor_SFB.host_view(), seed + 2025);
+  cutlass::reference::host::TensorCopy(reference_D.host_view(), tensor_C.host_view());
+
+  // * Random fill 50% A with zero
+  compressor_utility.structure_sparse_zero_mask_fill(tensor_A.host_data(), static_cast<int>(seed + 2023));
+
+  tensor_A.sync_device();
+  tensor_B.sync_device();
+  tensor_C.sync_device();
+  tensor_SFA.sync_device();
+  tensor_SFB.sync_device();
+
+  // * Compress
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  typename Compressor::Arguments arguments{
+    problem_shape,
+    {tensor_A.device_data(),
+      stride_A,
+      tensor_A_compressed.device_data(),
+      tensor_E.device_data()},
+    {hw_info}
+  };
+
+  Compressor compressor_op;
+  size_t workspace_size = Compressor::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  cutlass::Status status {cutlass::Status::kSuccess };
+  status = compressor_op.can_implement(arguments);
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.initialize(arguments, workspace.get());
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.run();
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  auto result = cudaDeviceSynchronize();
+  if (result != cudaSuccess) {
+    return false;
+  }
+
+  tensor_E.sync_host();
+  tensor_A_compressed.sync_host();
+
+  return true;
+}
+
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  using ArrayElementA = typename Gemm::GemmKernel::CollectiveMainloop::ArrayElementA;
+  using ArrayElementB = typename Gemm::GemmKernel::CollectiveMainloop::ArrayElementB;
+
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    {
+      reinterpret_cast<ArrayElementA *>(tensor_A_compressed.device_data()), layout_A,
+      reinterpret_cast<ArrayElementB *>(tensor_B.device_data()), stride_B,
+      tensor_E.device_data(), layout_E,
+      tensor_SFA.device_data(), layout_SFA,
+      tensor_SFB.device_data(), layout_SFB
+    },
+    {
+      {options.alpha, options.beta},
+      tensor_C.device_data(), stride_C,
+      tensor_D.device_data(), stride_D
+    }
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+
+  // Create the arguments for host reference implementation
+  auto A = make_tensor(make_iterator(tensor_A.host_data()), layout_A);
+  auto SFA = make_tensor(tensor_SFA.host_data(), layout_SFA);
+  auto B = make_tensor(make_iterator(tensor_B.host_data()), 
+    make_layout(make_shape(options.n, options.k, options.l), stride_B));
+  auto SFB = make_tensor(tensor_SFB.host_data(), layout_SFB);
+
+  cutlass::reference::host::GettMainloopParams<
+      ElementAccumulator, 
+      decltype(A),  
+      decltype(B), 
+      decltype(SFA), 
+      decltype(SFB)> mainloop_params{A, SFA, B, SFB};
+
+  auto C = make_tensor(make_iterator(tensor_C.host_data()), 
+    make_layout(make_shape(options.m, options.n, options.l), stride_C));
+  auto D = make_tensor(make_iterator(reference_D.host_data()),
+    make_layout(make_shape(options.m, options.n, options.l), stride_D));
+
+  cutlass::reference::host::GettBlockScalingEpilogueParams<
+      ElementAccumulator,                   // ElementScalar
+      ElementAccumulator,                   // ElementAccumulator
+      ElementAccumulator,                   // ElementCompute
+      decltype(C),                          // TensorC
+      decltype(D)                           // TensorD
+    > epilogue_params{
+      options.alpha,
+      options.beta,
+      C,
+      D};
+  
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // Comparison
+  tensor_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(reference_D.host_view(), tensor_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(tensor_D.host_view()) > 0);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  auto init_pass = initialize(options);
+  if (not init_pass) {
+    std::cout << "Initialization failure" << std::endl;
+    exit(EXIT_FAILURE);
+  }
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (not result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (not (props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu b/examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu
new file mode 100644
index 0000000000..a23af1581d
--- /dev/null
+++ b/examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu
@@ -0,0 +1,695 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A Narrow Precision Sparse GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrates a simple way to instantiate and run a blockscaled MXFP8 Sparse GEMM on the NVIDIA Blackwell SM100 architecture.
+
+    The Blackwell SM100 CUTLASS kernel uses the new Block Scaled Tensor Core MMA Instructions (tcgen05.mma.blockscaled) introduced
+    on the Blackwell architecture (sm100a) which have 2x throughput compared to fp8 Tensor Core MMA instructions (tcgen05.mma)
+    and 4x throughput compared to fp8 Hopper Tensor Core MMA Instructions (WGMMA) (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Similar to 83_blackwell_sparse_gemm, this kernel leverages:
+    1. Per-SM memory called Tensor Memory (TMEM)  (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+
+    2. The extended warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+
+    3. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Usage:
+      $ ./examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm --m=2048 --n=2048 --k=2048
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA     = cutlass::float_e4m3_t;
+using         ElementAPair = cutlass::mx_float8_t<ElementA>;                 // Element type for A matrix operand
+using         LayoutTagA   = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA   = 64;                                             // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes), 2x for compress along k
+
+// E matrix config
+using         ElementE    = cute::uint8_t;
+using         LayoutTagE  = LayoutTagA;
+
+// B matrix configuration
+using         ElementB     = cutlass::float_e2m1_t;
+using         ElementBPair = cutlass::mx_float4_t<ElementB>;                 // Element type for B matrix operand
+using         LayoutTagB   = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB   = 128;                                             // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// SF
+using         ElementSF   = typename ElementAPair::ScaleFactorType;
+
+// C/D matrix configuration
+using         ElementD    = cutlass::bfloat16_t;                            // Element type for D matrix operand
+using         ElementC    = cutlass::bfloat16_t;                            // Element type for C matrix operand
+using         LayoutTagC  = cutlass::layout::RowMajor;                      // Layout type for C matrix operand
+using         LayoutTagD  = cutlass::layout::RowMajor;                      // Layout type for D matrix operand
+constexpr int AlignmentD  = (16 * 8) / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = (16 * 8) / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledSparseTensorOp; // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_256,_128,_256>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_2,_1,_1>;
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutTagC, AlignmentC,
+    ElementD, LayoutTagD, AlignmentD,
+    cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementAPair, LayoutTagA, AlignmentA,
+    ElementBPair, LayoutTagB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>,
+    cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100
+  >::CollectiveOp;
+
+using ProblemShape = Shape<int,int,int,int>;
+
+// Compose into a kernel
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                   // Default to ClusterLaunchControl (CLC) based tile scheduler 
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+//
+// Blockscale
+//
+using Sm1xxBlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig;
+using Blk_MN   = typename Sm1xxBlkScaledConfig::Blk_MN;
+using Blk_SF   = typename Sm1xxBlkScaledConfig::Blk_SF;
+using SfAtom   = typename Sm1xxBlkScaledConfig::SfAtom;
+
+using LayoutA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+using LayoutE = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+using StrideA = cutlass::gemm::TagToStrideA_t<LayoutTagA>;
+using StrideE = StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+//
+// Compressor
+//
+using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+
+using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig>;
+
+using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig,
+                            ArchTag>;
+
+using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideA stride_A_compressed;
+StrideE stride_E;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+
+LayoutA layout_A;
+LayoutE layout_E;
+LayoutSFA layout_SFA;
+LayoutSFB layout_SFB;
+
+typename LayoutTagA::Stride stride_factor_A;
+typename LayoutTagB::Stride stride_factor_B;
+typename LayoutTagE::Stride stride_factor_E;
+typename LayoutTagC::Stride stride_factor_C;
+typename LayoutTagD::Stride stride_factor_D;
+
+uint64_t seed;
+
+ProblemShape problem_shape;
+
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA, LayoutTagA> tensor_A;
+cutlass::HostTensor<ElementA, LayoutTagA> tensor_A_compressed;
+cutlass::HostTensor<ElementE, LayoutTagE> tensor_E;
+cutlass::HostTensor<ElementB, LayoutTagB> tensor_B;
+cutlass::HostTensor<ElementC, LayoutTagC> tensor_C;
+cutlass::HostTensor<ElementSF, LayoutTagA> tensor_SFA;
+cutlass::HostTensor<ElementSF, LayoutTagB> tensor_SFB;
+cutlass::HostTensor<ElementD, LayoutTagD> tensor_D;
+cutlass::HostTensor<ElementD, LayoutTagD> reference_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k, l;
+
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024), l(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "84b_blackwell_mixed_mxfp8_bf16_sparse_gemm\n\n"
+      << "  Blackwell NVFP4 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the L extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out << "\n\nExamples:\n\n"
+      << "$ " << "./examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+void initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+    else if (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>){
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+bool initialize(const Options &options) {
+
+  problem_shape = make_tuple(options.m, options.n, options.k, options.l);
+
+  // * Get A B C D size
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+  layout_A = SparseConfig::fill_layoutA(problem_shape);
+  layout_E = SparseConfig::fill_layoutE(problem_shape);
+  layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(problem_shape);
+  layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB(problem_shape);
+
+  // * Get ACompress & E size
+  CompressorUtility compressor_utility(problem_shape, stride_A);
+
+  // TensorE
+  // In unit of ElementE (uint8_t), after alignment requirement
+  // M-dim: TensorEAtom_M alignment
+  // K-dim: TensorEAtom_K alignment
+  int KAlignedE = compressor_utility.get_metadata_k_physical();
+  int MAlignedE = compressor_utility.get_metadata_m_physical();
+
+  // TensorA Compressed
+  // In unit of ElementARaw, after alignment requirement
+  // M-dim: TMA alignment
+  // K-dim: TMA alignment
+  int KAlignedAC = compressor_utility.get_tensorA_k_physical();
+  int MAlignedAC = compressor_utility.get_tensorA_m_physical();
+
+  stride_A_compressed = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, KAlignedAC, options.l));
+  stride_E            = cutlass::make_cute_packed_stride(StrideE{}, cute::make_shape(MAlignedE, KAlignedE, options.l));
+
+  // * Get SFA & SFB size
+  auto k_blks = cutlass::ceil_div(options.k, cute::size<1>(shape(SfAtom{})));
+  auto m_blks = cutlass::ceil_div(options.m, Blk_MN{});
+  auto n_blks = cutlass::ceil_div(options.n, Blk_MN{});
+
+  // * Allocate Tensor
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+  auto e_coord = cutlass::make_Coord(MAlignedE * options.l, KAlignedE);
+  auto a_comp_coord = cutlass::make_Coord(MAlignedAC * options.l, KAlignedAC);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto d_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto sfa_coord   = cutlass::make_Coord(m_blks * Blk_MN{} * options.l, k_blks * Blk_SF{});
+  auto sfb_coord   = cutlass::make_Coord(n_blks * Blk_MN{} * options.l, k_blks * Blk_SF{});
+
+  tensor_A.resize(a_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_coord, stride_factor_A));
+  tensor_A_compressed.resize(a_comp_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_comp_coord, stride_factor_A));
+  tensor_B.resize(b_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(b_coord, stride_factor_B));
+  tensor_E.resize(e_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagE>::layout_factory(e_coord, stride_factor_E));
+  tensor_C.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagC>::layout_factory(c_coord, stride_factor_C));
+  tensor_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(d_coord, stride_factor_D));
+  reference_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(d_coord, stride_factor_D), false);
+  tensor_SFA.resize(sfa_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(sfa_coord, stride_factor_A));
+  tensor_SFB.resize(sfb_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(sfb_coord, stride_factor_B));
+
+  // * Random init
+  initialize_tensor(tensor_A.host_view(), seed + 2021);
+  initialize_tensor(tensor_B.host_view(), seed + 2022);
+  initialize_tensor(tensor_C.host_view(), seed + 2023);
+  initialize_tensor(tensor_SFA.host_view(), seed + 2024);
+  initialize_tensor(tensor_SFB.host_view(), seed + 2025);
+  cutlass::reference::host::TensorCopy(reference_D.host_view(), tensor_C.host_view());
+
+  // * Random fill 50% A with zero
+  compressor_utility.structure_sparse_zero_mask_fill(tensor_A.host_data(), static_cast<int>(seed + 2023));
+
+  tensor_A.sync_device();
+  tensor_B.sync_device();
+  tensor_C.sync_device();
+  tensor_SFA.sync_device();
+  tensor_SFB.sync_device();
+
+  // * Compress
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  typename Compressor::Arguments arguments{
+    problem_shape,
+    {tensor_A.device_data(),
+      stride_A,
+      tensor_A_compressed.device_data(),
+      tensor_E.device_data()},
+    {hw_info}
+  };
+
+  Compressor compressor_op;
+  size_t workspace_size = Compressor::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  cutlass::Status status {cutlass::Status::kSuccess };
+  status = compressor_op.can_implement(arguments);
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.initialize(arguments, workspace.get());
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  status = compressor_op.run();
+  if (status != cutlass::Status::kSuccess) {
+    return false;
+  }
+
+  auto result = cudaDeviceSynchronize();
+  if (result != cudaSuccess) {
+    return false;
+  }
+
+  tensor_E.sync_host();
+  tensor_A_compressed.sync_host();
+
+  return true;
+}
+
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  using ArrayElementA = typename Gemm::GemmKernel::CollectiveMainloop::ArrayElementA;
+  using ArrayElementB = typename Gemm::GemmKernel::CollectiveMainloop::ArrayElementB;
+
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    {
+      reinterpret_cast<ArrayElementA *>(tensor_A_compressed.device_data()), layout_A,
+      reinterpret_cast<ArrayElementB *>(tensor_B.device_data()), stride_B,
+      tensor_E.device_data(), layout_E,
+      tensor_SFA.device_data(), layout_SFA,
+      tensor_SFB.device_data(), layout_SFB
+    },
+    {
+      {options.alpha, options.beta},
+      tensor_C.device_data(), stride_C,
+      tensor_D.device_data(), stride_D
+    }
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+
+  // Create the arguments for host reference implementation
+  auto A = make_tensor(make_iterator(tensor_A.host_data()), layout_A);
+  auto SFA = make_tensor(tensor_SFA.host_data(), layout_SFA);
+  auto B = make_tensor(make_iterator(tensor_B.host_data()), 
+    make_layout(make_shape(options.n, options.k, options.l), stride_B));
+  auto SFB = make_tensor(tensor_SFB.host_data(), layout_SFB);
+
+  cutlass::reference::host::GettMainloopParams<
+      ElementAccumulator, 
+      decltype(A),  
+      decltype(B), 
+      decltype(SFA), 
+      decltype(SFB)> mainloop_params{A, SFA, B, SFB};
+
+  auto C = make_tensor(make_iterator(tensor_C.host_data()), 
+    make_layout(make_shape(options.m, options.n, options.l), stride_C));
+  auto D = make_tensor(make_iterator(reference_D.host_data()),
+    make_layout(make_shape(options.m, options.n, options.l), stride_D));
+
+  cutlass::reference::host::GettEpilogueParams<
+      ElementAccumulator,                   // ElementScalar
+      ElementAccumulator,                   // ElementScalingFactor
+      ElementAccumulator,                   // ElementAccumulator
+      ElementAccumulator,                   // ElementCompute
+      decltype(C),                          // TensorC
+      decltype(D)                           // TensorD
+    > epilogue_params{};
+  
+  epilogue_params.C = C;
+  epilogue_params.D = D;
+  epilogue_params.alpha = options.alpha;
+  epilogue_params.beta = options.beta;
+
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // Comparison
+  tensor_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(reference_D.host_view(), tensor_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(tensor_D.host_view()) > 0);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  auto init_pass = initialize(options);
+  if (not init_pass) {
+    std::cout << "Initialization failure" << std::endl;
+    exit(EXIT_FAILURE);
+  }
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (not result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (not (props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/84_blackwell_narrow_precision_sparse_gemm/CMakeLists.txt b/examples/84_blackwell_narrow_precision_sparse_gemm/CMakeLists.txt
new file mode 100644
index 0000000000..751590b702
--- /dev/null
+++ b/examples/84_blackwell_narrow_precision_sparse_gemm/CMakeLists.txt
@@ -0,0 +1,41 @@
+
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  84a_blackwell_nvfp4_bf16_sparse_gemm
+  84a_blackwell_nvfp4_bf16_sparse_gemm.cu
+  )
+
+cutlass_example_add_executable(
+  84b_blackwell_mixed_mxfp8_bf16_sparse_gemm
+  84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu
+  )
+endif()
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 84fc931118..f041869cc7 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -159,17 +159,21 @@ foreach(EXAMPLE
   67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling
   68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling
   69_hopper_mixed_dtype_grouped_gemm
-  70_blackwell_gemm
-  71_blackwell_gemm_with_collective_builder
-  72_blackwell_narrow_precision_gemm
-  73_blackwell_gemm_preferred_cluster
-  74_blackwell_gemm_streamk
-  75_blackwell_grouped_gemm
-  76_blackwell_conv
-  77_blackwell_fmha
-  78_blackwell_emulated_bf16x9_gemm
+  70_blackwell_gemm                             
+  71_blackwell_gemm_with_collective_builder     
+  72_blackwell_narrow_precision_gemm            
+  73_blackwell_gemm_preferred_cluster           
+  74_blackwell_gemm_streamk                     
+  75_blackwell_grouped_gemm                     
+  76_blackwell_conv                             
+  77_blackwell_fmha                             
+  78_blackwell_emulated_bf16x9_gemm             
   79_blackwell_geforce_gemm
+  80_blackwell_geforce_sparse_gemm
   81_blackwell_gemm_blockwise 
+  82_blackwell_distributed_gemm
+  83_blackwell_sparse_gemm
+  84_blackwell_narrow_precision_sparse_gemm
   )
     add_subdirectory(${EXAMPLE})
   endforeach()
diff --git a/examples/README.md b/examples/README.md
index 150115db1a..5bed6853d7 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -286,6 +286,18 @@
 
     Blackwell SM120 MMA kernel targeting GeForce RTX 50 series CUDA Cores
 
+* [80_blackwell_geforce_sparse_gemm](80_blackwell_geforce_sparse_gemm/)
+
+    Blackwell SM120 sparse MMA kernel targeting GeForce RTX 50 series CUDA Cores
+
+* [83_blackwell_sparse_gemm](83_blackwell_sparse_gemm)
+
+    Blackwell SM100 Sparse Gemm kernel
+
+* [84_blackwell_narrow_precision_sparse_gemm](84_blackwell_narrow_precision_sparse_gemm)
+
+    Blackwell Block Scaled SM100 Sparse Gemm kernel
+
 # CUTLASS SYCL - Programming Examples
 
 * [00_pvc_gemm](./sycl/00_pvc_gemm)
diff --git a/examples/65_distributed_gemm/util/benchmark.h b/examples/common/dist_gemm_helpers.h
similarity index 69%
rename from examples/65_distributed_gemm/util/benchmark.h
rename to examples/common/dist_gemm_helpers.h
index 66a0dbb50d..ef258e6922 100644
--- a/examples/65_distributed_gemm/util/benchmark.h
+++ b/examples/common/dist_gemm_helpers.h
@@ -44,6 +44,11 @@
 #include <cuda/atomic>
 #include <cuda/std/atomic>
 
+#include "cute/layout.hpp"
+#include "cute/tensor.hpp"
+#include "cutlass/cutlass.h"
+#include "cutlass/cuda_host_adapter.hpp"
+
 
 namespace cutlass {
 
@@ -115,4 +120,46 @@ struct DistGpuTimer {
   }
 };
 
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Generic device-to-device data movement kernel based for CuTe tensors.
+///
+///   NOTE: this kernel assigns one element copy to every thread, and is by no means
+///   an efficient way of copying tensors. It should only be used for convenience in
+///   reference checks.
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename TensorSource, typename TensorDestination>
+void device_copy(TensorSource      tensor_source,
+                 TensorDestination tensor_destination,
+                 cudaStream_t stream);
+
+
+template <typename TensorSource, typename TensorDestination>
+__global__ void device_copy_kernel(TensorSource const tensor_source, 
+                                   TensorDestination tensor_destination) {
+  auto linear_idx = blockIdx.x * blockDim.x + threadIdx.x;
+  using ElementSrc = typename TensorSource::value_type;
+  using ElementDst = typename TensorDestination::value_type;
+  NumericConverter<ElementDst, ElementSrc> converter;
+  if (linear_idx < size(tensor_source)) {
+    tensor_destination(linear_idx) = converter(tensor_source(linear_idx));
+  }
+}
+
+template <typename TensorSource, typename TensorDestination>
+void device_copy(TensorSource      tensor_source,
+                 TensorDestination tensor_destination,
+                 cudaStream_t stream) {
+  
+  assert(tensor_source.size() == tensor_destination.size());
+
+  auto numel = tensor_source.size();
+  static constexpr int NumThreads = 128;
+  auto grid_size = cute::ceil_div(numel, NumThreads);
+
+  dim3 grid(grid_size);
+  dim3 block(NumThreads);
+  device_copy_kernel<<<grid, block, 0, stream>>>(tensor_source, tensor_destination);
+}
+
 } //namespace cutlass
diff --git a/examples/cute/tutorial/blackwell/01_mma_sm100.cu b/examples/cute/tutorial/blackwell/01_mma_sm100.cu
index 3f73140a01..a11fb17c05 100644
--- a/examples/cute/tutorial/blackwell/01_mma_sm100.cu
+++ b/examples/cute/tutorial/blackwell/01_mma_sm100.cu
@@ -61,7 +61,8 @@
 #include <cute/tensor.hpp>                      // CuTe tensor implementation
 #include <cute/arch/cluster_sm90.hpp>           // CuTe functions for querying the details of cluster launched
 #include <cute/numeric/integral_constant.hpp>   // Compile time in constants such as _1, _256 etc.
-#include <cute/algorithm/cooperative_copy.hpp>
+#include <cute/algorithm/cooperative_copy.hpp>  // Auto vectorized copy operation
+#include <cute/arch/tmem_allocator_sm100.hpp>   // TMEM allocator for SM100
 
 // Tutorial helpers
 #include "example_utils.hpp"
@@ -122,7 +123,9 @@ struct SharedStorage
   alignas(128) cute::ArrayEngine<TypeA, cute::cosize_v<ASmemLayout>> A;
   alignas(128) cute::ArrayEngine<TypeB, cute::cosize_v<BSmemLayout>> B;
 
-  alignas(16) cute::uint64_t mma_barrier;  // Barrier to track MMA computation on SMEM
+  alignas(16) cute::uint64_t mma_barrier;   // Barrier to track MMA computation on SMEM
+
+  alignas(16) cute::uint32_t tmem_base_ptr; // Base pointer for TMEM allocation
 
   CUTE_DEVICE constexpr auto tensor_sA() { return make_tensor(make_smem_ptr(A.begin()), ASmemLayout{}); }
   CUTE_DEVICE constexpr auto tensor_sB() { return make_tensor(make_smem_ptr(B.begin()), BSmemLayout{}); }
@@ -225,6 +228,18 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   // ThrMma's make_fragment_C() creates a TMEM tensor with the appropriate layout for the accumulator.
   Tensor tCtAcc = cta_mma.make_fragment_C(tCgC);    // (MmaC, NumMma_M, NumMma_N)
 
+  uint32_t elect_one_thr  = cute::elect_one_sync();
+  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
+
+  using TmemAllocator = cute::TMEM::Allocator1Sm;
+  TmemAllocator tmem_allocator{};
+
+  if (elect_one_warp) {
+    tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+  }
+  __syncthreads(); // Wait for all threads until warp0 allocates TMEM
+  tCtAcc.data() = shared_storage.tmem_base_ptr;
+
   if (thread0()) {
     print("tCsA:\t"); print(tCsA); print("\n");     // tCsA:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_A) o ((_128,_16),_1,_4):((_64,_1),_0,_16)
     print("tCsB:\t"); print(tCsB); print("\n");     // tCsB:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_B) o ((_256,_16),_1,_4):((_64,_1),_0,_16)
@@ -233,10 +248,8 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
     print("tCtAcc:\t"); print(tCtAcc); print("\n"); // tCtAcc: tmem_[32b](TMEM_ADDR) o ((_128,_256),_1,_1):((_65536,_1),_0,_0)
   } __syncthreads();
 
-  // Barrier Initialization
-  uint32_t elect_one_thr  = cute::elect_one_sync();
-  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
 
+  // Barrier Initialization
   // Barriers in SMEM initialized by a single thread.
   if (elect_one_warp && elect_one_thr) {
     cute::initialize_barrier(shared_storage.mma_barrier, /* num_ctas */ 1);
@@ -306,6 +319,15 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   axpby(alpha, tDrAcc, beta, tDrC);
   // Store RMEM -> GMEM
   copy(tDrC, tDgD);
+
+  __syncthreads();
+
+  // Release the right to allocate before deallocations so that the next CTA can rasterize
+  // Then deallocate TMEM
+  if (elect_one_warp) {
+    tmem_allocator.release_allocation_lock();
+    tmem_allocator.free(shared_storage.tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+  }
 }
 
 template <class TypeA, class LayoutA,
diff --git a/examples/cute/tutorial/blackwell/02_mma_tma_sm100.cu b/examples/cute/tutorial/blackwell/02_mma_tma_sm100.cu
index e508e552d1..4ce2f4a869 100644
--- a/examples/cute/tutorial/blackwell/02_mma_tma_sm100.cu
+++ b/examples/cute/tutorial/blackwell/02_mma_tma_sm100.cu
@@ -61,7 +61,8 @@
 #include <cute/tensor.hpp>                      // CuTe tensor implementation
 #include <cute/arch/cluster_sm90.hpp>           // CuTe functions for querying the details of cluster launched
 #include <cute/numeric/integral_constant.hpp>   // Compile time in constants such as _1, _256 etc.
-#include <cute/algorithm/cooperative_copy.hpp>
+#include <cute/algorithm/cooperative_copy.hpp>  // Auto vectorized copy operation
+#include <cute/arch/tmem_allocator_sm100.hpp>   // TMEM allocator for SM100
 
 // Tutorial helpers
 #include "example_utils.hpp"
@@ -124,6 +125,8 @@ struct SharedStorage
   alignas(16) cute::uint64_t mma_barrier;  // Barrier to track MMA computation on SMEM
   alignas(16) cute::uint64_t tma_barrier;  // Barrier to track TMA data transfers to SMEM
 
+  alignas(16) cute::uint32_t tmem_base_ptr; // Base pointer for TMEM allocation
+
   CUTE_DEVICE constexpr auto tensor_sA() { return make_tensor(make_smem_ptr(A.begin()), ASmemLayout{}); }
   CUTE_DEVICE constexpr auto tensor_sB() { return make_tensor(make_smem_ptr(B.begin()), BSmemLayout{}); }
 };
@@ -228,6 +231,18 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   // ThrMma's make_fragment_C() creates a TMEM tensor with the appropriate layout for the accumulator.
   Tensor tCtAcc = cta_mma.make_fragment_C(tCgC);    // (MmaC, NumMma_M, NumMma_N)
 
+  uint32_t elect_one_thr  = cute::elect_one_sync();
+  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
+
+  using TmemAllocator = cute::TMEM::Allocator1Sm;
+  TmemAllocator tmem_allocator{};
+
+  if (elect_one_warp) {
+    tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+  }
+  __syncthreads(); // Wait for all threads until warp0 allocates TMEM
+  tCtAcc.data() = shared_storage.tmem_base_ptr;
+
   if (thread0()) {
     print("tCsA:\t"); print(tCsA); print("\n");     // tCsA:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_A) o ((_128,_16),_1,_4):((_64,_1),_0,_16)
     print("tCsB:\t"); print(tCsB); print("\n");     // tCsB:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_B) o ((_256,_16),_1,_4):((_64,_1),_0,_16)
@@ -269,9 +284,6 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   } __syncthreads();
 
   // Barrier Initialization
-  uint32_t elect_one_thr  = cute::elect_one_sync();
-  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
-
   // Barriers in SMEM initialized by a single thread.
   if (elect_one_warp && elect_one_thr) {
     cute::initialize_barrier(shared_storage.mma_barrier, /* num_ctas */ 1);
@@ -346,6 +358,15 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   axpby(alpha, tDrAcc, beta, tDrC);
   // Store RMEM -> GMEM
   copy(tDrC, tDgD);
+
+  __syncthreads();
+
+  // Release the right to allocate before deallocations so that the next CTA can rasterize
+  // Then deallocate TMEM
+  if (elect_one_warp) {
+    tmem_allocator.release_allocation_lock();
+    tmem_allocator.free(shared_storage.tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+  }
 }
 
 template <class TypeA, class LayoutA,
diff --git a/examples/cute/tutorial/blackwell/03_mma_tma_multicast_sm100.cu b/examples/cute/tutorial/blackwell/03_mma_tma_multicast_sm100.cu
index 1c2538e333..bc788bad16 100644
--- a/examples/cute/tutorial/blackwell/03_mma_tma_multicast_sm100.cu
+++ b/examples/cute/tutorial/blackwell/03_mma_tma_multicast_sm100.cu
@@ -61,7 +61,8 @@
 #include <cute/tensor.hpp>                      // CuTe tensor implementation
 #include <cute/arch/cluster_sm90.hpp>           // CuTe functions for querying the details of cluster launched
 #include <cute/numeric/integral_constant.hpp>   // Compile time in constants such as _1, _256 etc.
-#include <cute/algorithm/cooperative_copy.hpp>
+#include <cute/algorithm/cooperative_copy.hpp>  // Auto vectorized copy operation
+#include <cute/arch/tmem_allocator_sm100.hpp>   // TMEM allocator for SM100
 
 // Tutorial helpers
 #include "example_utils.hpp"
@@ -129,6 +130,8 @@ struct SharedStorage
   alignas(16) cute::uint64_t mma_barrier;  // Barrier to track MMA computation on SMEM
   alignas(16) cute::uint64_t tma_barrier;  // Barrier to track TMA data transfers to SMEM
 
+  alignas(16) cute::uint32_t tmem_base_ptr; // Base pointer for TMEM allocation
+
   CUTE_DEVICE constexpr auto tensor_sA() { return make_tensor(make_smem_ptr(A.begin()), ASmemLayout{}); }
   CUTE_DEVICE constexpr auto tensor_sB() { return make_tensor(make_smem_ptr(B.begin()), BSmemLayout{}); }
 };
@@ -231,6 +234,18 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   // ThrMma's make_fragment_C() creates a TMEM tensor with the appropriate layout for the accumulator.
   Tensor tCtAcc = cta_mma.make_fragment_C(tCgC);    // (MmaC, NumMma_M, NumMma_N)
 
+  uint32_t elect_one_thr  = cute::elect_one_sync();
+  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
+
+  using TmemAllocator = cute::TMEM::Allocator1Sm;
+  TmemAllocator tmem_allocator{};
+
+  if (elect_one_warp) {
+    tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+  }
+  __syncthreads(); // Wait for all threads until warp0 allocates TMEM
+  tCtAcc.data() = shared_storage.tmem_base_ptr;
+
   if (thread0()) {
     print("tCsA:\t"); print(tCsA); print("\n");     // tCsA:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_A) o ((_128,_16),_1,_4):((_64,_1),_0,_16)
     print("tCsB:\t"); print(tCsB); print("\n");     // tCsB:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_B) o ((_256,_16),_1,_4):((_64,_1),_0,_16)
@@ -305,10 +320,6 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   } __syncthreads();
 
   // Barrier Initialization
-
-  uint32_t elect_one_thr  = cute::elect_one_sync();
-  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
-
   // Barriers in SMEM initialized by a single thread.
   if (elect_one_warp && elect_one_thr) {
     // The number of CTAs that participates in multicast operation with this CTA (for both A and B matrices)
@@ -385,6 +396,15 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   axpby(alpha, tDrAcc, beta, tDrC);
   // Store RMEM -> GMEM
   copy(tDrC, tDgD);
+
+  __syncthreads();
+
+  // Release the right to allocate before deallocations so that the next CTA can rasterize
+  // Then deallocate TMEM
+  if (elect_one_warp) {
+    tmem_allocator.release_allocation_lock();
+    tmem_allocator.free(shared_storage.tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+  }
 }
 
 template <class TypeA, class LayoutA,
diff --git a/examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu b/examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu
index 290436ea1e..9b17cd5901 100644
--- a/examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu
+++ b/examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu
@@ -61,7 +61,8 @@
 #include <cute/tensor.hpp>                      // CuTe tensor implementation
 #include <cute/arch/cluster_sm90.hpp>           // CuTe functions for querying the details of cluster launched
 #include <cute/numeric/integral_constant.hpp>   // Compile time in constants such as _1, _256 etc.
-#include <cute/algorithm/cooperative_copy.hpp>
+#include <cute/algorithm/cooperative_copy.hpp>  // Auto vectorized copy operation
+#include <cute/arch/tmem_allocator_sm100.hpp>   // TMEM allocator for SM100
 
 // Tutorial helpers
 #include "example_utils.hpp"
@@ -132,6 +133,8 @@ struct SharedStorage
   alignas(16) cute::uint64_t mma_barrier;  // Barrier to track MMA computation on SMEM
   alignas(16) cute::uint64_t tma_barrier;  // Barrier to track TMA data transfers to SMEM
 
+  alignas(16) cute::uint32_t tmem_base_ptr; // Base pointer for TMEM allocation
+
   CUTE_DEVICE constexpr auto tensor_sA() { return make_tensor(make_smem_ptr(A.begin()), ASmemLayout{}); }
   CUTE_DEVICE constexpr auto tensor_sB() { return make_tensor(make_smem_ptr(B.begin()), BSmemLayout{}); }
 };
@@ -234,6 +237,18 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   // ThrMma's make_fragment_C() creates a TMEM tensor with the appropriate layout for the accumulator.
   Tensor tCtAcc = cta_mma.make_fragment_C(tCgC);    // (MmaC, NumMma_M, NumMma_N)
 
+  uint32_t elect_one_thr  = cute::elect_one_sync();
+  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
+
+  using TmemAllocator = cute::TMEM::Allocator2Sm;
+  TmemAllocator tmem_allocator{};
+
+  if (elect_one_warp) {
+    tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+  }
+  __syncthreads(); // Wait for all threads until warp0 allocates TMEM
+  tCtAcc.data() = shared_storage.tmem_base_ptr;
+
   if (thread0()) {
     print("tCsA:\t"); print(tCsA); print("\n");     // tCsA:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_A) o ((_128,_16),_1,_4):((_64,_1),_0,_16)
     print("tCsB:\t"); print(tCsB); print("\n");     // tCsB:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_B) o ((_256,_16),_1,_4):((_64,_1),_0,_16)
@@ -262,6 +277,7 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
 
   // Construct the CTA-in-Cluster coordinate for multicasting
   auto cta_in_cluster_coord_vmnk = cluster_layout_vmnk.get_flat_coord(int(cute::block_rank_in_cluster()));
+  auto elect_one_cta  = get<0>(cta_in_cluster_coord_vmnk) == Int<0>{};
 
   // Project the cluster_layout for tma_A along the N-modes
   auto [tAgA, tAsA] = tma_partition(tma_atom_A,
@@ -299,10 +315,6 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   } __syncthreads();
 
   // Barrier Initialization
-  auto elect_one_thr  = cute::elect_one_sync();
-  auto elect_one_warp = (threadIdx.x / 32 == 0);
-  auto elect_one_cta  = get<0>(cta_in_cluster_coord_vmnk) == Int<0>{};
-
   // Barriers in SMEM should be initialized by a single thread.
   if (elect_one_warp && elect_one_thr) {
     // The number of CTAs that participates in multicast operation with this CTA (for both A and B matrices)
@@ -386,6 +398,15 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   axpby(alpha, tDrAcc, beta, tDrC);
   // Store RMEM -> GMEM
   copy(tDrC, tDgD);
+
+  __syncthreads();
+
+  // Release the right to allocate before deallocations so that the next CTA can rasterize
+  // Then deallocate TMEM
+  if (elect_one_warp) {
+    tmem_allocator.release_allocation_lock();
+    tmem_allocator.free(shared_storage.tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+  }
 }
 
 template <class TypeA, class LayoutA,
diff --git a/examples/cute/tutorial/blackwell/05_mma_tma_epi_sm100.cu b/examples/cute/tutorial/blackwell/05_mma_tma_epi_sm100.cu
index 6d9ab03f7d..44b9758735 100644
--- a/examples/cute/tutorial/blackwell/05_mma_tma_epi_sm100.cu
+++ b/examples/cute/tutorial/blackwell/05_mma_tma_epi_sm100.cu
@@ -61,7 +61,8 @@
 #include <cute/tensor.hpp>                      // CuTe tensor implementation
 #include <cute/arch/cluster_sm90.hpp>           // CuTe functions for querying the details of cluster launched
 #include <cute/numeric/integral_constant.hpp>   // Compile time in constants such as _1, _256 etc.
-#include <cute/algorithm/cooperative_copy.hpp>
+#include <cute/algorithm/cooperative_copy.hpp>  // Auto vectorized copy operation
+#include <cute/arch/tmem_allocator_sm100.hpp>   // TMEM allocator for SM100
 
 // Tutorial helpers
 #include "example_utils.hpp"
@@ -140,6 +141,8 @@ struct SharedStorage
   alignas(16) cute::uint64_t mma_barrier;  // Barrier to track MMA computation on SMEM
   alignas(16) cute::uint64_t tma_barrier;  // Barrier to track TMA data transfers to SMEM
 
+  alignas(16) cute::uint32_t tmem_base_ptr; // Base pointer for TMEM allocation
+
   CUTE_DEVICE constexpr auto tensor_sA() { return make_tensor(make_smem_ptr(tensors.mainloop.A.begin()), ASmemLayout{}); }
   CUTE_DEVICE constexpr auto tensor_sB() { return make_tensor(make_smem_ptr(tensors.mainloop.B.begin()), BSmemLayout{}); }
   CUTE_DEVICE constexpr auto tensor_sC() { return make_tensor(make_smem_ptr(tensors.C.begin()), CSmemLayout{}); }
@@ -247,6 +250,18 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   // ThrMma's make_fragment_C() creates a TMEM tensor with the appropriate layout for the accumulator.
   Tensor tCtAcc = cta_mma.make_fragment_C(tCgC);    // (MmaC, NumMma_M, NumMma_N)
 
+  uint32_t elect_one_thr  = cute::elect_one_sync();
+  uint32_t elect_one_warp = (threadIdx.x / 32 == 0);
+
+  using TmemAllocator = cute::TMEM::Allocator2Sm;
+  TmemAllocator tmem_allocator{};
+
+  if (elect_one_warp) {
+    tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+  }
+  __syncthreads(); // Wait for all threads until warp0 allocates TMEM
+  tCtAcc.data() = shared_storage.tmem_base_ptr;
+
   if (thread0()) {
     print("tCsA:\t"); print(tCsA); print("\n");     // tCsA:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_A) o ((_128,_16),_1,_4):((_64,_1),_0,_16)
     print("tCsB:\t"); print(tCsB); print("\n");     // tCsB:   Sw<3,4,3>_smem_ptr[16b](SMEM_ADDR_B) o ((_256,_16),_1,_4):((_64,_1),_0,_16)
@@ -275,6 +290,7 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
 
   // Construct the CTA-in-Cluster coordinate for multicasting
   auto cta_in_cluster_coord_vmnk = cluster_layout_vmnk.get_flat_coord(int(cute::block_rank_in_cluster()));
+  auto elect_one_cta  = get<0>(cta_in_cluster_coord_vmnk) == Int<0>{};
 
   // Project the cluster_layout for tma_A along the N-modes
   auto [tAgA, tAsA] = tma_partition(tma_atom_A,
@@ -312,10 +328,6 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
   } __syncthreads();
 
   // Barrier Initialization
-  auto elect_one_thr  = cute::elect_one_sync();
-  auto elect_one_warp = (threadIdx.x / 32 == 0);
-  auto elect_one_cta  = get<0>(cta_in_cluster_coord_vmnk) == Int<0>{};
-
   // Barriers in SMEM should be initialized by a single thread.
   if (elect_one_warp && elect_one_thr) {
     // The number of CTAs that participates in multicast operation with this CTA (for both A and B matrices)
@@ -441,6 +453,14 @@ gemm_device(ATensor mA,                      // (Gemm_M, Gemm_K)
     }
     __syncthreads(); // All threads sync with issuing thread
   }
+  __syncthreads();
+
+  // Release the right to allocate before deallocations so that the next CTA can rasterize
+  // Then deallocate TMEM
+  if (elect_one_warp) {
+    tmem_allocator.release_allocation_lock();
+    tmem_allocator.free(shared_storage.tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+  }
 }
 
 template <class TypeA, class LayoutA,
diff --git a/examples/sycl/04_pvc_grouped_gemm/04_pvc_grouped_gemm.cpp b/examples/sycl/04_pvc_grouped_gemm/04_pvc_grouped_gemm.cpp
index 4b3f4bc8ca..e3d05e2951 100644
--- a/examples/sycl/04_pvc_grouped_gemm/04_pvc_grouped_gemm.cpp
+++ b/examples/sycl/04_pvc_grouped_gemm/04_pvc_grouped_gemm.cpp
@@ -450,7 +450,7 @@ void initialize(const Options &options) {
       fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 1};
       fusion_args.dBeta = {cute::_0{}, cute::_0{}, 1};
     }
-    using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Group<ProblemShape>::RasterOrderOptions;
+    using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerXeGroup<ProblemShape>::RasterOrderOptions;
 
     // Per-GEMM problem shape info may only exist on the device.
     if (host_problem_shapes_available) {
diff --git a/include/cute/algorithm/cooperative_gemm.hpp b/include/cute/algorithm/cooperative_gemm.hpp
index b1490b02db..f0a993593d 100644
--- a/include/cute/algorithm/cooperative_gemm.hpp
+++ b/include/cute/algorithm/cooperative_gemm.hpp
@@ -98,19 +98,23 @@ epilogue_predication(ThrMMA<Args...>    const& thr_mma,
   }
 }
 
-template<class Alpha, class TRC, class RCLayout,
+template<class ... Args, 
+         class Alpha, class TRC, class RCLayout,
          class Beta, class TSC, class SCLayout,
          class CLoadTransformOp, class CStoreTransformOp,
-         class SmemCopyOpC>
+         class SmemCopyLdOpC, class SmemCopyStOpC>
 CUTE_HOST_DEVICE
 void
-epilogue_no_predication(Alpha              const& alpha,
+epilogue_no_predication(uint32_t                   thread_idx,
+                        ThrMMA<Args...>     const& thr_mma,
+                        Alpha              const& alpha,
                         Tensor<TRC, RCLayout>   & tCrC,
                         Beta               const& beta,
-                        Tensor<TSC, SCLayout>   & tCsC,
+                        Tensor<TSC, SCLayout>   & sC,
                         CLoadTransformOp   const& sC_load_op,  // transforms C values before use in GEMM
                         CStoreTransformOp  const& sC_store_op, // transforms results before they are stored to C
-                        SmemCopyOpC        const& sC_copy_op)
+                        SmemCopyLdOpC      const& sC_copy_ld_op,
+                        SmemCopyStOpC      const& sC_copy_st_op)
 {
   using InputTypeC   = typename TSC::value_type;
   using ComputeTypeC = typename TRC::value_type;
@@ -125,10 +129,18 @@ epilogue_no_predication(Alpha              const& alpha,
     CUTE_GCC_UNREACHABLE;
   } ();
 
-  Tensor tCrDi = make_fragment_like(tCsC);
   Tensor tCrD = make_fragment_like(tCrC);
+  Tensor tCrDi = make_fragment_like<InputTypeC>(tCrD);
+
   if(!isBetaZero) {
-    copy(sC_copy_op, tCsC, tCrDi);
+    auto smem_tiled_copy_C = make_tiled_copy_C(Copy_Atom<SmemCopyLdOpC, InputTypeC>{}, thr_mma);
+    auto smem_thr_copy_C   = smem_tiled_copy_C.get_thread_slice(thread_idx);
+    Tensor tCsC            = smem_thr_copy_C.partition_S(sC);
+    Tensor tCrDi_copy_view = smem_thr_copy_C.retile_D(tCrDi);
+    CUTE_STATIC_ASSERT_V(size<1>(tCsC) == size<1>(tCrDi_copy_view));             // CPY_M
+    CUTE_STATIC_ASSERT_V(size<2>(tCsC) == size<2>(tCrDi_copy_view));             // CPY_N
+    copy(smem_tiled_copy_C, tCsC, tCrDi_copy_view);
+
     // Transform C on/after load
     cute::transform(tCrDi, tCrD, sC_load_op);
   }
@@ -136,7 +148,14 @@ epilogue_no_predication(Alpha              const& alpha,
   axpby(alpha, tCrC, beta, tCrD);
   // Transform C before/on store
   cute::transform(tCrD, tCrDi, sC_store_op);
-  copy(sC_copy_op, tCrDi, tCsC);
+
+  auto smem_tiled_copy_C = make_tiled_copy_C(Copy_Atom<SmemCopyStOpC, InputTypeC>{}, thr_mma);
+  auto smem_thr_copy_C   = smem_tiled_copy_C.get_thread_slice(thread_idx);
+  Tensor tCsC            = smem_thr_copy_C.partition_D(sC);
+  Tensor tCrDi_copy_view = smem_thr_copy_C.retile_S(tCrDi);
+  CUTE_STATIC_ASSERT_V(size<1>(tCsC) == size<1>(tCrDi_copy_view));             // CPY_M
+  CUTE_STATIC_ASSERT_V(size<2>(tCsC) == size<2>(tCrDi_copy_view));             // CPY_N
+  copy(smem_tiled_copy_C, tCrDi_copy_view, tCsC);
 }
 
 // Predicated Cooperative GEMM
@@ -283,7 +302,9 @@ cooperative_gemm_no_predication(uint32_t                   thread_idx,
 
   // Create register tensors for the MMA to operate on
   Tensor tCrA  = thr_mma.partition_fragment_A(sA);                    // (MMA,MMA_M,MMA_K)
+  Tensor tCrAi = make_fragment_like<InputTypeA>(tCrA);
   Tensor tCrB  = thr_mma.partition_fragment_B(sB);                    // (MMA,MMA_N,MMA_K)
+  Tensor tCrBi = make_fragment_like<InputTypeB>(tCrB);
 
   using CopyOpAType = SmemCopyOpA;
   using CopyOpBType = SmemCopyOpB;
@@ -291,7 +312,6 @@ cooperative_gemm_no_predication(uint32_t                   thread_idx,
   auto smem_tiled_copy_A = make_tiled_copy_A(Copy_Atom<CopyOpAType, InputTypeA>{}, thr_mma);
   auto smem_thr_copy_A   = smem_tiled_copy_A.get_thread_slice(thread_idx);
   Tensor tCsA            = smem_thr_copy_A.partition_S(sA);
-  Tensor tCrAi           = make_fragment_like(tCsA);
   Tensor tCrAi_copy_view = smem_thr_copy_A.retile_D(tCrAi);
   CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(tCrAi_copy_view));             // CPY_M
   CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCrAi_copy_view));             // CPY_K
@@ -299,7 +319,6 @@ cooperative_gemm_no_predication(uint32_t                   thread_idx,
   auto smem_tiled_copy_B = make_tiled_copy_B(Copy_Atom<CopyOpBType, InputTypeB>{}, thr_mma);
   auto smem_thr_copy_B   = smem_tiled_copy_B.get_thread_slice(thread_idx);
   Tensor tCsB            = smem_thr_copy_B.partition_S(sB);
-  Tensor tCrBi           = make_fragment_like(tCsB);
   Tensor tCrBi_copy_view = smem_thr_copy_B.retile_D(tCrBi);
   CUTE_STATIC_ASSERT_V(size<1>(tCsB) == size<1>(tCrBi_copy_view));            // CPY_N
   CUTE_STATIC_ASSERT_V(size<2>(tCsB) == size<2>(tCrBi_copy_view));            // CPY_K
@@ -346,7 +365,7 @@ template <class... Args,
           class ALoadTransformOp = cute::identity, class BLoadTransformOp  = cute::identity,
           class CLoadTransformOp = cute::identity, class CStoreTransformOp = cute::identity,
           class SmemCopyOpA = DefaultCopy, class SmemCopyOpB = DefaultCopy,
-          class SmemCopyOpC = DefaultCopy>
+          class SmemCopyLdOpC = DefaultCopy, class SmemCopyStOpC = DefaultCopy>
 CUTE_HOST_DEVICE
 void
 cooperative_gemm(uint32_t                   thread_idx,
@@ -356,13 +375,14 @@ cooperative_gemm(uint32_t                   thread_idx,
                  Tensor<TB, BLayout> const& sB,
                  Beta                const& beta,
                  Tensor<TC, CLayout>      & sC,
-                 ALoadTransformOp    const& sA_load_op  = {}, // transforms A values before use in GEMM
-                 BLoadTransformOp    const& sB_load_op  = {}, // transforms B values before use in GEMM
-                 CLoadTransformOp    const& sC_load_op  = {}, // transforms C values before use in GEMM
-                 CStoreTransformOp   const& sC_store_op = {}, // transforms results before they are stored to C
-                 SmemCopyOpA         const& sA_copy_op  = {},
-                 SmemCopyOpB         const& sB_copy_op  = {},
-                 SmemCopyOpC         const& sC_copy_op  = {})
+                 ALoadTransformOp    const& sA_load_op    = {}, // transforms A values before use in GEMM
+                 BLoadTransformOp    const& sB_load_op    = {}, // transforms B values before use in GEMM
+                 CLoadTransformOp    const& sC_load_op    = {}, // transforms C values before use in GEMM
+                 CStoreTransformOp   const& sC_store_op   = {}, // transforms results before they are stored to C
+                 SmemCopyOpA         const& sA_copy_op    = {},
+                 SmemCopyOpB         const& sB_copy_op    = {},
+                 SmemCopyLdOpC       const& sC_copy_ld_op = {},
+                 SmemCopyStOpC       const& sC_copy_st_op = {})
 {
   CUTE_STATIC_ASSERT_V(rank(sA) == Int<2>{});
   CUTE_STATIC_ASSERT_V(rank(sB) == Int<2>{});
@@ -394,7 +414,7 @@ cooperative_gemm(uint32_t                   thread_idx,
         thread_idx, thr_mma, sA, sB, tCrC, sA_load_op, sB_load_op, sA_copy_op, sB_copy_op
     );
     detail::epilogue_no_predication(
-        alpha, tCrC, beta, tCsC, sC_load_op, sC_store_op, sC_copy_op
+        thread_idx, thr_mma,alpha, tCrC, beta, sC, sC_load_op, sC_store_op, sC_copy_ld_op, sC_copy_st_op
     );
   } else {
     detail::cooperative_gemm_predication(
@@ -466,7 +486,7 @@ template <class... Args,
           class ALoadTransformOp = cute::identity, class BLoadTransformOp  = cute::identity,
           class CLoadTransformOp = cute::identity, class CStoreTransformOp = cute::identity,
           class SmemCopyOpA = DefaultCopy, class SmemCopyOpB = DefaultCopy,
-          class SmemCopyOpC = DefaultCopy>
+          class SmemCopyLdOpC = DefaultCopy, class SmemCopyStOpC = DefaultCopy>
 CUTE_HOST_DEVICE
 void
 cooperative_gemm(uint32_t thread_idx,
@@ -476,17 +496,18 @@ cooperative_gemm(uint32_t thread_idx,
                  Tensor<TB, BLayout> const& sB,
                  Beta                const& beta,
                  Tensor<TC, CLayout>     && sC,
-                 ALoadTransformOp    const& sA_load_op  = {}, // transforms A values before use in GEMM
-                 BLoadTransformOp    const& sB_load_op  = {}, // transforms B values before use in GEMM
-                 CLoadTransformOp    const& sC_load_op  = {}, // transforms C values before use in GEMM
-                 CStoreTransformOp   const& sC_store_op = {}, // transforms results before they are stored to C
-                 SmemCopyOpA         const& sA_copy_op  = {},
-                 SmemCopyOpB         const& sB_copy_op  = {},
-                 SmemCopyOpC         const& sC_copy_op  = {})
+                 ALoadTransformOp    const& sA_load_op    = {}, // transforms A values before use in GEMM
+                 BLoadTransformOp    const& sB_load_op    = {}, // transforms B values before use in GEMM
+                 CLoadTransformOp    const& sC_load_op    = {}, // transforms C values before use in GEMM
+                 CStoreTransformOp   const& sC_store_op   = {}, // transforms results before they are stored to C
+                 SmemCopyOpA         const& sA_copy_op    = {},
+                 SmemCopyOpB         const& sB_copy_op    = {},
+                 SmemCopyLdOpC       const& sC_copy_ld_op = {},
+                 SmemCopyStOpC       const& sC_copy_st_op = {})
 {
   cooperative_gemm(thread_idx, tiled_mma, alpha, sA, sB, beta, sC,
                    sA_load_op, sB_load_op, sC_load_op, sC_store_op,
-                   sA_copy_op, sB_copy_op, sC_copy_op);
+                   sA_copy_op, sB_copy_op, sC_copy_ld_op, sC_copy_st_op);
 }
 
 // Legacy overload of cute::gemm for backwards-compatibility
diff --git a/include/cute/algorithm/tuple_algorithms.hpp b/include/cute/algorithm/tuple_algorithms.hpp
index 5055605315..cec86c4d6d 100644
--- a/include/cute/algorithm/tuple_algorithms.hpp
+++ b/include/cute/algorithm/tuple_algorithms.hpp
@@ -33,6 +33,7 @@
 #include <cute/config.hpp>
 
 #include <cute/util/type_traits.hpp>
+#include <cute/container/type_list.hpp>
 #include <cute/container/tuple.hpp>
 #include <cute/algorithm/functional.hpp>
 #include <cute/numeric/integer_sequence.hpp>
@@ -283,34 +284,13 @@ transform_leaf(T0 const& t0, T1 const& t1, F&& f)
 // find and find_if
 //
 
-namespace detail {
-
-template <class T, class F, int I, int... Is>
-CUTE_HOST_DEVICE constexpr
-auto
-find_if(T const& t, F&& f, seq<I,Is...>)
-{
-  if constexpr (decltype(f(get<I>(t)))::value) {
-    return cute::C<I>{};
-  } else
-  if constexpr (sizeof...(Is) == 0) {
-    return cute::C<I+1>{};
-  } else {
-    return find_if(t, f, seq<Is...>{});
-  }
-
-  CUTE_GCC_UNREACHABLE;
-}
-
-} // end namespace detail
-
 template <class T, class F>
 CUTE_HOST_DEVICE constexpr
 auto
 find_if(T const& t, F&& f)
 {
   if constexpr (is_tuple<T>::value) {
-    return detail::find_if(t, f, tuple_seq<T>{});
+    return detail::tapply(t, f, [] (auto... a) { return cute::C<find_true_v<decltype(a)::value...>>{}; }, tuple_seq<T>{});
   } else {
     return cute::C<decltype(f(t))::value ? 0 : 1>{};
   }
@@ -332,7 +312,7 @@ auto
 any_of(T const& t, F&& f)
 {
   if constexpr (is_tuple<T>::value) {
-    return detail::apply(cute::transform(t, f), [&] (auto const&... a) { return (false_type{} || ... || a); }, tuple_seq<T>{});
+    return detail::tapply(t, f, [] (auto... a) { return (false_type{} || ... || a); }, tuple_seq<T>{});
   } else {
     return f(t);
   }
@@ -346,7 +326,7 @@ auto
 all_of(T const& t, F&& f)
 {
   if constexpr (is_tuple<T>::value) {
-    return detail::apply(cute::transform(t, f), [&] (auto const&... a) { return (true_type{} && ... && a); }, tuple_seq<T>{});
+    return detail::tapply(t, f, [] (auto... a) { return (true_type{} && ... && a); }, tuple_seq<T>{});
   } else {
     return f(t);
   }
diff --git a/include/cute/arch/cluster_sm90.hpp b/include/cute/arch/cluster_sm90.hpp
index ba22ef1ca5..524a47efb1 100644
--- a/include/cute/arch/cluster_sm90.hpp
+++ b/include/cute/arch/cluster_sm90.hpp
@@ -31,6 +31,7 @@
 #pragma once
 
 #include <cute/config.hpp>
+#include <cute/numeric/numeric_types.hpp>
 
 // Config
 #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && \
diff --git a/include/cute/arch/config.hpp b/include/cute/arch/config.hpp
index 9158953886..2383b4e6c6 100644
--- a/include/cute/arch/config.hpp
+++ b/include/cute/arch/config.hpp
@@ -72,6 +72,27 @@
 #  define CUTE_ARCH_TCGEN05_F16BF16_MMA_SCALED_ENABLED
 #endif
 
+#if (defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101F_ENABLED))
+#  define CUTE_ARCH_TMA_SM90_ENABLED 
+#  define CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED
+#  define CUTE_ARCH_STSM_SM90_ENABLED
+#  define CUTE_ARCH_TCGEN05_TF32_MMA_ENABLED
+#  define CUTE_ARCH_TCGEN05_F16F32_MMA_ENABLED
+#  define CUTE_ARCH_TCGEN05_MXF8F6F4_MMA_ENABLED
+#  define CUTE_ARCH_TCGEN05_MXF4_MMA_ENABLED
+#  define CUTE_ARCH_TCGEN05_MXF4NVF4_MMA_ENABLED
+#endif
+
+#if defined(CUTLASS_ARCH_MMA_SM100F_ENABLED)
+#  define CUTE_ARCH_TCGEN05_F16BF16_MMA_SCALED_ENABLED
+#endif
+
+#if (defined(CUTLASS_ARCH_MMA_SM120F_ENABLED))
+#  define CUTE_ARCH_TMA_SM90_ENABLED
+#  define CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED
+#  define CUTE_ARCH_STSM_SM90_ENABLED
+#endif
+
 #if (defined(CUTLASS_ARCH_MMA_SM100A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101A_ENABLED))
 #  define CUTE_ARCH_TCGEN05_S8_MMA_ENABLED
 #endif
@@ -91,8 +112,11 @@
 #endif
 
 // {add, mul, fma}.f32x2 PTX
-#if (defined(CUTLASS_ARCH_MMA_SM100A_ENABLED))
-  #define CUTE_ARCH_FLOAT2_MATH_ENABLED
+#if defined(CUTLASS_ARCH_MMA_SM100_ENABLED) || defined(CUTLASS_ARCH_MMA_SM100A_ENABLED)
+   // Enable CuTe MMA Atoms
+#  define CUTE_ARCH_FFMA2_SM100_ENABLED
+   // Enable f32x2 PTX generation
+#  define CUTE_ARCH_FLOAT2_MATH_ENABLED
 #endif
 
 #if defined(CUTLASS_ARCH_MMA_SM120_ENABLED) || defined(CUTLASS_ARCH_MMA_SM120A_ENABLED)
@@ -109,3 +133,37 @@
 #  endif
 #endif
 
+#if defined(CUTLASS_ARCH_MMA_SM100F_ENABLED)
+#  define CUTE_ARCH_LDSM_SM100A_ENABLED
+#  define CUTE_ARCH_STSM_SM100A_ENABLED
+#  define CUTE_ARCH_TCGEN05_TMEM_ENABLED
+#  define CUTE_ARCH_TMA_SM100_ENABLED
+#  define CUTE_ARCH_FLOAT2_MATH_ENABLED
+#endif
+
+#if defined(CUTLASS_ARCH_MMA_SM101F_ENABLED) 
+#  define CUTE_ARCH_LDSM_SM100A_ENABLED
+#  define CUTE_ARCH_STSM_SM100A_ENABLED
+#  define CUTE_ARCH_TCGEN05_TMEM_ENABLED
+#  define CUTE_ARCH_TMA_SM100_ENABLED
+#endif
+
+#if defined(CUTLASS_ARCH_MMA_SM120F_ENABLED)
+#  define CUTE_ARCH_LDSM_SM100A_ENABLED
+#  define CUTE_ARCH_STSM_SM100A_ENABLED
+#endif
+
+#if (defined(CUTLASS_ARCH_MMA_SM100A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) ||\
+     defined(CUTLASS_ARCH_MMA_SM101A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101F_ENABLED) ||\
+     defined(CUTLASS_ARCH_MMA_SM120A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM120F_ENABLED))
+#  if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))
+#    define CUTE_ARCH_LOAD256_SM100A_ENABLED
+#    define CUTE_ARCH_STORE256_SM100A_ENABLED
+#  endif
+#endif
+
+// {add, mul, fma}.f32x2 PTX
+#if defined(CUTLASS_ARCH_MMA_SM100A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM100F_ENABLED)
+  #define CUTE_ARCH_FLOAT2_MATH_ENABLED
+#endif
+
diff --git a/include/cute/arch/copy_sm100.hpp b/include/cute/arch/copy_sm100.hpp
index 19b13841a1..aa969afe9b 100644
--- a/include/cute/arch/copy_sm100.hpp
+++ b/include/cute/arch/copy_sm100.hpp
@@ -28,10 +28,6 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-
-//
-
-//
 #pragma once
 
 #include <cute/arch/config.hpp>
@@ -316,17 +312,14 @@ struct SM100_U8x16_STSM_T
   }
 };
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-} // namespace cute
-
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // UTCCP PTX definitions
 //
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-namespace cute {
+namespace SM100::TMEM::UTCCP {
+
 // 128 data path lanes, 256-bit pattern, 1cta mode
 struct SM100_UTCCP_128dp256bit_1cta
 {
@@ -558,21 +551,19 @@ struct SM100_UTCCP_2x64dp128bitlw0123_2cta
   }
 };
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-} // namespace cute
+} // end namespace SM100::TMEM::UTCCP
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-namespace cute {
+namespace SM100::TMEM::LOAD {
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //
-// TMEM_LOAD PTX definitions
+// TMEM LOAD PTX definitions
 //
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -3945,7 +3936,6 @@ struct SM100_TMEM_LOAD_32dp32b128x
   }
 };
 
-
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // 32 data path lanes, 32-bit pattern, repeated 128 times, packed 16b read
@@ -4065,9 +4055,21 @@ struct SM100_TMEM_LOAD_32dp32b128x_16b
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace SM100::TMEM::LOAD
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace SM100::TMEM::STORE {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //
-// TMEM_STORE PTX definitions
+// TMEM STORE PTX definitions
 //
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -4086,8 +4088,8 @@ struct SM100_TMEM_STORE_16dp256b1x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x256b.x1.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -4110,8 +4112,8 @@ struct SM100_TMEM_STORE_16dp256b1x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x256b.x1.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -4136,8 +4138,8 @@ struct SM100_TMEM_STORE_16dp256b2x
     asm volatile ("tcgen05.st.sync.aligned.16x256b.x2.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -4163,8 +4165,8 @@ struct SM100_TMEM_STORE_16dp256b2x_16b
     asm volatile ("tcgen05.st.sync.aligned.16x256b.x2.unpack::16b.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -4194,8 +4196,8 @@ struct SM100_TMEM_STORE_16dp256b4x
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4227,8 +4229,8 @@ struct SM100_TMEM_STORE_16dp256b4x_16b
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4268,8 +4270,8 @@ struct SM100_TMEM_STORE_16dp256b8x
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4313,8 +4315,8 @@ struct SM100_TMEM_STORE_16dp256b8x_16b
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4374,8 +4376,8 @@ struct SM100_TMEM_STORE_16dp256b16x
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4443,8 +4445,8 @@ struct SM100_TMEM_STORE_16dp256b16x_16b
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4544,8 +4546,8 @@ struct SM100_TMEM_STORE_16dp256b32x
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -4661,8 +4663,8 @@ struct SM100_TMEM_STORE_16dp256b32x_16b
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -4716,8 +4718,8 @@ struct SM100_TMEM_STORE_16dp128b1x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x128b.x1.b32"
                     "[%0],"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -4740,8 +4742,8 @@ struct SM100_TMEM_STORE_16dp128b1x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x128b.x1.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -4764,8 +4766,8 @@ struct SM100_TMEM_STORE_16dp128b2x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x128b.x2.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -4788,8 +4790,8 @@ struct SM100_TMEM_STORE_16dp128b2x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x128b.x2.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -4814,8 +4816,8 @@ struct SM100_TMEM_STORE_16dp128b4x
     asm volatile ("tcgen05.st.sync.aligned.16x128b.x4.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -4841,8 +4843,8 @@ struct SM100_TMEM_STORE_16dp128b4x_16b
     asm volatile ("tcgen05.st.sync.aligned.16x128b.x4.unpack::16b.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -4872,8 +4874,8 @@ struct SM100_TMEM_STORE_16dp128b8x
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4905,8 +4907,8 @@ struct SM100_TMEM_STORE_16dp128b8x_16b
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4946,8 +4948,8 @@ struct SM100_TMEM_STORE_16dp128b16x
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -4991,8 +4993,8 @@ struct SM100_TMEM_STORE_16dp128b16x_16b
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5052,8 +5054,8 @@ struct SM100_TMEM_STORE_16dp128b32x
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5121,8 +5123,8 @@ struct SM100_TMEM_STORE_16dp128b32x_16b
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5222,8 +5224,8 @@ struct SM100_TMEM_STORE_16dp128b64x
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -5339,8 +5341,8 @@ struct SM100_TMEM_STORE_16dp128b64x_16b
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -5394,8 +5396,8 @@ struct SM100_TMEM_STORE_16dp64b1x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x1.b32"
                     "[%0],"
-                    "{%1};\n" 
-    :  
+                    "{%1};\n"
+    :
     :  "r"(dst_addr), "r"(src0) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -5418,8 +5420,8 @@ struct SM100_TMEM_STORE_16dp64b1x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x1.unpack::16b.b32"
                     "[%0],"
-                    "{%1};\n" 
-    :  
+                    "{%1};\n"
+    :
     :  "r"(dst_addr), "r"(src0) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -5442,8 +5444,8 @@ struct SM100_TMEM_STORE_16dp64b2x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x2.b32"
                     "[%0],"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -5466,8 +5468,8 @@ struct SM100_TMEM_STORE_16dp64b2x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x2.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -5490,8 +5492,8 @@ struct SM100_TMEM_STORE_16dp64b4x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x4.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -5514,8 +5516,8 @@ struct SM100_TMEM_STORE_16dp64b4x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x4.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -5540,8 +5542,8 @@ struct SM100_TMEM_STORE_16dp64b8x
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x8.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -5567,8 +5569,8 @@ struct SM100_TMEM_STORE_16dp64b8x_16b
     asm volatile ("tcgen05.st.sync.aligned.16x64b.x8.unpack::16b.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -5598,8 +5600,8 @@ struct SM100_TMEM_STORE_16dp64b16x
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5631,8 +5633,8 @@ struct SM100_TMEM_STORE_16dp64b16x_16b
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5672,8 +5674,8 @@ struct SM100_TMEM_STORE_16dp64b32x
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5717,8 +5719,8 @@ struct SM100_TMEM_STORE_16dp64b32x_16b
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5778,8 +5780,8 @@ struct SM100_TMEM_STORE_16dp64b64x
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5847,8 +5849,8 @@ struct SM100_TMEM_STORE_16dp64b64x_16b
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -5948,8 +5950,8 @@ struct SM100_TMEM_STORE_16dp64b128x
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -6065,8 +6067,8 @@ struct SM100_TMEM_STORE_16dp64b128x_16b
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -6120,8 +6122,8 @@ struct SM100_TMEM_STORE_16dp32b1x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x1.b32"
                     "[%0] , 1,"
-                    "{%1};\n" 
-    :  
+                    "{%1};\n"
+    :
     :  "r"(dst_addr), "r"(src0) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6144,8 +6146,8 @@ struct SM100_TMEM_STORE_16dp32b1x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x1.unpack::16b.b32"
                     "[%0] , 2,"
-                    "{%1};\n" 
-    :  
+                    "{%1};\n"
+    :
     :  "r"(dst_addr), "r"(src0) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6168,8 +6170,8 @@ struct SM100_TMEM_STORE_16dp32b2x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x2.b32"
                     "[%0] , 2,"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6192,8 +6194,8 @@ struct SM100_TMEM_STORE_16dp32b2x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x2.unpack::16b.b32"
                     "[%0] , 4,"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6216,8 +6218,8 @@ struct SM100_TMEM_STORE_16dp32b4x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x4.b32"
                     "[%0] , 4,"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6240,8 +6242,8 @@ struct SM100_TMEM_STORE_16dp32b4x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x4.unpack::16b.b32"
                     "[%0] , 8,"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6266,8 +6268,8 @@ struct SM100_TMEM_STORE_16dp32b8x
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x8.b32"
                     "[%0] , 8,"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -6293,8 +6295,8 @@ struct SM100_TMEM_STORE_16dp32b8x_16b
     asm volatile ("tcgen05.st.sync.aligned.16x32bx2.x8.unpack::16b.b32"
                     "[%0] , 16,"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -6324,8 +6326,8 @@ struct SM100_TMEM_STORE_16dp32b16x
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -6357,8 +6359,8 @@ struct SM100_TMEM_STORE_16dp32b16x_16b
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -6398,8 +6400,8 @@ struct SM100_TMEM_STORE_16dp32b32x
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -6443,8 +6445,8 @@ struct SM100_TMEM_STORE_16dp32b32x_16b
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -6504,8 +6506,8 @@ struct SM100_TMEM_STORE_16dp32b64x
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -6573,8 +6575,8 @@ struct SM100_TMEM_STORE_16dp32b64x_16b
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -6674,8 +6676,8 @@ struct SM100_TMEM_STORE_16dp32b128x
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -6791,8 +6793,8 @@ struct SM100_TMEM_STORE_16dp32b128x_16b
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -6846,8 +6848,8 @@ struct SM100_TMEM_STORE_32dp32b1x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x1.b32"
                     "[%0],"
-                    "{%1};\n" 
-    :  
+                    "{%1};\n"
+    :
     :  "r"(dst_addr), "r"(src0) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6870,8 +6872,8 @@ struct SM100_TMEM_STORE_32dp32b1x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x1.unpack::16b.b32"
                     "[%0],"
-                    "{%1};\n" 
-    :  
+                    "{%1};\n"
+    :
     :  "r"(dst_addr), "r"(src0) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6894,8 +6896,8 @@ struct SM100_TMEM_STORE_32dp32b2x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x2.b32"
                     "[%0],"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6918,8 +6920,8 @@ struct SM100_TMEM_STORE_32dp32b2x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x2.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2};\n" 
-    :  
+                    "{%1, %2};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6942,8 +6944,8 @@ struct SM100_TMEM_STORE_32dp32b4x
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x4.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6966,8 +6968,8 @@ struct SM100_TMEM_STORE_32dp32b4x_16b
 #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x4.unpack::16b.b32"
                     "[%0],"
-                    "{%1, %2, %3, %4};\n" 
-    :  
+                    "{%1, %2, %3, %4};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3) );
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use TMEM_STORE without CUTE_ARCH_TCGEN05_TMEM_ENABLED.");
@@ -6992,8 +6994,8 @@ struct SM100_TMEM_STORE_32dp32b8x
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x8.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -7019,8 +7021,8 @@ struct SM100_TMEM_STORE_32dp32b8x_16b
     asm volatile ("tcgen05.st.sync.aligned.32x32b.x8.unpack::16b.b32"
                     "[%0],"
                     "{%1, %2, %3, %4,"
-                    "%5, %6, %7, %8};\n" 
-    :  
+                    "%5, %6, %7, %8};\n"
+    :
     :  "r"(dst_addr), "r"(src0), "r"(src1), "r"(src2), "r"(src3),
        "r"(src4), "r"(src5), "r"(src6), "r"(src7) );
 #else
@@ -7050,8 +7052,8 @@ struct SM100_TMEM_STORE_32dp32b16x
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -7083,8 +7085,8 @@ struct SM100_TMEM_STORE_32dp32b16x_16b
                     "{%1, %2, %3, %4,"
                     "%5, %6, %7, %8,"
                     "%9, %10, %11, %12,"
-                    "%13, %14, %15, %16};\n" 
-    :  
+                    "%13, %14, %15, %16};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -7124,8 +7126,8 @@ struct SM100_TMEM_STORE_32dp32b32x
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -7169,8 +7171,8 @@ struct SM100_TMEM_STORE_32dp32b32x_16b
                     "%17, %18, %19, %20,"
                     "%21, %22, %23, %24,"
                     "%25, %26, %27, %28,"
-                    "%29, %30, %31, %32};\n" 
-    :  
+                    "%29, %30, %31, %32};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -7230,8 +7232,8 @@ struct SM100_TMEM_STORE_32dp32b64x
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -7299,8 +7301,8 @@ struct SM100_TMEM_STORE_32dp32b64x_16b
                     "%49, %50, %51, %52,"
                     "%53, %54, %55, %56,"
                     "%57, %58, %59, %60,"
-                    "%61, %62, %63, %64};\n" 
-    :  
+                    "%61, %62, %63, %64};\n"
+    :
     :  "r"(dst_addr), "r"(src00), "r"(src01), "r"(src02), "r"(src03),
        "r"(src04), "r"(src05), "r"(src06), "r"(src07),
        "r"(src08), "r"(src09), "r"(src10), "r"(src11),
@@ -7400,8 +7402,8 @@ struct SM100_TMEM_STORE_32dp32b128x
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -7517,8 +7519,8 @@ struct SM100_TMEM_STORE_32dp32b128x_16b
                     "%113, %114, %115, %116,"
                     "%117, %118, %119, %120,"
                     "%121, %122, %123, %124,"
-                    "%125, %126, %127, %128};\n" 
-    :  
+                    "%125, %126, %127, %128};\n"
+    :
     :  "r"(dst_addr), "r"(src000), "r"(src001), "r"(src002), "r"(src003),
        "r"(src004), "r"(src005), "r"(src006), "r"(src007),
        "r"(src008), "r"(src009), "r"(src010), "r"(src011),
@@ -7561,7 +7563,8 @@ struct SM100_TMEM_STORE_32dp32b128x_16b
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-} // namespace cute
+} // namespace SM100::TMEM::STORE
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+} // end namespace cute
diff --git a/include/cute/arch/mma_sm100.hpp b/include/cute/arch/mma_sm100.hpp
index 2fa532d2ef..749da8167e 100644
--- a/include/cute/arch/mma_sm100.hpp
+++ b/include/cute/arch/mma_sm100.hpp
@@ -29,7 +29,6 @@
  *
  **************************************************************************************************/
 //
-
 //
 
 #pragma once
@@ -37,6 +36,48 @@
 #include <cute/arch/config.hpp>
 #include <cute/arch/mma.hpp>
 
+#include <cute/arch/simd_sm100.hpp>
+
 namespace cute {
 
+struct SM100_2x1x1_F32F32F32F32 {
+  using DRegisters = float2[1];
+  using ARegisters = float2[1];
+  using BRegisters = float[1];
+  using CRegisters = float2[1];
+
+  CUTE_HOST_DEVICE static void
+  fma(float2       &  d01,
+      float2  const&  a01,
+      float   const&  b0,
+      float2  const&  c01)
+  {
+#if defined(CUTE_ARCH_FFMA2_SM100_ENABLED)
+  cute::fma(d01, a01, make_float2(b0, b0), c01);
+#else
+  CUTE_INVALID_CONTROL_PATH("Attempting to use SM100_2x1x1_F32F32F32F32 without CUTE_ARCH_FLOAT2_MATH_ENABLED");
+#endif
+  }
+};
+
+struct SM100_1x2x1_F32F32F32F32 {
+  using DRegisters = float2[1];
+  using ARegisters = float[1];
+  using BRegisters = float2[1];
+  using CRegisters = float2[1];
+
+  CUTE_HOST_DEVICE static void
+  fma(float2       &  d01,
+      float   const&  a0,
+      float2  const&  b01,
+      float2  const&  c01)
+  {
+#if defined(CUTE_ARCH_FFMA2_SM100_ENABLED)
+  cute::fma(d01, make_float2(a0, a0), b01, c01);
+#else
+  CUTE_INVALID_CONTROL_PATH("Attempting to use SM100_1x2x1_F32F32F32F32 without CUTE_ARCH_FFMA2_SM100_ENABLED");
+#endif
+  }
+};
+
 } // namespace cute
diff --git a/include/cute/arch/mma_sm120.hpp b/include/cute/arch/mma_sm120.hpp
index 84c09b8b93..1433a2c8d0 100644
--- a/include/cute/arch/mma_sm120.hpp
+++ b/include/cute/arch/mma_sm120.hpp
@@ -3245,7 +3245,7 @@ rr_blockscaled_op_selector_sm120()
 {
   if constexpr (UseF8F6F4) {
     return SM120::BLOCKSCALED::SM120_16x8x32_TN_VS<ElementA, ElementB, ElementC, ElementSF, SFVecSize>{};
-  } 
+  }
   else{
     return SM120::BLOCKSCALED::SM120_16x8x64_TN_VS<ElementA, ElementB, ElementC, ElementSF, SFVecSize>{};
   }
diff --git a/include/cute/arch/mma_sm89.hpp b/include/cute/arch/mma_sm89.hpp
new file mode 100644
index 0000000000..85d7bb64ae
--- /dev/null
+++ b/include/cute/arch/mma_sm89.hpp
@@ -0,0 +1,180 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+//
+
+//
+#pragma once
+
+#include <cute/config.hpp>
+#include <cute/arch/mma.hpp>
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if (__CUDACC_VER_MAJOR__ > 12) || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 4)
+#  define CUTE_ARCH_MMA_F32_SM89_SUPPORTED
+#endif
+
+#if (__CUDACC_VER_MAJOR__ > 12) || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 8)
+#  define CUTE_ARCH_MMA_F16_SM89_SUPPORTED
+#endif
+
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 890)
+#  if defined(CUTE_ARCH_MMA_F32_SM89_SUPPORTED)
+#    define CUTE_ARCH_MMA_F32_SM89_ENABLED
+#  endif
+
+#  if defined(CUTE_ARCH_MMA_F16_SM89_SUPPORTED)
+#    define CUTE_ARCH_MMA_F16_SM89_ENABLED
+#  endif
+#endif
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cute {
+// MMA 16x8x32 TN
+struct SM89_16x8x32_F32E4M3E4M3F32_TN
+{
+  using DRegisters = float[4];
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint32_t[2];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(float      & d0, float      & d1, float      & d2, float      & d3,
+      uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint32_t const& b0, uint32_t const& b1,
+      float const& c0, float const& c1, float const& c2, float const& c3)
+  {
+#if defined(CUTE_ARCH_MMA_F32_SM89_ENABLED)
+    asm(
+      "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 "
+      "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
+      : "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
+      :
+        "r"(a0), "r"(a1), "r"(a2), "r"(a3),
+        "r"(b0), "r"(b1),
+        "f"(c0), "f"(c1), "f"(c2), "f"(c3)
+  );
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM89_16x8x32_F32E4M3E4M3F32_TN without CUTE_ARCH_MMA_F32_SM89_ENABLED");
+#endif
+  }
+};
+
+struct SM89_16x8x32_F32E4M3E5M2F32_TN
+{
+  using DRegisters = float[4];
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint32_t[2];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(float      & d0, float      & d1, float      & d2, float      & d3,
+      uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint32_t const& b0, uint32_t const& b1,
+      float const& c0, float const& c1, float const& c2, float const& c3)
+  {
+#if defined(CUTE_ARCH_MMA_F32_SM89_ENABLED)
+    asm(
+      "mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e5m2.f32 "
+      "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
+      : "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
+      :
+        "r"(a0), "r"(a1), "r"(a2), "r"(a3),
+        "r"(b0), "r"(b1),
+        "f"(c0), "f"(c1), "f"(c2), "f"(c3)
+  );
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM89_16x8x32_F32E4M3E5M2F32_TN without CUTE_ARCH_MMA_F32_SM89_ENABLED");
+#endif
+  }
+};
+
+struct SM89_16x8x32_F32E5M2E5M2F32_TN
+{
+  using DRegisters = float[4];
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint32_t[2];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(float      & d0, float      & d1, float      & d2, float      & d3,
+      uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint32_t const& b0, uint32_t const& b1,
+      float const& c0, float const& c1, float const& c2, float const& c3)
+  {
+#if defined(CUTE_ARCH_MMA_F32_SM89_ENABLED)
+    asm(
+      "mma.sync.aligned.m16n8k32.row.col.f32.e5m2.e5m2.f32 "
+      "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
+      : "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
+      :
+        "r"(a0), "r"(a1), "r"(a2), "r"(a3),
+        "r"(b0), "r"(b1),
+        "f"(c0), "f"(c1), "f"(c2), "f"(c3)
+  );
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM89_16x8x32_F32E5M2E5M2F32_TN without CUTE_ARCH_MMA_F32_SM89_ENABLED");
+#endif
+  }
+};
+
+struct SM89_16x8x32_F32E5M2E4M3F32_TN
+{
+  using DRegisters = float[4];
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint32_t[2];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(float      & d0, float      & d1, float      & d2, float      & d3,
+      uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint32_t const& b0, uint32_t const& b1,
+      float const& c0, float const& c1, float const& c2, float const& c3)
+  {
+#if defined(CUTE_ARCH_MMA_F32_SM89_ENABLED)
+    asm(
+      "mma.sync.aligned.m16n8k32.row.col.f32.e5m2.e4m3.f32 "
+      "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
+      : "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
+      :
+        "r"(a0), "r"(a1), "r"(a2), "r"(a3),
+        "r"(b0), "r"(b1),
+        "f"(c0), "f"(c1), "f"(c2), "f"(c3)
+  );
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM89_16x8x32_F32E5M2E4M3F32_TN without CUTE_ARCH_MMA_F32_SM89_ENABLED");
+#endif
+  }
+};
+
+} // namespace cute
diff --git a/include/cute/arch/simd_sm100.hpp b/include/cute/arch/simd_sm100.hpp
index 1c07a31e6d..58d8810e47 100644
--- a/include/cute/arch/simd_sm100.hpp
+++ b/include/cute/arch/simd_sm100.hpp
@@ -37,7 +37,6 @@
 #include <cute/config.hpp>
 #include <cute/arch/config.hpp>
 #include <cute/numeric/real.hpp>
-
 namespace cute {
 
 CUTE_HOST_DEVICE
diff --git a/include/cute/arch/tmem_allocator_sm100.hpp b/include/cute/arch/tmem_allocator_sm100.hpp
index 6cd9223b76..347a619508 100644
--- a/include/cute/arch/tmem_allocator_sm100.hpp
+++ b/include/cute/arch/tmem_allocator_sm100.hpp
@@ -28,19 +28,34 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-//
 
-//
 #pragma once
 
 #include <cute/arch/config.hpp>
-#include <cute/arch/cluster_sm90.hpp>
-#include <cute/atom/copy_traits_sm100.hpp>
-
-#include <cutlass/pipeline/sm90_pipeline.hpp>
+#include <cute/arch/util.hpp>
+#include <cute/numeric/integral_constant.hpp>
+#include <cute/pointer.hpp>
 
 namespace cute::TMEM {
 
+//
+// TMEM Addressing Constants
+//
+
+// 128 DP x 512 COL x uint32_t-addressing
+using MAX_CAPACITY_BITS = Int<128*512*32>;
+
+// TMEM DP stride in bit-addressing (shift by 5 for conversion from uint32_t)
+using DP_b = cute::constant<int32_t, (1 << 21)>;
+
+// TMEM DP stride in type-T addressing
+template <class T = uint32_t>
+using DP = cute::constant<int32_t, shiftl((1 << 16), tmem_ptr<T>::OffsetShift)>;
+
+//
+// TMEM Allocators
+//
+
 // All operations of this class require that only a single warp uniformly participates
 class Allocator1Sm {
 public:
@@ -57,7 +72,7 @@ class Allocator1Sm {
    * @pre Must never be issued by more than one warp at the same time.
    * @pre For repeated allocations, the same warp must be used to issue all allocations.
   **/
-  CUTLASS_DEVICE void
+  CUTE_HOST_DEVICE void
   allocate(int num_columns, uint32_t* dst_ptr) {
   #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     uint32_t dst_intptr = cute::cast_smem_ptr_to_uint(dst_ptr);
@@ -77,8 +92,8 @@ class Allocator1Sm {
     asm volatile(
       "{\n\t"
       "tcgen05.dealloc.cta_group::1.sync.aligned.b32  %0, %1; \n\t"
-      "}" 
-      : 
+      "}"
+      :
       : "r"(tmem_ptr), "r"(num_columns));
   #else
     CUTE_INVALID_CONTROL_PATH("Attempting to use TMEM allocation PTX without CUTE_ARCH_TCGEN05_TMEM_ENABLED");
@@ -116,7 +131,7 @@ class Allocator2Sm {
    * @pre For repeated allocations, the same warp must be used to issue all allocations.
    * @pre The 2 warps from participating CTAs have the same logical warp ID.
   **/
-  CUTLASS_DEVICE void
+  CUTE_HOST_DEVICE void
   allocate(int num_columns, uint32_t* dst_ptr) {
   #if defined(CUTE_ARCH_TCGEN05_TMEM_ENABLED)
     uint32_t dst_intptr = cute::cast_smem_ptr_to_uint(dst_ptr);
@@ -130,7 +145,7 @@ class Allocator2Sm {
   }
 
   /**
-   * Frees the TMEM corresponding to the pointer and slice count provided. 
+   * Frees the TMEM corresponding to the pointer and slice count provided.
    * Release the TMEM after checking that the CTA issuing the free does indeed own the corresponding slices.
    * @param tmem_ptr Base address of the TMEM address space being freed.
    * @param num_columns Number of columns being freed. Must be 32 <= num_columns <= 512 and power of 2.
@@ -146,8 +161,8 @@ class Allocator2Sm {
     asm volatile(
       "{\n\t"
       "tcgen05.dealloc.cta_group::2.sync.aligned.b32  %0, %1; \n\t"
-      "}" 
-      : 
+      "}"
+      :
       : "r"(tmem_ptr), "r"(num_columns));
   #else
     CUTE_INVALID_CONTROL_PATH("Attempting to use TMEM allocation PTX without CUTE_ARCH_TCGEN05_TMEM_ENABLED");
diff --git a/include/cute/arch/util.hpp b/include/cute/arch/util.hpp
index b0899f7a83..c9ff9ef878 100644
--- a/include/cute/arch/util.hpp
+++ b/include/cute/arch/util.hpp
@@ -88,7 +88,7 @@ namespace cute
 {
 
 /// CUTE helper to cast SMEM pointer to unsigned
-CUTE_DEVICE
+CUTE_HOST_DEVICE
 uint32_t
 cast_smem_ptr_to_uint(void const* const ptr)
 {
diff --git a/include/cute/atom/copy_traits_sm100.hpp b/include/cute/atom/copy_traits_sm100.hpp
index 6a767ae3c0..594149d4fd 100644
--- a/include/cute/atom/copy_traits_sm100.hpp
+++ b/include/cute/atom/copy_traits_sm100.hpp
@@ -28,13 +28,11 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-//
-
-//
 
 #pragma once
 
 #include <cute/arch/copy_sm100.hpp>
+#include <cute/arch/tmem_allocator_sm100.hpp>
 
 #include <cute/atom/copy_traits.hpp>
 #include <cute/atom/copy_atom.hpp>
@@ -230,92 +228,11 @@ struct Copy_Traits<SM100_U8x16_STSM_T>
   using RefLayout = SrcLayout;
 };
 
-namespace TMEM {
-  using MAX_CAPACITY_BITS = Int<128*512*32>;         // 128 DP x 512 COL x uint32_t-addressing
-
-  template <class T = uint32_t>                      // TMEM DP  stride in type-T addressing
-  using DP  = cute::constant<int32_t, shiftl((1 << 16), tmem_ptr<T>::OffsetShift)>;
-
-  using DP_b  = cute::constant<int32_t, (1 << 21)>;  // TMEM DP  stride in bit-addressing (shift by 5 for conversion from uint32_t)
-}
-
-// TMEM_LOAD copy_unpack
-template <class CopyOp>
-struct TMEM_LOAD_Unpack
-{
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr void
-  copy_unpack(Copy_Traits<CopyOp> const& traits,
-              Tensor<TS,SLayout>  const& src,
-              Tensor<TD,DLayout>       & dst)
-  {
-    static_assert(is_tmem<TS>::value, "Expected TMEM src.");
-    static_assert(is_rmem<TD>::value, "Expected RMEM dst.");
-
-    using SrcType = typename TS::value_type;
-    CUTE_STATIC_ASSERT_V((coalesce(layout(src)) == coalesce(upcast<sizeof_bits<SrcType>::value>(typename Copy_Traits<CopyOp>::ValID{}))),
-      "Expected src to have the specific TMEM layout required by CopyOp.");
-
-    uint32_t tmem_addr = raw_pointer_cast(src.data());
-
-    using RegTypeDst = typename remove_extent<typename CopyOp::DRegisters>::type;
-    Tensor rD = recast<RegTypeDst>(dst);
-
-    constexpr int RegNumDst = extent<typename CopyOp::DRegisters>::value;
-    CUTE_STATIC_ASSERT_V(size(rD) == Int<RegNumDst>{},
-      "In CopyAtom, dst layout doesn't vectorize into registers. This dst layout is incompatible with this CopyOp.");
-
-    // thread idx <=> DP lane assert.
-    // ASSERT TMEM_LOAD thread attemping to access DP lane within sub-partition.
-#if defined(__CUDA_ARCH__) && !defined(NDEBUG)
-    assert(((uint32_t(threadIdx.x) / 32) % 4) == (((tmem_addr >> 16) / 32) % 4));
-#endif
-
-    detail::explode(CopyOp::copy,
-                    &tmem_addr, seq<0>{},
-                    rD, make_seq<RegNumDst>{});
-  }
-};
-
-// TMEM_STORE copy_unpack
-template <class CopyOp>
-struct TMEM_STORE_Unpack
-{
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr void
-  copy_unpack(Copy_Traits<CopyOp> const& traits,
-              Tensor<TS,SLayout>  const& src,
-              Tensor<TD,DLayout>       & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected RMEM src.");
-    static_assert(is_tmem<TD>::value, "Expected TMEM dst.");
-
-    using RegTypeSrc = typename remove_extent<typename CopyOp::SRegisters>::type;
-    Tensor rS = recast<RegTypeSrc>(src);
-
-    constexpr int RegNumSrc = extent<typename CopyOp::SRegisters>::value;
-    CUTE_STATIC_ASSERT_V(size(rS) == Int<RegNumSrc>{},
-      "In CopyAtom, src layout doesn't vectorize into registers. This src layout is incompatible with this tiled copy.");
-
-    using DstType = typename TD::value_type;
-    CUTE_STATIC_ASSERT_V((coalesce(layout(dst)) == coalesce(upcast<sizeof_bits<DstType>::value>(typename Copy_Traits<CopyOp>::ValID{}))),
-      "Expected dst to have the specific TMEM layout required by CopyOp.");
-
-    uint32_t tmem_addr = raw_pointer_cast(dst.data());
-
-    // thread idx <=> DP lane assert.
-    // ASSERT TMEM_LOAD thread attemping to access DP lane within sub-partition.
-#if defined(__CUDA_ARCH__) && !defined(NDEBUG)
-    assert(((uint32_t(threadIdx.x) / 32) % 4) == (((tmem_addr >> 16) / 32) % 4));
-#endif
-
-    detail::explode(CopyOp::copy,
-                    rS, make_seq<RegNumSrc>{},
-                    &tmem_addr, seq<0>{});
-  }
-};
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//
+// TMEM Traits and Utilities
+//
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
 template <class... Args>
 struct Copy_Atom;
@@ -418,817 +335,162 @@ make_tmem_warp_partitioner(Tensor<TEngine,TLayout> const& tmem)
   return make_tiler_impl(layout_tv, tiler);
 }
 
-} // end namespace cute
+namespace SM100::TMEM::LOAD {
+
+//
+// Specialized copy_unpack implementation for SM100::TMEM::LOAD instructions
+//
+
+template <class CopyOp,
+          class TS, class SLayout,
+          class TD, class DLayout>
+CUTE_HOST_DEVICE constexpr
+void
+copy_unpack(Copy_Traits<CopyOp> const& traits,
+            Tensor<TS,SLayout>  const& src,
+            Tensor<TD,DLayout>       & dst)
+{
+  static_assert(is_tmem<TS>::value, "Expected TMEM src.");
+  static_assert(is_rmem<TD>::value, "Expected RMEM dst.");
+
+  using SrcType = typename TS::value_type;
+  CUTE_STATIC_ASSERT_V((coalesce(layout(src)) == coalesce(upcast<sizeof_bits<SrcType>::value>(typename Copy_Traits<CopyOp>::ValID{}))),
+    "Expected src to have the specific TMEM layout required by CopyOp.");
+
+  uint32_t tmem_addr = raw_pointer_cast(src.data());
+
+  using RegTypeDst = typename remove_extent<typename CopyOp::DRegisters>::type;
+  Tensor rD = recast<RegTypeDst>(dst);
+
+  constexpr int RegNumDst = extent<typename CopyOp::DRegisters>::value;
+  CUTE_STATIC_ASSERT_V(size(rD) == Int<RegNumDst>{},
+    "In CopyAtom, dst layout doesn't vectorize into registers. This dst layout is incompatible with this CopyOp.");
+
+  // thread idx <=> DP lane assert.
+  // ASSERT TMEM_LOAD thread attemping to access DP lane within sub-partition.
+#if defined(__CUDA_ARCH__) && !defined(NDEBUG)
+  assert(((uint32_t(threadIdx.x) / 32) % 4) == (((tmem_addr >> 16) / 32) % 4));
+#endif
+
+  detail::explode(CopyOp::copy,
+                  &tmem_addr, seq<0>{},
+                  rD, make_seq<RegNumDst>{});
+}
+
+} // end namespace SM100::TMEM::LOAD
+
+namespace SM100::TMEM::STORE {
+
+//
+// Specialized copy_unpack implementation for SM100::TMEM::STORE instructions
+//
+
+template <class CopyOp,
+          class TS, class SLayout,
+          class TD, class DLayout>
+CUTE_HOST_DEVICE constexpr
+void
+copy_unpack(Copy_Traits<CopyOp> const& traits,
+            Tensor<TS,SLayout>  const& src,
+            Tensor<TD,DLayout>       & dst)
+{
+  static_assert(is_rmem<TS>::value, "Expected RMEM src.");
+  static_assert(is_tmem<TD>::value, "Expected TMEM dst.");
+
+  using RegTypeSrc = typename remove_extent<typename CopyOp::SRegisters>::type;
+  Tensor rS = recast<RegTypeSrc>(src);
+
+  constexpr int RegNumSrc = extent<typename CopyOp::SRegisters>::value;
+  CUTE_STATIC_ASSERT_V(size(rS) == Int<RegNumSrc>{},
+    "In CopyAtom, src layout doesn't vectorize into registers. This src layout is incompatible with this tiled copy.");
+
+  using DstType = typename TD::value_type;
+  CUTE_STATIC_ASSERT_V((coalesce(layout(dst)) == coalesce(upcast<sizeof_bits<DstType>::value>(typename Copy_Traits<CopyOp>::ValID{}))),
+    "Expected dst to have the specific TMEM layout required by CopyOp.");
+
+  uint32_t tmem_addr = raw_pointer_cast(dst.data());
+
+  // thread idx <=> DP lane assert.
+  // ASSERT TMEM_LOAD thread attemping to access DP lane within sub-partition.
+#if defined(__CUDA_ARCH__) && !defined(NDEBUG)
+  assert(((uint32_t(threadIdx.x) / 32) % 4) == (((tmem_addr >> 16) / 32) % 4));
+#endif
+
+  detail::explode(CopyOp::copy,
+                  rS, make_seq<RegNumSrc>{},
+                  &tmem_addr, seq<0>{});
+}
+
+} // end namespace SM100::TMEM::STORE
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//
+// TMEM_LOAD Copy Traits
+//
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-namespace cute {
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b1x;
+
+template <>
+struct Copy_Traits<SM100_TMEM_LOAD_16dp256b1x>
+{
+  // Logical thread id to thread idx (warp)
+  using ThrID = Layout<_32>;
+  // Logical bit id to bit idx (address)
+  using ValID = Layout<Shape <_256,       _16>,
+                       Stride<  _1,TMEM::DP_b>>;
+  // Map from (src-thr,src-val) to bit
+  using SrcLayout = Layout<Shape <_32,_4096>,
+                           Stride< _0,   _1>>;
+  // Map from (dst-thr,dst-val) to bit
+  using DstLayout = Layout<Shape <Shape < _4,  _8>,Shape <_64,   _2>>,
+                           Stride<Stride<_64,_256>,Stride< _1,_2048>>>;
+  // Reference map from (thr,val) to bit
+  using RefLayout = SrcLayout;
+};
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b1x_16b;
+
+template <>
+struct Copy_Traits<SM100_TMEM_LOAD_16dp256b1x_16b>
+{
+  using ThrID = Layout<_32>;
+  using ValID = Layout<Shape <Shape <_16,_16>,       _16>,
+                       Stride<Stride< _1,_32>,TMEM::DP_b>>;
+  using SrcLayout = Layout<Shape <_32,_4096>,
+                           Stride< _0,   _1>>;
+  using DstLayout = Layout<Shape <Shape < _4,  _8>,Shape <_64,   _2>>,
+                           Stride<Stride<_64,_256>,Stride< _1,_2048>>>;
+  using RefLayout = SrcLayout;
+};
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-namespace TMEM {
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b2x;
+
+template <>
+struct Copy_Traits<SM100_TMEM_LOAD_16dp256b2x>
+{
+  using ThrID = Layout<_32>;
+  using ValID = Layout<Shape <_512,       _16>,
+                       Stride<  _1,TMEM::DP_b>>;
+  using SrcLayout = Layout<Shape <_32,_8192>,
+                           Stride< _0,   _1>>;
+  using DstLayout = Layout<Shape <Shape < _4,  _8>,Shape <_64,   _2,  _2>>,
+                           Stride<Stride<_64,_512>,Stride< _1,_4096,_256>>>;
+  using RefLayout = SrcLayout;
+};
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// Given a 1x tmem copy op, returns the widest repeated variant that divides the specified bits in the N-mode
-template <class CopyOp, int bits_n>
-CUTE_HOST_DEVICE constexpr
-auto
-op_repeater()
-{
-  if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x>) {
-    if constexpr (bits_n % (256 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp256b32x{};
-    }
-    else if constexpr (bits_n % (256 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp256b16x{};
-    }
-    else if constexpr (bits_n % (256 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp256b8x{};
-    }
-    else if constexpr (bits_n % (256 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp256b4x{};
-    }
-    else if constexpr (bits_n % (256 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp256b2x{};
-    }
-    else if constexpr (bits_n % (256 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp256b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x_16b>) {
-    if constexpr (bits_n % (256 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp256b32x_16b{};
-    }
-    else if constexpr (bits_n % (256 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp256b16x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp256b8x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp256b4x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp256b2x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp256b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x>) {
-    if constexpr (bits_n % (128 * 64) == 0) {
-      return SM100_TMEM_LOAD_16dp128b64x{};
-    }
-    else if constexpr (bits_n % (128 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp128b32x{};
-    }
-    else if constexpr (bits_n % (128 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp128b16x{};
-    }
-    else if constexpr (bits_n % (128 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp128b8x{};
-    }
-    else if constexpr (bits_n % (128 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp128b4x{};
-    }
-    else if constexpr (bits_n % (128 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp128b2x{};
-    }
-    else if constexpr (bits_n % (128 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp128b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x_16b>) {
-    if constexpr (bits_n % (128 * 64) == 0) {
-      return SM100_TMEM_LOAD_16dp128b64x_16b{};
-    }
-    else if constexpr (bits_n % (128 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp128b32x_16b{};
-    }
-    else if constexpr (bits_n % (128 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp128b16x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp128b8x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp128b4x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp128b2x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp128b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_LOAD_16dp64b128x{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_LOAD_16dp64b64x{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp64b32x{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp64b16x{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp64b8x{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp64b4x{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp64b2x{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp64b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x_16b>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_LOAD_16dp64b128x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_LOAD_16dp64b64x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp64b32x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp64b16x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp64b8x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp64b4x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp64b2x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp64b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_LOAD_16dp32b128x{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_LOAD_16dp32b64x{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp32b32x{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp32b16x{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp32b8x{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp32b4x{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp32b2x{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp32b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x_16b>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_LOAD_16dp32b128x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_LOAD_16dp32b64x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_LOAD_16dp32b32x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_LOAD_16dp32b16x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_LOAD_16dp32b8x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_LOAD_16dp32b4x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_LOAD_16dp32b2x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_LOAD_16dp32b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x>) {
-    if constexpr (bits_n % (32 * 128) == 0) {
-      return SM100_TMEM_LOAD_32dp32b128x{};
-    }
-    else if constexpr (bits_n % (32 * 64) == 0) {
-      return SM100_TMEM_LOAD_32dp32b64x{};
-    }
-    else if constexpr (bits_n % (32 * 32) == 0) {
-      return SM100_TMEM_LOAD_32dp32b32x{};
-    }
-    else if constexpr (bits_n % (32 * 16) == 0) {
-      return SM100_TMEM_LOAD_32dp32b16x{};
-    }
-    else if constexpr (bits_n % (32 *  8) == 0) {
-      return SM100_TMEM_LOAD_32dp32b8x{};
-    }
-    else if constexpr (bits_n % (32 *  4) == 0) {
-      return SM100_TMEM_LOAD_32dp32b4x{};
-    }
-    else if constexpr (bits_n % (32 *  2) == 0) {
-      return SM100_TMEM_LOAD_32dp32b2x{};
-    }
-    else if constexpr (bits_n % (32 *  1) == 0) {
-      return SM100_TMEM_LOAD_32dp32b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x_16b>) {
-    if constexpr (bits_n % (32 * 128) == 0) {
-      return SM100_TMEM_LOAD_32dp32b128x_16b{};
-    }
-    else if constexpr (bits_n % (32 * 64) == 0) {
-      return SM100_TMEM_LOAD_32dp32b64x_16b{};
-    }
-    else if constexpr (bits_n % (32 * 32) == 0) {
-      return SM100_TMEM_LOAD_32dp32b32x_16b{};
-    }
-    else if constexpr (bits_n % (32 * 16) == 0) {
-      return SM100_TMEM_LOAD_32dp32b16x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  8) == 0) {
-      return SM100_TMEM_LOAD_32dp32b8x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  4) == 0) {
-      return SM100_TMEM_LOAD_32dp32b4x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  2) == 0) {
-      return SM100_TMEM_LOAD_32dp32b2x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  1) == 0) {
-      return SM100_TMEM_LOAD_32dp32b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp256b1x>) {
-    if constexpr (bits_n % (256 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp256b32x{};
-    }
-    else if constexpr (bits_n % (256 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp256b16x{};
-    }
-    else if constexpr (bits_n % (256 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp256b8x{};
-    }
-    else if constexpr (bits_n % (256 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp256b4x{};
-    }
-    else if constexpr (bits_n % (256 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp256b2x{};
-    }
-    else if constexpr (bits_n % (256 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp256b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp256b1x_16b>) {
-    if constexpr (bits_n % (256 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp256b32x_16b{};
-    }
-    else if constexpr (bits_n % (256 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp256b16x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp256b8x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp256b4x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp256b2x_16b{};
-    }
-    else if constexpr (bits_n % (256 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp256b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp128b1x>) {
-    if constexpr (bits_n % (128 * 64) == 0) {
-      return SM100_TMEM_STORE_16dp128b64x{};
-    }
-    else if constexpr (bits_n % (128 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp128b32x{};
-    }
-    else if constexpr (bits_n % (128 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp128b16x{};
-    }
-    else if constexpr (bits_n % (128 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp128b8x{};
-    }
-    else if constexpr (bits_n % (128 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp128b4x{};
-    }
-    else if constexpr (bits_n % (128 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp128b2x{};
-    }
-    else if constexpr (bits_n % (128 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp128b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp128b1x_16b>) {
-    if constexpr (bits_n % (128 * 64) == 0) {
-      return SM100_TMEM_STORE_16dp128b64x_16b{};
-    }
-    else if constexpr (bits_n % (128 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp128b32x_16b{};
-    }
-    else if constexpr (bits_n % (128 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp128b16x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp128b8x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp128b4x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp128b2x_16b{};
-    }
-    else if constexpr (bits_n % (128 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp128b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp64b1x>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_STORE_16dp64b128x{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_STORE_16dp64b64x{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp64b32x{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp64b16x{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp64b8x{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp64b4x{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp64b2x{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp64b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp64b1x_16b>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_STORE_16dp64b128x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_STORE_16dp64b64x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp64b32x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp64b16x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp64b8x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp64b4x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp64b2x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp64b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp32b1x>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_STORE_16dp32b128x{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_STORE_16dp32b64x{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp32b32x{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp32b16x{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp32b8x{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp32b4x{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp32b2x{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp32b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp32b1x_16b>) {
-    if constexpr (bits_n % (64 * 128) == 0) {
-      return SM100_TMEM_STORE_16dp32b128x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 64) == 0) {
-      return SM100_TMEM_STORE_16dp32b64x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 32) == 0) {
-      return SM100_TMEM_STORE_16dp32b32x_16b{};
-    }
-    else if constexpr (bits_n % (64 * 16) == 0) {
-      return SM100_TMEM_STORE_16dp32b16x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  8) == 0) {
-      return SM100_TMEM_STORE_16dp32b8x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  4) == 0) {
-      return SM100_TMEM_STORE_16dp32b4x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  2) == 0) {
-      return SM100_TMEM_STORE_16dp32b2x_16b{};
-    }
-    else if constexpr (bits_n % (64 *  1) == 0) {
-      return SM100_TMEM_STORE_16dp32b1x_16b{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_32dp32b1x>) {
-    if constexpr (bits_n % (32 * 128) == 0) {
-      return SM100_TMEM_STORE_32dp32b128x{};
-    }
-    else if constexpr (bits_n % (32 * 64) == 0) {
-      return SM100_TMEM_STORE_32dp32b64x{};
-    }
-    else if constexpr (bits_n % (32 * 32) == 0) {
-      return SM100_TMEM_STORE_32dp32b32x{};
-    }
-    else if constexpr (bits_n % (32 * 16) == 0) {
-      return SM100_TMEM_STORE_32dp32b16x{};
-    }
-    else if constexpr (bits_n % (32 *  8) == 0) {
-      return SM100_TMEM_STORE_32dp32b8x{};
-    }
-    else if constexpr (bits_n % (32 *  4) == 0) {
-      return SM100_TMEM_STORE_32dp32b4x{};
-    }
-    else if constexpr (bits_n % (32 *  2) == 0) {
-      return SM100_TMEM_STORE_32dp32b2x{};
-    }
-    else if constexpr (bits_n % (32 *  1) == 0) {
-      return SM100_TMEM_STORE_32dp32b1x{};
-    }
-  }
-  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_32dp32b1x_16b>) {
-    if constexpr (bits_n % (32 * 128) == 0) {
-      return SM100_TMEM_STORE_32dp32b128x_16b{};
-    }
-    else if constexpr (bits_n % (32 * 64) == 0) {
-      return SM100_TMEM_STORE_32dp32b64x_16b{};
-    }
-    else if constexpr (bits_n % (32 * 32) == 0) {
-      return SM100_TMEM_STORE_32dp32b32x_16b{};
-    }
-    else if constexpr (bits_n % (32 * 16) == 0) {
-      return SM100_TMEM_STORE_32dp32b16x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  8) == 0) {
-      return SM100_TMEM_STORE_32dp32b8x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  4) == 0) {
-      return SM100_TMEM_STORE_32dp32b4x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  2) == 0) {
-      return SM100_TMEM_STORE_32dp32b2x_16b{};
-    }
-    else if constexpr (bits_n % (32 *  1) == 0) {
-      return SM100_TMEM_STORE_32dp32b1x_16b{};
-    }
-  }
-  else {
-    static_assert(dependent_false<CopyOp>, "Must pass 1x tmem copy operator");
-  }
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// Select TMEM store corresponding to the provided TMEM load
-template <class CopyOp>
-CUTE_HOST_DEVICE constexpr auto
-tmem_load_to_store(CopyOp) {
-  if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x>) {
-    return SM100_TMEM_STORE_16dp256b1x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x_16b>) {
-    return SM100_TMEM_STORE_16dp256b1x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b2x>) {
-    return SM100_TMEM_STORE_16dp256b2x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b2x_16b>) {
-    return SM100_TMEM_STORE_16dp256b2x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b4x>) {
-    return SM100_TMEM_STORE_16dp256b4x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b4x_16b>) {
-    return SM100_TMEM_STORE_16dp256b4x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b8x>) {
-    return SM100_TMEM_STORE_16dp256b8x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b8x_16b>) {
-    return SM100_TMEM_STORE_16dp256b8x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b16x>) {
-    return SM100_TMEM_STORE_16dp256b16x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b16x_16b>) {
-    return SM100_TMEM_STORE_16dp256b16x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b32x>) {
-    return SM100_TMEM_STORE_16dp256b32x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b32x_16b>) {
-    return SM100_TMEM_STORE_16dp256b32x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x>) {
-    return SM100_TMEM_STORE_16dp128b1x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x_16b>) {
-    return SM100_TMEM_STORE_16dp128b1x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b2x>) {
-    return SM100_TMEM_STORE_16dp128b2x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b2x_16b>) {
-    return SM100_TMEM_STORE_16dp128b2x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b4x>) {
-    return SM100_TMEM_STORE_16dp128b4x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b4x_16b>) {
-    return SM100_TMEM_STORE_16dp128b4x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b8x>) {
-    return SM100_TMEM_STORE_16dp128b8x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b8x_16b>) {
-    return SM100_TMEM_STORE_16dp128b8x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b16x>) {
-    return SM100_TMEM_STORE_16dp128b16x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b16x_16b>) {
-    return SM100_TMEM_STORE_16dp128b16x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b32x>) {
-    return SM100_TMEM_STORE_16dp128b32x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b32x_16b>) {
-    return SM100_TMEM_STORE_16dp128b32x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b64x>) {
-    return SM100_TMEM_STORE_16dp128b64x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b64x_16b>) {
-    return SM100_TMEM_STORE_16dp128b64x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x>) {
-    return SM100_TMEM_STORE_16dp64b1x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x_16b>) {
-    return SM100_TMEM_STORE_16dp64b1x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b2x>) {
-    return SM100_TMEM_STORE_16dp64b2x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b2x_16b>) {
-    return SM100_TMEM_STORE_16dp64b2x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b4x>) {
-    return SM100_TMEM_STORE_16dp64b4x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b4x_16b>) {
-    return SM100_TMEM_STORE_16dp64b4x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b8x>) {
-    return SM100_TMEM_STORE_16dp64b8x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b8x_16b>) {
-    return SM100_TMEM_STORE_16dp64b8x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b16x>) {
-    return SM100_TMEM_STORE_16dp64b16x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b16x_16b>) {
-    return SM100_TMEM_STORE_16dp64b16x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b32x>) {
-    return SM100_TMEM_STORE_16dp64b32x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b32x_16b>) {
-    return SM100_TMEM_STORE_16dp64b32x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b64x>) {
-    return SM100_TMEM_STORE_16dp64b64x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b64x_16b>) {
-    return SM100_TMEM_STORE_16dp64b64x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b128x>) {
-    return SM100_TMEM_STORE_16dp64b128x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b128x_16b>) {
-    return SM100_TMEM_STORE_16dp64b128x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x>) {
-    return SM100_TMEM_STORE_16dp32b1x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x_16b>) {
-    return SM100_TMEM_STORE_16dp32b1x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b2x>) {
-    return SM100_TMEM_STORE_16dp32b2x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b2x_16b>) {
-    return SM100_TMEM_STORE_16dp32b2x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b4x>) {
-    return SM100_TMEM_STORE_16dp32b4x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b4x_16b>) {
-    return SM100_TMEM_STORE_16dp32b4x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b8x>) {
-    return SM100_TMEM_STORE_16dp32b8x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b8x_16b>) {
-    return SM100_TMEM_STORE_16dp32b8x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b16x>) {
-    return SM100_TMEM_STORE_16dp32b16x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b16x_16b>) {
-    return SM100_TMEM_STORE_16dp32b16x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b32x>) {
-    return SM100_TMEM_STORE_16dp32b32x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b32x_16b>) {
-    return SM100_TMEM_STORE_16dp32b32x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b64x>) {
-    return SM100_TMEM_STORE_16dp32b64x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b64x_16b>) {
-    return SM100_TMEM_STORE_16dp32b64x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b128x>) {
-    return SM100_TMEM_STORE_16dp32b128x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b128x_16b>) {
-    return SM100_TMEM_STORE_16dp32b128x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x>) {
-    return SM100_TMEM_STORE_32dp32b1x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x_16b>) {
-    return SM100_TMEM_STORE_32dp32b1x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b2x>) {
-    return SM100_TMEM_STORE_32dp32b2x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b2x_16b>) {
-    return SM100_TMEM_STORE_32dp32b2x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b4x>) {
-    return SM100_TMEM_STORE_32dp32b4x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b4x_16b>) {
-    return SM100_TMEM_STORE_32dp32b4x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b8x>) {
-    return SM100_TMEM_STORE_32dp32b8x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b8x_16b>) {
-    return SM100_TMEM_STORE_32dp32b8x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b16x>) {
-    return SM100_TMEM_STORE_32dp32b16x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b16x_16b>) {
-    return SM100_TMEM_STORE_32dp32b16x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b32x>) {
-    return SM100_TMEM_STORE_32dp32b32x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b32x_16b>) {
-    return SM100_TMEM_STORE_32dp32b32x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b64x>) {
-    return SM100_TMEM_STORE_32dp32b64x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b64x_16b>) {
-    return SM100_TMEM_STORE_32dp32b64x_16b{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b128x>) {
-    return SM100_TMEM_STORE_32dp32b128x{};
-  }
-  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b128x_16b>) {
-    return SM100_TMEM_STORE_32dp32b128x_16b{};
-  }
-  else {
-    static_assert(dependent_false<CopyOp>, "No TMEM_STORE matching for provided TMEM_LOAD");
-  }
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-} // namespace TMEM
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-//
-// TMEM_LOAD Copy Traits
-//
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct Copy_Traits<SM100_TMEM_LOAD_16dp256b1x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b1x>
-{
-  // Logical thread id to thread idx (warp)
-  using ThrID = Layout<_32>;
-  // Logical bit id to bit idx (address)
-  using ValID = Layout<Shape <_256,       _16>,
-                       Stride<  _1,TMEM::DP_b>>;
-  // Map from (src-thr,src-val) to bit
-  using SrcLayout = Layout<Shape <_32,_4096>,
-                           Stride< _0,   _1>>;
-  // Map from (dst-thr,dst-val) to bit
-  using DstLayout = Layout<Shape <Shape < _4,  _8>,Shape <_64,   _2>>,
-                           Stride<Stride<_64,_256>,Stride< _1,_2048>>>;
-  // Reference map from (thr,val) to bit
-  using RefLayout = SrcLayout;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct Copy_Traits<SM100_TMEM_LOAD_16dp256b1x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b1x_16b>
-{
-  using ThrID = Layout<_32>;
-  using ValID = Layout<Shape <Shape <_16,_16>,       _16>,
-                       Stride<Stride< _1,_32>,TMEM::DP_b>>;
-  using SrcLayout = Layout<Shape <_32,_4096>,
-                           Stride< _0,   _1>>;
-  using DstLayout = Layout<Shape <Shape < _4,  _8>,Shape <_64,   _2>>,
-                           Stride<Stride<_64,_256>,Stride< _1,_2048>>>;
-  using RefLayout = SrcLayout;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct Copy_Traits<SM100_TMEM_LOAD_16dp256b2x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b2x>
-{
-  using ThrID = Layout<_32>;
-  using ValID = Layout<Shape <_512,       _16>,
-                       Stride<  _1,TMEM::DP_b>>;
-  using SrcLayout = Layout<Shape <_32,_8192>,
-                           Stride< _0,   _1>>;
-  using DstLayout = Layout<Shape <Shape < _4,  _8>,Shape <_64,   _2,  _2>>,
-                           Stride<Stride<_64,_512>,Stride< _1,_4096,_256>>>;
-  using RefLayout = SrcLayout;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct Copy_Traits<SM100_TMEM_LOAD_16dp256b2x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b2x_16b>
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b2x_16b;
+
+template <>
+struct Copy_Traits<SM100_TMEM_LOAD_16dp256b2x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_32>,       _16>,
@@ -1242,9 +504,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b4x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b4x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_1024,       _16>,
@@ -1258,9 +521,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b4x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b4x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_64>,       _16>,
@@ -1274,9 +538,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b8x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b8x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_2048,       _16>,
@@ -1290,9 +555,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b8x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b8x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_128>,       _16>,
@@ -1306,9 +572,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b16x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b16x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_4096,       _16>,
@@ -1322,9 +589,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b16x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b16x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_256>,       _16>,
@@ -1338,9 +606,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b32x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b32x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_8192,       _16>,
@@ -1354,9 +623,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp256b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp256b32x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp256b32x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_512>,       _16>,
@@ -1370,9 +640,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp256b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b1x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b1x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_128,       _16>,
@@ -1386,9 +657,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b1x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b1x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _8>,       _16>,
@@ -1402,9 +674,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b2x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b2x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_256,       _16>,
@@ -1418,9 +691,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b2x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b2x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_16>,       _16>,
@@ -1434,9 +708,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b4x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b4x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_512,       _16>,
@@ -1450,9 +725,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b4x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b4x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_32>,       _16>,
@@ -1466,9 +742,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b8x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b8x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_1024,       _16>,
@@ -1482,9 +759,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b8x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b8x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_64>,       _16>,
@@ -1498,9 +776,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b16x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b16x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_2048,       _16>,
@@ -1514,9 +793,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b16x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b16x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_128>,       _16>,
@@ -1530,9 +810,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b32x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b32x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_4096,       _16>,
@@ -1546,9 +827,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b32x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b32x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_256>,       _16>,
@@ -1562,9 +844,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b64x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b64x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_8192,       _16>,
@@ -1578,9 +861,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp128b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp128b64x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp128b64x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_512>,       _16>,
@@ -1594,9 +878,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp128b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b1x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b1x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_64,       _16>,
@@ -1610,9 +895,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b1x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b1x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _4>,       _16>,
@@ -1626,9 +912,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b2x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b2x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_128,       _16>,
@@ -1642,9 +929,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b2x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b2x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _8>,       _16>,
@@ -1658,9 +946,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b4x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b4x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_256,       _16>,
@@ -1674,9 +963,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b4x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b4x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_16>,       _16>,
@@ -1690,9 +980,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b8x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b8x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_512,       _16>,
@@ -1706,9 +997,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b8x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b8x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_32>,       _16>,
@@ -1722,9 +1014,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b16x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b16x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_1024,       _16>,
@@ -1738,9 +1031,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b16x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b16x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_64>,       _16>,
@@ -1754,9 +1048,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b32x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b32x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_2048,       _16>,
@@ -1770,9 +1065,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b32x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b32x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_128>,       _16>,
@@ -1786,9 +1082,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b64x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b64x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_4096,       _16>,
@@ -1802,9 +1099,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b64x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b64x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_256>,       _16>,
@@ -1818,9 +1116,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b128x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b128x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b128x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_8192,       _16>,
@@ -1834,9 +1133,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b128x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp64b128x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp64b128x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp64b128x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_512>,       _16>,
@@ -1850,9 +1150,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp64b128x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b1x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b1x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_64,       _16>,
@@ -1866,9 +1167,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b1x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b1x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _4>,       _16>,
@@ -1882,9 +1184,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b2x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b2x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_128,       _16>,
@@ -1898,9 +1201,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b2x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b2x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _8>,       _16>,
@@ -1914,9 +1218,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b4x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b4x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_256,       _16>,
@@ -1930,9 +1235,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b4x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b4x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_16>,       _16>,
@@ -1946,9 +1252,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b8x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b8x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_512,       _16>,
@@ -1962,9 +1269,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b8x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b8x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_32>,       _16>,
@@ -1978,9 +1286,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b16x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b16x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_1024,       _16>,
@@ -1994,9 +1303,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b16x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b16x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_64>,       _16>,
@@ -2010,9 +1320,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b32x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b32x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_2048,       _16>,
@@ -2026,9 +1337,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b32x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b32x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_128>,       _16>,
@@ -2042,9 +1354,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b64x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b64x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_4096,       _16>,
@@ -2058,9 +1371,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b64x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b64x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_256>,       _16>,
@@ -2074,9 +1388,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b128x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b128x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b128x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_8192,       _16>,
@@ -2090,9 +1405,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b128x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_16dp32b128x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_16dp32b128x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_16dp32b128x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_512>,       _16>,
@@ -2106,9 +1422,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_16dp32b128x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b1x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b1x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_32,       _32>,
@@ -2122,9 +1439,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b1x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b1x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _2>,       _32>,
@@ -2138,9 +1456,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b2x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b2x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_64,       _32>,
@@ -2154,9 +1473,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b2x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b2x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _4>,       _32>,
@@ -2170,9 +1490,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b4x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b4x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_128,       _32>,
@@ -2186,9 +1507,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b4x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b4x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16, _8>,       _32>,
@@ -2202,9 +1524,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b8x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b8x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_256,       _32>,
@@ -2218,9 +1541,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b8x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b8x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_16>,       _32>,
@@ -2234,9 +1558,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b16x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b16x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_512,       _32>,
@@ -2250,9 +1575,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b16x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b16x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_32>,       _32>,
@@ -2266,9 +1592,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b32x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b32x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_1024,       _32>,
@@ -2281,9 +1608,11 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b32x>
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b32x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_64>,       _32>,
@@ -2297,9 +1626,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b64x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_2048,       _32>,
@@ -2313,9 +1643,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b64x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_128>,       _32>,
@@ -2329,9 +1660,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b128x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b128x>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <_4096,       _32>,
@@ -2344,9 +1676,11 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b128x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>
-     : TMEM_LOAD_Unpack<SM100_TMEM_LOAD_32dp32b128x_16b>
 {
   using ThrID = Layout<_32>;
   using ValID = Layout<Shape <Shape <_16,_256>,       _32>,
@@ -2368,9 +1702,10 @@ struct Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b1x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b1x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b1x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b1x>::ValID;
@@ -2381,9 +1716,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b1x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b1x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b1x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b1x_16b>::ValID;
@@ -2394,9 +1730,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b2x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b2x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b2x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b2x>::ValID;
@@ -2407,9 +1744,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b2x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b2x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b2x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b2x_16b>::ValID;
@@ -2420,9 +1758,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b4x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b4x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b4x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b4x>::ValID;
@@ -2433,9 +1772,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b4x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b4x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b4x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b4x_16b>::ValID;
@@ -2446,9 +1786,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b8x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b8x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b8x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b8x>::ValID;
@@ -2459,9 +1800,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b8x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b8x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b8x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b8x_16b>::ValID;
@@ -2472,9 +1814,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b16x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b16x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b16x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b16x>::ValID;
@@ -2485,9 +1828,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b16x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b16x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b16x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b16x_16b>::ValID;
@@ -2498,9 +1842,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b32x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b32x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b32x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b32x>::ValID;
@@ -2511,9 +1856,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp256b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp256b32x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp256b32x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b32x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp256b32x_16b>::ValID;
@@ -2524,9 +1870,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp256b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b1x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b1x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b1x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b1x>::ValID;
@@ -2537,9 +1884,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b1x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b1x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b1x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b1x_16b>::ValID;
@@ -2550,9 +1898,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b2x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b2x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b2x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b2x>::ValID;
@@ -2563,9 +1912,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b2x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b2x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b2x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b2x_16b>::ValID;
@@ -2576,9 +1926,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b4x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b4x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b4x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b4x>::ValID;
@@ -2589,9 +1940,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b4x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b4x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b4x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b4x_16b>::ValID;
@@ -2602,9 +1954,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b8x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b8x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b8x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b8x>::ValID;
@@ -2615,9 +1968,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b8x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b8x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b8x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b8x_16b>::ValID;
@@ -2628,9 +1982,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b16x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b16x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b16x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b16x>::ValID;
@@ -2641,9 +1996,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b16x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b16x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b16x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b16x_16b>::ValID;
@@ -2654,9 +2010,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b32x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b32x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b32x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b32x>::ValID;
@@ -2667,9 +2024,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b32x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b32x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b32x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b32x_16b>::ValID;
@@ -2680,9 +2038,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b64x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b64x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b64x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b64x>::ValID;
@@ -2693,9 +2052,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp128b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp128b64x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp128b64x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b64x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp128b64x_16b>::ValID;
@@ -2706,9 +2066,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp128b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b1x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b1x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b1x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b1x>::ValID;
@@ -2719,9 +2080,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b1x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b1x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b1x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b1x_16b>::ValID;
@@ -2732,9 +2094,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b2x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b2x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b2x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b2x>::ValID;
@@ -2745,9 +2108,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b2x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b2x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b2x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b2x_16b>::ValID;
@@ -2758,9 +2122,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b4x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b4x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b4x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b4x>::ValID;
@@ -2771,9 +2136,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b4x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b4x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b4x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b4x_16b>::ValID;
@@ -2784,9 +2150,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b8x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b8x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b8x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b8x>::ValID;
@@ -2797,9 +2164,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b8x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b8x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b8x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b8x_16b>::ValID;
@@ -2810,9 +2178,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b16x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b16x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b16x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b16x>::ValID;
@@ -2823,9 +2192,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b16x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b16x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b16x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b16x_16b>::ValID;
@@ -2836,9 +2206,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b32x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b32x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b32x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b32x>::ValID;
@@ -2849,9 +2220,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b32x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b32x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b32x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b32x_16b>::ValID;
@@ -2862,9 +2234,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b64x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b64x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b64x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b64x>::ValID;
@@ -2875,9 +2248,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b64x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b64x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b64x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b64x_16b>::ValID;
@@ -2888,9 +2262,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b128x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b128x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b128x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b128x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b128x>::ValID;
@@ -2901,9 +2276,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b128x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp64b128x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp64b128x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp64b128x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b128x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp64b128x_16b>::ValID;
@@ -2914,9 +2290,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp64b128x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b1x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b1x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b1x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b1x>::ValID;
@@ -2927,9 +2304,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b1x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b1x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b1x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b1x_16b>::ValID;
@@ -2940,9 +2318,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b2x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b2x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b2x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b2x>::ValID;
@@ -2953,9 +2332,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b2x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b2x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b2x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b2x_16b>::ValID;
@@ -2966,9 +2346,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b4x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b4x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b4x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b4x>::ValID;
@@ -2979,9 +2360,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b4x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b4x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b4x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b4x_16b>::ValID;
@@ -2992,9 +2374,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b8x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b8x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b8x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b8x>::ValID;
@@ -3005,9 +2388,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b8x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b8x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b8x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b8x_16b>::ValID;
@@ -3018,9 +2402,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b16x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b16x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b16x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b16x>::ValID;
@@ -3031,9 +2416,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b16x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b16x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b16x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b16x_16b>::ValID;
@@ -3044,9 +2430,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b32x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b32x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b32x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b32x>::ValID;
@@ -3057,9 +2444,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b32x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b32x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b32x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b32x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b32x_16b>::ValID;
@@ -3070,9 +2458,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b32x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b64x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b64x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b64x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b64x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b64x>::ValID;
@@ -3083,9 +2472,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b64x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b64x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b64x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b64x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b64x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b64x_16b>::ValID;
@@ -3096,9 +2486,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b64x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b128x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b128x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b128x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b128x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b128x>::ValID;
@@ -3109,9 +2500,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b128x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_16dp32b128x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_16dp32b128x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_16dp32b128x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b128x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_16dp32b128x_16b>::ValID;
@@ -3122,9 +2514,10 @@ struct Copy_Traits<SM100_TMEM_STORE_16dp32b128x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b1x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b1x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b1x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b1x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b1x>::ValID;
@@ -3135,9 +2528,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b1x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b1x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b1x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b1x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b1x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b1x_16b>::ValID;
@@ -3148,9 +2542,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b1x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b2x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b2x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b2x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b2x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b2x>::ValID;
@@ -3161,9 +2556,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b2x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b2x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b2x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b2x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b2x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b2x_16b>::ValID;
@@ -3174,9 +2570,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b2x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b4x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b4x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b4x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b4x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b4x>::ValID;
@@ -3187,9 +2584,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b4x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b4x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b4x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b4x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b4x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b4x_16b>::ValID;
@@ -3200,9 +2598,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b4x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b8x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b8x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b8x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b8x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b8x>::ValID;
@@ -3213,9 +2612,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b8x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b8x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b8x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b8x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b8x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b8x_16b>::ValID;
@@ -3226,9 +2626,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b8x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b16x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b16x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b16x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b16x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b16x>::ValID;
@@ -3239,9 +2640,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b16x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b16x_16b;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b16x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b16x_16b>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b16x_16b>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b16x_16b>::ValID;
@@ -3252,9 +2654,10 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b16x_16b>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b32x;
+
 template <>
 struct Copy_Traits<SM100_TMEM_STORE_32dp32b32x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b32x>
 {
   using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x>::ThrID;
   using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x>::ValID;
@@ -3265,76 +2668,841 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b32x>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <>
-struct Copy_Traits<SM100_TMEM_STORE_32dp32b32x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b32x_16b>
-{
-  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::ThrID;
-  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::ValID;
-  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::DstLayout;
-  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::SrcLayout;
-  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::RefLayout;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b32x_16b;
+
+template <>
+struct Copy_Traits<SM100_TMEM_STORE_32dp32b32x_16b>
+{
+  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::ThrID;
+  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::ValID;
+  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::DstLayout;
+  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::SrcLayout;
+  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b32x_16b>::RefLayout;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b64x;
+
+template <>
+struct Copy_Traits<SM100_TMEM_STORE_32dp32b64x>
+{
+  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::ThrID;
+  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::ValID;
+  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::DstLayout;
+  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::SrcLayout;
+  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::RefLayout;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b64x_16b;
+
+template <>
+struct Copy_Traits<SM100_TMEM_STORE_32dp32b64x_16b>
+{
+  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::ThrID;
+  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::ValID;
+  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::DstLayout;
+  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::SrcLayout;
+  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::RefLayout;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b128x;
+
+template <>
+struct Copy_Traits<SM100_TMEM_STORE_32dp32b128x>
+{
+  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::ThrID;
+  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::ValID;
+  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::DstLayout;
+  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::SrcLayout;
+  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::RefLayout;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::STORE::SM100_TMEM_STORE_32dp32b128x_16b;
+
+template <>
+struct Copy_Traits<SM100_TMEM_STORE_32dp32b128x_16b>
+{
+  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::ThrID;
+  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::ValID;
+  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::DstLayout;
+  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::SrcLayout;
+  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::RefLayout;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace TMEM {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Given a 1x tmem copy op, returns the widest repeated variant that divides the specified bits in the N-mode
+template <class CopyOp, int bits_n>
+CUTE_HOST_DEVICE constexpr
+auto
+op_repeater()
+{
+  if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x>) {
+    if constexpr (bits_n % (256 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp256b32x{};
+    }
+    else if constexpr (bits_n % (256 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp256b16x{};
+    }
+    else if constexpr (bits_n % (256 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp256b8x{};
+    }
+    else if constexpr (bits_n % (256 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp256b4x{};
+    }
+    else if constexpr (bits_n % (256 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp256b2x{};
+    }
+    else if constexpr (bits_n % (256 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp256b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x_16b>) {
+    if constexpr (bits_n % (256 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp256b32x_16b{};
+    }
+    else if constexpr (bits_n % (256 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp256b16x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp256b8x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp256b4x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp256b2x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp256b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x>) {
+    if constexpr (bits_n % (128 * 64) == 0) {
+      return SM100_TMEM_LOAD_16dp128b64x{};
+    }
+    else if constexpr (bits_n % (128 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp128b32x{};
+    }
+    else if constexpr (bits_n % (128 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp128b16x{};
+    }
+    else if constexpr (bits_n % (128 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp128b8x{};
+    }
+    else if constexpr (bits_n % (128 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp128b4x{};
+    }
+    else if constexpr (bits_n % (128 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp128b2x{};
+    }
+    else if constexpr (bits_n % (128 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp128b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x_16b>) {
+    if constexpr (bits_n % (128 * 64) == 0) {
+      return SM100_TMEM_LOAD_16dp128b64x_16b{};
+    }
+    else if constexpr (bits_n % (128 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp128b32x_16b{};
+    }
+    else if constexpr (bits_n % (128 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp128b16x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp128b8x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp128b4x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp128b2x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp128b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_LOAD_16dp64b128x{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_LOAD_16dp64b64x{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp64b32x{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp64b16x{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp64b8x{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp64b4x{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp64b2x{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp64b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x_16b>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_LOAD_16dp64b128x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_LOAD_16dp64b64x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp64b32x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp64b16x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp64b8x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp64b4x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp64b2x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp64b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_LOAD_16dp32b128x{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_LOAD_16dp32b64x{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp32b32x{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp32b16x{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp32b8x{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp32b4x{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp32b2x{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp32b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x_16b>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_LOAD_16dp32b128x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_LOAD_16dp32b64x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_LOAD_16dp32b32x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_LOAD_16dp32b16x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_LOAD_16dp32b8x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_LOAD_16dp32b4x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_LOAD_16dp32b2x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_LOAD_16dp32b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x>) {
+    if constexpr (bits_n % (32 * 128) == 0) {
+      return SM100_TMEM_LOAD_32dp32b128x{};
+    }
+    else if constexpr (bits_n % (32 * 64) == 0) {
+      return SM100_TMEM_LOAD_32dp32b64x{};
+    }
+    else if constexpr (bits_n % (32 * 32) == 0) {
+      return SM100_TMEM_LOAD_32dp32b32x{};
+    }
+    else if constexpr (bits_n % (32 * 16) == 0) {
+      return SM100_TMEM_LOAD_32dp32b16x{};
+    }
+    else if constexpr (bits_n % (32 *  8) == 0) {
+      return SM100_TMEM_LOAD_32dp32b8x{};
+    }
+    else if constexpr (bits_n % (32 *  4) == 0) {
+      return SM100_TMEM_LOAD_32dp32b4x{};
+    }
+    else if constexpr (bits_n % (32 *  2) == 0) {
+      return SM100_TMEM_LOAD_32dp32b2x{};
+    }
+    else if constexpr (bits_n % (32 *  1) == 0) {
+      return SM100_TMEM_LOAD_32dp32b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x_16b>) {
+    if constexpr (bits_n % (32 * 128) == 0) {
+      return SM100_TMEM_LOAD_32dp32b128x_16b{};
+    }
+    else if constexpr (bits_n % (32 * 64) == 0) {
+      return SM100_TMEM_LOAD_32dp32b64x_16b{};
+    }
+    else if constexpr (bits_n % (32 * 32) == 0) {
+      return SM100_TMEM_LOAD_32dp32b32x_16b{};
+    }
+    else if constexpr (bits_n % (32 * 16) == 0) {
+      return SM100_TMEM_LOAD_32dp32b16x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  8) == 0) {
+      return SM100_TMEM_LOAD_32dp32b8x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  4) == 0) {
+      return SM100_TMEM_LOAD_32dp32b4x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  2) == 0) {
+      return SM100_TMEM_LOAD_32dp32b2x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  1) == 0) {
+      return SM100_TMEM_LOAD_32dp32b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp256b1x>) {
+    if constexpr (bits_n % (256 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp256b32x{};
+    }
+    else if constexpr (bits_n % (256 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp256b16x{};
+    }
+    else if constexpr (bits_n % (256 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp256b8x{};
+    }
+    else if constexpr (bits_n % (256 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp256b4x{};
+    }
+    else if constexpr (bits_n % (256 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp256b2x{};
+    }
+    else if constexpr (bits_n % (256 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp256b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp256b1x_16b>) {
+    if constexpr (bits_n % (256 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp256b32x_16b{};
+    }
+    else if constexpr (bits_n % (256 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp256b16x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp256b8x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp256b4x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp256b2x_16b{};
+    }
+    else if constexpr (bits_n % (256 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp256b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp128b1x>) {
+    if constexpr (bits_n % (128 * 64) == 0) {
+      return SM100_TMEM_STORE_16dp128b64x{};
+    }
+    else if constexpr (bits_n % (128 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp128b32x{};
+    }
+    else if constexpr (bits_n % (128 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp128b16x{};
+    }
+    else if constexpr (bits_n % (128 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp128b8x{};
+    }
+    else if constexpr (bits_n % (128 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp128b4x{};
+    }
+    else if constexpr (bits_n % (128 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp128b2x{};
+    }
+    else if constexpr (bits_n % (128 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp128b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp128b1x_16b>) {
+    if constexpr (bits_n % (128 * 64) == 0) {
+      return SM100_TMEM_STORE_16dp128b64x_16b{};
+    }
+    else if constexpr (bits_n % (128 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp128b32x_16b{};
+    }
+    else if constexpr (bits_n % (128 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp128b16x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp128b8x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp128b4x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp128b2x_16b{};
+    }
+    else if constexpr (bits_n % (128 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp128b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp64b1x>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_STORE_16dp64b128x{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_STORE_16dp64b64x{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp64b32x{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp64b16x{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp64b8x{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp64b4x{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp64b2x{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp64b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp64b1x_16b>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_STORE_16dp64b128x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_STORE_16dp64b64x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp64b32x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp64b16x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp64b8x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp64b4x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp64b2x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp64b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp32b1x>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_STORE_16dp32b128x{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_STORE_16dp32b64x{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp32b32x{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp32b16x{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp32b8x{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp32b4x{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp32b2x{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp32b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_16dp32b1x_16b>) {
+    if constexpr (bits_n % (64 * 128) == 0) {
+      return SM100_TMEM_STORE_16dp32b128x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 64) == 0) {
+      return SM100_TMEM_STORE_16dp32b64x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 32) == 0) {
+      return SM100_TMEM_STORE_16dp32b32x_16b{};
+    }
+    else if constexpr (bits_n % (64 * 16) == 0) {
+      return SM100_TMEM_STORE_16dp32b16x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  8) == 0) {
+      return SM100_TMEM_STORE_16dp32b8x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  4) == 0) {
+      return SM100_TMEM_STORE_16dp32b4x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  2) == 0) {
+      return SM100_TMEM_STORE_16dp32b2x_16b{};
+    }
+    else if constexpr (bits_n % (64 *  1) == 0) {
+      return SM100_TMEM_STORE_16dp32b1x_16b{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_32dp32b1x>) {
+    if constexpr (bits_n % (32 * 128) == 0) {
+      return SM100_TMEM_STORE_32dp32b128x{};
+    }
+    else if constexpr (bits_n % (32 * 64) == 0) {
+      return SM100_TMEM_STORE_32dp32b64x{};
+    }
+    else if constexpr (bits_n % (32 * 32) == 0) {
+      return SM100_TMEM_STORE_32dp32b32x{};
+    }
+    else if constexpr (bits_n % (32 * 16) == 0) {
+      return SM100_TMEM_STORE_32dp32b16x{};
+    }
+    else if constexpr (bits_n % (32 *  8) == 0) {
+      return SM100_TMEM_STORE_32dp32b8x{};
+    }
+    else if constexpr (bits_n % (32 *  4) == 0) {
+      return SM100_TMEM_STORE_32dp32b4x{};
+    }
+    else if constexpr (bits_n % (32 *  2) == 0) {
+      return SM100_TMEM_STORE_32dp32b2x{};
+    }
+    else if constexpr (bits_n % (32 *  1) == 0) {
+      return SM100_TMEM_STORE_32dp32b1x{};
+    }
+  }
+  else if constexpr (cute::is_same_v<CopyOp, SM100_TMEM_STORE_32dp32b1x_16b>) {
+    if constexpr (bits_n % (32 * 128) == 0) {
+      return SM100_TMEM_STORE_32dp32b128x_16b{};
+    }
+    else if constexpr (bits_n % (32 * 64) == 0) {
+      return SM100_TMEM_STORE_32dp32b64x_16b{};
+    }
+    else if constexpr (bits_n % (32 * 32) == 0) {
+      return SM100_TMEM_STORE_32dp32b32x_16b{};
+    }
+    else if constexpr (bits_n % (32 * 16) == 0) {
+      return SM100_TMEM_STORE_32dp32b16x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  8) == 0) {
+      return SM100_TMEM_STORE_32dp32b8x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  4) == 0) {
+      return SM100_TMEM_STORE_32dp32b4x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  2) == 0) {
+      return SM100_TMEM_STORE_32dp32b2x_16b{};
+    }
+    else if constexpr (bits_n % (32 *  1) == 0) {
+      return SM100_TMEM_STORE_32dp32b1x_16b{};
+    }
+  }
+  else {
+    static_assert(dependent_false<CopyOp>, "Must pass 1x tmem copy operator");
+  }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Select TMEM store corresponding to the provided TMEM load
+template <class CopyOp>
+CUTE_HOST_DEVICE constexpr auto
+tmem_load_to_store(CopyOp) {
+  if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x>) {
+    return SM100_TMEM_STORE_16dp256b1x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b1x_16b>) {
+    return SM100_TMEM_STORE_16dp256b1x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b2x>) {
+    return SM100_TMEM_STORE_16dp256b2x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b2x_16b>) {
+    return SM100_TMEM_STORE_16dp256b2x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b4x>) {
+    return SM100_TMEM_STORE_16dp256b4x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b4x_16b>) {
+    return SM100_TMEM_STORE_16dp256b4x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b8x>) {
+    return SM100_TMEM_STORE_16dp256b8x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b8x_16b>) {
+    return SM100_TMEM_STORE_16dp256b8x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b16x>) {
+    return SM100_TMEM_STORE_16dp256b16x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b16x_16b>) {
+    return SM100_TMEM_STORE_16dp256b16x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b32x>) {
+    return SM100_TMEM_STORE_16dp256b32x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp256b32x_16b>) {
+    return SM100_TMEM_STORE_16dp256b32x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x>) {
+    return SM100_TMEM_STORE_16dp128b1x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b1x_16b>) {
+    return SM100_TMEM_STORE_16dp128b1x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b2x>) {
+    return SM100_TMEM_STORE_16dp128b2x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b2x_16b>) {
+    return SM100_TMEM_STORE_16dp128b2x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b4x>) {
+    return SM100_TMEM_STORE_16dp128b4x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b4x_16b>) {
+    return SM100_TMEM_STORE_16dp128b4x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b8x>) {
+    return SM100_TMEM_STORE_16dp128b8x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b8x_16b>) {
+    return SM100_TMEM_STORE_16dp128b8x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b16x>) {
+    return SM100_TMEM_STORE_16dp128b16x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b16x_16b>) {
+    return SM100_TMEM_STORE_16dp128b16x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b32x>) {
+    return SM100_TMEM_STORE_16dp128b32x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b32x_16b>) {
+    return SM100_TMEM_STORE_16dp128b32x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b64x>) {
+    return SM100_TMEM_STORE_16dp128b64x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp128b64x_16b>) {
+    return SM100_TMEM_STORE_16dp128b64x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x>) {
+    return SM100_TMEM_STORE_16dp64b1x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b1x_16b>) {
+    return SM100_TMEM_STORE_16dp64b1x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b2x>) {
+    return SM100_TMEM_STORE_16dp64b2x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b2x_16b>) {
+    return SM100_TMEM_STORE_16dp64b2x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b4x>) {
+    return SM100_TMEM_STORE_16dp64b4x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b4x_16b>) {
+    return SM100_TMEM_STORE_16dp64b4x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b8x>) {
+    return SM100_TMEM_STORE_16dp64b8x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b8x_16b>) {
+    return SM100_TMEM_STORE_16dp64b8x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b16x>) {
+    return SM100_TMEM_STORE_16dp64b16x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b16x_16b>) {
+    return SM100_TMEM_STORE_16dp64b16x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b32x>) {
+    return SM100_TMEM_STORE_16dp64b32x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b32x_16b>) {
+    return SM100_TMEM_STORE_16dp64b32x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b64x>) {
+    return SM100_TMEM_STORE_16dp64b64x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b64x_16b>) {
+    return SM100_TMEM_STORE_16dp64b64x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b128x>) {
+    return SM100_TMEM_STORE_16dp64b128x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp64b128x_16b>) {
+    return SM100_TMEM_STORE_16dp64b128x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x>) {
+    return SM100_TMEM_STORE_16dp32b1x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b1x_16b>) {
+    return SM100_TMEM_STORE_16dp32b1x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b2x>) {
+    return SM100_TMEM_STORE_16dp32b2x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b2x_16b>) {
+    return SM100_TMEM_STORE_16dp32b2x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b4x>) {
+    return SM100_TMEM_STORE_16dp32b4x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b4x_16b>) {
+    return SM100_TMEM_STORE_16dp32b4x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b8x>) {
+    return SM100_TMEM_STORE_16dp32b8x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b8x_16b>) {
+    return SM100_TMEM_STORE_16dp32b8x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b16x>) {
+    return SM100_TMEM_STORE_16dp32b16x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b16x_16b>) {
+    return SM100_TMEM_STORE_16dp32b16x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b32x>) {
+    return SM100_TMEM_STORE_16dp32b32x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b32x_16b>) {
+    return SM100_TMEM_STORE_16dp32b32x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b64x>) {
+    return SM100_TMEM_STORE_16dp32b64x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b64x_16b>) {
+    return SM100_TMEM_STORE_16dp32b64x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b128x>) {
+    return SM100_TMEM_STORE_16dp32b128x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_16dp32b128x_16b>) {
+    return SM100_TMEM_STORE_16dp32b128x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x>) {
+    return SM100_TMEM_STORE_32dp32b1x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b1x_16b>) {
+    return SM100_TMEM_STORE_32dp32b1x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b2x>) {
+    return SM100_TMEM_STORE_32dp32b2x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b2x_16b>) {
+    return SM100_TMEM_STORE_32dp32b2x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b4x>) {
+    return SM100_TMEM_STORE_32dp32b4x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b4x_16b>) {
+    return SM100_TMEM_STORE_32dp32b4x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b8x>) {
+    return SM100_TMEM_STORE_32dp32b8x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b8x_16b>) {
+    return SM100_TMEM_STORE_32dp32b8x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b16x>) {
+    return SM100_TMEM_STORE_32dp32b16x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b16x_16b>) {
+    return SM100_TMEM_STORE_32dp32b16x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b32x>) {
+    return SM100_TMEM_STORE_32dp32b32x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b32x_16b>) {
+    return SM100_TMEM_STORE_32dp32b32x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b64x>) {
+    return SM100_TMEM_STORE_32dp32b64x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b64x_16b>) {
+    return SM100_TMEM_STORE_32dp32b64x_16b{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b128x>) {
+    return SM100_TMEM_STORE_32dp32b128x{};
+  }
+  else if constexpr (is_same_v<CopyOp, SM100_TMEM_LOAD_32dp32b128x_16b>) {
+    return SM100_TMEM_STORE_32dp32b128x_16b{};
+  }
+  else {
+    static_assert(dependent_false<CopyOp>, "No TMEM_STORE matching for provided TMEM_LOAD");
+  }
+}
 
-template <>
-struct Copy_Traits<SM100_TMEM_STORE_32dp32b64x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b64x>
-{
-  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::ThrID;
-  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::ValID;
-  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::DstLayout;
-  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::SrcLayout;
-  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x>::RefLayout;
-};
+} // namespace TMEM
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <>
-struct Copy_Traits<SM100_TMEM_STORE_32dp32b64x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b64x_16b>
-{
-  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::ThrID;
-  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::ValID;
-  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::DstLayout;
-  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::SrcLayout;
-  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b64x_16b>::RefLayout;
-};
-
 ////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct Copy_Traits<SM100_TMEM_STORE_32dp32b128x>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b128x>
-{
-  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::ThrID;
-  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::ValID;
-  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::DstLayout;
-  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::SrcLayout;
-  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x>::RefLayout;
-};
-
+//
+// UTCCP Copy Traits
+//
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <>
-struct Copy_Traits<SM100_TMEM_STORE_32dp32b128x_16b>
-     : TMEM_STORE_Unpack<SM100_TMEM_STORE_32dp32b128x_16b>
-{
-  using ThrID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::ThrID;
-  using ValID = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::ValID;
-  using SrcLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::DstLayout;
-  using DstLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::SrcLayout;
-  using RefLayout = typename Copy_Traits<SM100_TMEM_LOAD_32dp32b128x_16b>::RefLayout;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
+namespace SM100::TMEM::UTCCP {
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
 //
-// UTCCP Copy Traits
+// Specialized copy_unpack implementation for SM100::TMEM::UTCCP instructions
 //
-////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <class CopyOp,
+          class TS, class SLayout,
+          class TD, class DLayout>
+CUTE_HOST_DEVICE constexpr
+void
+copy_unpack(Copy_Traits<CopyOp> const&,
+            Tensor<TS,SLayout>  const& src,
+            Tensor<TD,DLayout>       & dst)
+{
+  static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
+  static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
+  CopyOp::copy(src[0], raw_pointer_cast(dst.data()));
+}
+
+} // end namespace SM100::TMEM::UTCCP
 
 // In the following UTCCP traits, the ValID is representing:
 // logical_bit_idx -> tmem_addr_offset.
@@ -3344,131 +3512,76 @@ struct Copy_Traits<SM100_TMEM_STORE_32dp32b128x_16b>
 // The last two modes provide boradcast transformation for 4x32DP and 2x64DP.
 // With above, the strides of first two modes are neccessary to be TMEM::DP_b and 1.
 // And the stride of the third mode in the SrcLayout must be zero.
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+using SM100::TMEM::UTCCP::SM100_UTCCP_128dp256bit_1cta;
+
 template <>
 struct Copy_Traits<SM100_UTCCP_128dp256bit_1cta>
 {
   using ThrID = Layout<_1>;
-  // logical bit_idx -> tmem_addr
   using ValID = Layout<Shape <_128,      _256>,
                        Stride<TMEM::DP_b, _1>>;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape<_1, _32768>,
                            Stride<_0, _1>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape<_1, _32768>,
                            Stride<_0,_1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_128dp256bit_1cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_128dp256bit_2cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_128dp256bit_2cta>
 {
   using ThrID = Layout<_2>;
-  // logical bit_idx -> tmem_addr
   using ValID = typename Copy_Traits<SM100_UTCCP_128dp256bit_1cta>::ValID;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_2, _32768>,
                            Stride<_0, _1>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape <_2, _32768>,
                            Stride<_0, _1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_128dp256bit_2cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_128dp128bit_1cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_128dp128bit_1cta>
 {
   using ThrID = Layout<_1>;
-  // logical bit_idx -> tmem_addr
   using ValID = Layout<Shape <_128,      _128>,
                        Stride<TMEM::DP_b, _1>>;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape<_1, _16384>,
                            Stride<_0, _1>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape<_1, _16384>,
                            Stride<_0,_1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_128dp128bit_1cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_128dp128bit_2cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_128dp128bit_2cta>
 {
   using ThrID = Layout<_2>;
-  // logical bit_idx -> tmem_addr
   using ValID = typename Copy_Traits<SM100_UTCCP_128dp128bit_1cta>::ValID;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_2, _16384>,
                            Stride<_0, _1>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape <_2, _16384>,
                            Stride<_0, _1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_128dp128bit_2cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_4dp256bit_1cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_4dp256bit_1cta>
@@ -3485,65 +3598,34 @@ struct Copy_Traits<SM100_UTCCP_4dp256bit_1cta>
   */
 
   using ThrID = Layout<_1>;
-  // logical bit_idx -> tmem_addr
   using ValID = Layout<Shape <_128,    _256>,
                        Stride<TMEM::DP_b,_1>>;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_1,Shape <_4, _256>>,
                            Stride<_0,Stride<_32,_128>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape <_1,Shape <_4, _256>>,
                            Stride<_0,Stride<_32,_128>>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_4dp256bit_1cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_4dp256bit_2cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_4dp256bit_2cta>
 {
-
   using ThrID = Layout<_2>;
-  // logical bit_idx -> tmem_addr
   using ValID = typename Copy_Traits<SM100_UTCCP_4dp256bit_1cta>::ValID;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_2,Shape <_4, _256>>,
                            Stride<_0,Stride<_32,_128>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape <_2,Shape <_4, _256>>,
                            Stride<_0,Stride<_32,_128>>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_4dp256bit_2cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_4x32dp128bit_1cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_4x32dp128bit_1cta>
@@ -3556,63 +3638,32 @@ struct Copy_Traits<SM100_UTCCP_4x32dp128bit_1cta>
   // [core_matrix_strided, core_matrix_leading, broadcast]
   using ValID = Layout<Shape <_32,_128,_4>,
                        Stride<_DP,_1,  _DPx32>>;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_1,Shape <_32,_128,_4>>,
                            Stride<_0,Stride<_1, _32, _0>>>;
-
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape <_1,_16384>,
                            Stride<_0,_1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_4x32dp128bit_1cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_4x32dp128bit_2cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_4x32dp128bit_2cta>
 {
-
   using ThrID = Layout<_2>;
-  // logical bit_idx -> tmem_addr
   using ValID = typename Copy_Traits<SM100_UTCCP_4x32dp128bit_1cta>::ValID;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_2,Shape <_32,_128,_4>>,
                            Stride<_0,Stride<_1, _32, _0>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape<_2, _16384>,
                            Stride<_0,_1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_4x32dp128bit_2cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_2x64dp128bitlw0213_1cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_2x64dp128bitlw0213_1cta>
@@ -3625,62 +3676,33 @@ struct Copy_Traits<SM100_UTCCP_2x64dp128bitlw0213_1cta>
   // [core_matrix_strided, core_matrix_leading, broadcast]
   using ValID = Layout<Shape <_64,_128,_2>,
                        Stride<_DP,_1,  _DPx64>>;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_1,Shape <_64,_128,_2>>,
                            Stride<_0,Stride<_1, _64, _0>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape<_1, _16384>,
                            Stride<_0, _1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_2x64dp128bitlw0213_1cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_2x64dp128bitlw0213_2cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_2x64dp128bitlw0213_2cta>
 {
-
   using ThrID = Layout<_2>;
-  // logical bit_idx -> tmem_addr
   using ValID = typename Copy_Traits<SM100_UTCCP_2x64dp128bitlw0213_1cta>::ValID;
 
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_2,Shape <_64,_128,_2>>,
                            Stride<_0,Stride<_1, _64, _0>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape<_2, _16384>,
                            Stride<_0, _1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_2x64dp128bitlw0213_2cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_2x64dp128bitlw0123_1cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_2x64dp128bitlw0123_1cta>
@@ -3695,62 +3717,31 @@ struct Copy_Traits<SM100_UTCCP_2x64dp128bitlw0123_1cta>
   using ValID = Layout<Shape <_32,_128,_2,    _2>,
                        Stride<_DP,_1  ,_DPx64,_DPx32>>;
 
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_1,Shape <_32,_128,_2,_2>>,
                            Stride<_0,Stride<_1, _32,_4096,_0>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape<_1, _16384>,
                            Stride<_0, _1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
+};
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
 
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_2x64dp128bitlw0123_1cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
-};
+using SM100::TMEM::UTCCP::SM100_UTCCP_2x64dp128bitlw0123_2cta;
 
 template <>
 struct Copy_Traits<SM100_UTCCP_2x64dp128bitlw0123_2cta>
 {
-
   using ThrID = Layout<_2>;
-  // logical bit_idx -> tmem_addr
   using ValID = typename Copy_Traits<SM100_UTCCP_2x64dp128bitlw0123_1cta>::ValID;
-
-  // Map from (src-thr,src-val) to bit
   using SrcLayout = Layout<Shape <_2,Shape <_32,_128,_2,_2>>,
                            Stride<_0,Stride<_1, _32, _4096,_0>>>;
-  // Map from (dst-thr,dst-val) to bit
   using DstLayout = Layout<Shape <_2,_16384>,
                            Stride<_0,_1>>;
-  // Reference map from (thr,val) to bit
   using RefLayout = DstLayout;
-
-
-  template <class TS, class SLayout,
-            class TD, class DLayout>
-  CUTE_HOST_DEVICE friend constexpr
-  void
-  copy_unpack(Copy_Traits        const& traits,
-              Tensor<TS,SLayout> const& src,
-              Tensor<TD,DLayout>      & dst)
-  {
-    static_assert(is_rmem<TS>::value, "Expected smem_desc src for SM100_UTCCP");
-    static_assert(is_tmem<TD>::value, "Expected tmem dst for SM100_UTCCP");
-    SM100_UTCCP_2x64dp128bitlw0123_2cta::copy(src[0], raw_pointer_cast(dst.data()));
-  }
 };
 
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
 template <class CopyOp,
           class TEngine, class TLayout>
 CUTE_HOST_DEVICE constexpr
@@ -3775,4 +3766,3 @@ make_utccp_copy(CopyOp const&,
 
 } // namespace cute
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cute/atom/copy_traits_sm90_im2col.hpp b/include/cute/atom/copy_traits_sm90_im2col.hpp
index beefa63f6c..e4d1e3ffff 100644
--- a/include/cute/atom/copy_traits_sm90_im2col.hpp
+++ b/include/cute/atom/copy_traits_sm90_im2col.hpp
@@ -647,7 +647,7 @@ make_tma_atom_im2col(CopyOp,
       gtensor_cwhdn,
       range_c,
       range_whdn,
-      detail::get_swizzle_portion(slayout),
+      get_swizzle_portion(slayout),
       tma_layout_vt,
       lower_corner_whd,
       upper_corner_whd,
diff --git a/include/cute/atom/mma_atom.hpp b/include/cute/atom/mma_atom.hpp
index a96291e138..08141a0920 100644
--- a/include/cute/atom/mma_atom.hpp
+++ b/include/cute/atom/mma_atom.hpp
@@ -454,8 +454,6 @@ struct TiledMMA : MMA_Atom
   {
     // (M,K) -> (M,K)
     auto ref_A = make_layout(make_shape(tile_size_mnk<0>(), tile_size_mnk<2>()));
-    // (athrid,val) -> (M,K)
-    auto layoutA_TV = thrfrg_A(ref_A);
 
     // (ThrV,(ThrM,ThrK)) -> (ThrV,(ThrM,ThrN,ThrK))
     auto atile = make_tile(_,
@@ -493,8 +491,6 @@ struct TiledMMA : MMA_Atom
   {
     // (N,K) -> (N,K)
     auto ref_B = make_layout(make_shape(tile_size_mnk<1>(), tile_size_mnk<2>()));
-    // (bthrid,val) -> (N,K)
-    auto layoutB_TV = thrfrg_B(ref_B);
 
     // (ThrV,(ThrN,ThrK)) -> (ThrV,(ThrM,ThrN,ThrK))
     auto btile = make_tile(_,
@@ -1192,6 +1188,7 @@ print_svg(TiledMMA<Args...> const &mma) {
 #include <cute/atom/mma_traits_sm70.hpp>
 #include <cute/atom/mma_traits_sm75.hpp>
 #include <cute/atom/mma_traits_sm80.hpp>
+#include <cute/atom/mma_traits_sm89.hpp>
 #include <cute/atom/mma_traits_sm90.hpp>
 #include <cute/atom/mma_traits_sm90_gmma.hpp>
 #include <cute/atom/mma_traits_sm100.hpp>
diff --git a/include/cute/atom/mma_traits_sm100.hpp b/include/cute/atom/mma_traits_sm100.hpp
index f336eff215..820dc103e1 100644
--- a/include/cute/atom/mma_traits_sm100.hpp
+++ b/include/cute/atom/mma_traits_sm100.hpp
@@ -37,10 +37,13 @@
 #include <cute/arch/mma_sm100.hpp>
 #include <cute/arch/mma_sm100_desc.hpp>
 #include <cute/arch/mma_sm100_umma.hpp>
-#include <cute/atom/copy_traits_sm100.hpp>            // cute::TMEM::
+#include <cute/arch/tmem_allocator_sm100.hpp>         // cute::TMEM::
+
 #include <cute/atom/mma_traits.hpp>
 #include <cute/atom/mma_traits_sm90_gmma.hpp>         // cute::GMMA::
 #include <cute/atom/mma_traits_sm90_gmma_sparse.hpp>  // cute::GMMA::
+#include <cute/atom/copy_traits_sm100.hpp>            // UTCCP smem desc
+
 #include <cute/numeric/numeric_types.hpp>
 
 // Check that aggregate initialization in .with() initializes all fields
@@ -417,6 +420,9 @@ constexpr auto get_utccp_smem_desc_tensor(Tensor<TEngine, TLayout> const& smem_u
 
 namespace UMMA {
 
+// Import TMEM constants
+namespace TMEM = cute::TMEM;
+
 enum class TmemAllocMode {
   // Default allocation mode.
   // If a TMEM Atom uses a half-subpartition (16DPs), then multiple atoms can be
@@ -3053,7 +3059,7 @@ struct MMA_Traits<SM100_MMA_F8F6F4_2x1SM_SS, a_type, b_type, c_type,
   static_assert(cute::sizeof_bits_v<a_type> <= 8 && cute::sizeof_bits_v<b_type> <= 8, "SM100_MMA_F8F6F4_2x1SM_SS supports types with leq 8bit types");
   static_assert(M == 128 || M == 256, "SM100_MMA_F8F6F4_2x1SM_SS M-mode size should be 64 or 128 for 1 CTA cluster MMA.");
   static_assert((N % 32 == 0) && (32 <= N) && (N <= 256), "SM100_MMA_F8F6F4_2x1SM_SS N-mode size should be a multiple of 32 between 32 and 256.");
- 
+
   using FrgTypeA = UMMA::smem_desc<a_major>;
   using FrgTypeB = UMMA::smem_desc<b_major>;
   using FrgTypeC = UMMA::tmem_frg_2sm<c_type>;
diff --git a/include/cute/atom/mma_traits_sm89.hpp b/include/cute/atom/mma_traits_sm89.hpp
new file mode 100644
index 0000000000..35ad436e22
--- /dev/null
+++ b/include/cute/atom/mma_traits_sm89.hpp
@@ -0,0 +1,96 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+//
+
+//
+#pragma once
+
+#include <cute/arch/mma_sm89.hpp>
+#include <cute/atom/mma_traits.hpp>
+#include <cute/layout.hpp>
+#include <cute/numeric/numeric_types.hpp>
+
+namespace cute 
+{
+
+namespace {
+
+// (T32,V4) -> (M16,N8)
+using SM80_16x8_Row = Layout<Shape <Shape < _4,_8>,Shape < _2,_2>>,
+                             Stride<Stride<_32,_1>,Stride<_16,_8>>>;
+
+}
+
+template <>
+struct MMA_Traits<SM89_16x8x32_F32E4M3E4M3F32_TN> {
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using Shape_MNK = Shape<_16,_8,_32>;
+  using ThrID   = Layout<_32>;
+  using ALayout = Layout<Shape <Shape < _4,_8>,Shape < _4,_2,  _2>>,
+                         Stride<Stride<_64,_1>,Stride<_16,_8,_256>>>;
+  using BLayout = Layout<Shape <Shape < _4,_8>,Shape <_4,  _2>>,
+                         Stride<Stride<_32,_1>,Stride<_8,_128>>>;
+  using CLayout = SM80_16x8_Row;
+};
+
+template <>
+struct MMA_Traits<SM89_16x8x32_F32E4M3E5M2F32_TN> : 
+MMA_Traits<SM89_16x8x32_F32E4M3E4M3F32_TN> {
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+};
+
+template <>
+struct MMA_Traits<SM89_16x8x32_F32E5M2E5M2F32_TN> : 
+MMA_Traits<SM89_16x8x32_F32E4M3E4M3F32_TN> {
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+};
+
+template <>
+struct MMA_Traits<SM89_16x8x32_F32E5M2E4M3F32_TN> : 
+MMA_Traits<SM89_16x8x32_F32E4M3E4M3F32_TN> {
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+};
+
+} // end namespace cute
diff --git a/include/cute/atom/mma_traits_sm90_gmma.hpp b/include/cute/atom/mma_traits_sm90_gmma.hpp
index e688a7e6a8..e1c3bb4034 100644
--- a/include/cute/atom/mma_traits_sm90_gmma.hpp
+++ b/include/cute/atom/mma_traits_sm90_gmma.hpp
@@ -322,7 +322,11 @@ struct DescriptorIterator
   CUTE_HOST_DEVICE constexpr
   DescriptorIterator operator+(Index const& offset) const
   {
-    return { GmmaDescriptor{desc_ + uint64_t(offset)} };
+    // Use 32bit calculation rather than 64 bit calculation as we only update the part of desc
+    GmmaDescriptor ret;
+    ret.reg32_[0] = desc_.reg32_[0] + uint32_t(offset);
+    ret.reg32_[1] = desc_.reg32_[1];
+    return { ret };
   }
 };
 
diff --git a/include/cute/config.hpp b/include/cute/config.hpp
index 3ac4c1024f..f3d72f257f 100644
--- a/include/cute/config.hpp
+++ b/include/cute/config.hpp
@@ -151,6 +151,16 @@
 #  include <iomanip>
 #endif
 
+//
+// Type
+//
+
+#if defined(__CUDACC_RTC__)
+#  include <cuda/std/cstdint>
+#else
+#  include <cstdint>
+#endif
+
 //
 // Debugging utilities
 //
diff --git a/include/cute/container/tuple.hpp b/include/cute/container/tuple.hpp
index ed4f8c8c23..9a13e951be 100644
--- a/include/cute/container/tuple.hpp
+++ b/include/cute/container/tuple.hpp
@@ -53,8 +53,8 @@
 // but do _not_ include references like int& or float&.
 // (See std::tie for an example of a tuple of references.)
 //
-// Standard-layout types preserve ABI across host-device boundaries.
-// They are safe to use as device kernel parameters.
+// Standard-layout types preserve ABI across host-device boundaries. They are safe to use as device kernel parameters.
+// The standard-layout requirement prevents a more common EBO-based implemented of cute::tuple.
 //
 // The cute::tuple is also simplified over the implementations in std::, cuda::std::, and thrust:: by ignoring much of
 // the conversion SFINAE, special overloading, and avoiding cvref template types.
@@ -64,12 +64,15 @@
 namespace cute
 {
 
-namespace detail
+template <class... T>
+struct tuple;
+
+namespace eso
 {
 
 // ESO stands for "empty structure optimization."
-// We use this technique to ensure that cute::tuple
-// doesn't waste space storing template arguments that have no data (like integral_constant).
+// We use this technique to ensure that cute::tuple doesn't waste space
+// storing template arguments that have no data (like integral_constant).
 // Empty types in the template argument list are not even constructed,
 // and do not have unique element addresses. Calling `get`
 // constructs and returns an instance of an empty type on demand.
@@ -133,88 +136,92 @@ struct ESO<false, false, First, Rest...> {
 };
 
 // Get Nth value from ESO
-template <size_t N, bool F, bool R, class T, class... Rest>
-CUTE_HOST_DEVICE constexpr
-cute::enable_if_t<cute::is_empty<cute::tuple_element_t<N, cute::type_list<T, Rest...>>>::value,
-                                 cute::tuple_element_t<N, cute::type_list<T, Rest...>>>
-getv(ESO<F, R, T, Rest...> const&)
-{
-  return {};
-}
-
-template <size_t N, bool F, bool R, class T, class... Rest>
+template <class R, size_t N, class S>
 CUTE_HOST_DEVICE constexpr
-cute::enable_if_t<not cute::is_empty<cute::tuple_element_t<N, cute::type_list<T, Rest...>>>::value,
-                                     cute::tuple_element_t<N, cute::type_list<T, Rest...>> const&>
-getv(ESO<F, R, T, Rest...> const& s)
+R
+getr(S&& s) noexcept
 {
   if constexpr (N == 0) {
-    return static_cast<T const&>(s.first_);
+    return static_cast<S&&>(s).first_;
   } else {
-    return getv<N-1>(s.rest_);
+    return getr<R,N-1>(static_cast<S&&>(s).rest_);
   }
+  CUTE_GCC_UNREACHABLE;
 }
 
-template <size_t N, bool F, bool R, class T, class... Rest>
+// Compilers disagree on decltype(auto), so these implementations avoid it at cost
+template <size_t N, bool F, bool R, class... T>
 CUTE_HOST_DEVICE constexpr
-cute::enable_if_t<not cute::is_empty<cute::tuple_element_t<N, cute::type_list<T, Rest...>>>::value,
-                                     cute::tuple_element_t<N, cute::type_list<T, Rest...>> &>
-getv(ESO<F, R, T, Rest...>& s)
+cute::conditional_t<cute::is_empty<cute::tuple_element_t<N, cute::tuple<T...>>>::value,
+                    cute::tuple_element_t<N, cute::tuple<T...>>,
+                    cute::tuple_element_t<N, cute::tuple<T...>> const&>
+getv_cr(ESO<F, R, T...> const& s) noexcept
 {
-  if constexpr (N == 0) {
-    return static_cast<T&>(s.first_);
+  if constexpr (cute::is_empty<cute::tuple_element_t<N, cute::tuple<T...>>>::value) {
+    return {};
   } else {
-    return getv<N-1>(s.rest_);
+    return getr<cute::tuple_element_t<N, cute::tuple<T...>> const&, N>(s);
   }
+  CUTE_GCC_UNREACHABLE;
 }
 
-template <size_t N, bool F, bool R, class T, class... Rest>
+template <size_t N, bool F, bool R, class... T>
 CUTE_HOST_DEVICE constexpr
-cute::enable_if_t<not cute::is_empty<cute::tuple_element_t<N, cute::type_list<T, Rest...>>>::value,
-                                     cute::tuple_element_t<N, cute::type_list<T, Rest...>> &&>
-getv(ESO<F, R, T, Rest...>&& s)
+cute::conditional_t<cute::is_empty<cute::tuple_element_t<N, cute::tuple<T...>>>::value,
+                    cute::tuple_element_t<N, cute::tuple<T...>>,
+                    cute::tuple_element_t<N, cute::tuple<T...>> &>
+getv_r(ESO<F, R, T...>& s) noexcept
 {
-  if constexpr (N == 0) {
-    return static_cast<T&&>(s.first_);
+  if constexpr (cute::is_empty<cute::tuple_element_t<N, cute::tuple<T...>>>::value) {
+    return {};
   } else {
-    return getv<N-1>(static_cast<ESO_t<Rest...>&&>(s.rest_));
+    return getr<cute::tuple_element_t<N, cute::tuple<T...>> &, N>(s);
   }
+  CUTE_GCC_UNREACHABLE;
 }
 
-template <class X, size_t N,
-          bool IsFirstEmpty, bool IsRestEmpty, class First, class... Rest>
+template <size_t N, bool F, bool R, class... T>
 CUTE_HOST_DEVICE constexpr
-auto
-findt(ESO<IsFirstEmpty, IsRestEmpty, First, Rest...> const& t) noexcept
-{
-  if constexpr (cute::is_same_v<X, First>) {
-    return C<N>{};
-  } else
-  if constexpr (sizeof...(Rest) == 0) {
-    return C<N+1>{};
-  } else
-  if constexpr (IsRestEmpty) {
-    return cute::detail::findt<X, N+1>(ESO_t<Rest...>{});
+cute::conditional_t<cute::is_empty<cute::tuple_element_t<N, cute::tuple<T...>>>::value,
+                    cute::tuple_element_t<N, cute::tuple<T...>>,
+                    cute::tuple_element_t<N, cute::tuple<T...>> &&>
+getv_rr(ESO<F, R, T...>&& s) noexcept
+{
+  if constexpr (cute::is_empty<cute::tuple_element_t<N, cute::tuple<T...>>>::value) {
+    return {};
   } else {
-    return cute::detail::findt<X, N+1>(t.rest_);
+    return getr<cute::tuple_element_t<N, cute::tuple<T...>> &&, N>(static_cast<ESO<F, R, T...>&&>(s));
   }
+  CUTE_GCC_UNREACHABLE;
 }
 
-} // end namespace detail
+} // end namespace eso
 
 template <class... T>
-struct tuple : detail::ESO_t<T...>
+struct tuple : eso::ESO_t<T...>
 {
   CUTE_HOST_DEVICE constexpr
   tuple() {}
 
   CUTE_HOST_DEVICE constexpr
-  tuple(T const&... t) : detail::ESO_t<T...>(t...) {}
+  tuple(T const&... t) : eso::ESO_t<T...>(t...) {}
 };
 
 template <>
 struct tuple<> {};
 
+//
+// make_tuple (value-based implementation)
+//
+
+template <class... T>
+CUTE_HOST_DEVICE constexpr
+tuple<T...>
+make_tuple(T const&... t)
+{
+  return {t...};
+}
+
 // Returns the element in the ith position of the tuple
 template <size_t I, class... T>
 CUTE_HOST_DEVICE constexpr
@@ -222,7 +229,7 @@ decltype(auto)
 get(tuple<T...> const& t) noexcept
 {
   static_assert(I < sizeof...(T), "Index out of range");
-  return detail::getv<I>(t);
+  return eso::getv_cr<I>(t);
 }
 
 template <size_t I, class... T>
@@ -231,7 +238,7 @@ decltype(auto)
 get(tuple<T...>& t) noexcept
 {
   static_assert(I < sizeof...(T), "Index out of range");
-  return detail::getv<I>(t);
+  return eso::getv_r<I>(t);
 }
 
 template <size_t I, class... T>
@@ -240,22 +247,22 @@ decltype(auto)
 get(tuple<T...>&& t) noexcept
 {
   static_assert(I < sizeof...(T), "Index out of range");
-  return detail::getv<I>(static_cast<detail::ESO_t<T...>&&>(t));
+  return eso::getv_rr<I>(static_cast<eso::ESO_t<T...>&&>(t));
 }
 
-// Returns the position of type X (as a static integer) in the tuple
-// type's argument list.  X must be unique in the argument list.
+// Returns the first position of type X (as a static integer) in the tuple
+// type's argument list.
 template <class X, class... T>
 CUTE_HOST_DEVICE constexpr
 auto
-find(tuple<T...> const& t) noexcept
+find(tuple<T...> const&) noexcept
 {
-  return detail::findt<X, 0>(t);
+  return cute::C<find_true_v<cute::is_same_v<X,T>...>>{};
 }
 
 //
 // Custom is_tuple trait simply checks the existence of tuple_size
-//      and assumes std::get<I>(.), std::tuple_element<I,.>
+//      and assumes get<I>(.), tuple_element<I,.>
 //
 namespace detail {
 
@@ -269,19 +276,7 @@ template <class T>
 struct is_tuple : decltype(detail::has_tuple_size((T*)0)) {};
 
 template <class T>
-constexpr bool is_tuple_v = cute::is_tuple<T>::value;
-
-//
-// make_tuple (value-based implementation)
-//
-
-template <class... T>
-CUTE_HOST_DEVICE constexpr
-tuple<T...>
-make_tuple(T const&... t)
-{
-  return {t...};
-}
+static constexpr bool is_tuple_v = cute::is_tuple<T>::value;
 
 //
 // tuple_cat concatenates multiple cute::tuple into a single cute::tuple,
diff --git a/include/cute/container/type_list.hpp b/include/cute/container/type_list.hpp
index b8ac5f0de5..dfffbe251f 100644
--- a/include/cute/container/type_list.hpp
+++ b/include/cute/container/type_list.hpp
@@ -31,6 +31,7 @@
 #pragma once
 
 #include <cute/config.hpp>            // CUTE_HOST_DEVICE, CUTE_STL_NAMESPACE
+#include <cute/util/type_traits.hpp>
 
 namespace cute
 {
@@ -39,11 +40,35 @@ template <class... T>
 struct type_list {};
 
 // get<I> for type_list<T...>
-//   requires tuple_element_t<I,type_list<T...>> to have std::is_default_constructible
+//   Get an instance of the Ith type in the pack T...
+//   Requires tuple_element_t<I,type_list<T...>> to have std::is_default_constructible
 template <size_t I, class... T>
 CUTE_HOST_DEVICE constexpr
 CUTE_STL_NAMESPACE::tuple_element_t<I, type_list<T...>>
-get(type_list<T...> const& t) noexcept {
+get(type_list<T...> const&) noexcept {
+  return {};
+}
+
+// Find the index of the first true in the pack B...
+template <bool... B>
+struct find_true {
+  CUTE_HOST_DEVICE static constexpr size_t find() {
+    size_t i = 0;
+    (void) ((B ? true : (++i, false)) || ...);
+    return i;
+  }
+  static constexpr size_t value = find();
+};
+
+template <bool... B>
+static constexpr size_t find_true_v = find_true<B...>::value;
+
+// find<X> for type_list<T...>
+//   Finds the first position of type X (as a static integer) in the T... pack
+template <class X, class... T>
+CUTE_HOST_DEVICE constexpr
+CUTE_STL_NAMESPACE::integral_constant<size_t, find_true_v<cute::is_same_v<X,T>...>>
+find(type_list<T...> const&) noexcept {
   return {};
 }
 
@@ -69,9 +94,8 @@ struct tuple_size<cute::type_list<T...>>
 
 template <size_t I, class... T>
 struct tuple_element<I, cute::type_list<T...>>
-{
-  using type = typename CUTE_STL_NAMESPACE::tuple_element<I, CUTE_STL_NAMESPACE::tuple<T...>>::type;
-};
+    : CUTE_STL_NAMESPACE::tuple_element<I, CUTE_STL_NAMESPACE::tuple<T...>>
+{};
 
 } // end namespace std
 
@@ -94,9 +118,8 @@ struct tuple_size<cute::type_list<T...>>
 
 template <size_t I, class... T>
 struct tuple_element<I, cute::type_list<T...>>
-{
-  using type = typename CUTE_STL_NAMESPACE::tuple_element<I, CUTE_STL_NAMESPACE::tuple<T...>>::type;
-};
+    : CUTE_STL_NAMESPACE::tuple_element<I, CUTE_STL_NAMESPACE::tuple<T...>>
+{};
 
 } // end namespace std
 #endif // CUTE_STL_NAMESPACE_IS_CUDA_STD
diff --git a/include/cute/layout.hpp b/include/cute/layout.hpp
index 4ee901ada0..3f02a41d44 100644
--- a/include/cute/layout.hpp
+++ b/include/cute/layout.hpp
@@ -834,6 +834,8 @@ coalesce_x(Layout<Shape,Stride> const& layout)
   } else {
     return detail::bw_coalesce<R-2>(flat_shape, flat_stride, get<R-1>(flat_shape), get<R-1>(flat_stride));
   }
+  
+  CUTE_GCC_UNREACHABLE;
 }
 
 // Apply coalesce_x at the terminals of trg_profile
@@ -903,6 +905,8 @@ coalesce(Shape const& shape)
     } else {
       return append(init, a);                     // Can't coalesce, so append
     }
+
+    CUTE_GCC_UNREACHABLE;
   });
 }
 
@@ -1026,7 +1030,7 @@ template <class LShape, class LStride,
           class RShape, class RStride>
 CUTE_HOST_DEVICE constexpr
 auto
-composition_impl(LShape const& lhs_shape, LStride const& lhs_stride,
+composition_impl(LShape const& lhs_shape, [[maybe_unused]] LStride const& lhs_stride,
                  RShape const& rhs_shape, RStride const& rhs_stride)
 {
   if constexpr (is_tuple<RShape>::value) {                 // Right-distributivity of Layout composition for RHS tuple
@@ -1063,7 +1067,7 @@ composition_impl(LShape const& lhs_shape, LStride const& lhs_stride,
                    auto rest_stride   = get<3>(init);
 
                    auto curr_shape  = get<curr_i>(lhs_shape);
-                   auto curr_stride = get<curr_i>(lhs_stride);
+                   [[maybe_unused]] auto curr_stride = get<curr_i>(lhs_stride);
 
                    // Strong divisibility condition -- requires composition to be statically verifiable.
                    //CUTE_STATIC_ASSERT_V(((rest_stride % curr_shape) == Int<0>{}) or (rest_stride < curr_shape), "Stride Divisibility Condition");
@@ -1105,6 +1109,8 @@ composition_impl(LShape const& lhs_shape, LStride const& lhs_stride,
                                              rest_shape / new_shape,
                                              next_stride);
                    }
+
+                   CUTE_GCC_UNREACHABLE;
                  });
 
     if constexpr (tuple_size<decltype(result_shape)>::value == 0) {
@@ -1289,6 +1295,8 @@ right_inverse(Layout<Shape,Stride> const& layout)
       } else {
         return init;
       }
+
+      CUTE_GCC_UNREACHABLE;
     });
 
   return coalesce(make_layout(result_shape, result_stride));
@@ -1344,9 +1352,11 @@ left_inverse(Layout<Shape,Stride> const& layout)
         return make_tuple(append(result_shape,  istride / size(result_shape)),
                           append(result_stride, get<i>(preprod_shape)));
       }
+
+      CUTE_GCC_UNREACHABLE;
     });
 
-  return coalesce(make_layout(append(result_shape, get<back(sorted_seq)>(lshape)),
+  return coalesce(make_layout(append(result_shape, get<decltype(back(sorted_seq))::value>(lshape)),
                               result_stride));
 }
 
@@ -1499,7 +1509,7 @@ nullspace(Layout<Shape,Stride> const& layout)
 {
   auto flat_layout = flatten(layout);
 
-  auto iseq = detail::nullspace_seq<0>(flat_layout.stride(), seq<>{});
+  [[maybe_unused]] auto iseq = detail::nullspace_seq<0>(flat_layout.stride(), seq<>{});
 
   if constexpr (iseq.size() == 0) {
     return Layout<_1,_0>{};     // Empty case, nothing found
diff --git a/include/cute/numeric/arithmetic_tuple.hpp b/include/cute/numeric/arithmetic_tuple.hpp
index 33076378ea..60a4ff4abc 100644
--- a/include/cute/numeric/arithmetic_tuple.hpp
+++ b/include/cute/numeric/arithmetic_tuple.hpp
@@ -84,6 +84,8 @@ as_arithmetic_tuple(T const& t) {
   } else {
     return t;
   }
+
+  CUTE_GCC_UNREACHABLE;
 }
 
 //
diff --git a/include/cute/numeric/math.hpp b/include/cute/numeric/math.hpp
index 83dcd4e6e5..147458b85d 100644
--- a/include/cute/numeric/math.hpp
+++ b/include/cute/numeric/math.hpp
@@ -57,7 +57,7 @@ template <class T, class U,
 CUTE_HOST_DEVICE constexpr
 auto
 min(T const& t, U const& u) {
-  return t < u ? t : u;
+  return static_cast<cute::common_type_t<T,U>>(t) < static_cast<cute::common_type_t<T,U>>(u) ? t : u;
 }
 
 template <class T,
diff --git a/include/cute/swizzle_layout.hpp b/include/cute/swizzle_layout.hpp
index 43d3c4b233..ef1ca18e8b 100644
--- a/include/cute/swizzle_layout.hpp
+++ b/include/cute/swizzle_layout.hpp
@@ -128,8 +128,6 @@ make_fragment_like(ComposedLayout<Swizzle<B,M,S>,Offset,Layout> const& layout)
 // Utilities
 //
 
-namespace detail {
-
 // Get just the Swizzle part of a composed layout.
 template <int B, int M, int S, class Offset, class LayoutB>
 CUTE_HOST_DEVICE constexpr
@@ -167,8 +165,6 @@ get_nonswizzle_portion(Layout<Shape,Stride> const& slayout)
   return slayout;
 }
 
-} // namespace detail
-
 //
 // Slice a Swizzled ComposedLayout
 //
diff --git a/include/cute/tensor_impl.hpp b/include/cute/tensor_impl.hpp
index 9c1a0b4420..0d9144884b 100644
--- a/include/cute/tensor_impl.hpp
+++ b/include/cute/tensor_impl.hpp
@@ -381,6 +381,8 @@ struct MakeTensor
         return Tensor<Engine,Layout>();
       }
     }
+
+    CUTE_GCC_UNREACHABLE;
   }
 };
 
diff --git a/include/cutlass/arch/arch.h b/include/cutlass/arch/arch.h
index 74ab834f49..c1032f0b0c 100644
--- a/include/cutlass/arch/arch.h
+++ b/include/cutlass/arch/arch.h
@@ -43,7 +43,7 @@ namespace cutlass {
 namespace arch {
 
 constexpr int sm100_smem_capacity_bytes = 232448;  
-constexpr int sm120_smem_capacity_bytes = 102400;
+constexpr int sm120_smem_capacity_bytes = 101376;
 
 #if defined(__NVCC__) || defined(__CUDACC_RTC__) || (defined(__clang__) && (defined(__CUDA__) || defined(CUTLASS_ENABLE_SYCL)))
 
diff --git a/include/cutlass/arch/barrier.h b/include/cutlass/arch/barrier.h
index 249191371a..6280430d95 100644
--- a/include/cutlass/arch/barrier.h
+++ b/include/cutlass/arch/barrier.h
@@ -53,6 +53,9 @@
 #define CUTLASS_ARCH_TCGEN_ENABLED 1
 #endif
 
+#if (defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101F_ENABLED))
+#define CUTLASS_ARCH_TCGEN_ENABLED 1
+#endif
 
 namespace cutlass {
 /// @brief
@@ -389,7 +392,7 @@ struct ClusterBarrier {
   //
   //  Static Versions
   //
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void init(ValueType const* smem_ptr, uint32_t arrive_count) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -406,7 +409,7 @@ struct ClusterBarrier {
   }
 
   // Static version of wait - in case we don't want to burn a register
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void wait(ValueType const* smem_ptr, uint32_t phase) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -430,7 +433,7 @@ struct ClusterBarrier {
 #endif
   }
 
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static bool test_wait(ValueType const* smem_ptr, uint32_t phase, uint32_t pred) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -455,7 +458,7 @@ struct ClusterBarrier {
     return 0;
   }
 
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static bool try_wait(ValueType const* smem_ptr, uint32_t phase) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -479,7 +482,7 @@ struct ClusterBarrier {
   }
 
   // Static Predicated version of the above - in case we know the address.
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void arrive(ValueType const* smem_ptr, uint32_t cta_id, uint32_t pred) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -501,7 +504,7 @@ struct ClusterBarrier {
   }
 
   // Barrier arrive on local smem
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void arrive(ValueType const* smem_ptr) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -517,7 +520,7 @@ struct ClusterBarrier {
 #endif
   }
 
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void invalidate(ValueType const* smem_ptr) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -578,7 +581,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
   //
 
   // Performs an arrive operation + expected transaction bytes increment
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void arrive_and_expect_tx(ValueType const* smem_ptr, uint32_t transaction_bytes) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -595,7 +598,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
   }
 
   // Performs an arrive operation + expected transaction bytes increment for a remote cta_id in a Cluster
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void arrive_and_expect_tx(
       ValueType const* smem_ptr, uint32_t transaction_bytes, uint32_t cta_id, uint32_t pred) {
 #if CUDA_BARRIER_ENABLED
@@ -616,7 +619,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
   }
 
   // Performs an expected transaction bytes increment without doing an arrive operation
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void expect_transaction(ValueType const* smem_ptr, uint32_t transaction_bytes) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -633,7 +636,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
   }
 
   // Performs an expected transaction bytes decrement without doing an arrive operation
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void complete_transaction(
       ValueType const* smem_ptr, uint32_t dst_cta_id, uint32_t transaction_bytes, uint32_t pred = 1) {
 #if CUDA_BARRIER_ENABLED
@@ -728,7 +731,7 @@ void fence_view_async_shared() {
 }
 
 // Arrive on completion of in-flight cp.async operations issued by the calling thread 
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void cpasync_barrier_arrive(uint64_t const* smem_ptr) {
 #if CUDA_BARRIER_ENABLED
   uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -745,7 +748,7 @@ void cpasync_barrier_arrive(uint64_t const* smem_ptr) {
 }
 
 // Arrive on completion of in-flight cp.async operations issued by the calling thread (noinc)
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void cpasync_barrier_arrive_noinc(uint64_t const* smem_ptr) {
 #if CUDA_BARRIER_ENABLED
   uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -764,7 +767,7 @@ void cpasync_barrier_arrive_noinc(uint64_t const* smem_ptr) {
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive(uint64_t const* smem_ptr) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -779,7 +782,7 @@ void umma_arrive(uint64_t const* smem_ptr) {
 }
 
 //UMMA arrive for MMA_2x1SM
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive_2x1SM(uint64_t const* smem_ptr) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -794,7 +797,7 @@ void umma_arrive_2x1SM(uint64_t const* smem_ptr) {
 }
 
 // UMMA arrive for MMA_1sm + TMA_LOAD_MULTICAST combination
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive_multicast(uint64_t const* smem_ptr, uint16_t cta_mask) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -812,7 +815,7 @@ void umma_arrive_multicast(uint64_t const* smem_ptr, uint16_t cta_mask) {
 }
 
 // UMMA arrive for MMA_2x1SM + TMA_LOAD_MULTICAST combination
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive_multicast_2x1SM(uint64_t const* smem_ptr, uint16_t cta_mask) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -824,14 +827,14 @@ void umma_arrive_multicast_2x1SM(uint64_t const* smem_ptr, uint16_t cta_mask) {
       :
       :"r"(bar_intptr), "h"(cta_mask));
   }
-#else
+#elif defined(__CUDA_ARCH__)
   asm volatile ("brkpt;\n" ::);
 #endif
 }
 
 // Temporary solution for sparse kernel.
 // Will remove this when we done tightly elect_one wrap.
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive_multicast_no_elect(uint64_t const* smem_ptr, uint16_t cta_mask) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -850,7 +853,7 @@ void umma_arrive_multicast_no_elect(uint64_t const* smem_ptr, uint16_t cta_mask)
 
 // Temporary solution for sparse kernel.
 // UMMA arrive for MMA_2x1SM + TMA_LOAD_MULTICAST combination
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive_multicast_2x1SM_no_elect(uint64_t const* smem_ptr, uint16_t cta_mask) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr);
@@ -868,7 +871,7 @@ void umma_arrive_multicast_2x1SM_no_elect(uint64_t const* smem_ptr, uint16_t cta
 }
 
 // Always arrive on even SM of collaborating 2 SMs.
-CUTLASS_DEVICE
+CUTLASS_HOST_DEVICE
 void umma_arrive_2x1SM_sm0(uint64_t const* smem_ptr) {
 #if defined(CUTLASS_ARCH_TCGEN_ENABLED)
   uint32_t bar_intptr = cute::cast_smem_ptr_to_uint(smem_ptr) & cute::Sm100MmaPeerBitMask;
@@ -879,7 +882,7 @@ void umma_arrive_2x1SM_sm0(uint64_t const* smem_ptr) {
     :
     : "r"(bar_intptr));
 
-#else
+#elif defined(__CUDA_ARCH__)
   asm volatile ("brkpt;\n" ::);
 #endif
 }
diff --git a/include/cutlass/arch/config.h b/include/cutlass/arch/config.h
index 1dd27f78db..e5daf8292b 100644
--- a/include/cutlass/arch/config.h
+++ b/include/cutlass/arch/config.h
@@ -92,6 +92,14 @@
       #define CUTLASS_ARCH_MMA_SM100A_ENABLED 1
     #endif
 
+    // SM100f
+    #if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))
+    #define CUTLASS_ARCH_MMA_SM100F_SUPPORTED 1
+    #endif
+
+    #if (!defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) && CUDA_ARCH_FAMILY(1000))
+      #define CUTLASS_ARCH_MMA_SM100F_ENABLED CUTLASS_ARCH_MMA_SM100F_SUPPORTED
+    #endif
   #endif
 #endif
 
@@ -109,6 +117,14 @@
       #define CUTLASS_ARCH_MMA_SM101A_ENABLED 1
     #endif
 
+    // SM101f
+    #if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))
+    #define CUTLASS_ARCH_MMA_SM101F_SUPPORTED 1
+    #endif
+
+    #if (!defined(CUTLASS_ARCH_MMA_SM101F_ENABLED) && CUDA_ARCH_FAMILY(1010))
+      #define CUTLASS_ARCH_MMA_SM101F_ENABLED CUTLASS_ARCH_MMA_SM101F_SUPPORTED
+    #endif
   #endif
 #endif
 
@@ -124,12 +140,21 @@
       #define CUTLASS_ARCH_MMA_SM120A_ENABLED 1
     #endif
 
+    // SM120f
+    #if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))
+    #define CUTLASS_ARCH_MMA_SM120F_SUPPORTED 1
+    #endif
+
+    #if (!defined(CUTLASS_ARCH_MMA_SM120F_ENABLED) && CUDA_ARCH_FAMILY(1200))
+      #define CUTLASS_ARCH_MMA_SM120F_ENABLED CUTLASS_ARCH_MMA_SM120F_SUPPORTED
+    #endif
   #endif
 #endif
 
 
-#if (defined(CUTLASS_ARCH_MMA_SM100A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101A_ENABLED) ||\
-     defined(CUTLASS_ARCH_MMA_SM120A_ENABLED))
+#if (defined(CUTLASS_ARCH_MMA_SM100A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) ||\
+     defined(CUTLASS_ARCH_MMA_SM101A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101F_ENABLED) ||\
+     defined(CUTLASS_ARCH_MMA_SM120A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM120F_ENABLED))
 #  define CUTLASS_ARCH_CLC_ENABLED
 #endif
 
diff --git a/include/cutlass/arch/grid_dependency_control.h b/include/cutlass/arch/grid_dependency_control.h
index ae66de279d..e7defb5dbb 100644
--- a/include/cutlass/arch/grid_dependency_control.h
+++ b/include/cutlass/arch/grid_dependency_control.h
@@ -53,6 +53,20 @@
   #endif
 #endif
 
+#ifndef CUTLASS_GDC_ENABLED
+  #if(CUDA_BARRIER_ENABLED && \
+    defined(CUTLASS_ENABLE_GDC_FOR_SM100) && \
+    defined(__CUDA_ARCH__) && \
+    ((__CUDA_ARCH__ == 1000 &&\
+        (defined(__CUDA_ARCH_FEAT_SM100_ALL) || CUDA_ARCH_FAMILY(1000))) || \
+     (__CUDA_ARCH__ == 1010 &&\
+        (defined(__CUDA_ARCH_FEAT_SM101_ALL) || CUDA_ARCH_FAMILY(1010))) || \
+     (__CUDA_ARCH__ == 1200 &&\
+        (defined(__CUDA_ARCH_FEAT_SM120_ALL) || CUDA_ARCH_FAMILY(1200)))))
+    #define CUTLASS_GDC_ENABLED
+  #endif
+#endif
+
 namespace cutlass {
 namespace arch {
 
@@ -84,6 +98,5 @@ static constexpr bool IsGdcGloballyEnabled = true;
 static constexpr bool IsGdcGloballyEnabled = false;
 #endif
 
-
 } // namespace arch
 } // namespace cutlass
diff --git a/include/cutlass/arch/memory_sm75.h b/include/cutlass/arch/memory_sm75.h
index 9192687763..040f707436 100644
--- a/include/cutlass/arch/memory_sm75.h
+++ b/include/cutlass/arch/memory_sm75.h
@@ -60,7 +60,7 @@ CUTLASS_DEVICE void ldsm(Array<unsigned, MatrixCount> & D, void const* ptr);
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// CUTLASS helper to get SMEM pointer
-CUTLASS_DEVICE unsigned cutlass_get_smem_pointer(void *ptr) {
+CUTLASS_HOST_DEVICE unsigned cutlass_get_smem_pointer(void *ptr) {
   return cute::cast_smem_ptr_to_uint(ptr);
 }
 
diff --git a/include/cutlass/arch/reg_reconfig.h b/include/cutlass/arch/reg_reconfig.h
index 557643e5e6..a65ee3281f 100644
--- a/include/cutlass/arch/reg_reconfig.h
+++ b/include/cutlass/arch/reg_reconfig.h
@@ -47,6 +47,14 @@
     #define CUDA_CTA_RECONFIG_ACTIVATED 1
   #endif
 
+  #if defined(__CUDA_ARCH__) && __CUDACC_VER_MAJOR__ >= 12 && (          \
+         (__CUDA_ARCH__ == 1000 && CUDA_ARCH_FAMILY(1000))  \
+      || (__CUDA_ARCH__ == 1010 && CUDA_ARCH_FAMILY(1010))  \
+      || (__CUDA_ARCH__ == 1200 && CUDA_ARCH_FAMILY(1200))  \
+    )
+    #define CUDA_CTA_RECONFIG_ACTIVATED 1
+  #endif
+
 #endif
 
 namespace cutlass {
diff --git a/include/cutlass/arch/wmma.h b/include/cutlass/arch/wmma.h
index 2cafa51085..9cb9c04f95 100644
--- a/include/cutlass/arch/wmma.h
+++ b/include/cutlass/arch/wmma.h
@@ -34,9 +34,6 @@
 
 #pragma once
 
-// CUTLASS WMMA does not support clang at present.
-#if !(defined(__clang__) && defined(__CUDA__))
-
 #if (__CUDACC_VER_MAJOR__ >= 9)
 #if (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 700))
 #define CUTLASS_ARCH_WMMA_ENABLED
@@ -58,8 +55,6 @@
 #endif
 #endif
 
-#endif //!(defined(__clang__) && defined(__CUDA__))
-
 #if defined(CUTLASS_ARCH_WMMA_ENABLED)
 
 #include <mma.h>
diff --git a/include/cutlass/array.h b/include/cutlass/array.h
index e1e182827f..ce33110aa4 100644
--- a/include/cutlass/array.h
+++ b/include/cutlass/array.h
@@ -986,6 +986,21 @@ struct multiply_add<Array<T, N>, Array<T, N>, Array<T, N>> {
     return result;
   }
 
+  CUTLASS_HOST_DEVICE
+  Array<T, N> operator()(Array<T, N> const &a, Array<T, N> const &b, T const &scalar) const {
+
+    Array<T, N> result;
+    multiply_add<T> scalar_op;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < N; ++i) {
+      result[i] = scalar_op(a[i], b[i], scalar);
+    }
+
+    return result;
+  }
+
+
   CUTLASS_HOST_DEVICE
   Array<T, N> operator()(Array<T, N> const &a, T const &scalar_b, T const &scalar_c) const {
 
diff --git a/include/cutlass/conv/collective/builders/sm100_umma_builder.inl b/include/cutlass/conv/collective/builders/sm100_umma_builder.inl
index db1f7dae0a..9a9d4cb4e9 100644
--- a/include/cutlass/conv/collective/builders/sm100_umma_builder.inl
+++ b/include/cutlass/conv/collective/builders/sm100_umma_builder.inl
@@ -168,7 +168,7 @@ private:
 
   // Calculate SMEM matrix A and B buffers' pipeline stages
   static constexpr uint32_t AccumulatorPipelineStageCount = 2;
-  static constexpr uint32_t SchedulerPipelineStageCount = 2;
+  static constexpr uint32_t SchedulerPipelineStageCount = 1;
   static constexpr uint32_t CLCResponseSize = 16;
 
   // AccumulatorPipeline = PipelineUmmaAsync
@@ -179,8 +179,6 @@ private:
   static constexpr auto LoadOrderBarrierStorage = sizeof(typename cutlass::OrderedSequenceBarrier<1,2>::SharedStorage);
   // CLC (scheduler) response
   static constexpr auto CLCResponseStorage = SchedulerPipelineStageCount * CLCResponseSize;
-  // CLC Throttle pipeline storage
-  static constexpr auto CLCThrottlePipelineStorage = sizeof(typename cutlass::PipelineAsync<SchedulerPipelineStageCount>::SharedStorage);
   // Tmem dealloc
   static constexpr auto TmemDeallocStorage = sizeof(cutlass::arch::ClusterBarrier);
   // Tmem ptr storage
@@ -190,7 +188,6 @@ private:
                                                                CLCPipelineStorage +
                                                                LoadOrderBarrierStorage +
                                                                TmemDeallocStorage +
-                                                               CLCThrottlePipelineStorage +
                                                                CLCResponseStorage +
                                                                TmemBasePtrsStorage);
   // Reduce SMEM capacity available for buffers considering barrier allocations.
@@ -204,7 +201,12 @@ private:
   constexpr static int NumSpatialDimensions = detail::gmem_layout_tags_to_spatial_dims<GmemLayoutA, GmemLayoutB>();
 
   using DispatchPolicy = cutlass::conv::MainloopSm100TmaUmmaWarpSpecializedImplicitGemm<
-      ConvOp, PipelineStages, NumSpatialDimensions, ClusterShape_MNK>;
+      ConvOp,
+      PipelineStages,
+      NumSpatialDimensions,
+      SchedulerPipelineStageCount,
+      AccumulatorPipelineStageCount,
+      ClusterShape_MNK>;
 
 public:
   using CollectiveOp = cutlass::conv::collective::CollectiveConv<
diff --git a/include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp b/include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp
index dc75b988d5..278f69f93f 100644
--- a/include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp
+++ b/include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp
@@ -28,9 +28,7 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-//
 
-//
 
 #pragma once
 
@@ -66,6 +64,8 @@ template <
   conv::Operator ConvOp,
   int Stages,
   int NumSpatialDims,
+  int SchedulerPipelineStageCount,
+  int AccumulatorPipelineStageCount,
   class ClusterShape,    // Static cluster shape or dynamic (int, int, _1)
   class TileShapeMNKL_,  // (MmaAtomShapeM, MmaAtomShapeN, TileK, optional: TileL)
   class ElementA_,
@@ -75,7 +75,12 @@ template <
   class TileTraitsB_>
 struct CollectiveConv<
     MainloopSm100TmaUmmaWarpSpecializedImplicitGemm<
-        ConvOp, Stages, NumSpatialDims, ClusterShape>,
+      ConvOp,
+      Stages,
+      NumSpatialDims,
+      SchedulerPipelineStageCount,
+      AccumulatorPipelineStageCount,
+      ClusterShape>,
     TileShapeMNKL_,
     ElementA_,
     ElementB_,
@@ -87,7 +92,12 @@ struct CollectiveConv<
   // Type Aliases
   //
   using DispatchPolicy = MainloopSm100TmaUmmaWarpSpecializedImplicitGemm<
-      ConvOp, Stages, NumSpatialDims, ClusterShape>;
+                           ConvOp,
+                           Stages,
+                           NumSpatialDims,
+                           SchedulerPipelineStageCount,
+                           AccumulatorPipelineStageCount,
+                           ClusterShape>;
   using TileShape = decltype(cute::take<0,3>(TileShapeMNKL_{})); // (MmaAtomShapeM, MmaAtomShapeN, TileK)
   using ElementA = ElementA_;
   using ElementB = ElementB_;
@@ -348,10 +358,12 @@ struct CollectiveConv<
   // Constructor
   //
   CUTLASS_DEVICE
-  CollectiveConv(Params const& params) {
+  CollectiveConv(Params const& params, ClusterShape cluster_shape, uint32_t block_rank_in_cluster)
+    : cluster_shape_(cluster_shape)
+    , block_rank_in_cluster_(block_rank_in_cluster) {
     if constexpr (IsDynamicCluster) {
-      dim3 cs = cute::cluster_shape();
-      const bool is_fallback_cluster = (cs.x == params.cluster_shape_fallback.x && cs.y == params.cluster_shape_fallback.y);
+      const bool is_fallback_cluster = (cute::size<0>(cluster_shape_) == params.cluster_shape_fallback.x &&
+                                        cute::size<1>(cluster_shape_) == params.cluster_shape_fallback.y);
       observed_tma_load_a_ = is_fallback_cluster ? &params.tma_load_a_fallback : &params.tma_load_a;
       observed_tma_load_b_ = is_fallback_cluster ? &params.tma_load_b_fallback : &params.tma_load_b;
     }
@@ -648,28 +660,14 @@ struct CollectiveConv<
   }
 
   /// Issue Tma Descriptor Prefetch -- ideally from a single thread for best performance
-  CUTLASS_DEVICE static void
-  prefetch_tma_descriptors(Params const& mainloop_params) {
-    if constexpr (IsDynamicCluster) {
-      dim3 cs = cute::cluster_shape();
-      const bool is_fallback_cluster = (cs.x == mainloop_params.cluster_shape_fallback.x && cs.y == mainloop_params.cluster_shape_fallback.y);
-      if (is_fallback_cluster) {
-        cute::prefetch_tma_descriptor(mainloop_params.tma_load_a_fallback.get_tma_descriptor());
-        cute::prefetch_tma_descriptor(mainloop_params.tma_load_b_fallback.get_tma_descriptor());
-      }
-      else {
-        cute::prefetch_tma_descriptor(mainloop_params.tma_load_a.get_tma_descriptor());
-        cute::prefetch_tma_descriptor(mainloop_params.tma_load_b.get_tma_descriptor());
-      }
-    }
-    else {
-      cute::prefetch_tma_descriptor(mainloop_params.tma_load_a.get_tma_descriptor());
-      cute::prefetch_tma_descriptor(mainloop_params.tma_load_b.get_tma_descriptor());
-    }
+  CUTLASS_DEVICE void
+  prefetch_tma_descriptors() {
+    cute::prefetch_tma_descriptor(observed_tma_load_a_->get_tma_descriptor());
+    cute::prefetch_tma_descriptor(observed_tma_load_b_->get_tma_descriptor());
   }
 
   /// Construct A Single Stage's Accumulator Shape
-  CUTLASS_DEVICE auto
+  CUTLASS_DEVICE static auto
   partition_accumulator_shape() {
     auto acc_shape = partition_shape_C(TiledMma{}, take<0,2>(TileShape{}));  // ((MMA_TILE_M,MMA_TILE_N),MMA_M,MMA_N)
 
@@ -794,11 +792,10 @@ struct CollectiveConv<
     Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});  // (MMA,MMA_M,MMA_K,PIPE)
     Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});  // (MMA,MMA_N,MMA_K,PIPE)
 
-    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, cute::cluster_shape());
-    Layout cta_layout_mnk  = make_layout(cluster_shape);
+    // Define the CTA-in-cluster Layout and Coord
+    Layout cta_layout_mnk  = make_layout(cluster_shape_);
     Layout cta_layout_vmnk = tiled_divide(cta_layout_mnk, make_tile(typename TiledMma::AtomThrID{}));
-    int block_rank_in_cluster = cute::block_rank_in_cluster();
-    auto cta_coord_vmnk  = cta_layout_vmnk.get_flat_coord(block_rank_in_cluster);
+    auto cta_coord_vmnk  = cta_layout_vmnk.get_flat_coord(block_rank_in_cluster_);
 
     // Project the cta_layout for tma_a along the n-modes
     auto [tAgA_mk, tAsA] = tma_partition(*observed_tma_load_a_,
@@ -890,7 +887,7 @@ struct CollectiveConv<
   }
 
   CUTLASS_DEVICE auto
-  mma_init(TensorStorage& shared_tensors) {
+  mma_init(TensorStorage& shared_tensors) const {
     Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});          // (BLK_M,BLK_K,PIPE)
     Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
 
@@ -909,6 +906,9 @@ struct CollectiveConv<
 
   typename Params::TMA_A const* observed_tma_load_a_ = nullptr;
   typename Params::TMA_B const* observed_tma_load_b_ = nullptr;
+
+  ClusterShape cluster_shape_;
+  uint32_t block_rank_in_cluster_;
 };
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/conv/dispatch_policy.hpp b/include/cutlass/conv/dispatch_policy.hpp
index b4bf8a5382..d569cb1c3e 100644
--- a/include/cutlass/conv/dispatch_policy.hpp
+++ b/include/cutlass/conv/dispatch_policy.hpp
@@ -86,7 +86,10 @@ struct MainloopSm90TmaGmmaWarpSpecializedImplicitGemm {
 
 
 // SM100 tensor op kernel schedule
-struct KernelImplicitTmaWarpSpecializedSm100 { };
+struct KernelImplicitTmaWarpSpecializedSm100 {
+  static constexpr int SchedulerPipelineStageCount = 0;
+  static constexpr int AccumulatorPipelineStageCount = 0;
+};
 
 // Pseudo-policies for builder auto override that dispatches to the KernelImplicitTmaWarpSpecializedSm100
 // but for opting into 1 or 2 SM atoms
@@ -96,11 +99,23 @@ struct KernelImplicitTmaWarpSpecialized2SmSm100 : KernelImplicitTmaWarpSpecializ
 struct KernelStridedDgradTmaWs1SmSm100 { };
 struct KernelStridedDgradTmaWs2SmSm100 { };
 
+// Policy for implicit gemm kernel
+template<
+  int SchedulerPipelineStageCount_,
+  int AccumulatorPipelineStageCount_
+>
+struct KernelScheduleImplicitTmaWarpSpecializedSm100 : KernelImplicitTmaWarpSpecializedSm100 {
+  static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+  static constexpr int AccumulatorPipelineStageCount = AccumulatorPipelineStageCount_;
+};
+
 // n-buffer in smem (Blackwell TMA), pipelined with Blackwell UMMA and TMA, fprop
 template<
   conv::Operator ConvOp_,
   int Stages_,
   int NumSpatialDimensions_,
+  int SchedulerPipelineStageCount_,
+  int AccumulatorPipelineStageCount_,
   class ClusterShape_ = cute::Shape<cute::C<1>,cute::C<1>,cute::C<1>>
 >
 struct MainloopSm100TmaUmmaWarpSpecializedImplicitGemm {
@@ -109,7 +124,7 @@ struct MainloopSm100TmaUmmaWarpSpecializedImplicitGemm {
   static constexpr Operator ConvOp = ConvOp_;
   using ClusterShape = ClusterShape_;
   using ArchTag = arch::Sm100;
-  using Schedule = KernelImplicitTmaWarpSpecializedSm100;
+  using Schedule = KernelScheduleImplicitTmaWarpSpecializedSm100<SchedulerPipelineStageCount_, AccumulatorPipelineStageCount_>;
 
   static_assert(NumSpatialDimensions >= 1);
 }; 
diff --git a/include/cutlass/conv/kernel/sm100_implicit_gemm_tma_warpspecialized.hpp b/include/cutlass/conv/kernel/sm100_implicit_gemm_tma_warpspecialized.hpp
index 90236e1fd9..0874d8f8ab 100644
--- a/include/cutlass/conv/kernel/sm100_implicit_gemm_tma_warpspecialized.hpp
+++ b/include/cutlass/conv/kernel/sm100_implicit_gemm_tma_warpspecialized.hpp
@@ -29,8 +29,6 @@
  *
  **************************************************************************************************/
 
-
-
 #pragma once
 
 #include "cutlass/cutlass.h"
@@ -110,7 +108,8 @@ class ConvUniversal<
   static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
   // TileID scheduler
   // CLC pipeline depth determines how many waves (stages-1) the scheduler can race ahead
-  static constexpr uint32_t SchedulerPipelineStageCount = 2;
+  static constexpr uint32_t SchedulerPipelineStageCount = DispatchPolicy::Schedule::SchedulerPipelineStageCount;
+  static constexpr uint32_t AccumulatorPipelineStageCount = DispatchPolicy::Schedule::AccumulatorPipelineStageCount;
 
   using TileSchedulerTag = TileSchedulerTag_;
   using TileScheduler = typename cutlass::gemm::kernel::detail::TileSchedulerSelector<
@@ -135,7 +134,6 @@ class ConvUniversal<
   static constexpr uint32_t NumFixupBarriers = 1;
 
   // Pipelines and pipeline states
-  static constexpr uint32_t AccumulatorPipelineStageCount = SchedulerPipelineStageCount;
   static constexpr uint32_t CLCResponseSize = sizeof(typename TileScheduler::CLCResponse);
 
   // Pipeline and pipeline state types
@@ -157,10 +155,6 @@ class ConvUniversal<
   using CLCPipelineState = cutlass::PipelineDetail::PipelineCLCFetchAsyncPipelineState<SchedulerPipelineStageCount>;
   using CLCPipelineSharedStorage = cutlass::PipelineDetail::PipelineCLCFetchAsyncSharedStorage<SchedulerPipelineStageCount>;
 
-  using CLCThrottlePipeline = cutlass::PipelineAsync<SchedulerPipelineStageCount>;
-  using CLCThrottlePipelineState = cutlass::PipelineDetail::PipelineAsyncPipelineState<SchedulerPipelineStageCount>;
-  using CLCThrottlePipelineSharedStorage = cutlass::PipelineDetail::PipelineAsyncSharedStorage<SchedulerPipelineStageCount>;
-
   using TmemAllocator = cute::conditional_t<cute::size(cute::shape<0>(typename TiledMma::ThrLayoutVMNK{})) == 1,
       cute::TMEM::Allocator1Sm, cute::TMEM::Allocator2Sm>;
 
@@ -172,14 +166,12 @@ class ConvUniversal<
       using LoadOrderBarrierStorage = typename LoadOrderBarrier::SharedStorage;
       using CLCPipelineStorage = CLCPipelineSharedStorage;
       using AccumulatorPipelineStorage = typename AccumulatorPipeline::SharedStorage;
-      using CLCThrottlePipelineStorage = CLCThrottlePipelineSharedStorage;
 
       alignas(16) MainloopPipelineStorage mainloop;
       alignas(16) EpiLoadPipelineStorage epi_load;
       alignas(16) LoadOrderBarrierStorage load_order;
       alignas(16) CLCPipelineStorage clc;
       alignas(16) AccumulatorPipelineStorage accumulator;
-      alignas(16) CLCThrottlePipelineStorage clc_throttle;
       alignas(16) arch::ClusterBarrier tmem_dealloc;
     } pipelines;
 
@@ -193,7 +185,6 @@ class ConvUniversal<
       EpilogueTensorStorage epilogue;
       MainloopTensorStorage mainloop;
     } tensors;
-
   };
 
   static constexpr int SharedStorageSize = sizeof(SharedStorage);
@@ -207,7 +198,7 @@ class ConvUniversal<
     KernelHardwareInfo hw_info{};
     TileSchedulerArguments scheduler{};
   };
-  
+
   // Kernel device entry point API
   struct Params {
     using ProblemShapeMNKL = decltype(CollectiveMainloop::get_problem_shape_MNKL(ProblemShape{}));
@@ -398,7 +389,7 @@ class ConvUniversal<
                                                                                      : WarpCategory::Epilogue;
 
     uint32_t lane_predicate = cute::elect_one_sync();
-    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, cute::cluster_shape());
+    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{});
     int cluster_size = size(cluster_shape);
     uint32_t cta_rank_in_cluster = cute::block_rank_in_cluster();
     bool is_first_cta_in_cluster = cta_rank_in_cluster == 0;
@@ -407,24 +398,23 @@ class ConvUniversal<
     constexpr bool has_mma_peer_cta = size(AtomThrShapeMNK{}) == 2;
     [[maybe_unused]] uint32_t mma_peer_cta_rank = has_mma_peer_cta ? cta_rank_in_cluster ^ 1 : cta_rank_in_cluster;
 
-    // Issue Tma Descriptor Prefetch from a single thread
-    if ((warp_category == WarpCategory::Sched) && lane_predicate) {
-      CollectiveMainloop::prefetch_tma_descriptors(params.mainloop);
-    }
-    if ((warp_category == WarpCategory::EpilogueLoad) && lane_predicate) {
-      CollectiveEpilogue::prefetch_tma_descriptors(params.epilogue);
-    }
-
     // Kernel level shared memory storage
     SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
 
     // In a warp specialized kernel, collectives expose data movement and compute operations separately
-    CollectiveMainloop collective_mainloop(params.mainloop);
+    CollectiveMainloop collective_mainloop(params.mainloop, cluster_shape, cta_rank_in_cluster);
     CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
 
+    // Issue Tma Descriptor Prefetch from a single thread
+    if ((warp_category == WarpCategory::Sched) && lane_predicate) {
+      collective_mainloop.prefetch_tma_descriptors();
+    }
+    if ((warp_category == WarpCategory::EpilogueLoad) && lane_predicate) {
+      collective_epilogue.prefetch_tma_descriptors(params.epilogue);
+    }
+
     // Do we load source tensor C or other aux inputs
     bool is_epi_load_needed = collective_epilogue.is_producer_load_needed();
-
     IsParticipant is_participant = {
       (warp_category == WarpCategory::MMA),                                 // mma
       (warp_category == WarpCategory::Sched) && is_first_cta_in_cluster,    // sched
@@ -462,7 +452,7 @@ class ConvUniversal<
     epi_load_pipeline_params.producer_arv_count = NumEpilogueLoadThreads;
     epi_load_pipeline_params.consumer_arv_count = NumEpilogueThreads;
     epi_load_pipeline_params.transaction_bytes = CollectiveEpilogue::TmaTransactionBytes;
-    epi_load_pipeline_params.initializing_warp = 4;
+    epi_load_pipeline_params.initializing_warp = 1;
     EpiLoadPipeline epi_load_pipeline(shared_storage.pipelines.epi_load, epi_load_pipeline_params);
 
     // Epilogue Store pipeline
@@ -474,7 +464,7 @@ class ConvUniversal<
     typename LoadOrderBarrier::Params load_order_barrier_params;
     load_order_barrier_params.group_id = (warp_category == WarpCategory::MainloopLoad) ? 0 : 1;
     load_order_barrier_params.group_size = NumMainloopLoadThreads;
-    load_order_barrier_params.initializing_warp = 5;
+    load_order_barrier_params.initializing_warp = 3;
     LoadOrderBarrier load_order_barrier(shared_storage.pipelines.load_order, load_order_barrier_params);
 
     // CLC pipeline
@@ -493,7 +483,7 @@ class ConvUniversal<
       clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
     }
     clc_pipeline_params.transaction_bytes = CLCResponseSize;
-    clc_pipeline_params.initializing_warp = 1;
+    clc_pipeline_params.initializing_warp = 4;
     CLCPipeline clc_pipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
 
     // Mainloop-Epilogue pipeline
@@ -507,29 +497,13 @@ class ConvUniversal<
     // Only one producer thread arrives on this barrier.
     accumulator_pipeline_params.producer_arv_count = 1;
     accumulator_pipeline_params.consumer_arv_count = size(AtomThrShapeMNK{}) * NumEpilogueThreads;
-    accumulator_pipeline_params.initializing_warp = 2;
+    accumulator_pipeline_params.initializing_warp = 5;
     AccumulatorPipeline accumulator_pipeline(shared_storage.pipelines.accumulator,
                                              accumulator_pipeline_params,
                                              cluster_shape,
                                              cute::true_type{},   // Perform barrier init
                                              cute::false_type{}); // Delay mask calculation
 
-    // CLC throttle pipeline
-    typename CLCThrottlePipeline::Params clc_throttle_pipeline_params;
-    if (WarpCategory::MainloopLoad == warp_category) {
-      clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
-    }
-    if (WarpCategory::Sched == warp_category) {
-      clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
-    }
-    clc_throttle_pipeline_params.producer_arv_count = NumMainloopLoadThreads;
-    clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
-    clc_throttle_pipeline_params.dst_blockid = 0;
-    clc_throttle_pipeline_params.initializing_warp = 3;
-    CLCThrottlePipeline clc_throttle_pipeline(shared_storage.pipelines.clc_throttle, clc_throttle_pipeline_params);
-    CLCThrottlePipelineState clc_pipe_throttle_consumer_state;
-    CLCThrottlePipelineState clc_pipe_throttle_producer_state = cutlass::make_producer_start_state<CLCThrottlePipeline>();
-
     // Tmem allocator
     TmemAllocator tmem_allocator{};
 
@@ -544,12 +518,10 @@ class ConvUniversal<
 
     // We need this to guarantee that the Pipeline init is visible
     // To all producers and consumer threadblocks in the cluster
-    if (cluster_size > 1) {
-      cute::cluster_arrive_relaxed();
-    }
-    else {
-      __syncthreads();
-    }
+    pipeline_init_arrive_relaxed(cluster_size);
+
+    auto load_inputs = collective_mainloop.load_init(
+      problem_shape_MNKL, params.mainloop, shared_storage.tensors.mainloop);
 
     uint32_t tmem_stage_ptrs[AccumulatorPipelineStageCount];
     MainloopPipelineState mainloop_pipe_consumer_state;
@@ -571,7 +543,7 @@ class ConvUniversal<
 
     // Calculate mask after cluster barrier arrival
     mainloop_pipeline.init_masks(cluster_shape, block_id_in_cluster);
-    accumulator_pipeline.init_masks(cluster_shape);
+    accumulator_pipeline.init_masks(cluster_shape, block_id_in_cluster);
 
     // TileID scheduler
     TileScheduler scheduler(&shared_storage.clc_response[0], params.scheduler, problem_shape_MNKL, TileShape{}, block_id_in_cluster);
@@ -583,58 +555,13 @@ class ConvUniversal<
     int TmemColumnsPerAccumulatorTile = cutlass::detail::find_tmem_tensor_col_offset(accumulators);
     pipeline_init_wait(cluster_size);
 
-    if (is_participant.sched) {
-
-      // Whether a new CLC query must be performed.
-      // See comment below where this variable is updated for a description of
-      // why this variable is needed.
-      bool requires_clc_query = true;
-
-      do {
-        if (requires_clc_query) {
-          // Throttle CLC query to mitigate workload imbalance caused by skews among persistent workers.
-          clc_throttle_pipeline.consumer_wait(clc_pipe_throttle_consumer_state);
-          clc_throttle_pipeline.consumer_release(clc_pipe_throttle_consumer_state);
-          ++clc_pipe_throttle_consumer_state;
-
-          // Query next clcID and update producer state
-          clc_pipe_producer_state = scheduler.advance_to_next_work(clc_pipeline, clc_pipe_producer_state);
-        }
-
-        // Fetch next work tile
-        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
-          work_tile_info,
-          clc_pipeline,
-          clc_pipe_consumer_state
-        );
-
-        // Only perform a new CLC query if we consumed a new CLC query result in
-        // `fetch_next_work`. An example of a case in which CLC `fetch_next_work` does
-        // not consume a new CLC query response is when processing stream-K units.
-        // The current stream-K scheduler uses single WorkTileInfo to track multiple
-        // (potentially-partial) tiles to be computed via stream-K. In this case,
-        // `fetch_next_work` simply performs in-place updates on the existing WorkTileInfo,
-        // rather than consuming a CLC query response.
-        requires_clc_query = increment_pipe;
-        if (increment_pipe) {
-          ++clc_pipe_consumer_state;
-        }
-
-        work_tile_info = next_work_tile_info;
-      } while (work_tile_info.is_valid());
-      clc_pipeline.producer_tail(clc_pipe_producer_state);
-    }
-    else if (is_participant.main_load) {
-
+    if (is_participant.main_load) {
       // Ensure that the prefetched kernel does not touch
       // unflushed global memory prior to this instruction
       cutlass::arch::wait_on_dependent_grids();
 
       bool do_load_order_arrive = is_epi_load_needed;
-      auto load_inputs = collective_mainloop.load_init(
-          problem_shape_MNKL, params.mainloop, shared_storage.tensors.mainloop);
       Tensor gA_mk = get<0>(load_inputs);
-      bool requires_clc_query = true;
 
       do {
         // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
@@ -642,12 +569,6 @@ class ConvUniversal<
         auto k_tile_count = scheduler.get_work_k_tile_count(work_tile_info, problem_shape_MNKL, TileShape{});
         auto k_tile_prologue = min(MainloopPipeline::Stages, k_tile_count);
 
-        if (is_first_cta_in_cluster && requires_clc_query) {
-          clc_throttle_pipeline.producer_acquire(clc_pipe_throttle_producer_state);
-          clc_throttle_pipeline.producer_commit(clc_pipe_throttle_producer_state);
-          ++clc_pipe_throttle_producer_state;
-        }
-
         auto [mainloop_producer_state_next, k_tile_iter_next] = collective_mainloop.load(
           params.mainloop,
           mainloop_pipeline,
@@ -683,7 +604,6 @@ class ConvUniversal<
         );
         work_tile_info = next_work_tile_info;
         cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
-        requires_clc_query = increment_pipe;
         if (increment_pipe) {
           ++clc_pipe_consumer_state;
         }
@@ -691,60 +611,43 @@ class ConvUniversal<
       collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
 
     }
-    else if (is_participant.epi_load) {
 
-      // Ensure that the prefetched kernel does not touch
-      // unflushed global memory prior to this instruction
-      cutlass::arch::wait_on_dependent_grids();
+    else if (is_participant.sched) {
+      // Whether a new CLC query must be performed.
+      // See comment below where this variable is updated for a description of
+      // why this variable is needed.
+      bool requires_clc_query = true;
 
-      bool do_load_order_wait = true;
-      bool do_tail_load = false;
       do {
-        bool compute_epilogue = TileScheduler::compute_epilogue(work_tile_info, params.scheduler);
+        if (requires_clc_query) {
+          // Query next clcID and update producer state
+          clc_pipe_producer_state = scheduler.advance_to_next_work(clc_pipeline, clc_pipe_producer_state);
+        }
 
-        // Get current work tile and fetch next work tile
+        // Fetch next work tile
         auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
           work_tile_info,
           clc_pipeline,
           clc_pipe_consumer_state
         );
-        work_tile_info = next_work_tile_info;
 
+        // Only perform a new CLC query if we consumed a new CLC query result in
+        // `fetch_next_work`. An example of a case in which CLC `fetch_next_work` does
+        // not consume a new CLC query response is when processing stream-K units.
+        // The current stream-K scheduler uses single WorkTileInfo to track multiple
+        // (potentially-partial) tiles to be computed via stream-K. In this case,
+        // `fetch_next_work` simply performs in-place updates on the existing WorkTileInfo,
+        // rather than consuming a CLC query response.
+        requires_clc_query = increment_pipe;
         if (increment_pipe) {
           ++clc_pipe_consumer_state;
         }
 
-        if (compute_epilogue) {
-
-          if (do_load_order_wait) {
-            load_order_barrier.wait();
-            do_load_order_wait = false;
-          }
-
-          epi_load_pipe_producer_state = collective_epilogue.load(
-            epi_load_pipeline,
-            epi_load_pipe_producer_state,
-            problem_shape_MNKL,
-            CtaShape_MNK{},
-            cta_coord_mnkl,
-            TileShape{},
-            TiledMma{},
-            shared_storage.tensors.epilogue
-          );
-
-          do_tail_load = true;
-        }
-
-        // Calculate the cta coordinates of the next work tile
-        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+        work_tile_info = next_work_tile_info;
       } while (work_tile_info.is_valid());
-
-      if (do_tail_load) {
-        collective_epilogue.load_tail(
-          epi_load_pipeline, epi_load_pipe_producer_state,
-          epi_store_pipeline, epi_store_pipe_producer_state);
-      }
+      clc_pipeline.producer_tail(clc_pipe_producer_state);
     }
+
     else if (is_participant.mma) {
       // Tmem allocation sequence
       tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
@@ -757,6 +660,7 @@ class ConvUniversal<
         tmem_stage_ptrs[acc_stage] = tmem_base_ptr + (TmemColumnsPerAccumulatorTile * acc_stage) & cutlass::detail::TmemColMask;
       }
       auto mma_inputs = collective_mainloop.mma_init(shared_storage.tensors.mainloop);
+
       do {
         auto k_tile_count = scheduler.get_work_k_tile_count(work_tile_info, problem_shape_MNKL, TileShape{});
 
@@ -788,7 +692,6 @@ class ConvUniversal<
             mma_inputs,
             k_tile_count
           );
-
           accumulator_pipeline.producer_commit(accumulator_pipe_producer_state);
         }
         ++accumulator_pipe_producer_state;
@@ -802,6 +705,7 @@ class ConvUniversal<
 
       // Release the right to allocate before deallocations so that the next CTA can rasterize
       tmem_allocator.release_allocation_lock();
+
       // Leader MMA waits for leader + peer epilogues to release accumulator stage
       if (is_mma_leader_cta) {
         accumulator_pipeline.producer_tail(accumulator_pipe_producer_state);
@@ -816,8 +720,66 @@ class ConvUniversal<
 
       // Free entire tmem allocation
       tmem_allocator.free(tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+    }
+
+    else if (is_participant.epi_load) {
+      // Ensure that the prefetched kernel does not touch
+      // unflushed global memory prior to this instruction
+      cutlass::arch::wait_on_dependent_grids();
+
+      bool do_load_order_wait = true;
+      bool do_tail_load = false;
+
+      do {
+        bool compute_epilogue = TileScheduler::compute_epilogue(work_tile_info, params.scheduler);
+
+        // Get current work tile and fetch next work tile
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+        work_tile_info = next_work_tile_info;
+
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+
+        if (compute_epilogue) {
+          if (do_load_order_wait) {
+            load_order_barrier.wait();
+            do_load_order_wait = false;
+          }
 
+          epi_load_pipe_producer_state = collective_epilogue.load(
+            epi_load_pipeline,
+            epi_load_pipe_producer_state,
+            problem_shape_MNKL,
+            CtaShape_MNK{},
+            cta_coord_mnkl,
+            TileShape{},
+            TiledMma{},
+            shared_storage.tensors.epilogue
+          );
+
+          do_tail_load = true;
+        }
+
+        // Calculate the cta coordinates of the next work tile
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+      } while (work_tile_info.is_valid());
+
+      // Only perform a tail load if one of the work units processed performed
+      // an epilogue load. An example of a case in which a tail load should not be
+      // performed is in split-K if a cluster is only assigned non-final splits (for which
+      // the cluster does not compute the epilogue).
+      if (do_tail_load) {
+        collective_epilogue.load_tail(
+          epi_load_pipeline, epi_load_pipe_producer_state,
+          epi_store_pipeline, epi_store_pipe_producer_state);
+      }
     }
+
     else if (is_participant.epilogue) {
       // Wait for tmem allocate here
       tmem_allocation_result_barrier.arrive_and_wait();
@@ -875,13 +837,16 @@ class ConvUniversal<
           epi_load_pipe_consumer_state = load_state_next;
           epi_store_pipe_producer_state = store_state_next;
           accumulator_pipe_consumer_state = acc_state_next;
-
           do_tail_store = true;
         }
         work_tile_info = next_work_tile_info;
         cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
       } while (work_tile_info.is_valid());
 
+      // Only perform a tail store if one of the work units processed performed
+      // an epilogue. An example of a case in which a tail load should not be
+      // performed is in split-K if a cluster is only assigned non-final splits (for which
+      // the cluster does not compute the epilogue).
       if (do_tail_store) {
         collective_epilogue.store_tail(
           epi_load_pipeline, epi_load_pipe_consumer_state,
@@ -889,19 +854,8 @@ class ConvUniversal<
           CtaShape_MNK{});
       }
     }
-  }
-
-private:
 
-  // Synchronization call. Blocks until barriers are initialized in shared memory.
-  CUTLASS_DEVICE
-  void
-  pipeline_init_wait(int cluster_size) {
-    if (cluster_size > 1) {
-      cute::cluster_wait();
-    }
     else {
-      __syncthreads();
     }
   }
 };
diff --git a/include/cutlass/detail/sm100_blockwise_scale_layout.hpp b/include/cutlass/detail/blockwise_scale_layout.hpp
similarity index 67%
rename from include/cutlass/detail/sm100_blockwise_scale_layout.hpp
rename to include/cutlass/detail/blockwise_scale_layout.hpp
index 8f75bd2561..2d545bbd1e 100644
--- a/include/cutlass/detail/sm100_blockwise_scale_layout.hpp
+++ b/include/cutlass/detail/blockwise_scale_layout.hpp
@@ -179,11 +179,110 @@ struct Sm100BlockwiseScaleConfig {
 
 };
 
+template<UMMA::Major majorSFA = UMMA::Major::MN, UMMA::Major majorSFB = UMMA::Major::MN>
+struct RuntimeBlockwiseScaleConfig {
+
+  using ShapeSFA = Shape<Shape<int32_t, int32_t>, Shape<int32_t, int32_t>, int32_t>;
+  using ShapeSFB = Shape<Shape<int32_t, int32_t>, Shape<int32_t, int32_t>, int32_t>;
+
+  using StrideSFA = conditional_t<majorSFA == UMMA::Major::MN, 
+      Stride<Stride<_0,_1>,Stride<_0,int32_t>, int32_t>, 
+      Stride<Stride<_0,int32_t>,Stride<_0,_1>, int32_t>>;
+
+  using StrideSFB = conditional_t<majorSFB == UMMA::Major::MN, 
+      Stride<Stride<_0,_1>,Stride<_0,int32_t>, int32_t>, 
+      Stride<Stride<_0,int32_t>,Stride<_0,_1>, int32_t>>;
+
+  using LayoutSFA = Layout<ShapeSFA, StrideSFA>;
+  using LayoutSFB = Layout<ShapeSFB, StrideSFB>;
+
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  deduce_layoutSFA() {
+    return LayoutSFA{};
+  }
+
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  deduce_layoutSFB() {
+    return LayoutSFB{};
+  }
+
+  // The following function is provided for user fill dynamic problem size to the layout_SFA.
+  template <class ProblemShape, class SFVecShape>
+  CUTE_HOST_DEVICE
+  static constexpr auto 
+  tile_atom_to_shape_SFA(ProblemShape problem_shape, SFVecShape sf_vec_shape) {
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+
+    auto strides = [&]() CUTLASS_LAMBDA_FUNC_INLINE {
+      auto [M, N, K, L] = problem_shape_MNKL;
+      auto [sfm, sfn, sfk] = sf_vec_shape;
+      if constexpr (majorSFA == UMMA::Major::MN) {
+        return make_stride(make_stride(_0{}, _1{}), make_stride(_0{}, cute::ceil_div(M, sfm)));
+      }
+      else {
+        return make_stride(make_stride(_0{}, cute::ceil_div(K, sfk)), make_stride(_0{}, _1{}));
+      }
+    }();
+
+    auto [M, N, K, L] = problem_shape_MNKL;
+    auto [sfm, sfn, sfk] = sf_vec_shape;
+    auto mk_layout = make_layout(
+      make_shape(make_shape(sfm, cute::ceil_div(M, sfm)),
+                 make_shape(sfk, cute::ceil_div(K, sfk))),
+      strides
+    );
+
+    return make_layout(append(shape(mk_layout), L), append(stride(mk_layout), size(filter_zeros(mk_layout))));
+  }
+
+  // The following function is provided for user fill dynamic problem size to the layout_SFB.
+  template <class ProblemShape, class SFVecShape>
+  CUTE_HOST_DEVICE
+  static constexpr auto 
+  tile_atom_to_shape_SFB(ProblemShape problem_shape, SFVecShape sf_vec_shape) {
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+
+    auto strides = [&]() CUTLASS_LAMBDA_FUNC_INLINE {
+      auto [M, N, K, L] = problem_shape_MNKL;
+      auto [sfm, sfn, sfk] = sf_vec_shape;
+
+      if constexpr (majorSFB == UMMA::Major::MN) {
+        return make_stride(make_stride(_0{}, _1{}), make_stride(_0{}, cute::ceil_div(N, sfn)));
+      }
+      else {
+        return make_stride(make_stride(_0{}, cute::ceil_div(K, sfk)), make_stride(_0{}, _1{}));
+      }
+    }();
+
+    auto [M, N, K, L] = problem_shape_MNKL;
+    auto [sfm, sfn, sfk] = sf_vec_shape;
+    auto nk_layout = make_layout(
+      make_shape(make_shape(sfn, cute::ceil_div(N, sfn)),
+                 make_shape(sfk, cute::ceil_div(K, sfk))),
+      strides
+    );
+
+    return make_layout(append(shape(nk_layout), L), append(stride(nk_layout), size(filter_zeros(nk_layout))));
+  }
+
+};
+
+// Sm90 only supports MN major for SFA and SFB for now
+template<int SFVecSizeM, int SFVecSizeN, int SFVecSizeK>
+using Sm90BlockwiseScaleConfig = Sm100BlockwiseScaleConfig<SFVecSizeM, SFVecSizeN, SFVecSizeK>;
+
 template<class MmaTileShape_MNK>
 constexpr auto sm100_trivial_blockwise_scale_config(MmaTileShape_MNK) {
   return Sm100BlockwiseScaleConfig<size<0>(MmaTileShape_MNK{}), size<1>(MmaTileShape_MNK{}), size<2>(MmaTileShape_MNK{})>{};
 }
 
+template<class MmaTileShape_MNK>
+constexpr auto sm90_trivial_blockwise_scale_config(MmaTileShape_MNK) {
+  return Sm90BlockwiseScaleConfig<size<0>(MmaTileShape_MNK{}), size<1>(MmaTileShape_MNK{}), size<2>(MmaTileShape_MNK{})>{};
+}
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass::detail
diff --git a/include/cutlass/detail/helper_macros.hpp b/include/cutlass/detail/helper_macros.hpp
index 758b52d3a0..94634e950f 100644
--- a/include/cutlass/detail/helper_macros.hpp
+++ b/include/cutlass/detail/helper_macros.hpp
@@ -217,6 +217,35 @@ namespace cutlass {
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+// __CUDA_ARCH_SPECIFIC__ is introduced in CUDA 12.9
+#if !defined(CUDA_ARCH_CONDITIONAL)
+
+#if defined(__CUDA_ARCH_SPECIFIC__)
+#define CUDA_ARCH_CONDITIONAL(ARCH_XXYY) (__CUDA_ARCH_SPECIFIC__ == ARCH_XXYY)
+#else
+#define CUDA_ARCH_CONDITIONAL(ARCH_XXYY) (false)
+#endif
+
+#endif
+
+// __CUDA_ARCH_FAMILY_SPECIFIC__ is introduced in CUDA 12.9
+#if !defined(CUDA_ARCH_FAMILY)
+
+#if defined(__CUDA_ARCH_FAMILY_SPECIFIC__)
+#define CUDA_ARCH_FAMILY(ARCH_XXYY) (__CUDA_ARCH_FAMILY_SPECIFIC__ == ARCH_XXYY)
+#else
+#define CUDA_ARCH_FAMILY(ARCH_XXYY) (false)
+#endif
+
+#endif
+
+#if !defined(CUDA_ARCH_CONDITIONAL_OR_FAMILY)
+#define CUDA_ARCH_CONDITIONAL_OR_FAMILY(ARCH_XXYY) \
+  (CUDA_ARCH_CONDITIONAL(ARCH_XXYY) || CUDA_ARCH_FAMILY(ARCH_XXYY))
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
 }; // namespace cutlass
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/detail/layout.hpp b/include/cutlass/detail/layout.hpp
index a0a183b0ee..562adc65ea 100644
--- a/include/cutlass/detail/layout.hpp
+++ b/include/cutlass/detail/layout.hpp
@@ -33,10 +33,10 @@
 #include "cute/layout.hpp"
 #include "cute/pointer_sparse.hpp"       // cute::is_sparse
 #include "cute/swizzle.hpp"              // cute::Swizzle
-#include "cute/swizzle_layout.hpp"       // cute::detail::get_swizzle_portion
+#include "cute/swizzle_layout.hpp"       // cute::get_swizzle_portion
 #include "cute/util/type_traits.hpp"
 #include "cute/arch/copy_sm90_tma.hpp"
-#include "cute/arch/copy_sm100_tma.hpp"  
+#include "cute/arch/copy_sm100_tma.hpp"
 
 #include "cutlass/layout/matrix.h"
 #include "cutlass/layout/tensor.h"
@@ -219,8 +219,8 @@ stride_to_layout_tag_A() {
     return layout::ColumnMajor{};
   }
   // Specialize for sparse layout
-  else if constexpr (cute::get<0>(InternalStrideA{}) == cute::_2{} && 
-                     cute::rank(cute::get<1>(InternalStrideA{})) == 2 && 
+  else if constexpr (cute::get<0>(InternalStrideA{}) == cute::_2{} &&
+                     cute::rank(cute::get<1>(InternalStrideA{})) == 2 &&
                      cute::is_same_v<cute::_1, cute::remove_cvref_t<decltype(cute::get<1,0>(InternalStrideA{}))>>) {
     return layout::ColumnMajor{};
   }
@@ -308,8 +308,8 @@ constexpr bool is_tma_copy_engine() {
                   || cute::is_base_of_v<cute::SM90_TMA_LOAD_IM2COL_MULTICAST,       GmemTiledCopy>
                   || cute::is_base_of_v<cute::SM90_TMA_STORE,                       GmemTiledCopy>
                   || cute::is_base_of_v<cute::SM90_TMA_STORE_IM2COL,                GmemTiledCopy>
-                  || cute::is_base_of_v<cute::SM100_TMA_2SM_LOAD,                   GmemTiledCopy> 
-                  || cute::is_base_of_v<cute::SM100_TMA_2SM_LOAD_MULTICAST,         GmemTiledCopy> 
+                  || cute::is_base_of_v<cute::SM100_TMA_2SM_LOAD,                   GmemTiledCopy>
+                  || cute::is_base_of_v<cute::SM100_TMA_2SM_LOAD_MULTICAST,         GmemTiledCopy>
                   ) {
       return true;
     }
@@ -349,7 +349,7 @@ get_alignment_count_from_gmem_tiled_copy() {
                      cutlass::gemm::collective::detail::is_sm10x_f8f6f4_element<Element>() && cute::is_same_v<typename RawDtype<ElementMma>::type, uint8_t>) {
         return 128;
       }
-      
+
       // For sparse MMA, alignment in logical elements is increased by sparsity factor
       if constexpr (cute::is_sparse_v<ElementMma>) {
         return 128 / sizeof_bits<Element>::value * ElementMma::sparsity;
@@ -366,7 +366,7 @@ get_alignment_count_from_gmem_tiled_copy() {
 // Return alignment bit requirements for the GEMM inputs.
 template <
   class ElementType
-  , bool IsF8F6F4SubBytes=false  
+  , bool IsF8F6F4SubBytes=false
 >
 constexpr int
 get_input_alignment_bits() {
@@ -383,12 +383,12 @@ get_input_alignment_bits() {
 template <class ElementType>
 constexpr int
 get_output_alignment_bits() {
-  
+
   if constexpr (sizeof_bits<ElementType>::value == 6) {
     // U6 format : The inner tensor size dimension must be a multiple of 96B.
     return 96 * 8;
   }
-  
+
   return 128;
 }
 
@@ -424,7 +424,7 @@ template<class Layout>
 CUTLASS_HOST_DEVICE constexpr
 size_t
 alignment_for_swizzle(Layout layout) {
-  return alignment_for_swizzle(cute::detail::get_swizzle_portion(layout));
+  return alignment_for_swizzle(cute::get_swizzle_portion(layout));
 }
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/collective/builders/sm100_builder.inl b/include/cutlass/epilogue/collective/builders/sm100_builder.inl
index 16eb4fc9f4..176b1f257f 100644
--- a/include/cutlass/epilogue/collective/builders/sm100_builder.inl
+++ b/include/cutlass/epilogue/collective/builders/sm100_builder.inl
@@ -866,6 +866,45 @@ struct CallbacksBuilder<
   >;
 };
 
+// ptr array aux fusion callbacks builder for sm100 tma epilogue
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  class FusionOp,
+  class CtaTileShape_MNK,
+  class EpilogueTile_MN,
+  class ElementAccumulator,
+  class AccLoadOp
+>
+struct CallbacksBuilder<
+  Sm100PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>,
+  FusionOp,
+  CtaTileShape_MNK,
+  EpilogueTile_MN,
+  ElementAccumulator,
+  AccLoadOp,
+  cute::enable_if_t<(FusionOp::IsAuxOutSupported ^ FusionOp::IsAuxInSupported) // only one aux tensor
+              && not cute::is_subbyte_v<typename FusionOp::ElementAux>>
+> {
+  using GmemStrideTypeAux = gemm::TagToStrideC_t<typename FusionOp::GmemLayoutTagAux>;
+  using SmemLayoutAtomAux = decltype(detail::sm100_get_epilogue_smem_swizzle_layout_atom<
+    GmemStrideTypeAux, typename FusionOp::ElementAux, EpilogueTile_MN>());
+  using CopyOpR2S = decltype(detail::sm100_get_smem_store_op<
+    GmemStrideTypeAux, typename FusionOp::ElementAux, ElementAccumulator, AccLoadOp>());
+  using CopyOpS2R = decltype(detail::sm100_get_smem_load_op<
+    GmemStrideTypeAux, typename FusionOp::ElementAux, ElementAccumulator, AccLoadOp>());
+  using SmemCopyOpAux = cute::conditional_t<FusionOp::IsAuxOutSupported, CopyOpR2S, CopyOpS2R>;
+
+  using Callbacks = fusion::FusionCallbacks<
+    Sm100PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>,
+    FusionOp, CtaTileShape_MNK, EpilogueTile_MN,
+    SmemLayoutAtomAux, SmemCopyOpAux
+  >;
+};
+
 template <
   int StagesC,
   int StagesD,
@@ -930,7 +969,7 @@ template <
   class ElementC_,
   class GmemLayoutTagC_,
   int AlignmentC,
-  class ElementD,
+  class ElementD_,
   class GmemLayoutTagD,
   int AlignmentD,
   class Schedule,
@@ -943,6 +982,9 @@ private:
   static_assert(Is1SmMma ^ Is2SmMma, "unsupported schedule");
   static_assert(not (Is2SmMma && size<0>(ClusterShape_MNK{}) % 2 == 1), "schedule + cluster mismatch");
 
+  static constexpr bool DisableDestination = cute::is_void_v<ElementD_>;
+  using ElementD = cute::conditional_t<DisableDestination,fusion::get_element_aux_t<FusionOpOrCallbacks>,ElementD_>; // prevents void ref breakages
+
   // Passing void C disables source load + smem allocation
   static constexpr bool DisableSource = cute::is_void_v<ElementC_>;
   using ElementC = cute::conditional_t<DisableSource,ElementD,ElementC_>; // prevents void ref breakages
@@ -1168,7 +1210,7 @@ public:
       EpilogueTile_MN,
       ElementC_, // Need to pass void through to expose via GemmUniversal
       GmemStrideTypeC,
-      ElementD,
+      ElementD_, // Need to pass void through to expose via GemmUniversal
       GmemStrideTypeD,
       decltype(fusion_callbacks()),
       AccLoadOp,
diff --git a/include/cutlass/epilogue/collective/builders/sm120_builder.inl b/include/cutlass/epilogue/collective/builders/sm120_builder.inl
index ad1f44a062..e1c1bff803 100644
--- a/include/cutlass/epilogue/collective/builders/sm120_builder.inl
+++ b/include/cutlass/epilogue/collective/builders/sm120_builder.inl
@@ -63,13 +63,27 @@ struct EpilogueSFVecSize<FusionOp, cute::void_t<decltype(FusionOp::SFVecSize)>>
   static constexpr int value = FusionOp::SFVecSize;
 };
 
+// Helper to deduce NumEpilogueWarpGroups based on Schedule
+template <class Schedule, class = void>
+struct GetNumEpilogueWarpGroups {
+  static constexpr int value = 2;
+};
+
+template <class Schedule>
+struct GetNumEpilogueWarpGroups<Schedule, cute::void_t<decltype(Schedule::NumEpilogueWarpGroups)>> {
+  static constexpr int value = Schedule::NumEpilogueWarpGroups;
+};
+
 // Returns the parameterized dispatch policy for the TMA epilogue
-template<class TileShapeMNK, class EpilogueTileMN, class ElementC, class ElementD, class StrideD, class Schedule>
+template<class TileShapeMNK, class EpilogueTileMN, class ElementC, class ElementD, class GmemLayoutTagD, class Schedule>
 constexpr auto
 sm120_get_tma_dispatch_policy() {
   using namespace cute;
 
   constexpr int EpiTiles = size(shape_div(take<0,2>(TileShapeMNK{}), EpilogueTileMN{}));
+  using StrideD = cutlass::detail::TagToStrideC_t<GmemLayoutTagD>;
+  using InternalStrideD  = cute::remove_pointer_t<StrideD>;
+  constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideD, StrideD>;
 
   // For 120, a FragmentSize of 4 is used to match the
   // output per thread from each MMA. Epilogue subtiles iterate over multiple of these
@@ -86,9 +100,17 @@ sm120_get_tma_dispatch_policy() {
 
   // SM120 epilogues use smaller stage counts in order to fit within the limited shared memory capacity.
   constexpr int StagesC = ReuseSmem ? cute::max(cute::min(EpiTiles, 2), StagesD+1)
-                                    : StagesD;
-                                    
-  return Sm120TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, DelayTmaStore>{};
+                                    : StagesD;  
+
+  constexpr int NumEpilogueWarpGroups = GetNumEpilogueWarpGroups<Schedule>::value;
+
+  if constexpr (IsGroupedGemmKernel) {
+    return Sm120PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, 
+                                          DelayTmaStore, NumEpilogueWarpGroups>{};
+  } 
+  else {
+    return Sm120TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, DelayTmaStore>{};
+  }
 }
 
 // Returns the smem layout atom to be used for C or D matrix
@@ -291,6 +313,9 @@ struct Sm120TmaBuilderImpl {
   using GmemStrideTypeC = cutlass::detail::TagToStrideC_t<GmemLayoutTagC>;
   using GmemStrideTypeD = cutlass::detail::TagToStrideC_t<GmemLayoutTagD>;
 
+  using UnderlyingGmemStrideTypeC = cute::remove_pointer_t<GmemStrideTypeC>;
+  using UnderlyingGmemStrideTypeD = cute::remove_pointer_t<GmemStrideTypeD>;
+
   using CopyOpS2G =
     cute::conditional_t<detail::is_im2col_mode<GmemLayoutTagD>,
       SM90_TMA_STORE_IM2COL,
@@ -306,15 +331,15 @@ struct Sm120TmaBuilderImpl {
   // Get the smallest tiled copy we can use to retile the accumulators
   using CopyAtomC = Copy_Atom<SM90_U32x2_STSM_N, cutlass::half_t>;
 
-  using SmemLayoutAtomC = decltype(detail::sm120_get_epilogue_smem_swizzle_layout_atom<GmemStrideTypeC, ElementC, EpilogueTile_MN>());
-  using SmemLayoutAtomD = decltype(detail::sm120_get_epilogue_smem_swizzle_layout_atom<GmemStrideTypeD, ElementD, EpilogueTile_MN>());
+  using SmemLayoutAtomC = decltype(detail::sm120_get_epilogue_smem_swizzle_layout_atom<UnderlyingGmemStrideTypeC, ElementC, EpilogueTile_MN>());
+  using SmemLayoutAtomD = decltype(detail::sm120_get_epilogue_smem_swizzle_layout_atom<UnderlyingGmemStrideTypeD, ElementD, EpilogueTile_MN>());
 
-  using CopyOpS2R = decltype(detail::sm120_get_smem_load_op_for_source<GmemStrideTypeC, ElementC>());
+  using CopyOpS2R = decltype(detail::sm120_get_smem_load_op_for_source<UnderlyingGmemStrideTypeC, ElementC>());
 
-  using CopyOpR2S = decltype(detail::sm120_get_smem_store_op_for_accumulator<GmemStrideTypeD, ElementD>());
+  using CopyOpR2S = decltype(detail::sm120_get_smem_store_op_for_accumulator<UnderlyingGmemStrideTypeD, ElementD>());
 
   // Get register to register tiled copy that happen before shared memory store.
-  using CopyOpR2R = decltype(detail::sm120_get_register_transform_op<GmemStrideTypeD, ElementD>());
+  using CopyOpR2R = decltype(detail::sm120_get_register_transform_op<UnderlyingGmemStrideTypeD, ElementD>());
 
   // TMA builder allows for passing callbacks directly, which is either a fusion::FusionCallbacks
   // instance or a direct visitor implementation, e.g. fusion::Sm90LinearCombination
@@ -334,8 +359,32 @@ struct Sm120TmaBuilderImpl {
   constexpr static bool ReuseSmemC = DispatchPolicy::ReuseSmemC;
   constexpr static bool DelayTmaStore = DispatchPolicy::DelayTmaStore;
 
+  //Helper to deduce BaseDispatchPolicy based on DispatchPolicy
+  template<class T>
+  struct GetBaseDispatchPolicy {
+    using Type = T;
+  };
+
+  template<int StagesC_, int StagesD_, int FragmentSize_, bool ReuseSmemC_, 
+           bool DelayTmaStore_, int NumEpilogueWarpGroups_>
+  struct GetBaseDispatchPolicy<Sm120PtrArrayTmaWarpSpecialized<StagesC_, StagesD_, 
+    FragmentSize_, ReuseSmemC_, DelayTmaStore_, NumEpilogueWarpGroups_>> {
+    using Type = typename cutlass::epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC_, StagesD_, 
+      FragmentSize_, ReuseSmemC_, DelayTmaStore_, NumEpilogueWarpGroups_>;
+  };
+
+  template<int StagesC_, int StagesD_, int FragmentSize_, bool ReuseSmemC_, 
+           bool DelayTmaStore_>
+  struct GetBaseDispatchPolicy<Sm120TmaWarpSpecialized<StagesC_, StagesD_, 
+    FragmentSize_, ReuseSmemC_, DelayTmaStore_>> {
+    using Type = typename cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC_, StagesD_, 
+      FragmentSize_, ReuseSmemC_, DelayTmaStore_>;
+  };
+
+  using BaseDispatchPolicy = typename GetBaseDispatchPolicy<DispatchPolicy>::Type;
+  
   using CollectiveOp = cutlass::epilogue::collective::CollectiveEpilogue<
-      Sm90TmaWarpSpecialized<StagesC,StagesD,FragmentSize,ReuseSmemC,DelayTmaStore>,
+      BaseDispatchPolicy,
       TileShape_MNK,
       EpilogueTile_MN,
       ElementC_, // Need to pass void through to expose via GemmUniversal
@@ -394,13 +443,15 @@ struct CollectiveBuilder<
     cute::enable_if_t<cute::is_same_v<Schedule, EpilogueScheduleAuto> ||
                       cute::is_same_v<Schedule, TmaWarpSpecialized> ||
                       cute::is_same_v<Schedule, TmaWarpSpecializedCooperative> ||
+                      cute::is_same_v<Schedule, PtrArrayTmaWarpSpecializedPingpong> ||
+                      cute::is_same_v<Schedule, PtrArrayTmaWarpSpecializedCooperative> ||
                       cute::is_same_v<Schedule, SparseTmaWarpSpecializedCooperativeSm120>
                      >> {
 private:
   using EpilogueTile_MN =
     decltype(detail::sm120_compute_tile_shape_or_override<ElementC, ElementD, EpilogueTileType, Schedule, TileShape_MNK, cutlass::detail::TagToStrideC_t<GmemLayoutTagD>, FusionOperation>());
   using DispatchPolicy =
-    decltype(detail::sm120_get_tma_dispatch_policy<TileShape_MNK,EpilogueTile_MN,ElementC,ElementD, cutlass::detail::TagToStrideC_t<GmemLayoutTagD>, Schedule>());
+    decltype(detail::sm120_get_tma_dispatch_policy<TileShape_MNK,EpilogueTile_MN,ElementC,ElementD, GmemLayoutTagD, Schedule>());
 
 
 public:
diff --git a/include/cutlass/epilogue/collective/builders/sm90_builder.inl b/include/cutlass/epilogue/collective/builders/sm90_builder.inl
index f684437580..9cb03fdc21 100644
--- a/include/cutlass/epilogue/collective/builders/sm90_builder.inl
+++ b/include/cutlass/epilogue/collective/builders/sm90_builder.inl
@@ -116,13 +116,13 @@ sm90_compute_tile_shape_or_override() {
     auto epi_tile = [&] () {
       if constexpr (detail::sm90_is_cooperative_v<Schedule>) {
         auto tile_m = cute::min(_128{}, size<0>(TileShape_MNK{}));
-        auto tile_n = cute::min(_32{}, size<1>(TileShape_MNK{}));
+        auto tile_n = cute::gcd(cute::min(_32{}, size<1>(TileShape_MNK{})), size<1>(TileShape_MNK{}));
         return make_shape(tile_m, tile_n);
       }
       else if constexpr (detail::sm90_is_warp_specialized_v<Schedule>) {
         constexpr int N_perf = sizeof_bits_v<ElementD> == 8 ? 64 : 32;
         auto tile_m = cute::min(_64{}, size<0>(TileShape_MNK{}));
-        auto tile_n = cute::min(Int<N_perf>{}, size<1>(TileShape_MNK{}));
+        auto tile_n = cute::gcd(cute::min(Int<N_perf>{}, size<1>(TileShape_MNK{})), size<1>(TileShape_MNK{}));
         return make_shape(tile_m, tile_n);
       }
       else {
@@ -206,6 +206,46 @@ struct CallbacksBuilder<
   >;
 };
 
+// ptr array aux fusion callbacks builder for sm90 tma epilogue
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  class FusionOp,
+  class TileShape_MNK,
+  class EpilogueTile_MN,
+  class AccLoadOp,
+  class ElementAccumulator
+>
+struct CallbacksBuilder<
+  Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+  FusionOp,
+  TileShape_MNK,
+  EpilogueTile_MN,
+  ElementAccumulator,
+  AccLoadOp,
+  cute::enable_if_t<(FusionOp::IsAuxOutSupported ^ FusionOp::IsAuxInSupported) // only one aux tensor
+              && not cute::is_subbyte_v<typename FusionOp::ElementAux>> // aux subbyte tensor doesn't use smem
+> {
+  using GmemStrideTypeAux = gemm::TagToStrideC_t<typename FusionOp::GmemLayoutTagAux>;
+  using SmemLayoutAtomAux = decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<
+    GmemStrideTypeAux, typename FusionOp::ElementAux, EpilogueTile_MN>());
+  using CopyOpR2S = decltype(detail::sm90_get_smem_store_op_for_accumulator<
+    GmemStrideTypeAux, typename FusionOp::ElementAux>());
+  using CopyOpS2R = decltype(detail::sm90_get_smem_load_op_for_source<
+    GmemStrideTypeAux, typename FusionOp::ElementAux>());
+  using SmemCopyOpAux = cute::conditional_t<FusionOp::IsAuxOutSupported, CopyOpR2S, CopyOpS2R>;
+
+  using Callbacks = fusion::FusionCallbacks<
+    Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    FusionOp, TileShape_MNK, EpilogueTile_MN,
+    SmemLayoutAtomAux, SmemCopyOpAux
+  >;
+};
+
 template <
   int StagesC,
   int StagesD,
diff --git a/include/cutlass/epilogue/collective/default_epilogue.hpp b/include/cutlass/epilogue/collective/default_epilogue.hpp
index b7bd6f4077..0d019b1c8c 100644
--- a/include/cutlass/epilogue/collective/default_epilogue.hpp
+++ b/include/cutlass/epilogue/collective/default_epilogue.hpp
@@ -35,6 +35,7 @@
 #pragma once
 
 #include "cutlass/cutlass.h"
+#include "cutlass/arch/memory.h"
 #include "cutlass/gemm/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/detail.hpp"
 
@@ -225,22 +226,27 @@ class DefaultEpilogue {
       return;
     }
 
+    using FragCType = remove_cvref_t<decltype(tCgC(0))>;
+    using FragDType = remove_cvref_t<decltype(tCgD(0))>;
+
     // source is needed
     if (epilogue_op.is_source_needed()) {
       CUTLASS_PRAGMA_UNROLL
       for (int i = 0; i < size(accumulators); ++i) {
-        if (elem_less(tCcD(i), residue_tCcD)) {
-          tCgD(i) = epilogue_op(accumulators(i), tCgC(i));
-        }
+        FragCType fragC;
+        bool pred = elem_less(tCcD(i), residue_tCcD);
+        arch::global_load<FragCType, sizeof(FragCType)>(fragC, &tCgC(i), pred);
+        FragDType fragD = epilogue_op(accumulators(i), fragC);
+        arch::global_store<FragDType, sizeof(FragDType)>(fragD, &tCgD(i), pred);
       }
     }
     // source is not needed, avoid load
     else {
       CUTLASS_PRAGMA_UNROLL
       for (int i = 0; i < size(accumulators); ++i) {
-        if (elem_less(tCcD(i), residue_tCcD)) {
-          tCgD(i) = epilogue_op(accumulators(i));
-        }
+        bool pred = elem_less(tCcD(i), residue_tCcD);
+        FragDType fragD = epilogue_op(accumulators(i));
+        arch::global_store<FragDType, sizeof(FragDType)>(fragD, &tCgD(i), pred);
       }
     }
   }
diff --git a/include/cutlass/epilogue/collective/detail.hpp b/include/cutlass/epilogue/collective/detail.hpp
index 2759d0c638..2c72c30168 100644
--- a/include/cutlass/epilogue/collective/detail.hpp
+++ b/include/cutlass/epilogue/collective/detail.hpp
@@ -124,6 +124,23 @@ struct sm90_is_ptr_array_tma_dispatch_policy<
                                    NumEpilogueWarpGroups>> 
     : cute::true_type {};
 
+template<
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups
+>
+struct sm90_is_ptr_array_tma_dispatch_policy<
+    Sm120PtrArrayTmaWarpSpecialized<StagesC, 
+                                   StagesD, 
+                                   FragmentSize,
+                                   ReuseSmemC, 
+                                   DelayTmaStore, 
+                                   NumEpilogueWarpGroups>> 
+    : cute::true_type {};
+
 template<class DispatchPolicy>
 static constexpr bool sm90_is_ptr_array_tma_dispatch_policy_v = sm90_is_ptr_array_tma_dispatch_policy<DispatchPolicy>::value;
 
diff --git a/include/cutlass/epilogue/collective/sm100_epilogue_array_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm100_epilogue_array_tma_warpspecialized.hpp
index 9c24913e9a..b9fb5320c1 100644
--- a/include/cutlass/epilogue/collective/sm100_epilogue_array_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm100_epilogue_array_tma_warpspecialized.hpp
@@ -129,8 +129,13 @@ class CollectiveEpilogue<
   static_assert(rank(EpilogueTile{}) == 2, "EpilogueTile must be rank-2: [EPI_TILE_M, EPI_TILE_N]");
 
 private:
-  using GmemElementD = ElementD;
-  using GmemElementC = cute::conditional_t<cute::is_void_v<ElementC>,ElementD,ElementC>; // prevents void ref breakages
+
+  constexpr static bool is_source_supported = not cute::is_void_v<ElementC>;
+  constexpr static bool is_destination_supported = not cute::is_void_v<ElementD>;
+  using GmemElementD = cute::conditional_t<is_destination_supported, ElementD, fusion::get_element_aux_t<FusionCallbacks>>;
+  using GmemElementC = cute::conditional_t<is_source_supported, ElementC, GmemElementD>; // prevents void ref breakages
+  static_assert(not cute::is_void_v<GmemElementD>, "GmemElementD is void");
+
   using SmemElementD = typename cutlass::detail::get_unpacked_element_type<GmemElementD>::type;
   using SmemElementC = typename cutlass::detail::get_unpacked_element_type<GmemElementC>::type;
   constexpr static int StagesC = StagesC_;
@@ -138,9 +143,7 @@ class CollectiveEpilogue<
   static_assert(StagesC >= 1, "StagesC must be >= 1");
   static_assert(StagesD >= 1, "StagesD must be >= 1");
   
-  constexpr static bool ReuseSmemC = ReuseSmemC_;
-  constexpr static bool DelayTmaStore = DelayTmaStore_;
-  constexpr static bool is_source_supported = not cute::is_void_v<ElementC>;
+  constexpr static bool ReuseSmemC = ReuseSmemC_ && is_destination_supported;
 
   constexpr static bool is_m_major_C = detail::is_m_major<InternalStrideC>();
   constexpr static bool is_m_major_D = detail::is_m_major<InternalStrideD>();
@@ -159,7 +162,7 @@ class CollectiveEpilogue<
   using SmemLayoutC = decltype(cute::append<3>(SmemLayoutStageC{}, Layout<Int<StagesC>,                        Int<StrideStageC>>{}));
   using SmemLayoutD = decltype(cute::append<3>(SmemLayoutStageD{}, Layout<Int<ReuseSmemC ? StagesC : StagesD>, Int<StrideStageD>>{}));
 
-  constexpr static bool support_smem_reuse = is_source_supported && StagesD <= StagesC
+  constexpr static bool support_smem_reuse = is_source_supported && is_destination_supported && StagesD <= StagesC
                                               && MaxStageBits % sizeof_bits_v<SmemElementC> == 0
                                               && MaxStageBits % sizeof_bits_v<SmemElementD> == 0;
   static_assert(not (ReuseSmemC && not support_smem_reuse), "Smem reuse requirements not met");
@@ -168,6 +171,12 @@ class CollectiveEpilogue<
   constexpr static size_t SmemAlignmentD = cutlass::detail::alignment_for_swizzle(SmemLayoutD{});
   constexpr static size_t MaxSmemAlignment = cute::max(SmemAlignmentC, SmemAlignmentD);
 
+  // Not unroll epi subtile loop when the activation op is heavy to reduce instruction size and register pressure.
+  constexpr static bool UnrollEpiLoop =
+    not cutlass::epilogue::thread::kIsHeavy_member_or_false<typename ThreadEpilogueOp::ActivationFn>::value;
+  // TMA store delay only benefits with loop unrolling
+  constexpr static bool DelayTmaStore = DelayTmaStore_ and UnrollEpiLoop;
+
   struct CollectiveStorageWithC {
     alignas(SmemAlignmentC) ArrayEngine<SmemElementC, cosize_v<SmemLayoutC>> smem_C;
     alignas(SmemAlignmentD) ArrayEngine<SmemElementD, cosize_v<SmemLayoutD>> smem_D;
@@ -239,7 +248,7 @@ class CollectiveEpilogue<
     using TMA_C = decltype(make_tma_copy(
         CopyOpG2S{},
         make_tensor(
-            make_gmem_ptr(static_cast<cute::conditional_t<cute::is_void_v<ElementC>,ElementD,ElementC> const*>(nullptr)),
+            make_gmem_ptr(static_cast<GmemElementC const*>(nullptr)),
             TensorShapeC{},
             append<3>(InternalStrideC{}, _0{})),
         SmemLayoutStageC{},
@@ -248,7 +257,7 @@ class CollectiveEpilogue<
     using TMA_D = decltype(make_tma_copy(
         CopyOpS2G{},
         make_tensor(
-            make_gmem_ptr(static_cast<ElementD*>(nullptr)),
+            make_gmem_ptr(static_cast<GmemElementD*>(nullptr)),
             TensorShapeD{},
             append<3>(InternalStrideD{}, _0{})),
         SmemLayoutStageD{},
@@ -278,6 +287,8 @@ class CollectiveEpilogue<
     // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
     // These will be replaced with correct values before the initial tma load.
     auto init_shape = repeat_like(append<4>(typename ProblemShape::UnderlyingProblemShape{}, 1), int32_t(1));
+    // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
+    // These will be replaced with correct values before the initial tma load.
     constexpr int tma_alignment_bits = 128;
     auto init_M = tma_alignment_bits;
     auto init_N = tma_alignment_bits;
@@ -308,10 +319,13 @@ class CollectiveEpilogue<
       tma_load_c = make_tma_copy(CopyOpG2S{}, tensor_c, SmemLayoutStageC{}, EpilogueTile{}, _1{});
     }
 
-    // Tensor pointers will be fixed before the first access
-    ElementD* ptr_D_first_batch = nullptr;
-    Tensor tensor_d = make_tensor(ptr_D_first_batch, make_layout(make_shape(init_M,init_N,init_L), append<3>(stride_d, _0{})));
-    typename Params::TMA_D tma_store_d = make_tma_copy(CopyOpS2G{}, tensor_d, SmemLayoutStageD{}, EpilogueTile{}, _1{});
+    typename Params::TMA_D tma_store_d{};
+    if constexpr (is_destination_supported) {
+      // Tensor pointers will be fixed before the first access
+      ElementD* ptr_D_first_batch = nullptr;
+      Tensor tensor_d = make_tensor(ptr_D_first_batch, make_layout(make_shape(init_M,init_N,init_L), append<3>(stride_d, _0{})));
+      tma_store_d = make_tma_copy(CopyOpS2G{}, tensor_d, SmemLayoutStageD{}, EpilogueTile{}, _1{});
+    }
 
     auto fusion_workspace = static_cast<char*>(workspace);
     auto fusion_workspace_size = round_nearest(FusionCallbacks::get_workspace_size(problem_shape, args.thread), MinTensorMapWorkspaceAlignment);
@@ -359,9 +373,11 @@ class CollectiveEpilogue<
         auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(i), 1);
         auto [M,N,K,L] = problem_shape_MNKL;
 
-        constexpr int tma_alignment_bits_D = cutlass::detail::get_output_alignment_bits<ElementD>();
-        constexpr int min_tma_aligned_elements_D = tma_alignment_bits_D / cutlass::sizeof_bits<ElementD>::value;
-        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_D>(cute::make_shape(M,N,L), InternalStrideD{});
+        if constexpr (is_destination_supported) {
+          constexpr int tma_alignment_bits_D = cutlass::detail::get_output_alignment_bits<ElementD>();
+          constexpr int min_tma_aligned_elements_D = tma_alignment_bits_D / cutlass::sizeof_bits<ElementD>::value;
+          implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_D>(cute::make_shape(M,N,L), InternalStrideD{});
+        }
 
         if constexpr (is_source_supported) {
           constexpr int tma_alignment_bits_C = cutlass::detail::get_input_alignment_bits<ElementC>();
@@ -752,13 +768,9 @@ class CollectiveEpilogue<
                       thread_idx
                     };
 
-    auto cst_callbacks = fusion_callbacks.template get_consumer_store_callbacks<RefSrc>(cst_args);
-    bool is_producer_load_needed = fusion_callbacks.is_producer_load_needed();
-    bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
-
     // Thread synchronizer for previously issued waits or fences
     // to ensure visibility of smem reads/writes to threads or TMA unit
-    auto synchronize = [] () { cutlass::arch::NamedBarrier::sync(ThreadCount, cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
+    auto synchronize = [] () CUTLASS_LAMBDA_FUNC_INLINE { cutlass::arch::NamedBarrier::sync(ThreadCount, cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
 
     // Predication for sub-128 thread T2R tiled copy
     Layout tmem_warp_layout = typename decltype(make_tmem_warp_partitioner(tAcc_epi(_,_,0,0)))::TiledLayout_TV{};
@@ -795,31 +807,38 @@ class CollectiveEpilogue<
     [[maybe_unused]] int epi_n_prev = 0;
     static_assert(not (DelayTmaStore and ReuseSmemC and StagesC <= StagesD), "This TMA epilogue configuration will deadlock");
 
-    auto epi_loop_fn = [&] (auto& cst_callbacks) {
-      // The TMA store sequence for one subtile iteration
-      auto tma_store_fn = [&] (int epi_m, int epi_n) {
+    // The Epilogue Loop
+    auto epi_loop_fn = [&] (auto& cst_callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+      bool is_producer_load_needed = fusion_callbacks.is_producer_load_needed();
+      bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
+
+      // The TMA store sequence for one epilogue loop iteration
+      auto tma_store_fn = [&] (int epi_m, int epi_n) CUTLASS_LAMBDA_FUNC_INLINE {
         // Write the tile from smem to gmem with TMA
         cutlass::arch::fence_view_async_shared(); // ensure smem writes are visible to TMA
         synchronize(); // ensure all threads have issued their async fence
-        if (issue_tma_store) {
-          copy(params.tma_store_d.with(get<0>(store_tensormap_info)), bSG_sD(_,_,_,store_pipe_producer_state.index()), bSG_gD(_,_,_,epi_m,epi_n));
-        }
 
+        if constexpr (is_destination_supported) {
+          if (issue_tma_store) {
+            copy(params.tma_store_d.with(get<0>(store_tensormap_info)), bSG_sD(_,_,_,store_pipe_producer_state.index()), bSG_gD(_,_,_,epi_m,epi_n));
+          }
+        }
+  
         // Post async fence, pre TMA commit callback entry point
         cst_callbacks.tma_store(epi_m, epi_n, store_pipe_producer_state.count(), issue_tma_store);
-
+  
         // Commit the TMA stores for this stage
         if (issue_tma_store) {
           store_pipeline.producer_commit(store_pipe_producer_state);
         }
         ++store_pipe_producer_state;
-
+  
         // Wait for the next smem buffer to be available
         if (issue_tma_store) {
           store_pipeline.producer_acquire(store_pipe_producer_state);
         }
         synchronize();
-
+  
         if constexpr (ReuseSmemC) {
           // producer_acquire returns when at most StagesD-1 committed stores are pending
           bool store_finished = store_pipe_producer_state.count() > StorePipeline::UnacquiredStages;
@@ -831,11 +850,7 @@ class CollectiveEpilogue<
             ++load_pipe_consumer_state;
           }
         }
-      };
-
-      //
-      // BEGIN EPILOGUE
-      //
+      }; // tma_store_fn
 
       // Begin the wait for the producer load results
       ConsumerToken load_wait_token{BarrierStatus::WaitDone};
@@ -850,10 +865,12 @@ class CollectiveEpilogue<
         synchronize();
       }
       // For each epilogue subtile within the CTA tile
-      CUTLASS_PRAGMA_UNROLL
-      for (int iter_n = 0; iter_n < size<3>(gD_epi); ++iter_n) {
-        CUTLASS_PRAGMA_UNROLL
-        for (int iter_m = 0; iter_m < size<2>(gD_epi); ++iter_m) {
+      constexpr int NumEpiSubtilesN = CUTE_STATIC_V(size<3>(gD_epi));
+      constexpr int NumEpiSubtilesM = CUTE_STATIC_V(size<2>(gD_epi));
+      #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesN : 1)
+      for (int iter_n = 0; iter_n < NumEpiSubtilesN; ++iter_n) {
+        #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesM : 1)
+        for (int iter_m = 0; iter_m < NumEpiSubtilesM; ++iter_m) {
           int epi_m = iter_m, epi_n = iter_n;
           bool is_first_iteration = iter_m == 0 && iter_n == 0;
           bool is_last_iteration = iter_m == size<2>(gD_epi)-1 && iter_n == size<3>(gD_epi)-1;
@@ -953,8 +970,10 @@ class CollectiveEpilogue<
 
           // Copy output tile from register to smem
           bool issue_smem_store = issue_tmem_load;
-          if (issue_smem_store) {
-            copy(tiled_r2s, tRS_rD, tRS_sD(_,_,_,store_pipe_producer_state.index()));
+          if constexpr (is_destination_supported) {
+            if (issue_smem_store) {
+              copy(tiled_r2s, tRS_rD, tRS_sD(_,_,_,store_pipe_producer_state.index()));
+            }
           }
 
           // Post reduction, pre TMA store callback entry point
@@ -982,9 +1001,11 @@ class CollectiveEpilogue<
       cst_callbacks.end();
     };
 
-      epi_loop_fn(cst_callbacks);
-    cst_callbacks.end();
-
+    //
+    // BEGIN EPILOGUE
+    //
+    auto cst_callbacks = fusion_callbacks.template get_consumer_store_callbacks<RefSrc>(cst_args);
+    epi_loop_fn(cst_callbacks);
     return cute::make_tuple(load_pipe_consumer_state, store_pipe_producer_state, acc_pipe_consumer_state);
   }
 
@@ -1201,10 +1222,12 @@ class CollectiveEpilogue<
     }
 
     // For each epilogue subtile within the CTA tile
-    CUTLASS_PRAGMA_UNROLL
-    for (int iter_n = 0; iter_n < size<3>(gD_epi); ++iter_n) {
-      CUTLASS_PRAGMA_UNROLL
-      for (int iter_m = 0; iter_m < size<2>(gD_epi); ++iter_m) {
+    constexpr int NumEpiSubtilesN = CUTE_STATIC_V(size<3>(gD_epi));
+    constexpr int NumEpiSubtilesM = CUTE_STATIC_V(size<2>(gD_epi));
+    #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesN : 1)
+    for (int iter_n = 0; iter_n < NumEpiSubtilesN; ++iter_n) {
+      #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesM : 1)
+      for (int iter_m = 0; iter_m < NumEpiSubtilesM; ++iter_m) {
         int epi_m = iter_m, epi_n = iter_n;
         bool is_first_iteration = iter_m == 0 && iter_n == 0;
         bool is_last_iteration = iter_m == size<2>(gD_epi)-1 && iter_n == size<3>(gD_epi)-1;
@@ -1343,7 +1366,7 @@ class CollectiveEpilogue<
         }
         syncwarp();
       }
-    } else {
+    } else if constexpr (is_destination_supported) {
       int const offset_Ddesc = cute::is_void_v<ElementC> ? 0 : sm_count;
       tma_desc = &gmem_tensormap[sm_idx + offset_Ddesc];
       if (cute::elect_one_sync()) {
@@ -1374,7 +1397,7 @@ class CollectiveEpilogue<
                                                           params.ptr_C[next_batch]);
         }
       }
-    } else {
+    } else if constexpr (is_destination_supported) {
       cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormap.smem_tensormap_D,
                                                       params.ptr_D[next_batch]);
     }
@@ -1414,7 +1437,7 @@ class CollectiveEpilogue<
         }
       }
     }
-    else {
+    else if constexpr (is_destination_supported) {
       ElementD const* ptr_D = nullptr;
       Tensor tensor_d = make_tensor(ptr_D, make_layout(make_shape(M,N,Int<1>{}), params.dD[next_group]));
 
@@ -1464,16 +1487,23 @@ class CollectiveEpilogue<
   tensormaps_cp_fence_release(
       TensorMapStorage& shared_tensormap,
       cute::TmaDescriptor const* tensormap) {
+    // Commit and wait for all TMA load/store instructions before updating the tensormap in gmem.
+    // This operation only happens when the group/batch changes between consecutive tiles.
+    // If there are no uncommitted instructions then tma_desc_commit_group results in an empty bulk async-group.
+    auto tma_desc_wait_all_fn = [] () CUTLASS_LAMBDA_FUNC_INLINE {
+      if (cute::elect_one_sync()) {
+        cute::tma_desc_commit_group();
+        cute::tma_desc_wait_group();
+      }
+    };
     // Entire warp must do this (ie its aligned)
     if constexpr (IsLoad) {
       if (is_source_supported) {
-        if (cute::elect_one_sync()) {
-          cute::tma_desc_commit_group();
-          cute::tma_desc_wait_group();
-        }
+        tma_desc_wait_all_fn();
         tma_descriptor_cp_fence_release(tensormap, shared_tensormap.smem_tensormap_C);
       }
-    } else {
+    } else if constexpr (is_destination_supported) {
+      tma_desc_wait_all_fn();
       tma_descriptor_cp_fence_release(tensormap, shared_tensormap.smem_tensormap_D);
     }
   }
@@ -1486,7 +1516,7 @@ class CollectiveEpilogue<
       if (is_source_supported) {
         cute::tma_descriptor_fence_acquire(tensormap);
       }
-    } else {
+    } else if constexpr (is_destination_supported) {
       cute::tma_descriptor_fence_acquire(tensormap);
     }
   }
diff --git a/include/cutlass/epilogue/collective/sm100_epilogue_nosmem.hpp b/include/cutlass/epilogue/collective/sm100_epilogue_nosmem.hpp
index ba85a75e54..f58f61fcb4 100644
--- a/include/cutlass/epilogue/collective/sm100_epilogue_nosmem.hpp
+++ b/include/cutlass/epilogue/collective/sm100_epilogue_nosmem.hpp
@@ -462,6 +462,10 @@ class CollectiveEpilogue<
                                                || is_same_v<ThreadEpilogueOp, epilogue::fusion::FusionOperation>; // alloc reduction buffer for custom EVTs
   constexpr static size_t ImplicitSharedStorageSize = IsReductionBufferNeeded ? size(EpilogueTile{}) : 0;
 
+  // Not unroll epi subtile loop when the activation op is heavy to reduce instruction size and register pressure.
+  constexpr static bool UnrollEpiLoop =
+    not cutlass::epilogue::thread::kIsHeavy_member_or_false<typename ThreadEpilogueOp::ActivationFn>::value;
+
 public:
   constexpr static int ThreadCount = 128;
   constexpr static uint32_t TmaTransactionBytes = 0;
@@ -646,12 +650,12 @@ class CollectiveEpilogue<
       thread_idx
     };
 
-    auto cst_callbacks = fusion_callbacks.get_consumer_store_callbacks<RefSrc>(cst_args);
-    bool is_C_load_needed = fusion_callbacks.is_C_load_needed();
-
     auto synchronize = [] () CUTLASS_LAMBDA_FUNC_INLINE { cutlass::arch::NamedBarrier::sync(ThreadCount, cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
 
+    // The Epilogue Loop
     auto epi_loop_fn = [&] (auto& cst_callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+      bool is_C_load_needed = fusion_callbacks.is_C_load_needed();
+
       // Ensure there are no threads from the previous wave writing to shared memory being utilized for the current wave.
       synchronize();
       cst_callbacks.begin();
@@ -669,10 +673,12 @@ class CollectiveEpilogue<
       static_assert(not (ReuseTmem && AccumulatorPipeline::Stages != 1), "Tmem reuse requires 1 accumulator stage");
 
       // For each epilogue subtile within the CTA tile
-      CUTLASS_PRAGMA_UNROLL
-      for (int iter_n = 0; iter_n < size<4>(tTR_tAcc); ++iter_n) {
-        CUTLASS_PRAGMA_UNROLL
-        for (int iter_m = 0; iter_m < size<3>(tTR_tAcc); ++iter_m) {
+      constexpr int NumEpiSubtilesN = CUTE_STATIC_V(size<4>(tTR_tAcc));
+      constexpr int NumEpiSubtilesM = CUTE_STATIC_V(size<3>(tTR_tAcc));
+      #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesN : 1)
+      for (int iter_n = 0; iter_n < NumEpiSubtilesN; ++iter_n) {
+        #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesM : 1)
+        for (int iter_m = 0; iter_m < NumEpiSubtilesM; ++iter_m) {
           int epi_m = iter_m, epi_n = iter_n;
 
           bool is_last_iteration = iter_m == size<3>(tTR_tAcc)-1 && iter_n == size<4>(tTR_tAcc)-1;
@@ -747,6 +753,10 @@ class CollectiveEpilogue<
       cst_callbacks.end();
     };
 
+    //
+    // BEGIN EPILOGUE
+    //
+    auto cst_callbacks = fusion_callbacks.template get_consumer_store_callbacks<RefSrc>(cst_args);
     epi_loop_fn(cst_callbacks);
     return cute::make_tuple(acc_pipe_consumer_state);
   }
diff --git a/include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp
index 37acf23ae3..6c3f111c11 100644
--- a/include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp
@@ -140,7 +140,6 @@ class CollectiveEpilogue<
   static_assert(StagesD >= 1, "StagesD must be >= 1");
   
   constexpr static bool ReuseSmemC = ReuseSmemC_;
-  constexpr static bool DelayTmaStore = DelayTmaStore_;
   constexpr static bool is_source_supported = not cute::is_void_v<ElementC>;
 
   constexpr static bool is_m_major_C = detail::is_m_major<StrideC>();
@@ -172,6 +171,12 @@ class CollectiveEpilogue<
   constexpr static size_t SmemAlignmentD = cutlass::detail::alignment_for_swizzle(SmemLayoutD{});
   constexpr static size_t MaxSmemAlignment = cute::max(SmemAlignmentC, SmemAlignmentD);
 
+  // Not unroll epi subtile loop when the activation op is heavy to reduce instruction size and register pressure.
+  constexpr static bool UnrollEpiLoop =
+    not cutlass::epilogue::thread::kIsHeavy_member_or_false<typename ThreadEpilogueOp::ActivationFn>::value;
+  // TMA store delay only benefits with loop unrolling
+  constexpr static bool DelayTmaStore = DelayTmaStore_ and UnrollEpiLoop;
+
   struct CollectiveStorageWithC {
     alignas(SmemAlignmentC) ArrayEngine<SmemElementC, cosize_v<SmemLayoutC>> smem_C;
     alignas(SmemAlignmentD) ArrayEngine<SmemElementD, cosize_v<SmemLayoutD>> smem_D;
@@ -687,7 +692,7 @@ class CollectiveEpilogue<
     // OOB predication for tile quantization "residue"
     // Absolute coordinate tensors (dynamic)
     Tensor mD_crd = make_identity_tensor(make_shape(M,N));                                                     // (M,N)
-    Tensor cD_mn = local_tile(mD_crd, take<0,2>(cta_tile_mnk), make_coord(m_coord, n_coord));        // (CTA_M,CTA_N)
+    Tensor cD_mn = local_tile(mD_crd, take<0,2>(cta_tile_mnk), make_coord(m_coord, n_coord));          // (CTA_M,CTA_N)
     Tensor tTR_cD_mn = thread_t2r.partition_D(flat_divide(cD_mn, EpilogueTile{}));     // (T2R,T2R_M,T2R_N,EPI_M,EPI_N)
     // Relative coordinate tensors (static)
     Tensor cD = make_counting_tensor(cD_mn.layout());                                                  // (CTA_M,CTA_N)
@@ -696,7 +701,7 @@ class CollectiveEpilogue<
     auto residue_cD = make_coord(M,N) - cD_mn(_0{});                                                           // (m,n)
     auto residue_tTR_cD = make_coord(M,N) - tTR_cD_mn(_0{});                                                   // (m,n)
 
-    // Get the fusion callbacks for the consumer store warps
+    // Arguments for the fusion callbacks for the consumer store warps
     constexpr bool RefSrc = false; // Register tensors reference T2R copy dst layout
     auto cst_args = cutlass::epilogue::fusion::detail::ConsumerStoreArgs{
                       problem_shape_mnkl,
@@ -713,10 +718,6 @@ class CollectiveEpilogue<
                       thread_idx
                     };
 
-    auto cst_callbacks = fusion_callbacks.template get_consumer_store_callbacks<RefSrc>(cst_args);
-    bool is_producer_load_needed = fusion_callbacks.is_producer_load_needed();
-    bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
-
     // Thread synchronizer for previously issued waits or fences
     // to ensure visibility of smem reads/writes to threads or TMA unit
     auto synchronize = [] () { cutlass::arch::NamedBarrier::sync(ThreadCount, cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
@@ -756,8 +757,12 @@ class CollectiveEpilogue<
     [[maybe_unused]] int epi_n_prev = 0;
     static_assert(not (DelayTmaStore and ReuseSmemC and StagesC <= StagesD), "This TMA epilogue configuration will deadlock");
 
+    // The Epilogue Loop
     auto epi_loop_fn = [&] (auto& cst_callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-      // The TMA store sequence for one subtile iteration
+      bool is_producer_load_needed = fusion_callbacks.is_producer_load_needed();
+      bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
+
+      // The TMA store sequence for one epilogue loop iteration
       auto tma_store_fn = [&] (int epi_m, int epi_n) CUTLASS_LAMBDA_FUNC_INLINE {
         // Write the tile from smem to gmem with TMA
         cutlass::arch::fence_view_async_shared(); // ensure smem writes are visible to TMA
@@ -765,22 +770,22 @@ class CollectiveEpilogue<
         if (issue_tma_store) {
           copy(params.tma_store_d, bSG_sD(_,_,_,store_pipe_producer_state.index()), bSG_gD(_,_,_,epi_m,epi_n));
         }
-
+  
         // Post async fence, pre TMA commit callback entry point
         cst_callbacks.tma_store(epi_m, epi_n, store_pipe_producer_state.count(), issue_tma_store);
-
+  
         // Commit the TMA stores for this stage
         if (issue_tma_store) {
           store_pipeline.producer_commit(store_pipe_producer_state);
         }
         ++store_pipe_producer_state;
-
+  
         // Wait for the next smem buffer to be available
         if (issue_tma_store) {
           store_pipeline.producer_acquire(store_pipe_producer_state);
         }
         synchronize();
-
+  
         if constexpr (ReuseSmemC) {
           // producer_acquire returns when at most StagesD-1 committed stores are pending
           bool store_finished = store_pipe_producer_state.count() > StorePipeline::UnacquiredStages;
@@ -792,11 +797,8 @@ class CollectiveEpilogue<
             ++load_pipe_consumer_state;
           }
         }
-      };
+      }; // tma_store_fn
 
-      //
-      // BEGIN EPILOGUE
-      //
       cst_callbacks.begin();
       if (cst_callbacks.begin_sync_needed()) {
         synchronize();
@@ -811,10 +813,12 @@ class CollectiveEpilogue<
       ConsumerToken acc_wait_token = acc_pipeline.consumer_try_wait(acc_pipe_consumer_state);
 
       // For each epilogue subtile within the CTA tile
-      CUTLASS_PRAGMA_UNROLL
-      for (int iter_n = 0; iter_n < size<3>(gD_epi); ++iter_n) {
-        CUTLASS_PRAGMA_UNROLL
-        for (int iter_m = 0; iter_m < size<2>(gD_epi); ++iter_m) {
+      constexpr int NumEpiSubtilesN = CUTE_STATIC_V(size<3>(gD_epi));
+      constexpr int NumEpiSubtilesM = CUTE_STATIC_V(size<2>(gD_epi));
+      #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesN : 1)
+      for (int iter_n = 0; iter_n < NumEpiSubtilesN; ++iter_n) {
+        #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesM : 1)
+        for (int iter_m = 0; iter_m < NumEpiSubtilesM; ++iter_m) {
           int epi_m = iter_m, epi_n = iter_n;
           bool is_first_iteration = iter_m == 0 && iter_n == 0;
           bool is_last_iteration = iter_m == size<2>(gD_epi)-1 && iter_n == size<3>(gD_epi)-1;
@@ -941,9 +945,13 @@ class CollectiveEpilogue<
       }
 
       cst_callbacks.end();
-    };
+    }; // epi_loop_fn
 
-      epi_loop_fn(cst_callbacks);
+    //
+    // BEGIN EPILOGUE
+    //
+    auto cst_callbacks = fusion_callbacks.template get_consumer_store_callbacks<RefSrc>(cst_args);
+    epi_loop_fn(cst_callbacks);
     return cute::make_tuple(load_pipe_consumer_state, store_pipe_producer_state, acc_pipe_consumer_state);
   }
 
@@ -1161,10 +1169,12 @@ class CollectiveEpilogue<
     }
 
     // For each epilogue subtile within the CTA tile
-    CUTLASS_PRAGMA_UNROLL
-    for (int iter_n = 0; iter_n < size<3>(gD_epi); ++iter_n) {
-      CUTLASS_PRAGMA_UNROLL
-      for (int iter_m = 0; iter_m < size<2>(gD_epi); ++iter_m) {
+    constexpr int NumEpiSubtilesN = CUTE_STATIC_V(size<3>(gD_epi));
+    constexpr int NumEpiSubtilesM = CUTE_STATIC_V(size<2>(gD_epi));
+    #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesN : 1)
+    for (int iter_n = 0; iter_n < NumEpiSubtilesN; ++iter_n) {
+      #pragma unroll(UnrollEpiLoop ? NumEpiSubtilesM : 1)
+      for (int iter_m = 0; iter_m < NumEpiSubtilesM; ++iter_m) {
         int epi_m = iter_m, epi_n = iter_n;
         bool is_first_iteration = iter_m == 0 && iter_n == 0;
         bool is_last_iteration = iter_m == size<2>(gD_epi)-1 && iter_n == size<3>(gD_epi)-1;
diff --git a/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
index b5cdfdcb87..c625f43d2e 100644
--- a/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
@@ -41,6 +41,7 @@
 #include "cutlass/epilogue/thread/scale_type.h"
 #include "cutlass/epilogue/fusion/callbacks.hpp"
 #include "cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp"
+#include "cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp"
 #include "cutlass/detail/collective.hpp"
 #include "cutlass/detail/layout.hpp"
 #include "cutlass/trace.h"
@@ -304,8 +305,9 @@ class CollectiveEpilogue<
     // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
     // These will be replaced with correct values before the initial tma load.
     auto init_shape = repeat_like(append<4>(typename ProblemShape::UnderlyingProblemShape{}, 1), int32_t(1));
-    auto init_M = get<0>(init_shape);
-    auto init_N = get<1>(init_shape);
+    constexpr int tma_alignment_bits = 128;
+    auto init_M = tma_alignment_bits;
+    auto init_N = tma_alignment_bits;
     auto init_L = get<3>(init_shape);
 
     static_assert(!is_im2col_C and !is_im2col_D, "Im2Col not supported on C or D");
@@ -761,7 +763,14 @@ class CollectiveEpilogue<
 
     CUTE_STATIC_ASSERT(epi_tile_m % mma_tile_m == 0, "MMA_TILE_M must divide EPI_TILE_M");
 
+    if constexpr (epi_tile_m * epi_tile_n > mma_tile_m * mma_tile_n) {
+      // When the epilogue subtile is larger than the MMA tiles, loop over multiple MMA tiles
+      CUTE_STATIC_ASSERT(epi_tile_n % mma_tile_n == 0, "MMA_TILE_N must divide EPI_TILE_N");
+    }
+    else {
     CUTE_STATIC_ASSERT(mma_tile_n % epi_tile_n == 0, "EPI_TILE_N must divide MMA_TILE_N");
+    }
+
     // Get TiledCopy for partition reference when consumer store.
     TiledCopy tiled_copy_partition_ref = make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
     // Get the fusion callbacks for the consumer store warps
@@ -784,6 +793,12 @@ class CollectiveEpilogue<
     bool is_producer_load_needed = fusion_callbacks.is_producer_load_needed();
     bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
 
+    using FragmentVisit = decltype(cst_callbacks.visit(tRS_rAcc_frg(0), 0, 0, 0));
+    constexpr bool IsDirectR2S = cute::is_same_v<FragmentVisit, Array<SmemElementD, FragmentSize>>;
+    using RegisterElementD = cute::conditional_t<!IsDirectR2S, ElementCompute, SmemElementD>;
+    Tensor tRS_rCompute = make_tensor<RegisterElementD>(tRS_rD_layout);                         // (R2S,R2S_M,R2S_N)
+    Tensor tRS_rCompute_frg = recast<Array<RegisterElementD, FragmentSize>>(tRS_rCompute);
+
     // Thread synchronizer for previously issued waits or fences
     // to ensure visibility of smem reads/writes to threads or TMA unit
     auto synchronize = [&] () { cutlass::arch::NamedBarrier::sync(size(TiledMma{}), cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
@@ -894,17 +909,41 @@ class CollectiveEpilogue<
           ++load_wait_state;
         }
 
-        int mma_m = epi_m;
-        int mma_n = (epi_n * size<1>(EpilogueTile{})) / mma_tile_n;
-        Tensor tRS_rAcc_frg_mn = tRS_rAcc_frg(_,mma_m,mma_n);
-
-        // Vectorized fragment loop with visitor callback entry point
-        int epi_n_in_mma = epi_n % (mma_tile_n / epi_tile_n);
-        int r2s_v = epi_n_in_mma * size(tRS_rD_frg);
-        CUTLASS_PRAGMA_UNROLL
-        for (int epi_v = 0; epi_v < size(tRS_rD_frg); ++epi_v) {
-          tRS_rD_frg(epi_v) = cst_callbacks.visit(tRS_rAcc_frg_mn(r2s_v + epi_v), epi_v, epi_m, epi_n);
+        if constexpr (epi_tile_m * epi_tile_n > mma_tile_m * mma_tile_n) {
+          // When the epilogue subtile is larger than the MMA tiles, loop over multiple
+          // MMA tiles
+          static constexpr int MmaMPerEpiM = epi_tile_m / mma_tile_m;
+          static constexpr int MmaNPerEpiN = epi_tile_n / mma_tile_n;
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int mma_n_in_epi = 0; mma_n_in_epi < MmaNPerEpiN; ++mma_n_in_epi) {
+            int mma_n = (epi_n * MmaNPerEpiN) + mma_n_in_epi;
+
+            CUTLASS_PRAGMA_UNROLL
+            for (int mma_m_in_epi = 0; mma_m_in_epi < MmaMPerEpiM; ++mma_m_in_epi) {
+              int mma_m = (epi_m * MmaMPerEpiM) + mma_m_in_epi;
+              Tensor tRS_rAcc_frg_mn = tRS_rAcc_frg(_,mma_m,mma_n);
+              int idx_in_epi_subtile = (mma_n_in_epi * MmaMPerEpiM + mma_m_in_epi);
+
+              tRS_rCompute_frg(idx_in_epi_subtile) = cst_callbacks.visit(
+                tRS_rAcc_frg_mn(0), idx_in_epi_subtile, epi_m, epi_n);
+            }
+          }
+        }
+        else {
+          int mma_m = epi_m;
+          int mma_n = (epi_n * size<1>(EpilogueTile{})) / mma_tile_n;
+          Tensor tRS_rAcc_frg_mn = tRS_rAcc_frg(_,mma_m,mma_n);
+
+          // Vectorized fragment loop with visitor callback entry point
+          int epi_n_in_mma = epi_n % (mma_tile_n / epi_tile_n);
+          int r2s_v = epi_n_in_mma * size(tRS_rCompute_frg);
+          CUTLASS_PRAGMA_UNROLL
+          for (int epi_v = 0; epi_v < size(tRS_rCompute_frg); ++epi_v) {
+            tRS_rCompute_frg(epi_v) = cst_callbacks.visit(tRS_rAcc_frg_mn(r2s_v + epi_v), epi_v, epi_m, epi_n);
+          }
         }
+
         // The latest we can delay the TMA store is right before the smem store of the next iteration
         // since the current TMA store needs to be committed before we can acquire the next smem buffer
         if constexpr (DelayTmaStore) {
@@ -918,7 +957,7 @@ class CollectiveEpilogue<
 
         // Smem reduction callback entry point using current store buffer for workspace
         cst_callbacks.reduce(sD_epi(_,_,store_pipe_producer_state.index()),
-                              synchronize, epi_m, epi_n, is_last_iteration, tRS_rD_frg);
+                              synchronize, epi_m, epi_n, is_last_iteration, tRS_rCompute_frg);
 
         // Copy tile from register to regiser if needed
         if constexpr (IsUseR2R) {
@@ -930,6 +969,11 @@ class CollectiveEpilogue<
           copy(tiled_r2r, tRR_rD_src, tRR_rD_dst);
         }
 
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(tRS_rD_frg); ++i) {
+          tRS_rD_frg(i) = cutlass::NumericArrayConverter<SmemElementD, RegisterElementD, FragmentSize>{}(tRS_rCompute_frg(i));
+        }
+        
         // Copy tile from register to smem
         if constexpr (is_destination_supported) {
           copy(tiled_r2s, tRS_rD, tRS_sD(_,_,_,store_pipe_producer_state.index()));
@@ -1140,7 +1184,6 @@ class CollectiveEpilogue<
       ProblemShape_MNKL problem_shape_mnkl,
       int32_t next_batch,
       int32_t warp_group_idx) {
-
     if (cute::elect_one_sync()) {
       // Replacing global_address for the next batch
       tensormaps_replace_global_address<IsLoad>(shared_tensormaps, params, next_batch, warp_group_idx);
@@ -1161,14 +1204,24 @@ class CollectiveEpilogue<
       TensorMapStorage& shared_tensormaps,
       cute::TmaDescriptor const* tensormap,
       const int32_t warp_group_idx = 0) {
-
+    // Commit and wait for all TMA load/store instructions before updating the tensormap in gmem.
+    // This operation only happens when the group/batch changes between consecutive tiles.
+    // If there are no uncommitted instructions then tma_desc_commit_group results in an empty bulk async-group.
+    auto tma_desc_wait_all_fn = [] () CUTLASS_LAMBDA_FUNC_INLINE {
+      if (cute::elect_one_sync()) {
+        cute::tma_desc_commit_group();
+        cute::tma_desc_wait_group();
+      }
+    };
     // Entire warp must do this (ie its aligned)
     if constexpr (IsLoad) {
       if constexpr (is_source_supported) {
+        tma_desc_wait_all_fn();
         tma_descriptor_cp_fence_release(tensormap, shared_tensormaps.smem_tensormap_C);
       }
     }
     else if constexpr (is_destination_supported) {
+      tma_desc_wait_all_fn();
       tma_descriptor_cp_fence_release(tensormap, shared_tensormaps.smem_tensormap_D[warp_group_idx]);
     }
   }
diff --git a/include/cutlass/epilogue/dispatch_policy.hpp b/include/cutlass/epilogue/dispatch_policy.hpp
index a2a46b73c9..db53153c50 100644
--- a/include/cutlass/epilogue/dispatch_policy.hpp
+++ b/include/cutlass/epilogue/dispatch_policy.hpp
@@ -255,6 +255,23 @@ struct Sm120TmaWarpSpecialized {
   constexpr static bool DelayTmaStore = DelayTmaStore_;
 };
 
+template<
+  int StagesC_,
+  int StagesD_,
+  int FragmentSize_,
+  bool ReuseSmemC_,
+  bool DelayTmaStore_,
+  int NumEpilogueWarpGroups_
+>
+struct Sm120PtrArrayTmaWarpSpecialized {
+  constexpr static int StagesC = StagesC_;
+  constexpr static int StagesD = StagesD_;
+  constexpr static int FragmentSize = FragmentSize_;
+  constexpr static bool ReuseSmemC = ReuseSmemC_;
+  constexpr static bool DelayTmaStore = DelayTmaStore_;
+  constexpr static int NumEpilogueWarpGroups = NumEpilogueWarpGroups_;
+};
+
 #if defined (SYCL_INTEL_TARGET)
 // Specialization of the GEMM Epilogue for Intel Xe architectures.
 // This version is tuned for operations with a subgroup size of 16.
diff --git a/include/cutlass/epilogue/fusion/sm100_visitor_store_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm100_visitor_store_tma_warpspecialized.hpp
index 5c47d70627..28099b2116 100644
--- a/include/cutlass/epilogue/fusion/sm100_visitor_store_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm100_visitor_store_tma_warpspecialized.hpp
@@ -78,12 +78,13 @@ namespace detail {
         }
       }();
 
+      // norm_constant and qpvscale_rcps are all positive numbers.
+      auto acc_scales = cutlass::multiplies<Array<ElementCompute, NumVecs>>{}(norm_constant, qpvscale_rcps);
+
       CUTLASS_PRAGMA_UNROLL
       for (int sf_v = 0; sf_v < NumVecs; ++sf_v) {
-        // norm_constant and qpvscale_rcps[sf_v] are all positive numbers.
-        ElementCompute acc_scale = mul(norm_constant, qpvscale_rcps[sf_v]);
         // Map INF to fp32::max
-        acc_scale = minimum_with_nan_propagation<ElementCompute>{}(acc_scale, cutlass::platform::numeric_limits<ElementCompute>::max());
+        auto acc_scale = minimum_with_nan_propagation<ElementCompute>{}(acc_scales[sf_v], cutlass::platform::numeric_limits<ElementCompute>::max());
         // Convert to output type
         output_frgs[sf_v] = cutlass::NumericArrayConverter<ElementOutput, ElementCompute, SFVecSize>{}(mul_array(compute_frgs[sf_v], acc_scale));
       }
@@ -240,17 +241,19 @@ struct Sm100BlockScaleFactorRowStore {
       cutlass::multiplies<ElementCompute> mul;
       cutlass::maximum_absolute_value_reduction<Array<ElementCompute, SFVecSize>, true> amax_reduction;
 
+      cutlass::Array<ElementCompute, NumVecs> vec_maxs;
       cutlass::Array<ElementCompute, NumVecs> pvscales;
       // SF generation
       CUTLASS_PRAGMA_UNROLL
       for (int sf_v = 0; sf_v < NumVecs; ++sf_v) {
         compute_frgs[sf_v] = NumericArrayConverter<ElementCompute, ElementInput, SFVecSize>{}(input_frgs[sf_v]);
         /// Step1: get max across a vector
-        ElementCompute vec_max = amax_reduction(ElementCompute(0), compute_frgs[sf_v]);
-        /// Step2: Compute Scale
-        pvscales[sf_v] = mul(vec_max, norm_constant_scaled_down);
+        vec_maxs[sf_v] = amax_reduction(ElementCompute(0), compute_frgs[sf_v]);
       }
 
+      /// Step2: Compute Scale
+      pvscales = cutlass::multiplies<Array<ElementCompute, NumVecs>>{}(vec_maxs, norm_constant_scaled_down);
+
       tC_rSFD_frg(_0{}) = cutlass::NumericArrayConverter<UnderlyingElementBlockScaleFactor, ElementCompute, NumVecs>{}(pvscales);
 
       Tensor tCgSFD_flt = filter_zeros(tC_gSFD(_,_,_,_0{},_0{},get<0>(epi_tile_coord_mn) + epi_m, get<1>(epi_tile_coord_mn) + epi_n));
diff --git a/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp
index 8f391aace0..b769b1f0fb 100644
--- a/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp
@@ -1317,6 +1317,277 @@ struct FusionCallbacks<
   using Impl::Impl;
 };
 
+// Sm120 Ptr array tma warp specialized callbacks just alias to their sm90 counterpart
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  class Operation,
+  class CtaTile_MNK,
+  class EpilogueTile_MN,
+  class... Args
+>
+struct FusionCallbacks<
+    epilogue::Sm120PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    Operation,
+    CtaTile_MNK,
+    EpilogueTile_MN,
+    Args...
+> : FusionCallbacks<
+      epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+      Operation,
+      CtaTile_MNK,
+      EpilogueTile_MN,
+      Args...
+    > {
+  using FusionCallbacks<
+      epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+      Operation,
+      CtaTile_MNK,
+      EpilogueTile_MN,
+      Args...>::FusionCallbacks;
+};
+
+// For Ptr-Array and Grouped GEMM
+// D = alpha * acc + beta * C, where alpha and beta can be vectors for each batch/group
+// With Row BlockScaleFactor Generation, separate tensors per batch/group.
+template<
+  int SFVecsize,
+  class EpilogueTile,
+  class CtaTileShapeMNK,
+  int FragmentSize,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementBlockScaleFactor, 
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm120LinearCombRowBlockScaleFactorPtrArray =
+  Sm90EVT<
+    Sm120BlockScaleFactorRowStore<
+      SFVecsize, EpilogueTile, CtaTileShapeMNK, FragmentSize, ElementOutput,
+      ElementCompute, ElementBlockScaleFactor *, RoundStyle
+    >, // gen scalefactor
+    Sm90LinearCombinationPtrArray< ElementCompute, ElementCompute, 
+      ElementSource, ElementScalar, RoundStyle
+    > // beta * C + (alpha * acc)
+  >;
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementBlockScaleFactor,
+  int SFVecSize,
+  class ElementSource,
+  class ElementScalar,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm120PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    fusion::LinCombBlockScaleFactor<
+      SFVecSize, ElementOutput, ElementCompute,
+      ElementBlockScaleFactor, cutlass::layout::RowMajor,
+      ElementSource, ElementScalar, RoundStyle
+    >,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm120LinearCombRowBlockScaleFactorPtrArray<
+      SFVecSize, EpilogueTile, CtaTileShapeMNK, FragmentSize,
+      typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type,
+      ElementCompute, ElementBlockScaleFactor, ElementSource, ElementScalar, RoundStyle
+    > {
+
+  using Impl =
+    Sm120LinearCombRowBlockScaleFactorPtrArray<
+      SFVecSize, EpilogueTile, CtaTileShapeMNK, FragmentSize, 
+      typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type,
+      ElementCompute, ElementBlockScaleFactor, ElementSource, ElementScalar, RoundStyle
+    >;
+
+  using Operation =
+    fusion::LinCombBlockScaleFactor<
+      SFVecSize, ElementOutput, ElementCompute,
+      ElementBlockScaleFactor, cutlass::layout::RowMajor,
+      ElementSource, ElementScalar, RoundStyle
+    >;
+
+  struct Arguments {
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar beta = ElementScalar(0);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+    ElementScalar const* const* alpha_ptr_array = nullptr;
+    ElementScalar const* const* beta_ptr_array = nullptr;
+    ElementBlockScaleFactor ** block_scale_factor_ptr = nullptr;
+
+    // A matrix wide constant value to scale the output matrix
+    // Avoids generating small FP4 values.
+    using StrideNormConst = Stride<_0,_0,int64_t>;
+    ElementCompute const* norm_constant_ptr = nullptr;
+    StrideNormConst dNormConst = {_0{}, _0{}, 0};
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+
+    operator typename Impl::Arguments() const {
+      return
+        {
+            {    // ternary op : beta * C + (alpha * acc + bias)
+              {{beta}, {beta_ptr}, {beta_ptr_array}, {dBeta}}, // leaf args : beta
+              {},                   // leaf args : C
+              {                     // ternary op : alpha * acc + bias
+                {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}}, // leaf args : alpha
+                {},                 // leaf args : acc
+                {}                  // ternary args : multiply_add
+              },                    // end ternary op
+              {} // ternary args : multiply_add
+            },   // end ternary op
+          {block_scale_factor_ptr, norm_constant_ptr, dNormConst} // BlockScaleFactor args
+        };   // end ternary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+
+// For Ptr-Array and Grouped GEMM
+// D = activation(alpha * acc + beta * C), where alpha and beta can be vectors for each batch/group
+// With Row BlockScaleFactor Generation, separate tensors per batch/group.
+template<
+  int SFVecsize,
+  class EpilogueTile,
+  class CtaTileShapeMNK,
+  int FragmentSize,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementBlockScaleFactor, 
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm120LinCombEltActRowBlockScaleFactorPtrArray =
+  Sm90EVT<
+    Sm120BlockScaleFactorRowStore<
+      SFVecsize, EpilogueTile, CtaTileShapeMNK, FragmentSize, ElementOutput,
+      ElementCompute, ElementBlockScaleFactor *, RoundStyle
+    >, // gen scalefactor
+    Sm90LinCombEltActPtrArray<ActivationFn, ElementCompute, ElementCompute, 
+      ElementSource, ElementScalar, RoundStyle
+    > // activation(beta * C + (alpha * acc))
+  >;
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementBlockScaleFactor,
+  int SFVecSize,
+  class ElementSource,
+  class ElementScalar,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm120PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    fusion::LinCombEltActBlockScaleFactor<
+      ActivationFn, SFVecSize, ElementOutput, ElementCompute,
+      ElementBlockScaleFactor, cutlass::layout::RowMajor,
+      ElementSource, ElementScalar, RoundStyle
+    >,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm120LinCombEltActRowBlockScaleFactorPtrArray<
+      SFVecSize, EpilogueTile, CtaTileShapeMNK, FragmentSize, ActivationFn,
+      typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type,
+      ElementCompute, ElementBlockScaleFactor, ElementSource, ElementScalar, RoundStyle
+    > {
+
+  using Impl =
+    Sm120LinCombEltActRowBlockScaleFactorPtrArray<
+      SFVecSize, EpilogueTile, CtaTileShapeMNK, FragmentSize, ActivationFn, 
+      typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type,
+      ElementCompute, ElementBlockScaleFactor, ElementSource, ElementScalar, RoundStyle
+    >;
+
+  using Operation =
+    fusion::LinCombEltActBlockScaleFactor<
+      ActivationFn, SFVecSize, ElementOutput, ElementCompute,
+      ElementBlockScaleFactor, cutlass::layout::RowMajor,
+      ElementSource, ElementScalar, RoundStyle
+    >;
+
+  struct Arguments {
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar beta = ElementScalar(0);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+    ElementScalar const* const* alpha_ptr_array = nullptr;
+    ElementScalar const* const* beta_ptr_array = nullptr;
+    ElementBlockScaleFactor ** block_scale_factor_ptr = nullptr;
+
+    // A matrix wide constant value to scale the output matrix
+    // Avoids generating small FP4 values.
+    using StrideNormConst = Stride<_0,_0,int64_t>;
+    ElementCompute const* norm_constant_ptr = nullptr;
+    StrideNormConst dNormConst = {_0{}, _0{}, 0};
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using ActivationArguments = typename Sm90Compute<ActivationFn, ElementOutput, ElementCompute, RoundStyle>::Arguments;
+    ActivationArguments activation = ActivationArguments();
+
+    operator typename Impl::Arguments() const {
+      return
+        {
+          {    // unary op : activation(beta * C + (alpha * acc + bias))
+            {    // ternary op : beta * C + (alpha * acc + bias)
+              {{beta}, {beta_ptr}, {beta_ptr_array}, {dBeta}}, // leaf args : beta
+              {},                   // leaf args : C
+              {                     // ternary op : alpha * acc + bias
+                {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}}, // leaf args : alpha
+                {},                 // leaf args : acc
+                {}                  // ternary args : multiply_add
+              },                    // end ternary op
+              {} // ternary args : multiply_add
+            },   // end ternary op
+            activation // unary args : activation
+          },   // end unary op
+          {block_scale_factor_ptr, norm_constant_ptr, dNormConst} // BlockScaleFactor args
+        };   // end ternary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
 } // namespace cutlass::epilogue::fusion
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp
index 59a9d03026..e72e971bd8 100644
--- a/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp
@@ -94,6 +94,8 @@ struct Sm120BlockScaleFactorRowStore {
 
   using Params = Arguments;
 
+  using UnderlyingElementBlockScaleFactor = cute::remove_pointer_t<ElementBlockScaleFactor>;
+
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
@@ -390,21 +392,21 @@ struct Sm120BlockScaleFactorRowStore {
         }
 
         ElementCompute pvscale = mul(amax, norm_constant_scaled_down);
-        ElementBlockScaleFactor qpvscale = NumericConverter<ElementBlockScaleFactor, ElementCompute>{}(pvscale);
+        UnderlyingElementBlockScaleFactor qpvscale = NumericConverter<UnderlyingElementBlockScaleFactor, ElementCompute>{}(pvscale);
         tC_rSFD_flt(coord) = qpvscale;
 
         //
         // Apply the scale factor to the output
         //
         ElementCompute qpvscale_rcp = [&]() {
-          if constexpr (cute::is_same_v<ElementBlockScaleFactor, float_ue8m0_t>) {
+          if constexpr (cute::is_same_v<UnderlyingElementBlockScaleFactor, float_ue8m0_t>) {
             // UE8M0: Use integer subtraction to do the fast rcp in ue8m0 and then convert to float.
-            auto e8m0_qpvscale_rcp = cutlass::reciprocal_approximate<ElementBlockScaleFactor>{}(qpvscale);
-            return cutlass::NumericConverter<ElementCompute, ElementBlockScaleFactor>{}(e8m0_qpvscale_rcp);
+            auto e8m0_qpvscale_rcp = cutlass::reciprocal_approximate<UnderlyingElementBlockScaleFactor>{}(qpvscale);
+            return cutlass::NumericConverter<ElementCompute, UnderlyingElementBlockScaleFactor>{}(e8m0_qpvscale_rcp);
           }
           else {
             // UE4M3: Do the rcp in fp32 data type.
-            auto qpvscale_up = cutlass::NumericConverter<ElementCompute, ElementBlockScaleFactor>{}(qpvscale);
+            auto qpvscale_up = cutlass::NumericConverter<ElementCompute, UnderlyingElementBlockScaleFactor>{}(qpvscale);
             return cutlass::reciprocal_approximate_ftz<decltype(qpvscale_up)>{}(qpvscale_up);
           }
         }();
@@ -458,15 +460,24 @@ struct Sm120BlockScaleFactorRowStore {
     auto [M, N, K, L] = args.problem_shape_mnkl;
     auto [m, n, k, l] = args.tile_coord_mnkl;
     using Sm1xxBlockScaledOutputConfig = cutlass::detail::Sm1xxBlockScaledOutputConfig<SFVecSize>;
+    UnderlyingElementBlockScaleFactor* ptr_scale_factor = nullptr;
+    // If Ptr-Array/Grouped GEMM with BlockScaleFactor per batch/group
+    if constexpr (!cute::is_same_v<UnderlyingElementBlockScaleFactor, ElementBlockScaleFactor>) {
+      ptr_scale_factor = params_ptr->ptr_scale_factor[l];
+      l = 0;
+    }
+    else {
+      ptr_scale_factor = params_ptr->ptr_scale_factor;
+    }
 
     auto epi_tile_mn = shape<1>(zipped_divide(make_layout(take<0,2>(args.tile_shape_mnk)), args.epi_tile));
-    Tensor mSFD = make_tensor(make_gmem_ptr(params_ptr->ptr_scale_factor), Sm1xxBlockScaledOutputConfig::tile_atom_to_shape_SFD(args.problem_shape_mnkl));
+    Tensor mSFD = make_tensor(make_gmem_ptr(ptr_scale_factor), Sm1xxBlockScaledOutputConfig::tile_atom_to_shape_SFD(args.problem_shape_mnkl));
 
     static_assert(size<1>(EpilogueTile{}) && ((size<1>(EpilogueTile{}) & (size<1>(EpilogueTile{}) - 1)) == 0), "Epilogue Tile N should be pow of 2");
     Tensor gSFD = local_tile(mSFD, args.epi_tile, make_coord(_, _,l));                             // (EPI_M,EPI_N, #EPI_Ms, #EPI_Ns)
     Tensor tCgSFD = sm90_partition_for_epilogue<ReferenceSrc>(                                     // (CPY,CPY_M,CPY_N,EPI_M,EPI_N,#EPI_Ms, #EPI_Ns)
                         gSFD, args.epi_tile, args.tiled_copy, args.thread_idx);
-    Tensor tCrSFD = make_tensor_like<ElementBlockScaleFactor>(take<0,3>(cute::layout(tCgSFD)));    // (CPY,CPY_M,CPY_N)
+    Tensor tCrSFD = make_tensor_like<UnderlyingElementBlockScaleFactor>(take<0,3>(cute::layout(tCgSFD)));    // (CPY,CPY_M,CPY_N)
 
     auto tile_coord_mn = make_coord(m * size<0>(epi_tile_mn), n * size<1>(epi_tile_mn));
 
@@ -537,6 +548,8 @@ struct Sm120BlockScaleFactorColStore {
   };
   using Params = Arguments;
 
+  using UnderlyingElementBlockScaleFactor = cute::remove_pointer_t<ElementBlockScaleFactor>;
+
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
@@ -770,21 +783,21 @@ struct Sm120BlockScaleFactorColStore {
           synchronize();
 
           ElementCompute pvscale = mul(amax, norm_constant_scaled_down);
-          ElementBlockScaleFactor qpvscale = NumericConverter<ElementBlockScaleFactor, ElementCompute>{}(pvscale);
+          UnderlyingElementBlockScaleFactor qpvscale = NumericConverter<UnderlyingElementBlockScaleFactor, ElementCompute>{}(pvscale);
           filter(tC_rSFD)(sf_id + mma_in_epi*ColsPerThreadAccFrag) = qpvscale;
 
           //
           // Apply the scale factor to the output
           //
           ElementCompute qpvscale_rcp = [&]() {
-            if constexpr (cute::is_same_v<ElementBlockScaleFactor, float_ue8m0_t>) {
+            if constexpr (cute::is_same_v<UnderlyingElementBlockScaleFactor, float_ue8m0_t>) {
               // UE8M0: Use integer subtraction to do the fast rcp in ue8m0 and then convert to float.
-              auto e8m0_qpvscale_rcp = cutlass::reciprocal_approximate<ElementBlockScaleFactor>{}(qpvscale);
-              return cutlass::NumericConverter<ElementCompute, ElementBlockScaleFactor>{}(e8m0_qpvscale_rcp);
+              auto e8m0_qpvscale_rcp = cutlass::reciprocal_approximate<UnderlyingElementBlockScaleFactor>{}(qpvscale);
+              return cutlass::NumericConverter<ElementCompute, UnderlyingElementBlockScaleFactor>{}(e8m0_qpvscale_rcp);
             }
             else {
               // UE4M3: Do the rcp in fp32 data type.
-              auto qpvscale_up = cutlass::NumericConverter<ElementCompute, ElementBlockScaleFactor>{}(qpvscale);
+              auto qpvscale_up = cutlass::NumericConverter<ElementCompute, UnderlyingElementBlockScaleFactor>{}(qpvscale);
               return cutlass::reciprocal_approximate_ftz<decltype(qpvscale_up)>{}(qpvscale_up);
             }
           }();
@@ -829,18 +842,27 @@ struct Sm120BlockScaleFactorColStore {
     auto [M, N, K, L] = args.problem_shape_mnkl;
     auto [m, n, k, l] = args.tile_coord_mnkl;
     using Sm1xxBlockScaledOutputConfig= cutlass::detail::Sm1xxBlockScaledOutputConfig<SFVecSize, UMMA::Major::MN>;
+    UnderlyingElementBlockScaleFactor* ptr_scale_factor = nullptr;
+    // If Ptr-Array/Grouped GEMM with BlockScaleFactor per batch/group
+    if constexpr (!cute::is_same_v<UnderlyingElementBlockScaleFactor, ElementBlockScaleFactor>) {
+      ptr_scale_factor = params_ptr->ptr_scale_factor[l];
+      l = 0;
+    }
+    else {
+      ptr_scale_factor = params_ptr->ptr_scale_factor;
+    }
 
     static_assert(size<0>(EpilogueTile{}) && ((size<0>(EpilogueTile{}) & (size<1>(EpilogueTile{}) - 1)) == 0),
       "Epilogue Tile N should be pow of 2");
 
     auto epi_tile_mn = shape<1>(zipped_divide(make_layout(take<0,2>(args.tile_shape_mnk)), args.epi_tile));
-    Tensor mSFD = make_tensor(make_gmem_ptr(params_ptr->ptr_scale_factor),
+    Tensor mSFD = make_tensor(make_gmem_ptr(ptr_scale_factor),
                     Sm1xxBlockScaledOutputConfig::tile_atom_to_shape_SFD(args.problem_shape_mnkl));
 
     Tensor gSFD = local_tile(mSFD, args.epi_tile, make_coord(_, _,l));               // (EPI_M,EPI_N, #EPI_Ms, #EPI_Ns)
     Tensor tCgSFD = sm90_partition_for_epilogue<ReferenceSrc>(        // (CPY,CPY_M,CPY_N,EPI_M,EPI_N,#EPI_Ms, #EPI_Ns)
                       gSFD, args.epi_tile, args.tiled_copy, args.thread_idx);
-    Tensor tCrSFD = make_tensor_like<ElementBlockScaleFactor>(take<0,3>(cute::layout(tCgSFD)));    // (CPY,CPY_M,CPY_N)
+    Tensor tCrSFD = make_tensor_like<UnderlyingElementBlockScaleFactor>(take<0,3>(cute::layout(tCgSFD)));    // (CPY,CPY_M,CPY_N)
 
     auto tile_coord_mn = make_coord(m * size<0>(epi_tile_mn), n * size<1>(epi_tile_mn));
 
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
index c498a3829f..cd470f84f7 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
@@ -1191,9 +1191,11 @@ struct Sm90RowBroadcast {
 
     auto layout_M = make_layout(M, repeat_like(M, _0{}));
     auto layout_L = make_layout(L, get<2>(params.dRow));
-    ElementInput const* ptr_row;
+    ElementInput const* ptr_row = nullptr;
     if constexpr(IsArrayOfPointers) {
-      ptr_row = params.ptr_row[l];
+      if (!(EnableNullptr && params.ptr_row == nullptr)) {
+        ptr_row = params.ptr_row[l];
+      }
     } else {
       ptr_row = params.ptr_row;
     }
@@ -1439,9 +1441,11 @@ struct Sm90ColBroadcast {
 
     auto layout_N = make_layout(N, repeat_like(N, _0{}));
     auto layout_L = make_layout(L, get<2>(params.dCol));
-    ElementInput const* ptr_col;
+    ElementInput const* ptr_col = nullptr;
     if constexpr(IsArrayOfPointers) {
-      ptr_col = params.ptr_col[l];
+      if (!(EnableNullptr && params.ptr_col == nullptr)) {
+        ptr_col = params.ptr_col[l];
+      }
     } else {
       ptr_col = params.ptr_col;
     }
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
index ce841bf28b..93720f8d3d 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
@@ -116,6 +116,172 @@ sm90_partition_for_epilogue(
 //
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
+//
+// Producer load callbacks, called by the epilogue load warp.
+// Operations usually only define this if TMA load is needed. Most operations will reuse this empy implementation
+// Load callbacks are responsible for issuing corresponding mbarrier expect-tx ops for any TMA loads issued, but
+// are not responsible for issuing the producer_commit barrier arrival, which is issued by the collective instead
+// If this is non-empty, is_producer_load_needed must be true.
+//
+template <class CallbacksTuple>
+struct ProducerLoadCallbacksImpl {
+  // Callbacks can store non-persistent variables (e.g. tensors) or copies of persistent variables
+  CallbacksTuple callbacks_tuple;
+
+  // Before entry of the subtile load loop
+  CUTLASS_DEVICE void
+  begin() {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.begin();
+      }
+    );
+  }
+
+  // Entry of the subtile load loop. Aux loads usually performed here
+  // Upon entry the producer acquire of the current subtile lock has completed.
+  // Upon exit all TMA loads for this subtile must have been issued, with corresponding expect-tx operations
+  CUTLASS_DEVICE void
+  step(uint64_t* full_mbarrier_ptr, int epi_m, int epi_n, int load_iteration, bool issue_tma_load) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.step(full_mbarrier_ptr, epi_m, epi_n, load_iteration, issue_tma_load);
+      }
+    );
+  }
+
+  // Exit of the subtile load loop.
+  CUTLASS_DEVICE void
+  end() {
+    for_each(callbacks_tuple,
+      [] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.end();
+      }
+    );
+  }
+};
+
+
+//
+// Consumer store callbacks, called by the epilogue store warps.
+// All operations must redefine this, with optional inheritance from this empty implementation.
+//
+template <class CallbacksTuple>
+struct ConsumerStoreCallbacksImpl {
+  // Callbacks can store non-persistent variables (e.g. tensors) or copies of persistent variables
+  CallbacksTuple callbacks_tuple;
+
+  // Before entry of subtile store loop. Gmem broadcasts usually performed here.
+  CUTLASS_DEVICE void
+  begin() {
+    for_each(callbacks_tuple,
+      [] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.begin();
+      }
+    );
+  }
+
+  // Is a thread sync needed after begin(). Allows chaining async copies across multiple nodes
+  CUTLASS_DEVICE bool
+  begin_sync_needed() const {
+    return cute::apply(callbacks_tuple,
+      [] (auto const&... callbacks) {
+        return (false || ... || callbacks.begin_sync_needed());
+      }
+    );
+  }
+
+  // Start of subtile store iteration
+  CUTLASS_DEVICE void
+  begin_loop(int epi_m, int epi_n) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.begin_loop(epi_m, epi_n);
+      }
+    );
+  }
+
+  // Before visit callback. Smem broadcasts usually performed here.
+  // Upon entry, all producer loads for this subtile are completed and visible.
+  CUTLASS_DEVICE void
+  previsit(int epi_m, int epi_n, int load_iteration, bool is_producer_load_needed) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.previsit(epi_m, epi_n, load_iteration, is_producer_load_needed);
+      }
+    );
+  }
+
+  // Perform the fused elementwise computation
+  template <typename ElementAccumulator, typename... ElementInputs, int FragmentSize>
+  CUTLASS_DEVICE auto // returns an Array
+  visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
+        Array<ElementInputs, FragmentSize> const&... frg_inputs) // depends on the N-naryness of the op
+    = delete; // Must be implemented for each operation
+
+  // After visit call. Smem reductions usually performed here
+  // reduction_buffer is an arbitrary smem tensor that can be used for workspace
+  // It is each nodes reponsibility to assert that this buffer is sufficiently sized
+  // and to ensure that this buffer is no longer needed upon callback exit
+  // i.e. results are synchronized and no longer in the reduction buffer
+  //
+  // visit_results is a rmem tensor that contains the results of visit() for an entire
+  // on the current epilogue subtile
+  template <class STensor, class SyncFn, class VTensor>
+  CUTLASS_DEVICE void
+  reduce(STensor&& reduction_buffer, SyncFn const& sync_fn, int epi_m, int epi_n, bool is_last_iteration, VTensor visit_results) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.reduce(reduction_buffer, sync_fn, epi_m, epi_n, is_last_iteration, visit_results);
+      }
+    );
+  }
+
+  // After reduce call, before smem async fence. Smem stores usually performed here.
+  // Upon exit, all smem stores for TMA must have been issued
+  CUTLASS_DEVICE void
+  postreduce(int epi_m, int epi_n, int store_iteration, bool issue_smem_store) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.postreduce(epi_m, epi_n, store_iteration, issue_smem_store);
+      }
+    );
+  }
+
+  // After smem async fence, before TMA store commit. Aux stores usually performed here
+  // Upon exit, all TMA stores for this subtile must have been issued
+  // Because of the TMA store delay optimization, this entry point must ONLY be used for TMA stores
+  // other gmem stores can be placed in the reduce or postreduce entry points
+  CUTLASS_DEVICE void
+  tma_store(int epi_m, int epi_n, int store_iteration, bool issue_tma_store) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.tma_store(epi_m, epi_n, store_iteration, issue_tma_store);
+      }
+    );
+  }
+
+  // End of subtile store iteration
+  CUTLASS_DEVICE void
+  end_loop(int epi_m, int epi_n) {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.end_loop(epi_m, epi_n);
+      }
+    );
+  }
+
+  // Exit of subtile store loop. Gmem reductions usually performed here.
+  CUTLASS_DEVICE void
+  end() {
+    for_each(callbacks_tuple,
+      [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
+        callbacks.end();
+      }
+    );
+  }
+};
+
 template<
   class ProblemShapeMNKL,
   class TileShapeMNK,
@@ -349,51 +515,6 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
     );
   }
 
-  //
-  // Producer load callbacks, called by the epilogue load warp.
-  // Operations usually only define this if TMA load is needed. Most operations will reuse this empy implementation
-  // Load callbacks are responsible for issuing corresponding mbarrier expect-tx ops for any TMA loads issued, but
-  // are not responsible for issuing the producer_commit barrier arrival, which is issued by the collective instead
-  // If this is non-empty, is_producer_load_needed must be true.
-  //
-  template <class CallbacksTuple>
-  struct ProducerLoadCallbacks {
-    // Callbacks can store non-persistent variables (e.g. tensors) or copies of persistent variables
-    CallbacksTuple callbacks_tuple;
-
-    // Before entry of the subtile load loop
-    CUTLASS_DEVICE void
-    begin() {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.begin();
-        }
-      );
-    }
-
-    // Entry of the subtile load loop. Aux loads usually performed here
-    // Upon entry the producer acquire of the current subtile lock has completed.
-    // Upon exit all TMA loads for this subtile must have been issued, with corresponding expect-tx operations
-    CUTLASS_DEVICE void
-    step(uint64_t* full_mbarrier_ptr, int epi_m, int epi_n, int load_iteration, bool issue_tma_load) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.step(full_mbarrier_ptr, epi_m, epi_n, load_iteration, issue_tma_load);
-        }
-      );
-    }
-
-    // Exit of the subtile load loop.
-    CUTLASS_DEVICE void
-    end() {
-      for_each(callbacks_tuple,
-        [] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.end();
-        }
-      );
-    }
-  };
-
   // Producer load callbacks factory
   // All operations must redefine this, but most can just dispatch to the base impl
   template <class... Args>
@@ -405,131 +526,11 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
       },
       [] (auto&&... callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
         auto callbacks_tuple = cute::make_tuple(callbacks...);
-        return ProducerLoadCallbacks<decltype(callbacks_tuple)>{callbacks_tuple};
+        return ProducerLoadCallbacksImpl<decltype(callbacks_tuple)>{callbacks_tuple};
       }
     );
   }
 
-  //
-  // Consumer store callbacks, called by the epilogue store warps.
-  // All operations must redefine this, with optional inheritance from this empty implementation.
-  //
-  template <class CallbacksTuple>
-  struct ConsumerStoreCallbacks {
-    // Callbacks can store non-persistent variables (e.g. tensors) or copies of persistent variables
-    CallbacksTuple callbacks_tuple;
-
-    // Before entry of subtile store loop. Gmem broadcasts usually performed here.
-    CUTLASS_DEVICE void
-    begin() {
-      for_each(callbacks_tuple,
-        [] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.begin();
-        }
-      );
-    }
-
-    // Is a thread sync needed after begin(). Allows chaining async copies across multiple nodes
-    CUTLASS_DEVICE bool
-    begin_sync_needed() const {
-      return cute::apply(callbacks_tuple,
-        [] (auto const&... callbacks) {
-          return (false || ... || callbacks.begin_sync_needed());
-        }
-      );
-    }
-
-    // Start of subtile store iteration
-    CUTLASS_DEVICE void
-    begin_loop(int epi_m, int epi_n) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.begin_loop(epi_m, epi_n);
-        }
-      );
-    }
-
-    // Before visit callback. Smem broadcasts usually performed here.
-    // Upon entry, all producer loads for this subtile are completed and visible.
-    CUTLASS_DEVICE void
-    previsit(int epi_m, int epi_n, int load_iteration, bool is_producer_load_needed) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.previsit(epi_m, epi_n, load_iteration, is_producer_load_needed);
-        }
-      );
-    }
-
-    // Perform the fused elementwise computation
-    template <typename ElementAccumulator, typename... ElementInputs, int FragmentSize>
-    CUTLASS_DEVICE auto // returns an Array
-    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
-          Array<ElementInputs, FragmentSize> const&... frg_inputs) // depends on the N-naryness of the op
-      = delete; // Must be implemented for each operation
-
-    // After visit call. Smem reductions usually performed here
-    // reduction_buffer is an arbitrary smem tensor that can be used for workspace
-    // It is each nodes reponsibility to assert that this buffer is sufficiently sized
-    // and to ensure that this buffer is no longer needed upon callback exit
-    // i.e. results are synchronized and no longer in the reduction buffer
-    //
-    // visit_results is a rmem tensor that contains the results of visit() for an entire
-    // on the current epilogue subtile
-    template <class STensor, class SyncFn, class VTensor>
-    CUTLASS_DEVICE void
-    reduce(STensor&& reduction_buffer, SyncFn const& sync_fn, int epi_m, int epi_n, bool is_last_iteration, VTensor visit_results) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.reduce(reduction_buffer, sync_fn, epi_m, epi_n, is_last_iteration, visit_results);
-        }
-      );
-    }
-
-    // After reduce call, before smem async fence. Smem stores usually performed here.
-    // Upon exit, all smem stores for TMA must have been issued
-    CUTLASS_DEVICE void
-    postreduce(int epi_m, int epi_n, int store_iteration, bool issue_smem_store) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.postreduce(epi_m, epi_n, store_iteration, issue_smem_store);
-        }
-      );
-    }
-
-    // After smem async fence, before TMA store commit. Aux stores usually performed here
-    // Upon exit, all TMA stores for this subtile must have been issued
-    // Because of the TMA store delay optimization, this entry point must ONLY be used for TMA stores
-    // other gmem stores can be placed in the reduce or postreduce entry points
-    CUTLASS_DEVICE void
-    tma_store(int epi_m, int epi_n, int store_iteration, bool issue_tma_store) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.tma_store(epi_m, epi_n, store_iteration, issue_tma_store);
-        }
-      );
-    }
-
-    // End of subtile store iteration
-    CUTLASS_DEVICE void
-    end_loop(int epi_m, int epi_n) {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.end_loop(epi_m, epi_n);
-        }
-      );
-    }
-
-    // Exit of subtile store loop. Gmem reductions usually performed here.
-    CUTLASS_DEVICE void
-    end() {
-      for_each(callbacks_tuple,
-        [&] (auto& callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
-          callbacks.end();
-        }
-      );
-    }
-  };
-
   // Consumer store callbacks factory
   // All operations must redefine this
   template <
@@ -544,7 +545,7 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
       },
       [] (auto&&... callbacks) CUTLASS_LAMBDA_FUNC_INLINE {
         auto callbacks_tuple = cute::make_tuple(callbacks...);
-        return ConsumerStoreCallbacks<decltype(callbacks_tuple)>{callbacks_tuple};
+        return ConsumerStoreCallbacksImpl<decltype(callbacks_tuple)>{callbacks_tuple};
       }
     );
   }
@@ -553,8 +554,8 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 // Convenience aliases
-using EmptyProducerLoadCallbacks = Sm90VisitorImpl<>::ProducerLoadCallbacks<cute::tuple<>>;
-using EmptyConsumerStoreCallbacks = Sm90VisitorImpl<>::ConsumerStoreCallbacks<cute::tuple<>>;
+using EmptyProducerLoadCallbacks = ProducerLoadCallbacksImpl<cute::tuple<>>;
+using EmptyConsumerStoreCallbacks = ConsumerStoreCallbacksImpl<cute::tuple<>>;
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -614,9 +615,9 @@ struct Sm90TreeVisitor : Sm90VisitorImpl<ChildOps..., NodeOp> {
   >
   CUTLASS_DEVICE auto
   get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
-    auto callbacks_tuple = Sm90VisitorImpl<ChildOps..., NodeOp>::
+    auto callbacks_impl = Sm90VisitorImpl<ChildOps..., NodeOp>::
       template get_consumer_store_callbacks<ReferenceSrc>(args);
-    return ConsumerStoreCallbacks<decltype(callbacks_tuple)>(std::move(callbacks_tuple));
+    return ConsumerStoreCallbacks<decltype(callbacks_impl)>(cute::move(callbacks_impl));
   }
 };
 
@@ -663,9 +664,9 @@ struct Sm90SplitTreeVisitor : Sm90VisitorImpl<InputTree, AuxOutTrees..., OutputT
   >
   CUTLASS_DEVICE auto
   get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
-    auto callbacks_tuple = Sm90VisitorImpl<InputTree, AuxOutTrees..., OutputTree>::
+    auto callbacks_impl = Sm90VisitorImpl<InputTree, AuxOutTrees..., OutputTree>::
       template get_consumer_store_callbacks<ReferenceSrc>(args);
-    return ConsumerStoreCallbacks<decltype(callbacks_tuple)>(std::move(callbacks_tuple));
+    return ConsumerStoreCallbacks<decltype(callbacks_impl)>(cute::move(callbacks_impl));
   }
 };
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -739,9 +740,9 @@ struct Sm90TopologicalVisitor : Sm90VisitorImpl<Ops...> {
   >
   CUTLASS_DEVICE auto
   get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
-    auto callbacks_tuple = Sm90VisitorImpl<Ops...>::
+    auto callbacks_impl = Sm90VisitorImpl<Ops...>::
       template get_consumer_store_callbacks<ReferenceSrc>(args);
-    return ConsumerStoreCallbacks<decltype(callbacks_tuple)>(std::move(callbacks_tuple));
+    return ConsumerStoreCallbacks<decltype(callbacks_impl)>(cute::move(callbacks_impl));
   }
 };
 
diff --git a/include/cutlass/epilogue/thread/activation.h b/include/cutlass/epilogue/thread/activation.h
index 04935e3421..44c606c4ea 100644
--- a/include/cutlass/epilogue/thread/activation.h
+++ b/include/cutlass/epilogue/thread/activation.h
@@ -52,6 +52,18 @@ namespace thread {
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
+// If kIsHeavy is a member, use it.  Otherwise, assume that it's false.
+template<class Op, class Enable = void>
+struct kIsHeavy_member_or_false {
+  static constexpr bool value = false;
+};
+template<class Op>
+struct kIsHeavy_member_or_false<Op, typename cutlass::platform::enable_if<Op::kIsHeavy>::type> {
+  static constexpr bool value = Op::kIsHeavy;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
 // Identity operator
 template <typename T>
 struct Identity {
@@ -113,6 +125,8 @@ template <template <class> class Activation, typename T>
 struct Scale<Activation<T>> {
   using Arguments = typename Scale<T>::Arguments;
 
+  static const bool kIsHeavy = Activation<T>::kIsHeavy;
+
   CUTLASS_HOST_DEVICE
   T operator()(T value, typename Arguments::scale_type scale) const {
     multiplies<T> mul;
@@ -127,21 +141,22 @@ struct Scale<Activation<T>> {
 };
 
 /// ReLu operator - propagates NaNs
-/// Always put threshold in the right hand side of max to propagate NaN.
 template <typename T>
 struct ReLu {
   static const bool kIsHeavy = false;
 
   CUTLASS_HOST_DEVICE
   T operator()(T threshold, T value) const {
-    maximum<T> mx;
+    constexpr bool PropagateNaN = true;
+    maximum<T, PropagateNaN> mx;
 
     return mx(value, threshold);
   }
 
   CUTLASS_HOST_DEVICE
   T operator()(T value) const {
-    maximum<T> mx;
+    constexpr bool PropagateNaN = true;
+    maximum<T, PropagateNaN> mx;
 
     return mx(value, T(0));
   }
@@ -156,14 +171,16 @@ struct ReLu<Array<T, N>> {
 
   CUTLASS_HOST_DEVICE
   Array<T, N> operator()(T const & threshold, Array<T, N> const &frag) const {
-    maximum<Array<T, N>> mx;
+    constexpr bool PropagateNaN = true;
+    maximum<Array<T, N>, PropagateNaN> mx;
 
     return mx(frag, threshold);
   }
 
   CUTLASS_HOST_DEVICE
   Array<T, N> operator()(Array<T, N> const &frag) const {
-    maximum<Array<T, N>> mx;
+    constexpr bool PropagateNaN = true;
+    maximum<Array<T, N>, PropagateNaN> mx;
     return mx(frag, T(0));
   }
 };
@@ -210,6 +227,45 @@ struct Clamp<Array<T,N>> {
   }
 };
 
+// Lower Bound
+template <typename T>
+struct LowerBound {
+  struct Arguments {
+    T lower_bound;
+  };
+
+  CUTLASS_HOST_DEVICE
+  T operator()(T const& value, T const& lower_bound) const {
+    constexpr bool PropagateNaN = true;
+    maximum<T, PropagateNaN> mx;
+
+    return mx(value, lower_bound);
+  }
+
+  CUTLASS_HOST_DEVICE
+  T operator()(T const& value, Arguments const& args = Arguments()) const {
+    return this->operator()(value, args.lower_bound);
+  }
+};
+
+template <typename T, int N>
+struct LowerBound<Array<T,N>> {
+  using Arguments = typename LowerBound<T>::Arguments;
+
+  CUTLASS_HOST_DEVICE
+  Array<T,N> operator()(Array<T,N> const& values, T const& lower_bound) const {
+    constexpr bool PropagateNaN = true;
+    maximum<Array<T,N>, PropagateNaN> mx;
+
+    return mx(values, lower_bound);
+  }
+
+  CUTLASS_HOST_DEVICE
+  Array<T,N> operator()(Array<T,N> const& values, Arguments const& args = Arguments()) const {
+    return this->operator()(values, args.lower_bound);
+  }
+};
+
 // Leaky Relu operator
 template <typename T>
 struct LeakyReLU {
@@ -569,6 +625,28 @@ struct GELU_taylor {
   }
 };
 
+template <>
+struct GELU_taylor <float>{
+  static const bool kIsHeavy = true;
+  using T = float;
+  CUTLASS_HOST_DEVICE
+  T operator()(T const &z) const {
+    // 0.5f * (x + x * tanh(x * (0.797885f + 0.0356774f * x * x)));
+    T k0 = T(0.7978845608028654);
+    T tmp = T(0.044715);
+    T k1 = T(k0*tmp);
+    multiply_add<T> fma;
+    multiplies<T> mul;
+    T v0 = mul(k1, z);
+    T v1 = fma(v0, z, k0);
+    T v2 = mul(z, v1);
+    T v3 = fast_tanh(v2);
+    T v4 = fma(z, v3, z);
+    T v5 = mul(cutlass::constants::half<T>(), v4);
+    return v5;
+  }
+};
+
 template <int N>
 struct GELU_taylor<Array<half_t, N> > {
   static const bool kIsHeavy = true;
@@ -596,6 +674,30 @@ struct GELU_taylor<Array<half_t, N> > {
   }
 };
 
+template <int N>
+struct GELU_taylor<Array<float, N> > {
+  static const bool kIsHeavy = true;
+
+  CUTLASS_HOST_DEVICE
+  Array<float, N> operator()(Array<float, N> const &value) const {
+    multiply_add<Array<float, N>> fma;
+    multiplies<Array<float, N>> mul;
+    fast_tanh_op<Array<float, N>> tanh;
+    // 0.5f * (x + x * tanh(x * (0.797885f + 0.0356774f * x * x)));
+    float k0 = float(0.7978845608028654);
+    float tmp = float(0.044715);
+    float k1 = float(k0*tmp);
+
+    Array<float, N> v0 = mul(k1, value);
+    Array<float, N> v1 = fma(v0, value, k0);
+    Array<float, N> v2 = mul(value, v1);
+    Array<float, N> v3 = tanh(v2);
+    Array<float, N> v4 = fma(value, v3, value);
+    Array<float, N> v5 = mul(cutlass::constants::half<float>(), v4);
+    return v5;
+  }
+};
+
 template <typename T, int N>
 struct GELU_taylor<Array<T, N> > {
   static const bool kIsHeavy = true;
diff --git a/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h b/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
index 0b8d07a693..0b6aa714b3 100644
--- a/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
+++ b/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
@@ -53,21 +53,6 @@ namespace thread {
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-// If kIsHeavy is a member, use it.  Otherwise, assume that it's false.
-namespace { // (anonymous)
-template<class Op, class Enable = void>
-struct kIsHeavy_member_or_false {
-  static constexpr bool value = false;
-};
-template<class Op>
-struct kIsHeavy_member_or_false<Op, typename cutlass::platform::enable_if<Op::kIsHeavy>::type> {
-  static constexpr bool value = Op::kIsHeavy;
-};
-
-} // namespace (anonymous)
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
 namespace detail {
 
 struct EmptyArguments {};
diff --git a/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h b/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h
index bdd75a698e..245499b02e 100644
--- a/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h
+++ b/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h
@@ -43,8 +43,6 @@
 
 #pragma once
 
-#if !(defined(__clang__) && defined(__CUDA__))
-
 #include "cutlass/wmma_array.h"
 #include "cutlass/layout/matrix.h"
 
@@ -158,7 +156,3 @@ class FragmentIteratorWmmaTensorOp<WarpShape_, OperatorShape_, OperatorElementC_
 
 ////////////////////////////////////////////////////////////////////////////////
 
-#else
-#error (defined(__clang__) && defined(__CUDA__))
-#endif // !defined(__clang__)
-
diff --git a/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h b/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h
index 8dbb128252..8129dce1d8 100644
--- a/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h
+++ b/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h
@@ -34,8 +34,6 @@
 
 #pragma once
 
-#if !(defined(__clang__) && defined(__CUDA__))
-
 #include "cutlass/cutlass.h"
 #include "cutlass/wmma_array.h"
 #include "cutlass/layout/matrix.h"
@@ -223,5 +221,4 @@ class TileIteratorWmmaTensorOp<WarpShape_, OperatorShape_, OperatorFragment_, la
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-#endif // !defined(__clang__)
 
diff --git a/include/cutlass/experimental/distributed/device/dist_gemm_universal_wrapper.hpp b/include/cutlass/experimental/distributed/device/dist_gemm_universal_wrapper.hpp
index ad9abfb37d..7ac336a3a9 100644
--- a/include/cutlass/experimental/distributed/device/dist_gemm_universal_wrapper.hpp
+++ b/include/cutlass/experimental/distributed/device/dist_gemm_universal_wrapper.hpp
@@ -340,7 +340,7 @@ class DistributedGemmUniversalAdapter {
         base_args.epilogue.thread,
         reinterpret_cast<const ElementC*>(tensor_c_iter.data()),
         tensor_c_iter.stride(),
-        reinterpret_cast<const ElementD*>(tensor_d_iter.data()),
+        reinterpret_cast<ElementD*>(tensor_d_iter.data()),
         tensor_d_iter.stride()
       };
 
diff --git a/include/cutlass/experimental/distributed/kernel/dist_gemm_kernel_wrapper.hpp b/include/cutlass/experimental/distributed/kernel/dist_gemm_kernel_wrapper.hpp
index a9a40cfe1d..b290031045 100644
--- a/include/cutlass/experimental/distributed/kernel/dist_gemm_kernel_wrapper.hpp
+++ b/include/cutlass/experimental/distributed/kernel/dist_gemm_kernel_wrapper.hpp
@@ -82,7 +82,7 @@ struct DistributedGemmKernelWrapper<
   using BaseArguments = typename BaseKernel::Arguments;
   using BaseParams = typename BaseKernel::Params;
 
-  static_assert(BaseKernel::ArchTag::kMinComputeCapability == 90, "DistGEMM only supports Hopper GEMMs for now.");
+  //static_assert(BaseKernel::ArchTag::kMinComputeCapability == 90, "DistGEMM only supports Hopper GEMMs for now.");
   static_assert(not cute::is_same_v<typename BaseKernel::ElementC, void>, "DistributedGEMM epilogues must have a source.");
 
   using ElementFlag = uint32_t;
diff --git a/include/cutlass/fast_math.h b/include/cutlass/fast_math.h
index a229a35671..f658ed0dc9 100644
--- a/include/cutlass/fast_math.h
+++ b/include/cutlass/fast_math.h
@@ -399,6 +399,20 @@ struct FastDivmod {
     return div(dividend);
   }
 
+  /// Computes integer division remainder using precomputed values.
+  CUTLASS_HOST_DEVICE
+  int rem(int dividend) const {
+    int quotient, remainder;
+    fast_divmod(quotient, remainder, dividend);
+    return remainder;
+  }
+
+  /// Alias for `rem`
+  CUTLASS_HOST_DEVICE
+  int remainder(int dividend) const {
+    return rem(dividend);
+  }
+
   /// Computes integer division and modulus using precomputed values. This is computationally
   /// inexpensive.
   ///
diff --git a/include/cutlass/float8.h b/include/cutlass/float8.h
index 7be37d3fc8..81e97126cf 100644
--- a/include/cutlass/float8.h
+++ b/include/cutlass/float8.h
@@ -63,6 +63,10 @@
 #  define CUDA_PTX_UE8M0_CVT_ENABLED 1
 #endif
 
+#if (defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101F_ENABLED) ||\
+     defined(CUTLASS_ARCH_MMA_SM120F_ENABLED))
+#  define CUDA_PTX_UE8M0_CVT_ENABLED 1
+#endif
 
 #ifdef __GNUC__
 // Ignore checks on reinterpret-casts that are being used for bitcasts.
diff --git a/include/cutlass/float_subbyte.h b/include/cutlass/float_subbyte.h
index 5de48abcff..caa619715f 100644
--- a/include/cutlass/float_subbyte.h
+++ b/include/cutlass/float_subbyte.h
@@ -48,6 +48,12 @@
      defined(CUTLASS_ARCH_MMA_SM120A_ENABLED))
 #  define CUDA_PTX_FP4FP6_CVT_ENABLED 1
 #endif
+
+#if (defined(CUTLASS_ARCH_MMA_SM100F_ENABLED) || defined(CUTLASS_ARCH_MMA_SM101F_ENABLED) ||\
+     defined(CUTLASS_ARCH_MMA_SM120F_ENABLED))
+#  define CUDA_PTX_FP4FP6_CVT_ENABLED 1
+#endif
+
 #include "cutlass/cutlass.h"
 #include "cutlass/exmy_base.h"
 
diff --git a/include/cutlass/functional.h b/include/cutlass/functional.h
index f69de5efaa..752b99be27 100644
--- a/include/cutlass/functional.h
+++ b/include/cutlass/functional.h
@@ -56,7 +56,7 @@
 #include <intrin.h>
 #endif // _MSC_VER
 
-#if defined(CUTLASS_ARCH_MMA_SM100A_ENABLED)
+#if defined(CUTLASS_ARCH_MMA_SM100A_ENABLED) || defined(CUTLASS_ARCH_MMA_SM100F_ENABLED)
 #  define CUTLASS_ARCH_CREDUX_ENABLED
 #endif
 
@@ -419,6 +419,8 @@ struct maximum {
     else {
       return (lhs < rhs ? rhs : lhs);
     }
+
+    CUTE_GCC_UNREACHABLE;
   }
 };
 
diff --git a/include/cutlass/gemm/collective/builders/sm100_9xBF16_umma_builder.inl b/include/cutlass/gemm/collective/builders/sm100_9xBF16_umma_builder.inl
index d26503713f..3edd92801a 100644
--- a/include/cutlass/gemm/collective/builders/sm100_9xBF16_umma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm100_9xBF16_umma_builder.inl
@@ -60,16 +60,18 @@ sm100_compute_stage_count_or_override_fast_fp32(StageCountAutoCarveout<carveout_
   static_assert(CtaN <= 128, "Can't support CtaN>128 tiles");
   constexpr int CtaK = get<2>(CtaTileShape_MNK{});
   using AtomThrID = typename TiledMma::AtomThrID;
+  constexpr int TmemColumns = 512;
+
   // Detect 2x2 TMEM layout
   constexpr int TmemAccWordsPerDP = (CtaM == 64 && size(AtomThrID{}) == 2) ? CtaN/2 : CtaN;
   constexpr int TmemAWordsPerDP = ComplexComponent * NumComputeMtxs * CtaK / 2;
   constexpr bool IsAComputeinTmem = UmmaMajorA == cute::UMMA::Major::K && !cute::is_base_of_v<KernelTmaWarpSpecializedFastFP32SmemSm100, BuilderScheduleTag>;
   constexpr bool IsAComputeinSmem = !IsAComputeinTmem;
-  constexpr int AccumulatorStageCount = (IsAComputeinTmem) ? (((TmemAccWordsPerDP * ComplexComponent == 128) ? 2 : 3) * ComplexComponent) : (512 / TmemAccWordsPerDP);
+  constexpr int AccumulatorStageCount = (IsAComputeinTmem) ? (((TmemAccWordsPerDP * ComplexComponent == 128) ? 2 : 3) * ComplexComponent) : (TmemColumns / TmemAccWordsPerDP);
   
   constexpr int SmemCapacityAfterMma2AccumCarveout = CapacityBytes - (carveout_bytes + AccumulatorStageCount * 32);
 
-  constexpr int TmemInAStageCount_Potential = (IsAComputeinTmem) ? (512 - AccumulatorStageCount * TmemAccWordsPerDP) / TmemAWordsPerDP : 10000;
+  constexpr int TmemInAStageCount_Potential = (IsAComputeinTmem) ? (TmemColumns - AccumulatorStageCount * TmemAccWordsPerDP) / TmemAWordsPerDP : 10000;
   
   constexpr auto load2transform_pipeline_bytes = sizeof(typename cutlass::PipelineTmaTransformAsync<1>::SharedStorage);
   constexpr auto a_bits = cute::sizeof_bits_v<float> * ComplexComponent;
diff --git a/include/cutlass/gemm/collective/builders/sm100_blockscaled_umma_builder.inl b/include/cutlass/gemm/collective/builders/sm100_blockscaled_umma_builder.inl
index 6fd9c15f57..af122b4dc5 100644
--- a/include/cutlass/gemm/collective/builders/sm100_blockscaled_umma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm100_blockscaled_umma_builder.inl
@@ -235,9 +235,11 @@ struct CollectiveBuilder<
 
   static constexpr int MMA_N = cute::size<1>(TileShape_MNK{});
   static constexpr uint32_t AccumulatorPipelineStageCount = (MMA_N == 256) ? 1 : 2;
-  // Grouped GEMM (where Stride type is Stride*) does not use CLC based scheduler.
-  static constexpr uint32_t SchedulerPipelineStageCount = 1;
   static constexpr bool IsArrayOfPointersGemm = cute::is_base_of_v<KernelSchedulePtrArrayBlockScaledGemmSm100, BuilderScheduleTag>;
+  // Grouped GEMM(where Stride type is Stride*) uses specific static tile scheduler.  
+  static constexpr bool IsGroupGemm = !cute::is_same_v<StrideA, InternalStrideA>;
+  static constexpr uint32_t SchedulerPipelineStageCount = cute::conditional_return<IsGroupGemm>(8, 1);
+
   static constexpr uint32_t KernelSmemCarveout = detail::Sm100DenseGemmTmaUmmaCarveout<
       ClusterShape_MNK,
       AccumulatorPipelineStageCount,
diff --git a/include/cutlass/gemm/collective/builders/sm100_blockwise_umma_builder.inl b/include/cutlass/gemm/collective/builders/sm100_blockwise_umma_builder.inl
index d2b87315aa..dc99eb3d40 100644
--- a/include/cutlass/gemm/collective/builders/sm100_blockwise_umma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm100_blockwise_umma_builder.inl
@@ -49,9 +49,8 @@ template<
   class ElementScalar,
   class ScaleShapeMNK,
   class TileShapeMNK,
-  class MainloopPipelineStorage,
-  class TransformLoadPipelineStorage,
-  class TransformPipelineStorage,
+  class MainloopABPipelineStorage,
+  class MainloopSFPipelineStorage,
   int stages
 >
 constexpr int
@@ -67,9 +66,8 @@ template<
   class ElementScalar,
   class ScaleShapeMNK,
   class TileShapeMNK,
-  class MainloopPipelineStorage,
-  class TransformLoadPipelineStorage,
-  class TransformPipelineStorage,
+  class MainloopABPipelineStorage,
+  class MainloopSFPipelineStorage,
   int stages
 >
 constexpr int
@@ -85,9 +83,8 @@ template<
   class ElementScalar,
   class ScaleShapeMNK,
   class TileShapeMNK,
-  class MainloopPipelineStorage,
-  class TransformLoadPipelineStorage,
-  class TransformPipelineStorage,
+  class MainloopABPipelineStorage,
+  class MainloopSFPipelineStorage,
   int carveout_bytes>
 constexpr int
 sm100_compute_stage_count_or_override_blockwise(StageCountAutoCarveout<carveout_bytes> stage_count) {
@@ -96,23 +93,141 @@ sm100_compute_stage_count_or_override_blockwise(StageCountAutoCarveout<carveout_
   // Each stage include (CollectiveMma::SharedStorage)
   // 1. smem for A and smem for B (CollectiveMma::SharedStorage::TensorStorage)
   // 2. one of each of the pipelines
-  constexpr auto pipeline_bytes = sizeof(MainloopPipelineStorage) + 
-      sizeof(TransformLoadPipelineStorage) + sizeof(TransformPipelineStorage);
+  constexpr auto pipeline_bytes = sizeof(MainloopABPipelineStorage) + 
+      sizeof(MainloopSFPipelineStorage);
 
   constexpr auto a_bits = cute::sizeof_bits_v<ElementA>;
   constexpr auto b_bits = cute::sizeof_bits_v<ElementB>;
   constexpr auto scale_bits = cute::sizeof_bits_v<ElementScalar>;
 
   constexpr int stage_bytes =
-    cutlass::bits_to_bytes(a_bits * size<0>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
-    cutlass::bits_to_bytes(b_bits * size<1>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
-    cutlass::bits_to_bytes(scale_bits * size<0>(ScaleShapeMNK{}) * size<2>(ScaleShapeMNK{})) +
-    cutlass::bits_to_bytes(scale_bits * size<1>(ScaleShapeMNK{}) * size<2>(ScaleShapeMNK{})) +
+    cutlass::round_nearest(
+      cutlass::bits_to_bytes(a_bits * size<0>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
+      cutlass::bits_to_bytes(b_bits * size<1>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
+      cutlass::bits_to_bytes(scale_bits * size<0>(ScaleShapeMNK{}) * size<2>(ScaleShapeMNK{})) +
+      cutlass::bits_to_bytes(scale_bits * size<1>(ScaleShapeMNK{}) * size<2>(ScaleShapeMNK{})),
+      128) +
     static_cast<int>(pipeline_bytes);
 
   return (CapacityBytes - carveout_bytes) / stage_bytes;
 }
 
+template<class Element, typename LayoutSFA, class CtaShape_MNK>
+auto sm100_make_simt_gmem_tiled_copy_SFA() {
+
+  // we have at most a warp to perform the loads
+
+  constexpr int ScaleGranularityM = size<0,0>(LayoutSFA{});
+  constexpr int ScaleMsPerTile = size<0>(CtaShape_MNK{}) / ScaleGranularityM;
+  constexpr int ScaleGranularityK = size<1,0>(LayoutSFA{});
+  constexpr int ScaleKsPerTile = size<2>(CtaShape_MNK{}) / ScaleGranularityK;
+
+  if constexpr (size<0,1>(LayoutSFA{}.stride()) == 1) {
+    constexpr int LeadingScalesPerTileSFA = ScaleMsPerTile;
+    if constexpr (LeadingScalesPerTileSFA >= 32) {
+      constexpr int Alignment = cute::min(static_cast<int>(LeadingScalesPerTileSFA * sizeof(Element)) / 32, 16);
+      using ScaleCopyTypeA = cute::uint_byte_t<Alignment>; 
+      using SmemScalingCopyAtomA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ScaleCopyTypeA>, Element>;
+      constexpr int ElementsPerSFACopy = static_cast<int>(sizeof(ScaleCopyTypeA) / sizeof(Element));
+      return make_tiled_copy(SmemScalingCopyAtomA{}, Layout<Shape<_32>>{}, Layout<Shape<Int<ElementsPerSFACopy>>>{});
+    } 
+    else {
+      using SmemScalingCopyAtomA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<Element>, Element>;
+      return make_tiled_copy(SmemScalingCopyAtomA{}, Layout<Shape<Int<LeadingScalesPerTileSFA>>>{}, Layout<Shape<_1>>{});
+    }
+  } 
+  else {
+    // we expect scale Ks per tile to be small
+    constexpr int LeadingScalesPerTileSFA = ScaleKsPerTile;
+    using SmemScalingCopyAtomA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<Element>, Element>;
+    return make_tiled_copy(SmemScalingCopyAtomA{}, Layout<Shape<_1, Int<LeadingScalesPerTileSFA>>>{}, Layout<Shape<_1,_1>>{});
+  }
+}
+
+template<class Element, typename LayoutSFB, class CtaShape_MNK>
+auto sm100_make_simt_gmem_tiled_copy_SFB() {
+
+  // we have at most a warp to perform the loads
+
+  constexpr int ScaleGranularityN = size<0,0>(LayoutSFB{});
+  constexpr int ScaleNsPerTile = size<1>(CtaShape_MNK{}) / ScaleGranularityN;
+  constexpr int ScaleGranularityK = size<1,0>(LayoutSFB{});
+  constexpr int ScaleKsPerTile = size<2>(CtaShape_MNK{}) / ScaleGranularityK;
+
+  if constexpr (size<0,1>(LayoutSFB{}.stride()) == 1) {
+    constexpr int LeadingScalesPerTileSFB = ScaleNsPerTile;
+    if constexpr (LeadingScalesPerTileSFB >= 32) {
+      constexpr int Alignment = cute::min(static_cast<int>(LeadingScalesPerTileSFB * sizeof(Element)) / 32, 16);
+      using ScaleCopyTypeB = cute::uint_byte_t<Alignment>; 
+      using SmemScalingCopyAtomB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ScaleCopyTypeB>, Element>;
+      constexpr int ElementsPerSFBCopy = static_cast<int>(sizeof(ScaleCopyTypeB) / sizeof(Element));
+      return make_tiled_copy(SmemScalingCopyAtomB{}, Layout<Shape<_32>>{}, Layout<Shape<Int<ElementsPerSFBCopy>>>{});
+    } 
+    else {
+      using SmemScalingCopyAtomB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<Element>, Element>;
+      return make_tiled_copy(SmemScalingCopyAtomB{}, Layout<Shape<Int<LeadingScalesPerTileSFB>>>{}, Layout<Shape<_1>>{});
+    }
+  } 
+  else {
+    // we expect scale Ks per tile to be small
+    constexpr int LeadingScalesPerTileSFB = ScaleKsPerTile;
+    using SmemScalingCopyAtomB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<Element>, Element>;
+    return make_tiled_copy(SmemScalingCopyAtomB{}, Layout<Shape<_1, Int<LeadingScalesPerTileSFB>>>{}, Layout<Shape<_1,_1>>{});
+  }
+}
+
+// For new MMA construction and partitioning that supports both dynamic and static cluster shape.
+// Used in conjunction with make_tma_atom_(A|B)_sm100
+// TileShape_MNK is always static and has shape (MmaAtomShapeM, MmaAtomShapeN, TileK)
+// ClusterShape_MNK can be dynamic or static.
+template<
+  class ElementAMma,
+  class ElementBMma,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  UMMA::Major UmmaMajorA,
+  UMMA::Major UmmaMajorB,
+  class BuilderScheduleTag,
+  UMMA::ScaleIn ANeg = UMMA::ScaleIn::One,
+  UMMA::ScaleIn BNeg = UMMA::ScaleIn::One
+>
+constexpr auto
+sm100_make_trivial_tiled_mma_blockwise() {
+  // MMA_2SM requested
+  if constexpr (cute::is_base_of_v<KernelSchedule2Sm, BuilderScheduleTag> ) {
+    return sm100_make_2sm_trivial_tiled_mma<ElementAMma, ElementBMma, ElementAccumulator,
+                                    TileShape_MNK, ClusterShape_MNK, UmmaMajorA, UmmaMajorB, ANeg, BNeg>();
+  }
+  // MMA_1SM requested
+  else if constexpr (cute::is_base_of_v<KernelSchedule1Sm, BuilderScheduleTag> ) {
+    return sm100_make_1sm_trivial_tiled_mma<ElementAMma, ElementBMma, ElementAccumulator,
+                                    TileShape_MNK, ClusterShape_MNK, UmmaMajorA, UmmaMajorB, ANeg, BNeg>();
+  }
+  // Auto scheduling requested
+  else if constexpr (cute::is_same_v<BuilderScheduleTag, KernelScheduleSm100Blockwise>) {
+    // Static cluster
+    if constexpr (cute::is_static_v<ClusterShape_MNK>) {
+      // For MMA_2SM we need a cluster shape that is multiple of 2x1
+      // and only M=128 and M=256 are supported, otherwise, fall back to MMA_1SM
+      if constexpr (cute::size<0>(ClusterShape_MNK{}) % 2 == 0 &&
+                    cute::size<0>(TileShape_MNK{}) % 128 == 0) {
+        return sm100_make_2sm_trivial_tiled_mma<ElementAMma, ElementBMma, ElementAccumulator,
+                                        TileShape_MNK, ClusterShape_MNK, UmmaMajorA, UmmaMajorB, ANeg, BNeg>();
+      }
+      else {
+        return sm100_make_1sm_trivial_tiled_mma<ElementAMma, ElementBMma, ElementAccumulator,
+                                        TileShape_MNK, ClusterShape_MNK, UmmaMajorA, UmmaMajorB, ANeg, BNeg>();
+      }
+    // Dynamic cluster shape means we cannot assume we can use 2SM MMA 
+    }
+    else {
+        return sm100_make_1sm_trivial_tiled_mma<ElementAMma, ElementBMma, ElementAccumulator,
+                                        TileShape_MNK, ClusterShape_MNK, UmmaMajorA, UmmaMajorB, ANeg, BNeg>();
+    }
+  }
+}
+
 } // namespace detail
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -161,9 +276,11 @@ struct CollectiveBuilder<
   using GmemLayoutBTag   = cute::remove_cvref_t<decltype(get<0>(GmemLayoutBTagPair{}))>;
   using GmemLayoutSFBTag = cute::remove_cvref_t<decltype(get<1>(GmemLayoutBTagPair{}))>;
 
-  static_assert(cute::depth(GmemLayoutSFATag{}) == 2 and cute::depth(GmemLayoutSFBTag{}) == 2, 
+  static_assert(cute::depth(cute::remove_pointer_t<GmemLayoutSFATag>{}) == 2 and 
+                cute::depth(cute::remove_pointer_t<GmemLayoutSFBTag>{}) == 2, 
       "Expect SFA and SFB layout to be depth of two with shape ((SFVecMN, restMN),(SFVecK, restK), L)");
-  static_assert(size<1,0>(GmemLayoutSFATag{}) == size<1, 0>(GmemLayoutSFBTag{}), 
+  static_assert(size<1,0>(cute::remove_pointer_t<GmemLayoutSFATag>{}) == 
+                size<1,0>(cute::remove_pointer_t<GmemLayoutSFBTag>{}), 
       "SFA and SFB must have equivalent SF vector sizes along K");
 
   static constexpr cute::UMMA::Major UmmaMajorA = cutlass::gemm::collective::detail::tag_to_umma_major_A<GmemLayoutATag>();
@@ -183,7 +300,7 @@ struct CollectiveBuilder<
                                                                       TileShape_MNK, ClusterShape_MNK,
                                                                       GmemLayoutATag, GmemLayoutBTag, false /*is_sparse*/, is_2sm>(),
                 "TileSize and MNK Major does not met with MMA Mix 8-bit TMA load requirement" );
-  using TiledMma =  decltype(detail::sm100_make_trivial_tiled_mma<
+  using TiledMma =  decltype(detail::sm100_make_trivial_tiled_mma_blockwise<
       ElementAMma, ElementBMma, ElementAccumulator,
       decltype(cute::product_each(TileShape_MNK{})), ClusterShape_MNK,
       UmmaMajorA, UmmaMajorB, BuilderScheduleTag>());
@@ -238,29 +355,39 @@ struct CollectiveBuilder<
   // SchedulerPipelineStageCount could be set to zero for Grouped GEMM, but we shouldn't define CLC Pipeline's barrier arrays of size zero.
   static constexpr uint32_t SchedulerPipelineStageCount = cute::is_same_v<InternalStrideA, StrideA> ? (AccumulatorPipelineStageCount + 1) : 1;
 
+  static constexpr bool IsArrayOfPointersGemm = (cute::is_base_of_v<KernelScheduleSm100PtrArrayBlockwise, BuilderScheduleTag>);
+
   static constexpr uint32_t KernelSmemCarveout = detail::Sm100DenseGemmTmaUmmaCarveout<
       ClusterShape_MNK,
       AccumulatorPipelineStageCount,
       SchedulerPipelineStageCount,
       detail::CLCResponseSize,
-      false
+      IsArrayOfPointersGemm
     >::KernelSmemCarveout;
   // Reduce SMEM capacity available for buffers considering barrier allocations.
   static constexpr int Sm100ReducedSmemCapacityBytes = cutlass::gemm::collective::detail::sm100_smem_capacity_bytes - KernelSmemCarveout;
 
   using SmemTileShape = cute::Shape<BlockTileA_M, BlockTileB_N, BlockTileA_K>;
-  using MainloopPipelineStorage = typename cutlass::PipelineTmaUmmaAsync<1>::SharedStorage;
-  using TransformLoadPipelineStorage = typename cutlass::PipelineAsync<1>::SharedStorage;
-  using TransformPipelineStorage = typename cutlass::PipelineUmmaAsync<1>::SharedStorage;
+  using MainloopABPipelineStorage = typename cutlass::PipelineTmaUmmaAsync<1>::SharedStorage;
+  using MainloopSFPipelineStorage = typename cutlass::PipelineAsync<1>::SharedStorage;
 
-  static constexpr int ScaleGranularityM = size<0,0>(GmemLayoutSFATag{});
-  static constexpr int ScaleGranularityN = size<0,0>(GmemLayoutSFBTag{});
-  static constexpr int ScaleGranularityK = size<1,0>(GmemLayoutSFBTag{});
+  static constexpr int ScaleGranularityM = size<0,0>(cute::remove_pointer_t<GmemLayoutSFATag>{});
+  static constexpr int ScaleGranularityN = size<0,0>(cute::remove_pointer_t<GmemLayoutSFBTag>{});
+  static constexpr int ScaleGranularityK = size<1,0>(cute::remove_pointer_t<GmemLayoutSFBTag>{});
 
   static_assert(size<0>(CtaTileShape_MNK{}) >= ScaleGranularityM, "Scale Granularity must be smaller than or equal to the tile shape");
   static_assert(size<1>(CtaTileShape_MNK{}) >= ScaleGranularityN, "Scale Granularity must be smaller than or equal to the tile shape");
   static_assert(size<2>(CtaTileShape_MNK{}) >= ScaleGranularityK, "Scale Granularity must be smaller than or equal to the tile shape");
 
+  using GmemTiledCopySFA = decltype(detail::sm100_make_simt_gmem_tiled_copy_SFA<
+      ElementAccumulator,
+      cute::remove_pointer_t<GmemLayoutSFATag>,
+      CtaTileShape_MNK>());
+  using GmemTiledCopySFB = decltype(detail::sm100_make_simt_gmem_tiled_copy_SFB<
+      ElementAccumulator,
+      cute::remove_pointer_t<GmemLayoutSFBTag>,
+      CtaTileShape_MNK>());
+
   using BlockTileScale_M = Int<size<0>(TileShape_MNK{}) / ScaleGranularityM>;
   using BlockTileScale_N = Int<size<1>(TileShape_MNK{}) / ScaleGranularityN>;
   using BlockTileScale_K = Int<size<2>(TileShape_MNK{}) / ScaleGranularityK>;
@@ -269,15 +396,22 @@ struct CollectiveBuilder<
 
   static constexpr int PipelineStages = cutlass::gemm::collective::detail::sm100_compute_stage_count_or_override_blockwise<
       Sm100ReducedSmemCapacityBytes, ElementAMma_SmemAllocType, ElementBMma_SmemAllocType, 
-      ElementAccumulator, ScaleTileShape, SmemTileShape, MainloopPipelineStorage,
-      TransformLoadPipelineStorage, TransformPipelineStorage>(StageCountType{});
+      ElementAccumulator, ScaleTileShape, SmemTileShape, MainloopABPipelineStorage,
+      MainloopSFPipelineStorage>(StageCountType{});
   static_assert(PipelineStages > 0, "Smem usage is too high. Can't create any SMEM buffers for A, B, and scales.");
 
-  using DispatchPolicy = cutlass::gemm::MainloopSm100TmaUmmaWarpSpecializedBlockwiseScaling<
+  using DispatchPolicy = cute::conditional_t<
+    IsArrayOfPointersGemm,
+    cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecializedBlockwiseScaling<
+      PipelineStages,
+      SchedulerPipelineStageCount,
+      AccumulatorPipelineStageCount,
+      ClusterShape_MNK>,
+    cutlass::gemm::MainloopSm100TmaUmmaWarpSpecializedBlockwiseScaling<
       PipelineStages,
       SchedulerPipelineStageCount,
       AccumulatorPipelineStageCount,
-      ClusterShape_MNK>;
+      ClusterShape_MNK>>;
 
   using CollectiveOp = cutlass::gemm::collective::CollectiveMma<
       DispatchPolicy,
@@ -287,11 +421,11 @@ struct CollectiveBuilder<
       ElementB,
       cute::tuple<cutlass::gemm::TagToStrideB_t<GmemLayoutBTag>, cutlass::gemm::TagToStrideB_t<GmemLayoutSFBTag>>,
       TiledMma,
-      GmemTiledCopyA,
+      cute::tuple<GmemTiledCopyA, GmemTiledCopySFA>,
       SmemLayoutAtomA,
       void,
       cute::identity,
-      GmemTiledCopyB,
+      cute::tuple<GmemTiledCopyB, GmemTiledCopySFB>,
       SmemLayoutAtomB,
       void,
       cute::identity
diff --git a/include/cutlass/gemm/collective/builders/sm100_umma_builder.inl b/include/cutlass/gemm/collective/builders/sm100_umma_builder.inl
index 4cfff447fc..4ebfdb52c4 100644
--- a/include/cutlass/gemm/collective/builders/sm100_umma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm100_umma_builder.inl
@@ -264,10 +264,11 @@ struct CollectiveBuilder<
   // Calculate scheduler pipeline stages. Having one more stage than the accumulator allows more latency hiding.
   using StrideA = cutlass::gemm::TagToStrideA_t<GmemLayoutATag>;
   using InternalStrideA  = cute::remove_pointer_t<StrideA>;
-  // Grouped GEMM (where Stride type is Stride*) does not use CLC based scheduler.
-  // SchedulerPipelineStageCount could be set to zero for Grouped GEMM, but we shouldn't define CLC Pipeline's barrier arrays of size zero.
-  static constexpr uint32_t SchedulerPipelineStageCount = 1;
   static constexpr bool IsArrayOfPointersGemm = (cute::is_base_of_v<KernelScheduleSm100PtrArrayDenseGemm, BuilderScheduleTag>);
+  // Grouped GEMM(where Stride type is Stride*) uses specific static tile scheduler.
+  static constexpr bool IsGroupGemm = !cute::is_same_v<StrideA, InternalStrideA>;
+  static constexpr uint32_t SchedulerPipelineStageCount = cute::conditional_return<IsGroupGemm>(8, 1);
+  
   static constexpr uint32_t KernelSmemCarveout = detail::Sm100DenseGemmTmaUmmaCarveout<
       ClusterShape_MNK,
       AccumulatorPipelineStageCount,
diff --git a/include/cutlass/gemm/collective/builders/sm120_blockscaled_mma_builder.inl b/include/cutlass/gemm/collective/builders/sm120_blockscaled_mma_builder.inl
index 062631e738..862d430277 100755
--- a/include/cutlass/gemm/collective/builders/sm120_blockscaled_mma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm120_blockscaled_mma_builder.inl
@@ -69,6 +69,8 @@ struct CollectiveBuilder<
       (cute::is_base_of_v<KernelScheduleBlockScaledGemmSm120, BuilderScheduleTag> ||
        cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag> ||
        cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, BuilderScheduleTag> ||
+       cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, BuilderScheduleTag> ||
+       cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, BuilderScheduleTag> ||
        cute::is_same_v<KernelScheduleAuto, BuilderScheduleTag>)
        &&
       // Alignment check
@@ -90,6 +92,7 @@ struct CollectiveBuilder<
 
   static constexpr cute::UMMA::Major UmmaMajorA = cutlass::gemm::collective::detail::tag_to_umma_major_A<GmemLayoutATag>();
   static constexpr cute::UMMA::Major UmmaMajorB = cutlass::gemm::collective::detail::tag_to_umma_major_B<GmemLayoutBTag>();
+  static_assert((UmmaMajorA == UMMA::Major::K && UmmaMajorB == UMMA::Major::K), "Only TN layout is supported.");
 
   static_assert(cute::is_static_v<TileShape_MNK>, "TileShape has to be static");
   static_assert(cute::is_static_v<ClusterShape_MNK>, "Cluster has to be static");
@@ -104,12 +107,13 @@ struct CollectiveBuilder<
                                                                   UmmaMajorB,
                                                                   BuilderScheduleTag>();
   static constexpr bool UseMxf8f6f4 = Instr == detail::blockscaled::BlockScaledInstr::MXF4F6F8;
-
   using PermTileM = decltype(cute::min(size<0>(TileShape_MNK{}), _128{}));
   using PermTileN = decltype(detail::sm120_tile_n_permute_selector<SFVectorSize>());
-  using PermTileK = cute::conditional_t<UseMxf8f6f4, _32, _64>;
+  using PermTileK = cute::conditional_t<(UseMxf8f6f4
+                                        ), _32, _64>;
 
-  static constexpr bool IsCooperative = !cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag>;
+  static constexpr bool IsCooperative = !(cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag> ||
+                                          cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, BuilderScheduleTag>);
   // Data type used by MMA instruction
   using ElementAMma = decltype(cutlass::gemm::collective::detail::sm1xx_kernel_input_element_to_mma_input_element<ElementA>());
   using ElementBMma = decltype(cutlass::gemm::collective::detail::sm1xx_kernel_input_element_to_mma_input_element<ElementB>());
@@ -124,7 +128,13 @@ struct CollectiveBuilder<
       Layout<Shape<_4,_2,_1>>, Layout<Shape<_2,_2,_1>>>;
 
   using TiledMma = decltype(cute::make_tiled_mma(
-    cute::rr_blockscaled_op_selector_sm120<ElementA, ElementB, ElementAccumulator, ElementSF, SFVectorSize, UseMxf8f6f4>(),
+    cute::rr_blockscaled_op_selector_sm120<ElementA,
+                                           ElementB,
+                                           ElementAccumulator,
+                                           ElementSF,
+                                           SFVectorSize,
+                                           UseMxf8f6f4
+                                           >(),
     AtomLayoutMNK{},
     Tile<PermTileM, PermTileN, PermTileK>{}
   ));
@@ -150,8 +160,14 @@ struct CollectiveBuilder<
   using SmemLayoutAtomA = decltype(detail::sm120_rr_smem_selector<SmemAllocTypeA, decltype(size<2>(TileShape_MNK{}))>());
   using SmemLayoutAtomB = decltype(detail::sm120_rr_smem_selector<SmemAllocTypeB, decltype(size<2>(TileShape_MNK{}))>());
 
-  using SmemCopyAtomA = Copy_Atom<decltype(detail::sm120_rr_smem_copy_selector_A<ElementA, ElementB, UseMxf8f6f4>()), SmemAllocTypeA>;
-  using SmemCopyAtomB = Copy_Atom<decltype(detail::sm120_rr_smem_copy_selector_B<ElementA, ElementB, UseMxf8f6f4>()), SmemAllocTypeB>;
+  using SmemCopyAtomA = Copy_Atom<decltype(detail::sm120_rr_smem_copy_selector_A<ElementA,
+                                                                                 ElementB,
+                                                                                 UseMxf8f6f4
+                                                                                 >()), SmemAllocTypeA>;
+  using SmemCopyAtomB = Copy_Atom<decltype(detail::sm120_rr_smem_copy_selector_B<ElementA,
+                                                                                 ElementB,
+                                                                                 UseMxf8f6f4
+                                                                                >()), SmemAllocTypeB>;
 
   using SmemCopyAtomSF = Copy_Atom<UniversalCopy<SmemAllocTypeSF>, SmemAllocTypeSF>; // auto-vectorized LDS
   using SmemCopyAtomSFA = SmemCopyAtomSF;
@@ -198,17 +214,50 @@ struct CollectiveBuilder<
 
   static constexpr uint32_t SchedulerPipelineStageCount = 3;
 
-  using DispatchPolicy = MainloopSm120TmaWarpSpecializedBlockScaled<PipelineStages,
+  using StrideA = cutlass::gemm::TagToStrideA_t<GmemLayoutATag>;
+  using StrideB = cutlass::gemm::TagToStrideB_t<GmemLayoutBTag>;
+  using InternalStrideA  = cute::remove_pointer_t<StrideA>;
+  using InternalStrideB  = cute::remove_pointer_t<StrideB>;
+  using InternalLayoutSFA = decltype(Sm1xxBlkScaledConfig::deduce_layoutSFA());
+  using InternalLayoutSFB = decltype(Sm1xxBlkScaledConfig::deduce_layoutSFB());
+  using LayoutSFA = cute::conditional_t<cute::is_same_v<InternalStrideA, StrideA>, InternalLayoutSFA, InternalLayoutSFA *>;
+  using LayoutSFB = cute::conditional_t<cute::is_same_v<InternalStrideB, StrideB>, InternalLayoutSFB, InternalLayoutSFB *>;
+  using StridePairA = decltype(cute::make_tuple(StrideA{}, LayoutSFA{}));
+  using StridePairB = decltype(cute::make_tuple(StrideB{}, LayoutSFB{}));
+
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
+  static_assert(!IsGroupedGemmKernel || 
+                cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, BuilderScheduleTag> ||
+                cute::is_base_of_v<KernelScheduleAuto, BuilderScheduleTag> ||
+                cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, BuilderScheduleTag>,
+                "Invalid builder schedule tag for grouped GEMM");
+
+  using KernelSchedule = cute::conditional_t<IsGroupedGemmKernel, 
+                                              // PtrArray
+                                              cute::conditional_t<IsCooperative, 
+                                                KernelPtrArrayTmaWarpSpecializedCooperativeBlockScaledSm120<SchedulerPipelineStageCount>, 
+                                                KernelPtrArrayTmaWarpSpecializedPingpongBlockScaledSm120<SchedulerPipelineStageCount>>,
+                                              // Non-PtrArray
+                                              cute::conditional_t<IsCooperative, 
+                                                KernelTmaWarpSpecializedCooperativeBlockScaledSm120<SchedulerPipelineStageCount>, 
+                                                KernelTmaWarpSpecializedPingpongBlockScaledSm120<SchedulerPipelineStageCount>>>;
+
+  using DispatchPolicy = cute::conditional_t<IsGroupedGemmKernel,
+                                              MainloopSm120ArrayTmaWarpSpecializedBlockScaled<PipelineStages,
+                                                                    SchedulerPipelineStageCount,
+                                                                    ClusterShape_MNK,
+                                                                    KernelSchedule>,
+                                              MainloopSm120TmaWarpSpecializedBlockScaled<PipelineStages,
                                                                     SchedulerPipelineStageCount,
                                                                     ClusterShape_MNK,
-                                                                    BuilderScheduleTag>;
+                                                                    KernelSchedule>>;
+                                                                    
   static_assert(cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, typename DispatchPolicy::Schedule> ||
-                cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, typename DispatchPolicy::Schedule>, 
+                cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, typename DispatchPolicy::Schedule> ||
+                cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, typename DispatchPolicy::Schedule> ||
+                cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, typename DispatchPolicy::Schedule>, 
                 "Unsupported kernel schedule by this collective mainloop dispatch policy.");
 
-  using StridePairA = decltype(cute::make_tuple(cutlass::gemm::TagToStrideA_t<GmemLayoutATag>{}, Sm1xxBlkScaledConfig::deduce_layoutSFA()));
-  using StridePairB = decltype(cute::make_tuple(cutlass::gemm::TagToStrideB_t<GmemLayoutBTag>{}, Sm1xxBlkScaledConfig::deduce_layoutSFB()));
-
   using CollectiveOp = CollectiveMma<
       DispatchPolicy,
       TileShape_MNK,
diff --git a/include/cutlass/gemm/collective/builders/sm120_common.inl b/include/cutlass/gemm/collective/builders/sm120_common.inl
index 7915eb97a2..45e201b373 100644
--- a/include/cutlass/gemm/collective/builders/sm120_common.inl
+++ b/include/cutlass/gemm/collective/builders/sm120_common.inl
@@ -45,7 +45,11 @@ namespace cutlass::gemm::collective::detail {
 
 constexpr int sm120_smem_capacity_bytes = cutlass::arch::sm120_smem_capacity_bytes;
 // Helper for selecting the shared memory copy atom to use for operand A
-template <class ElementA, class ElementB, bool UseF8f6f4>
+template <
+  class ElementA,
+  class ElementB,
+  bool UseF8f6f4
+>
 CUTLASS_HOST_DEVICE constexpr
 auto
 sm120_rr_smem_copy_selector_A() {
@@ -66,7 +70,11 @@ sm120_rr_smem_copy_selector_A() {
 }
 
 // Helper for selecting the shared memory copy atom to use for operand B
-template <class ElementA, class ElementB, bool UseF8f6f4>
+template <
+  class ElementA,
+  class ElementB,
+  bool UseF8f6f4
+>
 CUTLASS_HOST_DEVICE constexpr
 auto
 sm120_rr_smem_copy_selector_B() {
diff --git a/include/cutlass/gemm/collective/builders/sm120_mma_builder.inl b/include/cutlass/gemm/collective/builders/sm120_mma_builder.inl
index 25a06e5083..5426aa4ce2 100644
--- a/include/cutlass/gemm/collective/builders/sm120_mma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm120_mma_builder.inl
@@ -70,6 +70,8 @@ struct CollectiveBuilder<
       (cute::is_base_of_v<KernelScheduleSm120DenseGemm, BuilderScheduleTag> ||
        cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag> ||
        cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, BuilderScheduleTag> ||
+       cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, BuilderScheduleTag> ||
+       cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, BuilderScheduleTag> ||
        cute::is_same_v<KernelScheduleAuto, BuilderScheduleTag>) &&
       // Alignment check
       detail::sm1xx_gemm_is_aligned<ElementA, AlignmentA, ElementB, AlignmentB, BuilderScheduleTag>()>>
@@ -79,9 +81,11 @@ struct CollectiveBuilder<
                 "SM120 TmaWarpSpecialized builder currently only supports F8F6F4 MMA.");
   static_assert(cute::is_static_v<TileShape_MNK>, "TileShape has to be static");
   static_assert(cute::is_static_v<ClusterShape_MNK>, "Cluster has to be static");
+  static_assert(cute::size(ClusterShape_MNK{}) == Int<1>{}, "no programmatic multicast on this arch");
 
-  static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_rs_tag_to_major_A<GmemLayoutATag>();
-  static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_rs_tag_to_major_B<GmemLayoutBTag>();
+  static constexpr cute::UMMA::Major UmmaMajorA = cutlass::gemm::collective::detail::tag_to_umma_major_A<GmemLayoutATag>();
+  static constexpr cute::UMMA::Major UmmaMajorB = cutlass::gemm::collective::detail::tag_to_umma_major_B<GmemLayoutBTag>();
+  static_assert((UmmaMajorA == UMMA::Major::K && UmmaMajorB == UMMA::Major::K), "Only TN layout is supported.");
 
   using PermTileM = decltype(cute::min(size<0>(TileShape_MNK{}), _128{}));
   using PermTileN = decltype(cute::min(size<1>(TileShape_MNK{}),  _32{}));
@@ -127,10 +131,24 @@ struct CollectiveBuilder<
       detail::sm120_smem_capacity_bytes, SmemAllocTypeA,
       SmemAllocTypeB, TileShape_MNK, MainloopPipelineStorage>(StageCountType{});
   static constexpr uint32_t SchedulerPipelineStageCount = 2;
+
+
+  static constexpr bool IsPtrArrayKernel = cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, BuilderScheduleTag> ||
+                                           cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, BuilderScheduleTag>;
+  static_assert(!IsPtrArrayKernel, "PtrArray kernel is not supported for this collective builder.");
+  
+  using KernelSchedule = cute::conditional_t<IsCooperative, 
+                                              KernelTmaWarpSpecializedCooperativeSm120<SchedulerPipelineStageCount>, 
+                                              KernelTmaWarpSpecializedPingpongSm120<SchedulerPipelineStageCount>>;
+
   using DispatchPolicy = MainloopSm120TmaWarpSpecialized<PipelineStages,
-                                                         SchedulerPipelineStageCount,
-                                                         ClusterShape_MNK,
-                                                         BuilderScheduleTag>;
+                                                          SchedulerPipelineStageCount,
+                                                          ClusterShape_MNK,
+                                                          KernelSchedule>;
+
+  static_assert(cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, typename DispatchPolicy::Schedule> ||
+                cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, typename DispatchPolicy::Schedule>, 
+                "Unsupported kernel schedule by this collective mainloop dispatch policy.");                                                                    
 
   using SmemCopyAtomA = Copy_Atom<decltype(detail::sm120_rr_smem_copy_selector_A<ElementA, ElementB, UseF8f6f4>()), SmemAllocTypeA>;
   using SmemCopyAtomB = Copy_Atom<decltype(detail::sm120_rr_smem_copy_selector_B<ElementA, ElementB, UseF8f6f4>()), SmemAllocTypeB>;
diff --git a/include/cutlass/gemm/collective/builders/sm1xx_common.inl b/include/cutlass/gemm/collective/builders/sm1xx_common.inl
index 76c71f8095..a6444e02d5 100644
--- a/include/cutlass/gemm/collective/builders/sm1xx_common.inl
+++ b/include/cutlass/gemm/collective/builders/sm1xx_common.inl
@@ -189,6 +189,100 @@ template<
   bool Is2sm = false
 >
 constexpr bool sm1xx_gemm_check_for_f8f6f4_mix8bit_requirement(){
+  // * 1SM Dense
+  //    * A_K(t) : TileShape_K % 128 == 0
+  //    * A_M(n) : TileShape_M % 128 == 0
+  //    * B_N(t) : TileSize_N % 128 == 0
+  //    * B_K(n) : TileSize_K % 128 == 0
+  //
+  // * 2SM Dense
+  //    * A_K(t) : TileShape_K % 128 == 0
+  //    * A_M(n) : TileShape_M % 128 == 0
+  //    * B_N(t) : TileSize_N % 256 == 0
+  //        each sm load half the data along tile_n (split vertically), each sm needs to be 128 elts aligned.
+  //        full tile_n needs to be 256 elts aligned
+  //    * B_K(n) : TileShape_K % 128 == 0
+  //
+  // * 1SM Sparse
+  //    * A_K(t) : TileShape_K % 256 == 0
+  //        num of physical elems needs to be 128 elts aligned
+  //        num of logical elems needs to be 256 elts aligned
+  //    * A_M(n) : TileShape_M % 128 == 0
+  //    * B_N(t) : TileSize_N % 128 == 0
+  //    * B_K(n) : TileSize_K % 128 == 0
+  //
+  // * 2SM Sparse
+  //    * A_K(t) : TileShape_K % 256 == 0
+  //        num of physical elems needs to be 128 elts aligned
+  //        num of logical elems needs to be 256 elts aligned
+  //    * A_M(n) : TileShape_M % 128 == 0
+  //    * B_N(t) : TileSize_N % 256 == 0
+  //        each sm load half the data along tile_n (split vertically), each sm needs to be 128 elts aligned.
+  //        full tile_n needs to be 256 elts aligned
+  //    * B_K(n) : TileShape_K % 128 == 0
+  //
+  // * Valid TileShape_MNK Dense
+  //    * Notation: 
+  //          mma_instruction_tile_shape-cta_tile_shape
+  //    * s128x128x64
+  //          s128x128x32_128x128x128_nn YES
+  //          s128x128x32_128x128x128_nt YES
+  //          s128x128x32_128x128x128_tn YES
+  //          s128x128x32_128x128x128_tt YES
+  //    * s128x256x64
+  //          s128x256x32_128x256x128_nn YES
+  //          s128x256x32_128x256x128_nt YES
+  //          s128x256x32_128x256x128_tn YES
+  //          s128x256x32_128x256x128_tt YES
+  //    * s256x128x64
+  //          s256x128x32_256x128x128_nn YES
+  //          s256x128x32_256x128x128_nt NO (2SM B_N TileSize_N % 256 != 0)
+  //          s256x128x32_256x128x128_tn YES
+  //          s256x128x32_256x128x128_tt NO (2SM B_N TileSize_N % 256 != 0)
+  //    * s256x256x64
+  //          s256x256x32_256x256x128_nn YES
+  //          s256x256x32_256x256x128_nt YES
+  //          s256x256x32_256x256x128_tn YES
+  //          s256x256x32_256x256x128_tt YES
+  //
+  // * Valid TileShape_MNK Sparse
+  //    * s128x128x64
+  //          s128x128x64_128x128x128_nn YES
+  //          s128x128x64_128x128x128_nt YES
+  //          s128x128x64_128x128x128_tn NO (A_K TileShape_K % 256 != 0)
+  //          s128x128x64_128x128x128_tt NO (A_K TileShape_K % 256 != 0)
+  //          s128x128x64_128x128x256_nn YES
+  //          s128x128x64_128x128x256_nt YES
+  //          s128x128x64_128x128x256_tn YES
+  //          s128x128x64_128x128x256_tt YES
+  //    * s128x256x64
+  //          s128x256x64_128x256x128_nn YES
+  //          s128x256x64_128x256x128_nt YES
+  //          s128x256x64_128x256x128_tn NO (A_K TileShape_K % 256 != 0)
+  //          s128x256x64_128x256x128_tt NO (A_K TileShape_K % 256 != 0)
+  //          s128x256x64_128x256x256_nn YES
+  //          s128x256x64_128x256x256_nt YES
+  //          s128x256x64_128x256x256_tn YES
+  //          s128x256x64_128x256x256_tt YES
+  //    * s256x128x64
+  //          s256x128x64_128x128x128_nn YES
+  //          s256x128x64_128x128x128_nt NO (2SM B_N TileSize_N % 256 != 0)
+  //          s256x128x64_128x128x128_tn NO (A_K TileShape_K % 256 != 0)
+  //          s256x128x64_128x128x128_tt NO (A_K TileShape_K % 256 != 0)
+  //          s256x128x64_128x128x256_nn YES
+  //          s256x128x64_128x128x256_nt NO (2SM B_N TileSize_N % 256 != 0)
+  //          s256x128x64_128x128x256_tn YES
+  //          s256x128x64_128x128x256_tt NO (2SM B_N TileSize_N % 256 != 0)
+  //    * s256x256x64
+  //          s256x256x64_128x256x128_nn YES
+  //          s256x256x64_128x256x128_nt YES
+  //          s256x256x64_128x256x128_tn NO (A_K TileShape_K % 256 != 0)
+  //          s256x256x64_128x256x128_tt NO (A_K TileShape_K % 256 != 0)
+  //          s256x256x64_128x256x256_nn YES
+  //          s256x256x64_128x256x256_nt YES
+  //          s256x256x64_128x256x256_tn YES
+  //          s256x256x64_128x256x256_tt YES
+
   [[maybe_unused]] constexpr int TileShape_M = Is2sm ? size<0>(TileShape_MNK{}) / 2 : size<0>(TileShape_MNK{});
   [[maybe_unused]] constexpr int TileShape_N = size<1>(TileShape_MNK{});
   [[maybe_unused]] constexpr int TileShape_K = size<2>(TileShape_MNK{});
@@ -408,6 +502,9 @@ check_input_datatypes() {
             || (cute::is_same_v<BuilderScheduleTag, KernelScheduleBlockScaledGemmSm120>)
             || (cute::is_same_v<BuilderScheduleTag, KernelTmaWarpSpecializedPingpong>)
             || (cute::is_same_v<BuilderScheduleTag, KernelTmaWarpSpecializedCooperative>)
+            // SM120 BS ptr_array
+            || (cute::is_same_v<BuilderScheduleTag, KernelPtrArrayTmaWarpSpecializedPingpong>)
+            || (cute::is_same_v<BuilderScheduleTag, KernelPtrArrayTmaWarpSpecializedCooperative>)
             // SM120 BSSP
             || (cute::is_same_v<BuilderScheduleTag, KernelScheduleBlockScaledSparseGemmSm120>)
             );
@@ -467,6 +564,8 @@ check_input_datatypes() {
       //   SfVectorSize = 64 for blockscaled sparse gemm
       static_assert(
         ((SfVectorSizeA == 32 && cute::is_same_v<KernelScheduleAuto, BuilderScheduleTag>)
+      || (SfVectorSizeA == 32 && cute::is_same_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag>)
+      || (SfVectorSizeA == 32 && cute::is_same_v<KernelTmaWarpSpecializedCooperative, BuilderScheduleTag>)
       || (SfVectorSizeA == 32 && cute::is_base_of_v<KernelScheduleBlockScaledGemmSm100, BuilderScheduleTag>)
       || (SfVectorSizeA == 32 && cute::is_base_of_v<KernelSchedulePtrArrayBlockScaledGemmSm100, BuilderScheduleTag>)
       || (SfVectorSizeA == 64 && cute::is_base_of_v<KernelScheduleBlockScaledSparseGemmSm100, BuilderScheduleTag>)
@@ -645,6 +744,8 @@ select_instr() {
       static_assert(
          (SfVectorSize == 32 && cute::is_same_v<KernelScheduleAuto, BuilderScheduleTag>)
       || (SfVectorSize == 32 && cute::is_base_of_v<KernelScheduleBlockScaledGemmSm100, BuilderScheduleTag>)
+      || (SfVectorSize == 32 && cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag>)
+      || (SfVectorSize == 32 && cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, BuilderScheduleTag>)
       || (SfVectorSize == 32 && cute::is_base_of_v<KernelSchedulePtrArrayBlockScaledGemmSm100, BuilderScheduleTag>)
       || (SfVectorSize == 64 && cute::is_base_of_v<KernelScheduleBlockScaledSparseGemmSm100, BuilderScheduleTag>
       || (SfVectorSize == 32 && cute::is_base_of_v<KernelScheduleBlockScaledGemmSm120, BuilderScheduleTag>)
@@ -666,6 +767,8 @@ select_instr() {
       else {
         static_assert(
         ((SfVectorSize == 32 && cute::is_same_v<KernelScheduleAuto, BuilderScheduleTag>)
+      || (SfVectorSize == 32 && cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, BuilderScheduleTag>)
+      || (SfVectorSize == 32 && cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, BuilderScheduleTag>)
       || (SfVectorSize == 32 && cute::is_base_of_v<KernelScheduleBlockScaledGemmSm100, BuilderScheduleTag>)
       || (SfVectorSize == 32 && cute::is_base_of_v<KernelSchedulePtrArrayBlockScaledGemmSm100, BuilderScheduleTag>)
       || (SfVectorSize == 64 && cute::is_base_of_v<KernelScheduleBlockScaledSparseGemmSm100, BuilderScheduleTag>)
diff --git a/include/cutlass/gemm/collective/builders/sm90_common.inl b/include/cutlass/gemm/collective/builders/sm90_common.inl
index 3a7bb8422e..b1f4f1fc35 100644
--- a/include/cutlass/gemm/collective/builders/sm90_common.inl
+++ b/include/cutlass/gemm/collective/builders/sm90_common.inl
@@ -389,20 +389,44 @@ is_input_fp8() {
 
 // We need to handle the tuples in this function since it is used in SFINAE dispatch in the CollectiveBuilder.
 // At that point, it is not guaranteed that the tuples have been split out into the required parts.
-template <class MaybeTupleElementA, class LayoutA, class MaybeTupleElementB, class LayoutB>
+template <class MaybeTupleElementA, class MaybePairLayoutA, class MaybeTupleElementB, class MaybePairLayoutB>
 constexpr bool
 is_use_rmem_A() {
 
-  using ElementA = detail::deduce_mixed_width_dtype_t<0, MaybeTupleElementA>;
-  using ElementB = detail::deduce_mixed_width_dtype_t<0, MaybeTupleElementB>;
-
-  constexpr bool IsABDifferentWidth = cute::sizeof_bits_v<ElementA> != cute::sizeof_bits_v<ElementB>;
-  constexpr bool HasScales = cute::is_tuple<MaybeTupleElementA>::value ^ cute::is_tuple<MaybeTupleElementB>::value;
-  constexpr bool IsInputSizeTwoBytes = is_input_size_two_bytes<ElementA, ElementB>();
-  constexpr bool IsLayoutAkBk = cutlass::gemm::detail::is_k_major_A<LayoutA>() &&
-                                cutlass::gemm::detail::is_k_major_B<LayoutB>();
-  constexpr bool IsUseRmemA = (!IsInputSizeTwoBytes && !IsLayoutAkBk) || IsABDifferentWidth || HasScales;
-  return IsUseRmemA;
+  // Handle the case we get a pair of layouts. We expect one of them to be an actual cute layout
+  if constexpr (cute::is_tuple_v<MaybePairLayoutA> and
+                cute::is_tuple_v<MaybePairLayoutB>) {
+    if constexpr ((cute::is_layout<cute::remove_pointer_t<cute::tuple_element_t<0, MaybePairLayoutA>>>::value or
+                   cute::is_layout<cute::remove_pointer_t<cute::tuple_element_t<1, MaybePairLayoutA>>>::value) and 
+                  (cute::is_layout<cute::remove_pointer_t<cute::tuple_element_t<0, MaybePairLayoutB>>>::value or
+                   cute::is_layout<cute::remove_pointer_t<cute::tuple_element_t<1, MaybePairLayoutB>>>::value)) {
+      return is_use_rmem_A<MaybeTupleElementA, cute::tuple_element_t<0, MaybePairLayoutA>,
+                           MaybeTupleElementB, cute::tuple_element_t<0, MaybePairLayoutB>>();
+    } else {
+      using ElementA = detail::deduce_mixed_width_dtype_t<0, MaybeTupleElementA>;
+      using ElementB = detail::deduce_mixed_width_dtype_t<0, MaybeTupleElementB>;
+
+      constexpr bool IsABDifferentWidth = cute::sizeof_bits_v<ElementA> != cute::sizeof_bits_v<ElementB>;
+      constexpr bool HasScales = cute::is_tuple<MaybeTupleElementA>::value ^ cute::is_tuple<MaybeTupleElementB>::value;
+      constexpr bool IsInputSizeTwoBytes = is_input_size_two_bytes<ElementA, ElementB>();
+      constexpr bool IsLayoutAkBk = cutlass::gemm::detail::is_k_major_A<MaybePairLayoutA>() &&
+                                  cutlass::gemm::detail::is_k_major_B<MaybePairLayoutB>();
+      constexpr bool IsUseRmemA = (!IsInputSizeTwoBytes && !IsLayoutAkBk) || IsABDifferentWidth || HasScales;
+      return IsUseRmemA;
+    }
+  } 
+  else {
+    using ElementA = detail::deduce_mixed_width_dtype_t<0, MaybeTupleElementA>;
+    using ElementB = detail::deduce_mixed_width_dtype_t<0, MaybeTupleElementB>;
+
+    constexpr bool IsABDifferentWidth = cute::sizeof_bits_v<ElementA> != cute::sizeof_bits_v<ElementB>;
+    constexpr bool HasScales = cute::is_tuple<MaybeTupleElementA>::value ^ cute::is_tuple<MaybeTupleElementB>::value;
+    constexpr bool IsInputSizeTwoBytes = is_input_size_two_bytes<ElementA, ElementB>();
+    constexpr bool IsLayoutAkBk = cutlass::gemm::detail::is_k_major_A<MaybePairLayoutA>() &&
+                                  cutlass::gemm::detail::is_k_major_B<MaybePairLayoutB>();
+    constexpr bool IsUseRmemA = (!IsInputSizeTwoBytes && !IsLayoutAkBk) || IsABDifferentWidth || HasScales;
+    return IsUseRmemA;
+  }
 }
 
 template <class ElementA, int AlignmentA, class ElementB, int AlignmentB, int RequiredAlignment>
diff --git a/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl b/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl
index 01c78f5ae3..b6c489da09 100644
--- a/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl
@@ -1037,10 +1037,10 @@ static constexpr bool IsMixedWidthInput = IsDifferentWidth || (IsDifferentWidth
 // GMMA_TMA_WS_SS (BlockScaled Builders)
 template <
   class ElementA,
-  class GmemLayoutATag,
+  class GmemLayoutPairA,
   int AlignmentA,
   class ElementB,
-  class GmemLayoutBTag,
+  class GmemLayoutPairB,
   int AlignmentB,
   class ElementAccumulator,
   class TileShape_MNK,
@@ -1052,10 +1052,10 @@ struct CollectiveBuilder<
     arch::Sm90,
     arch::OpClassTensorOp,
     ElementA,
-    GmemLayoutATag,
+    GmemLayoutPairA,
     AlignmentA,
     ElementB,
-    GmemLayoutBTag,
+    GmemLayoutPairB,
     AlignmentB,
     ElementAccumulator,
     TileShape_MNK,
@@ -1063,14 +1063,28 @@ struct CollectiveBuilder<
     StageCountType,
     KernelScheduleType,
     cute::enable_if_t<
-      cute::is_same_v<decltype(KernelScheduleType::ScaleGranularityM), decltype(KernelScheduleType::ScaleGranularityN)> and
-      not detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>()
+      (cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum> or
+       cute::is_same_v<KernelScheduleType, KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum> or
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpongFP8BlockScaledAccum> or
+       cute::is_same_v<KernelScheduleType, KernelPtrArrayTmaWarpSpecializedPingpongFP8BlockScaledAccum>) and
+      not detail::is_use_rmem_A<ElementA, GmemLayoutPairA, ElementB, GmemLayoutPairB>()
     >
 > {
-
-  static constexpr auto ScaleGranularityM_ = KernelScheduleType::ScaleGranularityM;
-  static constexpr auto ScaleGranularityN_ = KernelScheduleType::ScaleGranularityN;
-  static constexpr auto ScalePromotionInterval_ = KernelScheduleType::ScalePromotionInterval;
+  using GmemLayoutATag   = cute::remove_cvref_t<decltype(get<0>(GmemLayoutPairA{}))>;
+  using GmemLayoutSFATag = cute::remove_cvref_t<decltype(get<1>(GmemLayoutPairA{}))>;
+  using GmemLayoutBTag   = cute::remove_cvref_t<decltype(get<0>(GmemLayoutPairB{}))>;
+  using GmemLayoutSFBTag = cute::remove_cvref_t<decltype(get<1>(GmemLayoutPairB{}))>;
+
+  static_assert(cute::depth(cute::remove_pointer_t<GmemLayoutSFATag>{}) == 2 and 
+                cute::depth(cute::remove_pointer_t<GmemLayoutSFBTag>{}) == 2, 
+      "Expect SFA and SFB layout to be depth of two with shape ((SFVecMN, restMN),(SFVecK, restK), L)");
+  static_assert(size<1,0>(cute::remove_pointer_t<GmemLayoutSFATag>{}) == 
+                size<1,0>(cute::remove_pointer_t<GmemLayoutSFBTag>{}), 
+      "SFA and SFB must have equivalent SF vector sizes along K");
+
+  static constexpr auto ScaleGranularityM = size<0,0>(cute::remove_pointer_t<GmemLayoutSFATag>{});
+  static constexpr auto ScaleGranularityN = size<0,0>(cute::remove_pointer_t<GmemLayoutSFBTag>{});
+  static constexpr auto ScaleGranularityK = size<1,0>(cute::remove_pointer_t<GmemLayoutSFATag>{});
 
   static_assert(is_static<TileShape_MNK>::value);
   static_assert(is_static<ClusterShape_MNK>::value);
@@ -1113,10 +1127,10 @@ struct CollectiveBuilder<
       GmmaMajorB, ElementBMma, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
 
   static constexpr size_t TensorMapStorage = IsArrayOfPointersGemm ? sizeof(cute::TmaDescriptor) * 2 /* for A and B */ : 0;
-  static constexpr int KernelSmemCarveout = static_cast<int>(TensorMapStorage);
+  // Reserve 128B for 8 stages of tile scheduling
+  static constexpr size_t TileSchedulerCarveout = IsArrayOfPointersGemm ? 128 : 0;
+  static constexpr int KernelSmemCarveout = static_cast<int>(TensorMapStorage) + static_cast<int>(TileSchedulerCarveout);
 
-  static constexpr int ScaleGranularityM = ScaleGranularityM_ == 0 ? size<0>(TileShape_MNK{}) : ScaleGranularityM_;
-  static constexpr int ScaleGranularityN = ScaleGranularityN_ == 0 ? size<1>(TileShape_MNK{}) : ScaleGranularityN_;
   static constexpr int ScaleMsPerTile = size<0>(TileShape_MNK{}) / ScaleGranularityM;
   static constexpr int ScaleNsPerTile = size<1>(TileShape_MNK{}) / ScaleGranularityN;
   static_assert((size<0>(TileShape_MNK{}) % ScaleGranularityM) == 0, "FP8 scaling granularity must evenly divide tile shape along M.");
@@ -1125,8 +1139,8 @@ struct CollectiveBuilder<
   static constexpr int PipelineStages = detail::compute_stage_count_with_blockwise_scale<detail::sm90_smem_capacity_bytes - KernelSmemCarveout,
       ElementAMma, ElementBMma, ElementBlockScale, TileShape_MNK, ScaleMsPerTile, ScaleNsPerTile>(StageCountType{});
   using DispatchPolicy = cute::conditional_t<IsArrayOfPointersGemm,
-    MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<PipelineStages, ClusterShape_MNK, KernelScheduleType, ScaleGranularityM_, ScaleGranularityN_, ScalePromotionInterval_>,
-    MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<PipelineStages, ClusterShape_MNK, KernelScheduleType, ScaleGranularityM_, ScaleGranularityN_, ScalePromotionInterval_>>;
+    MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<PipelineStages, ClusterShape_MNK, KernelScheduleType>,
+    MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<PipelineStages, ClusterShape_MNK, KernelScheduleType>>;
 
   using SmemCopyAtomA = void;
   using SmemCopyAtomB = void;
@@ -1135,9 +1149,9 @@ struct CollectiveBuilder<
       DispatchPolicy,
       TileShape_MNK,
       ElementA,
-      TagToStrideA_t<GmemLayoutATag>,
+      cute::tuple<TagToStrideA_t<GmemLayoutATag>, TagToStrideA_t<GmemLayoutSFATag>>,
       ElementB,
-      TagToStrideB_t<GmemLayoutBTag>,
+      cute::tuple<TagToStrideB_t<GmemLayoutBTag>, TagToStrideB_t<GmemLayoutSFBTag>>,
       TiledMma,
       GmemTiledCopyA,
       SmemLayoutAtomA,
diff --git a/include/cutlass/gemm/collective/collective_mma.hpp b/include/cutlass/gemm/collective/collective_mma.hpp
index af1394c60a..7c82acc09f 100644
--- a/include/cutlass/gemm/collective/collective_mma.hpp
+++ b/include/cutlass/gemm/collective/collective_mma.hpp
@@ -61,8 +61,10 @@
 #include "cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp" 
 #include "cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp" 
 #include "cutlass/gemm/collective/sm100_mma_warpspecialized_blockwise_scaling.hpp"
+#include "cutlass/gemm/collective/sm100_mma_array_warpspecialized_blockwise_scaling.hpp"
 #include "cutlass/gemm/collective/sm120_mma_tma.hpp"
 #include "cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp"
+#include "cutlass/gemm/collective/sm120_blockscaled_mma_array_tma.hpp"
 #include "cutlass/gemm/collective/sm120_sparse_mma_tma.hpp"
 #include "cutlass/gemm/collective/sm120_blockscaled_sparse_mma_tma.hpp"
 #endif // !defined(__CUDACC_RTC__)
diff --git a/include/cutlass/gemm/collective/fp8_accumulation.hpp b/include/cutlass/gemm/collective/fp8_accumulation.hpp
index e616b01441..df0375f095 100644
--- a/include/cutlass/gemm/collective/fp8_accumulation.hpp
+++ b/include/cutlass/gemm/collective/fp8_accumulation.hpp
@@ -77,7 +77,6 @@ struct GmmaFP8Accumulation {
   // `multiply` scale the partial accumulators and `add` to main accumulator (FFMA).
   CUTLASS_DEVICE
   void scale_core(ElementAccumulator const &scale) {
-    warpgroup_wait<0>();
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < size(accum_); ++i) {
       accum_(i) += accum_temp_(i) * scale;
@@ -96,7 +95,6 @@ struct GmmaFP8Accumulation {
 
     static_assert(LayoutAccum{}.shape() == LayoutScale{}.shape(), "Accumulator and scale must have same shape.");
 
-    warpgroup_wait<0>();
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < size(accum_); ++i) {
       accum_(i) += accum_temp_(i) * scale(i);
@@ -121,7 +119,6 @@ struct GmmaFP8Accumulation {
     static_assert(LayoutAccum{}.shape() == LayoutScaleA{}.shape(), "Accumulator and scaleA must have same shape.");
     static_assert(LayoutAccum{}.shape() == LayoutScaleB{}.shape(), "Accumulator and scaleB must have same shape.");
 
-    warpgroup_wait<0>();
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < size(accum_); ++i) {
       accum_(i) += accum_temp_(i) * scaleA(i) * scaleB(i);
diff --git a/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp b/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp
index ae318d8848..24bdf9a65a 100644
--- a/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp
@@ -28,8 +28,6 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-
-
 #pragma once
 
 #include "cutlass/cutlass.h"
@@ -989,12 +987,59 @@ struct CollectiveMma<
 
     uint32_t skip_wait = k_tile_count <= 0;
     auto barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
-    bool is_first_iter = true;
 
     //
     // PIPELINED MAIN LOOP
     //
     tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+    if constexpr (IsOverlappingAccum) {
+      // first iteration manual unroll for tmem overlap kernel
+      if (k_tile_count > 0) {
+        // WAIT on mainloop_pipe_consumer_state until its data are available
+        // (phase bit flips from mainloop_pipe_consumer_state.phase() value)
+        mainloop_pipeline.consumer_wait(mainloop_pipe_consumer_state, barrier_token);
+
+        // Compute on k_tile
+        int read_stage = mainloop_pipe_consumer_state.index();
+        // Save current mainlop pipeline read state
+        auto curr_mainloop_pipe_consumer_state = mainloop_pipe_consumer_state;
+
+        // Advance mainloop_pipe
+        ++mainloop_pipe_consumer_state;
+        --k_tile_count;
+        skip_wait = k_tile_count <= 0;
+        // Peek at next iteration
+        barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
+
+        if (cute::elect_one_sync()) {
+          copy(tiled_copy_s2t_SFA, thr_tCsSFA_s2t(_,_,_,_,read_stage), thr_tCtSFA_s2t);
+          copy(tiled_copy_s2t_SFB, thr_tCsSFB_s2t(_,_,_,_,read_stage), thr_tCtSFB_s2t);
+        }
+
+        // Wait for tmem accumulator buffer to become empty with a flipped phase
+        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+
+        // Unroll the K mode manually so we can set scale C to 1
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+          // (V,M) x (V,N) => (V,M,N)
+          cute::gemm(tiled_mma.with(tiled_mma.accumulate_,
+                                    tCtSFA(_,_,k_block),
+                                    tCtSFB_mma(_,_,k_block)),
+              tCrA(_,_,k_block,read_stage),
+              tCrB(_,_,k_block,read_stage),
+              accumulators);
+          tiled_mma.accumulate_ = UMMA::ScaleOut::One;
+        }
+
+        mainloop_pipeline.consumer_release(curr_mainloop_pipe_consumer_state);
+      }
+    }
+    else {
+      // Wait for tmem accumulator buffer to become empty with a flipped phase
+      accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+    }
+
     CUTLASS_PRAGMA_NO_UNROLL
     while (k_tile_count > 0) {
       // WAIT on mainloop_pipe_consumer_state until its data are available
@@ -1018,12 +1063,6 @@ struct CollectiveMma<
         copy(tiled_copy_s2t_SFB, thr_tCsSFB_s2t(_,_,_,_,read_stage), thr_tCtSFB_s2t);
       }
 
-      // Wait for tmem accumulator buffer to become empty with a flipped phase
-      if (is_first_iter) {
-        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
-        is_first_iter = false;
-      }
-
       // Unroll the K mode manually so we can set scale C to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
@@ -1036,6 +1075,7 @@ struct CollectiveMma<
             accumulators);
         tiled_mma.accumulate_ = UMMA::ScaleOut::One;
       }
+
       mainloop_pipeline.consumer_release(curr_mainloop_pipe_consumer_state);
     }
 
diff --git a/include/cutlass/gemm/collective/sm100_blockscaled_sparse_mma_warpspecialized.hpp b/include/cutlass/gemm/collective/sm100_blockscaled_sparse_mma_warpspecialized.hpp
index d3e6978951..5423ba682a 100644
--- a/include/cutlass/gemm/collective/sm100_blockscaled_sparse_mma_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm100_blockscaled_sparse_mma_warpspecialized.hpp
@@ -1197,12 +1197,61 @@ struct CollectiveMma<
 
     uint32_t skip_wait = k_tile_count <= 0;
     auto barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
-    bool is_first_iter = true;
 
     //
     // PIPELINED MAIN LOOP
     //
     tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+    if constexpr (IsOverlappingAccum) {
+      // first iteration manual unroll for tmem overlap kernel
+      if (k_tile_count > 0) {
+        // WAIT on mainloop_pipe_consumer_state until its data are available
+        // (phase bit flips from mainloop_pipe_consumer_state.phase() value)
+        mainloop_pipeline.consumer_wait(mainloop_pipe_consumer_state, barrier_token);
+
+        // Compute on k_tile
+        int read_stage = mainloop_pipe_consumer_state.index();
+        // Save current mainlop pipeline read state
+        auto curr_mainloop_pipe_consumer_state = mainloop_pipe_consumer_state;
+
+        // Advance mainloop_pipe
+        ++mainloop_pipe_consumer_state;
+        --k_tile_count;
+        skip_wait = k_tile_count <= 0;
+        // Peek at next iteration
+        barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
+
+        if (cute::elect_one_sync()) {
+          copy(tiled_copy_s2t_E,   thr_tCsE_s2t(_,_,_,_,read_stage),   thr_tCtE_s2t);
+          copy(tiled_copy_s2t_SFA, thr_tCsSFA_s2t(_,_,_,_,read_stage), thr_tCtSFA_s2t);
+          copy(tiled_copy_s2t_SFB, thr_tCsSFB_s2t(_,_,_,_,read_stage), thr_tCtSFB_s2t);
+        }
+
+        // Wait for tmem accumulator buffer to become empty with a flipped phase
+        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+
+        // Unroll the K mode manually so we can set scale C to 1
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+          // (V,M) x (V,N) => (V,M,N)
+          cute::gemm(tiled_mma.with(tiled_mma.accumulate_,
+                                    tCtE(_,_,k_block),
+                                    tCtSFA(_,_,k_block),
+                                    tCtSFB_mma(_,_,k_block)),
+              tCrA(_,_,k_block,read_stage),
+              tCrB(_,_,k_block,read_stage),
+              accumulators);
+          tiled_mma.accumulate_ = UMMA::ScaleOut::One;
+        }
+
+        mainloop_pipeline.consumer_release(curr_mainloop_pipe_consumer_state);
+      }
+    }
+    else {
+      // Wait for tmem accumulator buffer to become empty with a flipped phase
+      accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+    }
+
     CUTLASS_PRAGMA_NO_UNROLL
     while (k_tile_count > 0) {
       // WAIT on mainloop_pipe_consumer_state until its data are available
@@ -1227,12 +1276,6 @@ struct CollectiveMma<
         copy(tiled_copy_s2t_SFB, thr_tCsSFB_s2t(_,_,_,_,read_stage), thr_tCtSFB_s2t);
       }
 
-      // Wait for tmem accumulator buffer to become empty with a flipped phase
-      if (is_first_iter) {
-        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
-        is_first_iter = false;
-      }
-
       // Unroll the K mode manually so we can set scale C to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
diff --git a/include/cutlass/gemm/collective/sm100_mma_array_warpspecialized_blockwise_scaling.hpp b/include/cutlass/gemm/collective/sm100_mma_array_warpspecialized_blockwise_scaling.hpp
new file mode 100644
index 0000000000..3ca1b4aa72
--- /dev/null
+++ b/include/cutlass/gemm/collective/sm100_mma_array_warpspecialized_blockwise_scaling.hpp
@@ -0,0 +1,1337 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/detail/collective.hpp"
+#include "cutlass/detail/cluster.hpp"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/trace.h"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cutlass/cuda_host_adapter.hpp"
+
+#include "cute/algorithm/functional.hpp"
+#include "cute/arch/cluster_sm90.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cute/algorithm/gemm.hpp"
+#include "cute/tensor_predicate.hpp"
+#include "cute/numeric/arithmetic_tuple.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::collective {
+using namespace cute;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// WarpSpecialized Mainloop
+// Both DMA Load and MMA methods of this class must be run by a single thread that's picked by elect_one
+template <
+  int Stages,
+  int SchedulerPipelineStageCount,
+  int AccumulatorPipelineStageCount,
+  class ClusterShape,   // Static cluster shape or dynamic (int, int, _1)
+  class TileShape_,     // (MmaAtomShapeM, MmaAtomShapeN, TileK)
+  class ElementA_,
+  class StridePairA_,
+  class ElementB_,
+  class StridePairB_,
+  class TiledMma_,
+  class GmemTiledCopyPairA_,
+  class SmemLayoutAtomA_,
+  class SmemCopyAtomA_,
+  class TransformA_,
+  class GmemTiledCopyPairB_,
+  class SmemLayoutAtomB_,
+  class SmemCopyAtomB_,
+  class TransformB_>
+struct CollectiveMma<
+    MainloopSm100ArrayTmaUmmaWarpSpecializedBlockwiseScaling<
+      Stages,
+      SchedulerPipelineStageCount,
+      AccumulatorPipelineStageCount,
+      ClusterShape>,
+    TileShape_,
+    ElementA_,
+    StridePairA_,
+    ElementB_,
+    StridePairB_,
+    TiledMma_,
+    GmemTiledCopyPairA_,
+    SmemLayoutAtomA_,
+    SmemCopyAtomA_,
+    TransformA_,
+    GmemTiledCopyPairB_,
+    SmemLayoutAtomB_,
+    SmemCopyAtomB_,
+    TransformB_>
+{
+  //
+  // Type Aliases
+  //
+  using TiledMma = TiledMma_;
+  using AtomThrShapeMNK = Shape<decltype(shape<0>(typename TiledMma::ThrLayoutVMNK{})), _1, _1>;
+
+  using DispatchPolicy = MainloopSm100ArrayTmaUmmaWarpSpecializedBlockwiseScaling<
+                          Stages,
+                          SchedulerPipelineStageCount,
+                          AccumulatorPipelineStageCount,
+                          ClusterShape>;
+  using TileShape = TileShape_;
+
+  static constexpr bool IsDynamicCluster = not cute::is_static_v<ClusterShape>;
+
+  CUTE_STATIC_ASSERT_V(evenly_divides(TileShape{}, tile_shape(TiledMma{})),
+                       "Static cluster shape used: TileShape should be evenly divided by TiledMma");
+
+  using CtaShape_MNK = decltype(shape_div(TileShape{}, AtomThrShapeMNK{}));
+
+  // Define A and B block shapes for reduced size TMA_LOADs
+  using MmaShapeA_MK = decltype(partition_shape_A(TiledMma{}, make_shape(size<0>(TileShape{}), size<2>(TileShape{}))));
+  using MmaShapeB_NK = decltype(partition_shape_B(TiledMma{}, make_shape(size<1>(TileShape{}), size<2>(TileShape{}))));
+
+  using ElementA = ElementA_;
+  using ElementAMma = typename TiledMma::ValTypeA;
+  using StrideA = cute::remove_cvref_t<decltype(get<0>(StridePairA_{}))>;
+  using LayoutSFA = cute::remove_cvref_t<decltype(get<1>(StridePairA_{}))>;
+  using InternalStrideA = cute::remove_pointer_t<StrideA>;
+  using InternalLayoutSFA = cute::remove_pointer_t<LayoutSFA>;
+  using ElementB = ElementB_;
+  using ElementBMma = typename TiledMma::ValTypeB;
+  using StrideB = cute::remove_cvref_t<decltype(get<0>(StridePairB_{}))>;
+  using LayoutSFB = cute::remove_cvref_t<decltype(get<1>(StridePairB_{}))>;
+  using InternalStrideB = cute::remove_pointer_t<StrideB>; 
+  using InternalLayoutSFB = cute::remove_pointer_t<LayoutSFB>;
+
+  static constexpr bool IsRuntimeDataTypeA = cutlass::gemm::collective::detail::is_sm10x_runtime_f8f6f4<ElementA>();
+
+  static constexpr bool IsRuntimeDataTypeB = cutlass::gemm::collective::detail::is_sm10x_runtime_f8f6f4<ElementB>();
+
+  static_assert((IsRuntimeDataTypeA && IsRuntimeDataTypeB) ||
+                (!IsRuntimeDataTypeA && !IsRuntimeDataTypeB),
+                "ElementA and ElementB should be both runtime or both static.");
+
+  static constexpr bool IsRuntimeDataType = IsRuntimeDataTypeA && IsRuntimeDataTypeB;
+  
+  static constexpr int ScaleGranularityM = size<0,0>(InternalLayoutSFA{});
+  
+  static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
+  static_assert(size<0>(TileShape{}) % ScaleGranularityM == 0 and ScaleGranularityM <= size<0>(TileShape{}), "Scale Granularity M must divide Tile Shape");
+
+  static constexpr int ScaleGranularityN = size<0,0>(InternalLayoutSFB{});
+  static constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
+  static_assert(size<1>(TileShape{}) % ScaleGranularityN == 0 and ScaleGranularityN <= size<1>(TileShape{}), "Scale Granularity N must divide Tile Shape");
+
+  static_assert(size<1, 0>(InternalLayoutSFA{}) == size<1, 0>(InternalLayoutSFB{}), "Vector size K must be equal for SFA and SFB");
+
+  static constexpr int ScaleGranularityK = size<1, 0>(InternalLayoutSFA{});
+  static constexpr int ScaleKsPerTile = size<2>(TileShape{}) / ScaleGranularityK;
+  static_assert(size<2>(TileShape{}) % ScaleGranularityK == 0 and ScaleGranularityK <= size<2>(TileShape{}), "Scale Granularity K must divide Tile Shape");
+  static_assert(ScaleGranularityK % size<2>(typename TiledMma::AtomShape_MNK{}) == 0, "Scale Granularity K must be divisible by MMA_K");
+
+  static constexpr int K_BLOCK_MMAS_PER_SCALE_K = ScaleGranularityK / size<2>(typename TiledMma::AtomShape_MNK{});
+
+  static_assert(size<0>(CtaShape_MNK{}) >= ScaleGranularityM, "Scale Granularity must be smaller than or equal to the tile shape");
+  static_assert(size<1>(CtaShape_MNK{}) >= ScaleGranularityN, "Scale Granularity must be smaller than or equal to the tile shape");
+  static_assert(size<2>(CtaShape_MNK{}) >= ScaleGranularityK, "Scale Granularity must be smaller than or equal to the tile shape");
+
+  using ScaleConfig = cutlass::detail::Sm100BlockwiseScaleConfig<ScaleGranularityM, 
+      ScaleGranularityN, 
+      ScaleGranularityK, 
+      size<0,1>(InternalLayoutSFA{}.stride()) == 1 ? UMMA::Major::MN : UMMA::Major::K,
+      size<0,1>(InternalLayoutSFB{}.stride()) == 1 ? UMMA::Major::MN : UMMA::Major::K>;
+  
+
+  using SmemLayoutAtomSFA = decltype(ScaleConfig::smem_atom_layoutSFA(CtaShape_MNK{}));
+  using SmemLayoutAtomSFB = decltype(ScaleConfig::smem_atom_layoutSFB(CtaShape_MNK{}));
+
+
+  using ElementAccumulator = typename TiledMma::ValTypeC;
+  using GmemTiledCopyA = cute::remove_cvref_t<decltype(get<0>(GmemTiledCopyPairA_{}))>;
+  using GmemTiledCopySFA = cute::remove_cvref_t<decltype(get<1>(GmemTiledCopyPairA_{}))>;
+  using GmemTiledCopyB = cute::remove_cvref_t<decltype(get<0>(GmemTiledCopyPairB_{}))>;
+  using GmemTiledCopySFB = cute::remove_cvref_t<decltype(get<1>(GmemTiledCopyPairB_{}))>;
+  using SmemLayoutAtomA = SmemLayoutAtomA_;
+  using SmemLayoutAtomB = SmemLayoutAtomB_;
+  using SmemCopyAtomA = SmemCopyAtomA_;
+  using SmemCopyAtomB = SmemCopyAtomB_;
+  using TransformA = TransformA_;
+  using TransformB = TransformB_;
+  using ArchTag = typename DispatchPolicy::ArchTag;
+
+  static constexpr int AlignmentSFA = GmemTiledCopySFA::AtomNumVal::value * sizeof(typename GmemTiledCopySFA::ValType) / sizeof(ElementAccumulator);
+  static constexpr int AlignmentSFB = GmemTiledCopySFB::AtomNumVal::value * sizeof(typename GmemTiledCopySFB::ValType) / sizeof(ElementAccumulator);
+
+  using MainloopABPipeline = cutlass::PipelineTmaUmmaAsync<
+                                DispatchPolicy::Stages,
+                                ClusterShape,
+                                AtomThrShapeMNK>;
+  using MainloopABPipelineState = typename MainloopABPipeline::PipelineState;
+
+  using MainloopSFPipeline = cutlass::PipelineAsync<DispatchPolicy::Stages>;
+  using MainloopSFPipelineState = typename MainloopSFPipeline::PipelineState;
+
+  using AccumulatorPipeline = cutlass::PipelineUmmaAsync<
+                                  AccumulatorPipelineStageCount,
+                                  AtomThrShapeMNK>;
+  using AccumulatorPipelineState = typename AccumulatorPipeline::PipelineState;
+
+  // Two arrivals per thread in the warp (1 arrival and 1 arrival through cp.async.mbarrier)
+  static constexpr int NumMainloopSFProducerThreadEvents = 64;
+
+  static_assert(rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtomA must be rank 2 (M,K)");
+  static_assert(((size<0,0>(MmaShapeA_MK{}) * size<1>(MmaShapeA_MK{})) % size<0>(SmemLayoutAtomA{})) == 0,
+      "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert(((size<0,1>(MmaShapeA_MK{}) * size<2>(MmaShapeA_MK{})) % size<1>(SmemLayoutAtomA{})) == 0,
+      "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert(cute::is_void_v<SmemCopyAtomA>,
+      "SM100 UMMA cannot have a non-void copy atom for smem sourced instructions.");
+
+  static_assert(rank(SmemLayoutAtomB{}) == 2, "SmemLayoutAtomB must be rank 2 (N,K)");
+  static_assert(((size<0,0>(MmaShapeB_NK{}) * size<1>(MmaShapeB_NK{})) % size<0>(SmemLayoutAtomB{})) == 0,
+      "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert(((size<0,1>(MmaShapeB_NK{}) * size<2>(MmaShapeB_NK{})) % size<1>(SmemLayoutAtomB{})) == 0,
+      "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert(cute::is_void_v<SmemCopyAtomB>,
+      "SM100 UMMA cannot have a non-void copy atom for smem sourced instructions.");
+
+  // Tile along K mode first before tiling over MN. PIPE mode last as usual.
+  // This maximizes TMA boxes due to better smem-K vectorization, reducing total issued TMAs.
+  // (MMA_TILE_M,MMA_TILE_K),MMA_M,MMA_K,PIPE)
+  using SmemLayoutA = decltype(UMMA::tile_to_mma_shape(
+      SmemLayoutAtomA{},
+      append(MmaShapeA_MK{}, Int<DispatchPolicy::Stages>{}),
+      cute::conditional_t<cutlass::gemm::detail::is_mn_major<InternalStrideA>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+  // (MMA_TILE_N,MMA_TILE_K),MMA_N,MMA_K,PIPE)
+  using SmemLayoutB = decltype(UMMA::tile_to_mma_shape(
+      SmemLayoutAtomB{},
+      append(MmaShapeB_NK{}, Int<DispatchPolicy::Stages>{}),
+      cute::conditional_t<cutlass::gemm::detail::is_mn_major<InternalStrideB>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+
+  static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 1 or more.");
+  static_assert(cute::is_base_of<cute::UMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value &&
+                cute::is_base_of<cute::UMMA::DescriptorIterator, typename TiledMma::FrgTypeB>::value,
+                "MMA atom must source both A and B operand from smem_desc for this mainloop.");
+  static_assert(
+      (size(AtomThrShapeMNK{}) == 1 &&
+        (cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>)) ||
+      (size(AtomThrShapeMNK{}) == 2 &&
+        (cute::is_same_v<GmemTiledCopyA, SM100_TMA_2SM_LOAD> || cute::is_same_v<GmemTiledCopyA, SM100_TMA_2SM_LOAD_MULTICAST>)),
+      "GmemTiledCopy - invalid TMA copy atom specified.");
+  static_assert(
+      (size(AtomThrShapeMNK{}) == 1 &&
+        (cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>)) ||
+      (size(AtomThrShapeMNK{}) == 2 &&
+        (cute::is_same_v<GmemTiledCopyB, SM100_TMA_2SM_LOAD> || cute::is_same_v<GmemTiledCopyB, SM100_TMA_2SM_LOAD_MULTICAST>)),
+      "GmemTiledCopy -  invalid TMA copy atom specified.");
+
+  using TmaInternalElementA = cute::conditional_t<cute::is_same_v<ElementA, float>, cutlass::tfloat32_t, ElementAMma>;
+  using TmaInternalElementB = cute::conditional_t<cute::is_same_v<ElementB, float>, cutlass::tfloat32_t, ElementBMma>;
+
+  using SmemAllocTypeA = cute::conditional_t<cute::sizeof_bits_v<ElementAMma> < 8, uint8_t, ElementAMma>;
+  using SmemAllocTypeB = cute::conditional_t<cute::sizeof_bits_v<ElementBMma> < 8, uint8_t, ElementBMma>;
+
+  using BitTypeElementA = uint_bit_t<cute::sizeof_bits_v<ElementA>>;
+  using BitTypeElementB = uint_bit_t<cute::sizeof_bits_v<ElementB>>;
+
+  using ArrayElementA = cute::conditional_t<IsRuntimeDataTypeA, BitTypeElementA, ElementA>;
+  using ArrayElementB = cute::conditional_t<IsRuntimeDataTypeB, BitTypeElementB, ElementB>;
+
+  using RuntimeDataTypeA = cute::conditional_t<IsRuntimeDataTypeA, cute::UMMA::MXF8F6F4Format, void*>;
+  using RuntimeDataTypeB = cute::conditional_t<IsRuntimeDataTypeB, cute::UMMA::MXF8F6F4Format, void*>;
+
+  using SmemLayoutScaleA = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFA{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFA{}), size(filter_zeros(SmemLayoutAtomSFA{})))
+  ));
+  using SmemLayoutScaleB = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFB{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFB{}), size(filter_zeros(SmemLayoutAtomSFB{})))
+  ));
+
+  struct SharedStorage {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
+      cute::ArrayEngine<SmemAllocTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
+      cute::ArrayEngine<SmemAllocTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
+      cute::ArrayEngine<ElementAccumulator, cute::cosize_v<SmemLayoutScaleA>> smem_SFA;
+      cute::ArrayEngine<ElementAccumulator, cute::cosize_v<SmemLayoutScaleB>> smem_SFB;
+    } tensors;
+
+    struct TensorMapStorage : cute::aligned_struct<128, _0> {
+      cute::TmaDescriptor smem_tensormap_A;
+      cute::TmaDescriptor smem_tensormap_B;
+    } tensormaps;
+
+    using PipelineABStorage = typename MainloopABPipeline::SharedStorage;
+    using PipelineSFStorage = typename MainloopSFPipeline::SharedStorage;
+    using AccumulatorPipelineStorage = typename AccumulatorPipeline::SharedStorage;
+
+    struct PipelineStorage {
+      alignas(16) PipelineABStorage pipeline_ab;
+      alignas(16) PipelineSFStorage pipeline_sf;
+      alignas(16) AccumulatorPipelineStorage pipeline_accum;
+    };
+ };
+
+  // Expose shared storage for tensors/pipelines separately to allow kernel layer to reorder them.
+  using TensorStorage = typename SharedStorage::TensorStorage;
+  using TensorMapStorage = typename SharedStorage::TensorMapStorage;
+  using PipelineStorage = typename SharedStorage::PipelineStorage;
+
+  // Only one thread issues the TMA and updates the barriers in a 2SM MMA, adjust bytes accordingly
+  static constexpr uint32_t TmaTransactionBytes =
+    cutlass::bits_to_bytes(size(AtomThrShapeMNK{}) * cosize(take<0,3>(SmemLayoutA{})) * cute::sizeof_bits_v<ElementA>) +
+    cutlass::bits_to_bytes(size(AtomThrShapeMNK{}) * cosize(take<0,3>(SmemLayoutB{})) * cute::sizeof_bits_v<ElementB>);
+
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
+
+  // Host side kernel arguments
+  struct Arguments {
+    ArrayElementA const** ptr_A{nullptr};
+    StrideA dA{};
+    ArrayElementB const** ptr_B{nullptr};
+    StrideB dB{};
+    ElementAccumulator const** ptr_SFA{nullptr};
+    LayoutSFA layout_SFA{};
+    ElementAccumulator const** ptr_SFB{nullptr};
+    LayoutSFB layout_SFB{};
+    RuntimeDataTypeA runtime_data_type_a{};
+    RuntimeDataTypeB runtime_data_type_b{};
+  };
+
+  // Device side kernel params
+  struct Params {
+    using ClusterLayout_VMNK = decltype(tiled_divide(make_layout(conditional_return<IsDynamicCluster>(make_shape(uint32_t(0), uint32_t(0), Int<1>{}), ClusterShape{})),
+                                                     make_tile(typename TiledMma::AtomThrID{})));
+
+    using TMA_A = decltype(make_tma_atom_A_sm100<TmaInternalElementA>(
+        GmemTiledCopyA{},
+        make_tensor(recast_ptr<TmaInternalElementA>(nullptr), repeat_like(InternalStrideA{}, int32_t(0)), InternalStrideA{}),
+        SmemLayoutA{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        ClusterLayout_VMNK{})
+      );
+
+    using TMA_B = decltype(make_tma_atom_B_sm100<TmaInternalElementB>(
+        GmemTiledCopyB{},
+        make_tensor(recast_ptr<TmaInternalElementB>(nullptr), repeat_like(InternalStrideB{}, int32_t(0)), InternalStrideB{}),
+        SmemLayoutB{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        ClusterLayout_VMNK{})
+      );
+
+    TMA_A tma_load_a;
+    TMA_B tma_load_b;
+    TMA_A tma_load_a_fallback;
+    TMA_B tma_load_b_fallback;
+    dim3 cluster_shape_fallback;
+    RuntimeDataTypeA runtime_data_type_a;
+    RuntimeDataTypeB runtime_data_type_b;
+    cute::TmaDescriptor* tensormaps;
+    ArrayElementA const** ptr_A;
+    StrideA dA;
+    ArrayElementB const** ptr_B;
+    StrideB dB;
+
+    ElementAccumulator const** ptr_SFA;
+    LayoutSFA layout_SFA;
+    ElementAccumulator const** ptr_SFB;
+    LayoutSFB layout_SFB;
+  };
+
+  CUTLASS_DEVICE
+  CollectiveMma(Params const& params, ClusterShape cluster_shape, uint32_t block_rank_in_cluster)
+    : cluster_shape_(cluster_shape)
+    , block_rank_in_cluster_(block_rank_in_cluster) {
+    if constexpr (IsDynamicCluster) {
+      const bool is_fallback_cluster = (cute::size<0>(cluster_shape_) == params.cluster_shape_fallback.x && 
+                                        cute::size<1>(cluster_shape_) == params.cluster_shape_fallback.y);
+      observed_tma_load_a_ = is_fallback_cluster ? &params.tma_load_a_fallback : &params.tma_load_a;
+      observed_tma_load_b_ = is_fallback_cluster ? &params.tma_load_b_fallback : &params.tma_load_b;
+    }
+    else {
+      observed_tma_load_a_ = &params.tma_load_a;
+      observed_tma_load_b_ = &params.tma_load_b;
+    }
+  }
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(
+    ProblemShape problem_shapes,
+    Arguments const& args,
+    void* workspace,
+    cutlass::KernelHardwareInfo const& hw_info = cutlass::KernelHardwareInfo{}) {
+    // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
+    // These will be replaced with correct values before the initial tma load.
+    auto init_shape = repeat_like(append<4>(typename ProblemShape::UnderlyingProblemShape{}, 1), int32_t(1));
+    auto init_M = get<0>(init_shape);
+    auto init_N = get<1>(init_shape);
+    auto init_K = get<2>(init_shape);
+    auto init_L = get<3>(init_shape);
+
+    // Tensor pointers will be fixed before the first access
+    TmaInternalElementA const* ptr_A_first_batch = nullptr;
+    TmaInternalElementB const* ptr_B_first_batch = nullptr;
+
+    InternalStrideA stride_a;
+    InternalStrideB stride_b;
+    if constexpr (IsGroupedGemmKernel) {
+      // Strides for Grouped Gemm will be replaced prior to the first access regardless.
+      stride_a = InternalStrideA{};
+      stride_b = InternalStrideB{};
+    }
+    else {
+      // Tensor shapes for Ptr-Array are initialized correctly only here.
+      auto problem_shape_MNK = problem_shapes.get_host_problem_shape(0);
+      init_M = get<0>(problem_shape_MNK);
+      init_N = get<1>(problem_shape_MNK);
+      init_K = get<2>(problem_shape_MNK);
+
+      stride_a = args.dA;
+      stride_b = args.dB;
+    }
+
+    // Batches/Groups are managed by using appropriate pointers to input matrices.
+    Tensor tensor_a = make_tensor(ptr_A_first_batch, make_layout(make_shape(init_M,init_K,init_L), stride_a));
+    Tensor tensor_b = make_tensor(ptr_B_first_batch, make_layout(make_shape(init_N,init_K,init_L), stride_b));
+
+    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, hw_info.cluster_shape);
+    // Cluster layout for TMA construction
+    auto cluster_layout_vmnk = tiled_divide(make_layout(cluster_shape), make_tile(typename TiledMma::AtomThrID{}));
+    auto cluster_shape_fallback = cutlass::detail::select_cluster_shape(ClusterShape{}, hw_info.cluster_shape_fallback);
+    auto cluster_layout_vmnk_fallback = tiled_divide(make_layout(cluster_shape_fallback), make_tile(typename TiledMma::AtomThrID{}));
+
+    typename Params::TMA_A tma_load_a = make_tma_atom_A_sm100<TmaInternalElementA>(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk);
+
+    typename Params::TMA_B tma_load_b = make_tma_atom_B_sm100<TmaInternalElementB>(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk);
+
+    typename Params::TMA_A tma_load_a_fallback = make_tma_atom_A_sm100<TmaInternalElementA>(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk_fallback);
+
+    typename Params::TMA_B tma_load_b_fallback = make_tma_atom_B_sm100<TmaInternalElementB>(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk_fallback);
+
+    return {
+      tma_load_a,
+      tma_load_b,
+      tma_load_a_fallback,
+      tma_load_b_fallback,
+      hw_info.cluster_shape_fallback,
+      args.runtime_data_type_a,
+      args.runtime_data_type_b,
+      reinterpret_cast<cute::TmaDescriptor*>(workspace),
+      reinterpret_cast<ArrayElementA const**>(args.ptr_A),
+      args.dA,
+      reinterpret_cast<ArrayElementB const**>(args.ptr_B),
+      args.dB,
+      args.ptr_SFA,
+      args.layout_SFA,
+      args.ptr_SFB,
+      args.layout_SFB
+    };
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args, int sm_count) {
+    constexpr uint32_t NumInputTensors = 2;
+    constexpr size_t SizeOfCuTensorMap = sizeof(cute::TmaDescriptor);
+    // Allocate gmem space for input tensormaps per each SM, A tensormap copies followed by B tensormap copies
+    return (NumInputTensors * SizeOfCuTensorMap * sm_count);
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream, CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  template<class ProblemShape>
+  static bool
+  can_implement(
+      ProblemShape problem_shapes,
+      [[maybe_unused]] Arguments const& args) {
+    static constexpr bool IsF8F6F4 = detail::is_sm100_mma_f8f6f4<TiledMma, ElementA, ElementB>();
+    constexpr int tma_alignment_bits_A = cutlass::detail::get_input_alignment_bits<ElementA, IsF8F6F4>();
+    constexpr int tma_alignment_bits_B = cutlass::detail::get_input_alignment_bits<ElementB, IsF8F6F4>();
+    constexpr int min_tma_aligned_elements_A = tma_alignment_bits_A / cute::sizeof_bits<ElementA>::value;
+    constexpr int min_tma_aligned_elements_B = tma_alignment_bits_B / cute::sizeof_bits<ElementB>::value;
+
+    bool implementable = true;
+    bool implementable_sf = true;
+    if (problem_shapes.is_host_problem_shape_available()) {
+      // Check alignment for all problem sizes
+      for (int i = 0; i < problem_shapes.groups(); i++) {
+        auto problem_shape_MNKL = append<4>(problem_shapes.get_host_problem_shape(i), 1);
+        auto [M,N,K,L] = problem_shape_MNKL;
+        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), InternalStrideA{});
+        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), InternalStrideB{});
+        implementable_sf = implementable_sf && cutlass::detail::check_alignment<AlignmentSFA>(ScaleConfig::tile_atom_to_shape_SFA(problem_shape_MNKL));
+        implementable_sf = implementable_sf && cutlass::detail::check_alignment<AlignmentSFB>(ScaleConfig::tile_atom_to_shape_SFB(problem_shape_MNKL));
+        if (!implementable_sf) {
+          CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for Scale Factors.\n");
+        }
+      }
+    }
+    else {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Ignoring check to can implement because host problem shape is not available.\n");
+    }
+
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+    }
+    implementable = implementable && implementable_sf;
+    return implementable;
+  }
+
+  /// Construct A Single Stage's Accumulator Shape
+  CUTLASS_DEVICE auto
+  partition_accumulator_shape() {
+    auto acc_shape = partition_shape_C(TiledMma{}, take<0,2>(TileShape{}));     // ((MMA_TILE_M,MMA_TILE_N),MMA_M,MMA_N)
+
+    return acc_shape;
+  }
+
+  template <class FrgEngine, class FrgLayout>
+  CUTLASS_DEVICE auto
+  slice_accumulator(cute::Tensor<FrgEngine, FrgLayout> const& accumulators, int stage) {
+    return accumulators(_,_,_,stage);
+  }
+
+  /// Set up the data needed by this collective for load.
+  /// Return tuple element contain
+  /// gA_mkl - The tiled tma tensor for input A
+  /// gB_nkl - The tiled tma tensor for input B
+  /// tAsA - partitioned smem tensor for A
+  /// tBsB - partitioned smem tensor for B
+  /// mcast_mask_a - tma multicast mask for A
+  /// mcast_mask_b - tma multicast mask for B
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_ab_init(
+      ProblemShape_MNKL const& problem_shape_MNKL,
+      Params const& params,
+      TensorStorage& shared_tensors,
+      TensorMapStorage& shared_tensormaps,
+      int32_t const sm_count, int32_t const sm_idx,
+      [[maybe_unused]] int32_t init_group) const {
+    using X = Underscore;
+
+    // Separate out problem shape for convenience
+    auto [M,N,K,L] = problem_shape_MNKL;
+    // Problem Shape and therefore strides that we construct are [M,N,K,L], but since here for the TMA loads
+    // we are managing TMA descriptors to change batches, we need to neglect the L mode
+    const int32_t mock_L = 1;
+
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mkl = observed_tma_load_a_->get_tma_tensor(make_shape(M,K,mock_L));
+    Tensor mB_nkl = observed_tma_load_b_->get_tma_tensor(make_shape(N,K,mock_L));
+
+    // Tile the tensors and defer the slice
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});     // (BLK_M, BLK_K, m, k, l)
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});     // (BLK_N, BLK_K, n, k, l)
+
+    // Partition for this CTA
+    ThrMMA cta_mma = TiledMma{}.get_slice(BlockIdxX() % size(typename TiledMma::AtomThrID{}));
+
+    Tensor tCgA_mkl = cta_mma.partition_A(gA_mkl);                                       // (MMA, MMA_M, MMA_K, m, k, l)
+    Tensor tCgB_nkl = cta_mma.partition_B(gB_nkl);                                       // (MMA, MMA_N, MMA_K, n, k, l)
+
+    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});      // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});      // (MMA,MMA_N,MMA_K,PIPE)
+
+    // Define the CTA-in-Cluster Layout and Coord
+    Layout cta_layout_mnk  = make_layout(cluster_shape_);
+    Layout cta_layout_vmnk = tiled_divide(cta_layout_mnk, make_tile(typename TiledMma::AtomThrID{}));
+    auto cta_coord_vmnk  = cta_layout_vmnk.get_flat_coord(block_rank_in_cluster_);
+
+    // Project the cta_layout for tma_a along the n-modes
+    auto [tAgA_mkl, tAsA] = tma_partition(*observed_tma_load_a_,
+                                      get<2>(cta_coord_vmnk), make_layout(size<2>(cta_layout_vmnk)),
+                                      group_modes<0,3>(sA), group_modes<0,3>(tCgA_mkl));
+
+    // Project the cta_layout for tma_b along the m-modes
+    auto [tBgB_nkl, tBsB] = tma_partition(*observed_tma_load_b_,
+                                      get<1>(cta_coord_vmnk), make_layout(size<1>(cta_layout_vmnk)),
+                                      group_modes<0,3>(sB), group_modes<0,3>(tCgB_nkl));
+
+    // TMA Multicast Masks
+    uint16_t mcast_mask_a = create_tma_multicast_mask<2>(cta_layout_vmnk, cta_coord_vmnk);
+    uint16_t mcast_mask_b = create_tma_multicast_mask<1>(cta_layout_vmnk, cta_coord_vmnk);
+
+    // Fetch a copy of tensormaps for the CTA from Params
+    auto input_tensormaps = tensormaps_init(params, shared_tensormaps, sm_count, sm_idx);
+
+    return cute::make_tuple(
+        gA_mkl, gB_nkl,                        // for scheduler
+        tAgA_mkl, tBgB_nkl, tAsA, tBsB,        // for input tensor values
+        mcast_mask_a, mcast_mask_b,            // multicast masks
+        input_tensormaps);                     // for tma descriptor modification (per-CTA tensormap copy)
+  }
+
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_sf_init(
+      ProblemShape_MNKL const& problem_shape_MNKL,
+      Params const& params,
+      TensorStorage& shared_tensors,
+      int current_group) const {
+    return load_sf_update(problem_shape_MNKL, params, shared_tensors, current_group);
+  }
+
+  /// Set up the data needed by this collective for load.
+  /// Return tuple element contain
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_sf_update(
+      ProblemShape_MNKL const& problem_shape_MNKL,
+      Params const& params,
+      TensorStorage& shared_tensors,
+      int current_group) const {
+    using X = Underscore;
+
+    // Separate out problem shape for convenience
+    auto [M,N,K,L] = problem_shape_MNKL;
+    // Problem Shape and therefore strides that we construct are [M,N,K,L], but since here for the TMA loads
+    // we are managing TMA descriptors to change batches, we need to neglect the L mode
+    const int32_t mock_L = 1;
+
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mkl = observed_tma_load_a_->get_tma_tensor(make_shape(M,K,mock_L));
+    // Tile the tensors and defer the slice
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});    // (BLK_M, BLK_K, m, k, l)
+
+    auto layout_SFA = [&]() CUTLASS_LAMBDA_FUNC_INLINE {
+      if constexpr (IsGroupedGemmKernel) {
+        return params.layout_SFA[current_group];
+      } 
+      else {
+        return params.layout_SFA;
+      }
+    }();
+
+    auto layout_SFB = [&]() CUTLASS_LAMBDA_FUNC_INLINE {
+      if constexpr (IsGroupedGemmKernel) {
+        return params.layout_SFB[current_group];
+      } 
+      else {
+        return params.layout_SFB;
+      }
+    }();
+
+    Tensor mSFA_mkl = make_tensor(make_gmem_ptr(params.ptr_SFA[current_group]), layout_SFA);                  // (m,k,l)
+
+    Tensor mSFB_nkl = make_tensor(make_gmem_ptr(params.ptr_SFB[current_group]), layout_SFB);                  // (n,k,l)
+
+    Tensor SFA_mkl_ident = make_identity_tensor(shape(layout_SFA));
+
+    Tensor SFB_nkl_ident = make_identity_tensor(shape(layout_SFB));
+
+    // Tile the tensors and defer the slice
+    Tensor gSFA_mkl = local_tile(mSFA_mkl, CtaShape_MNK{}, 
+        make_coord(_,_,_), Step<_1, X,_1>{});                                                 // (BLK_M, BLK_K, m, k, l)
+    Tensor gSFB_nkl = local_tile(mSFB_nkl, CtaShape_MNK{}, 
+        make_coord(_,_,_), Step< X,_1,_1>{});                                                 // (BLK_N, BLK_K, n, k, l)
+
+    Tensor identSFA_mkl = local_tile(SFA_mkl_ident, CtaShape_MNK{}, 
+        make_coord(_,_,_), Step<_1, X,_1>{});                                                 // (BLK_M, BLK_K, m, k, l)
+    Tensor identSFB_nkl = local_tile(SFB_nkl_ident, CtaShape_MNK{}, 
+        make_coord(_,_,_), Step< X,_1,_1>{});                                                 // (BLK_N, BLK_K, n, k, l)
+
+    static_assert(rank(decltype(gSFA_mkl){}) == 5);
+    static_assert(rank(decltype(gSFB_nkl){}) == 5);
+
+    // 1 thread copies entire set of scalar
+    GmemTiledCopySFA scale_copy_a{};
+    GmemTiledCopySFB scale_copy_b{};
+
+    ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(ThreadIdxX() % size(scale_copy_a));
+    ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(ThreadIdxX() % size(scale_copy_b));
+
+    Tensor sSFA = make_tensor(make_smem_ptr(shared_tensors.smem_SFA.begin()), 
+        SmemLayoutScaleA{});                                                                          // (CTA_M,CTA_K,P)
+    Tensor sSFB = make_tensor(make_smem_ptr(shared_tensors.smem_SFB.begin()), 
+        SmemLayoutScaleB{});                                                                          // (CTA_M,CTA_K,P)
+
+    Tensor tSFAgSFA_mkl = thr_scale_copy_a.partition_S(gSFA_mkl);                        // (CPY, BLK_M, BLK_K, m, k, l)
+    Tensor tSFAIdentSFA_mkl = thr_scale_copy_a.partition_S(identSFA_mkl);                // (CPY, BLK_M, BLK_K, m, k, l)
+
+    Tensor tSFAsSFA = thr_scale_copy_a.partition_D(sSFA);
+    
+    Tensor tSFBgSFB_nkl = thr_scale_copy_b.partition_S(gSFB_nkl);                        // (CPY, BLK_N, BLK_K, m, k, l)
+    Tensor tSFBIdentSFB_nkl = thr_scale_copy_b.partition_S(identSFB_nkl);                // (CPY, BLK_N, BLK_K, m, k, l)
+    Tensor tSFBsSFB = thr_scale_copy_b.partition_D(sSFB);
+
+    static_assert(rank(decltype(tSFAgSFA_mkl){}) == 6);
+    static_assert(rank(decltype(tSFBgSFB_nkl){}) == 6);
+
+    return cute::make_tuple(gA_mkl,
+                            tSFAgSFA_mkl, tSFBgSFB_nkl,
+                            tSFAsSFA, tSFBsSFB,
+                            tSFAIdentSFA_mkl, tSFBIdentSFB_nkl,
+                            layout_SFA, layout_SFB);                     
+  }
+
+  /// Setup data needed for transform
+  CUTLASS_DEVICE auto
+  accum_init(
+      TensorStorage& shared_tensors) const {
+    Tensor sSFA = make_tensor(make_smem_ptr(shared_tensors.smem_SFA.begin()), 
+        SmemLayoutScaleA{});                                                                          // (CTA_M,CTA_K,P)
+    Tensor sSFB = make_tensor(make_smem_ptr(shared_tensors.smem_SFB.begin()), 
+        SmemLayoutScaleB{});                                                                          // (CTA_M,CTA_K,P)
+
+    return cute::make_tuple(sSFA, sSFB);
+  }
+
+  /// Set up the data needed by this collective for mma compute.
+  template <class FrgEngine, class FrgLayout>
+  CUTLASS_DEVICE auto
+  mma_init(
+      Params const& params,
+      [[maybe_unused]] cute::Tensor<FrgEngine, FrgLayout> const& accumulators,
+      TensorStorage& shared_tensors,
+      [[maybe_unused]] uint32_t const tmem_nonaccum_offset) const {
+    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});          // (BLK_M,BLK_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
+
+    // Allocate "fragments/descriptors" for A and B matrices
+    Tensor tCrA_ = TiledMma::make_fragment_A(sA);                                              // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCrB_ = TiledMma::make_fragment_B(sB);                                              // (MMA,MMA_N,MMA_K,PIPE)
+
+    CUTE_STATIC_ASSERT_V(rank(tCrA_) == _4{});
+
+    auto mma_tile_shape_A = make_shape(get<0>(shape(tCrA_.layout())), 
+                                       get<1>(shape(tCrA_.layout())), 
+                                       Int<K_BLOCK_MMAS_PER_SCALE_K>{}, 
+                                       _1{});
+
+    auto mma_tile_shape_B = make_shape(get<0>(shape(tCrB_.layout())), 
+                                       get<1>(shape(tCrB_.layout())), 
+                                       Int<K_BLOCK_MMAS_PER_SCALE_K>{}, 
+                                       _1{});
+
+    Tensor tCrA = flat_divide(tCrA_, 
+        mma_tile_shape_A)(_,_,_,_0{},_0{},_0{},_,_);                      // (MMA,MMA_M,MMA_K_PER_SCALE,MMA_K_REST,PIPE)
+
+    Tensor tCrB = flat_divide(tCrB_, 
+        mma_tile_shape_B)(_,_,_,_0{},_0{},_0{},_,_);                      // (MMA,MMA_N,MMA_K_PER_SCALE,MMA_K_REST,PIPE)
+
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<3>(sA));                                          // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<3>(sB));
+
+    TiledMma tiled_mma;
+
+    if constexpr (IsRuntimeDataType) {
+      // Update instruction descriptor according to runtime argument.
+      // Applying bitmask (0b111) to help compiler deduce that the conversion and assignment are safe.
+      tiled_mma.idesc_.a_format_ = uint8_t(params.runtime_data_type_a) & 0b111;
+      tiled_mma.idesc_.b_format_ = uint8_t(params.runtime_data_type_b) & 0b111;
+    }
+
+    return cute::make_tuple(tiled_mma, tCrA, tCrB);
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Producer Perspective
+  template <
+    class GTensorA, class GTensorB,
+    class GTensorPartitionedA, class GTensorPartitionedB,
+    class STensorA, class STensorB,
+    class TensorMapA, class TensorMapB,
+    class TileCoordMNKL,
+    class KTileIterator
+  >
+  CUTLASS_DEVICE auto
+  load_ab(
+    Params const& params,
+    MainloopABPipeline mainloop_ab_pipeline,
+    MainloopABPipelineState mainloop_ab_pipe_producer_state,
+    cute::tuple<GTensorA, GTensorB,
+                GTensorPartitionedA, GTensorPartitionedB,
+                STensorA, STensorB,
+                uint16_t, uint16_t,
+                cute::tuple<TensorMapA, TensorMapB>> const& load_inputs,
+    TileCoordMNKL const& cta_coord_mnkl,
+    KTileIterator k_tile_iter, int k_tile_count,
+    bool did_batch_change) {
+
+    auto [unused_gA, unused_gB,
+          tAgA_mkl, tBgB_nkl, tAsA, tBsB,
+          mcast_mask_a, mcast_mask_b,
+          input_tensormaps] = load_inputs;
+
+    // Check to see if tensormaps have been replaced in gmem
+    if (did_batch_change) {
+      tensormaps_fence_acquire(input_tensormaps);
+    }
+
+    // slice out the work coord from partitioned tensors
+    Tensor tAgA = tAgA_mkl(_, get<0>(cta_coord_mnkl) / size(typename TiledMma::AtomThrID{}), _, get<3>(cta_coord_mnkl));
+    Tensor tBgB = tBgB_nkl(_, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    auto barrier_token = mainloop_ab_pipeline.producer_try_acquire(mainloop_ab_pipe_producer_state);
+
+    // Issue the Mainloop loads
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+      // LOCK mainloop_pipe_producer_state for _writing_
+      mainloop_ab_pipeline.producer_acquire(mainloop_ab_pipe_producer_state, barrier_token);
+
+      using BarrierType = typename MainloopABPipeline::ProducerBarrierType;
+      BarrierType* tma_barrier = mainloop_ab_pipeline.producer_get_barrier(mainloop_ab_pipe_producer_state);
+
+      int write_stage = mainloop_ab_pipe_producer_state.index();
+      ++mainloop_ab_pipe_producer_state;
+      barrier_token = mainloop_ab_pipeline.producer_try_acquire(mainloop_ab_pipe_producer_state);
+
+      if (cute::elect_one_sync()) {
+        copy(observed_tma_load_a_->with(get<0>(input_tensormaps), *tma_barrier, mcast_mask_a), tAgA(_,*k_tile_iter), tAsA(_,write_stage));
+        copy(observed_tma_load_b_->with(get<1>(input_tensormaps), *tma_barrier, mcast_mask_b), tBgB(_,*k_tile_iter), tBsB(_,write_stage));
+      }
+      --k_tile_count;
+      ++k_tile_iter;
+    }
+
+    return cute::make_tuple(mainloop_ab_pipe_producer_state, k_tile_iter);
+  }
+
+  /// Perform a Producer Epilogue to prevent early exit of ctas in a Cluster
+  CUTLASS_DEVICE void
+  load_ab_tail(MainloopABPipeline mainloop_ab_pipeline, MainloopABPipelineState mainloop_ab_pipe_producer_state) {
+    // Issue the epilogue waits
+    // This helps avoid early exit of ctas in Cluster
+    // Waits for all stages to either be released (all
+    // Consumer UNLOCKs), or if the stage was never used
+    // then would just be acquired since the phase was
+    // still inverted from make_producer_start_state
+    mainloop_ab_pipeline.producer_tail(mainloop_ab_pipe_producer_state);
+  }
+
+  /// Perform a collective-scoped transform
+  /// Producer Perspective
+  template <
+    class UnusedGTensorA,
+    class GTensorPartitionedSFA, class GTensorPartitionedSFB,
+    class STensorSFA, class STensorSFB,
+    class IdentPartitionedSFA, class IdentPartitionedSFB,
+    class TileCoordMNKL,
+    class KTileIterator
+  >
+  CUTLASS_DEVICE auto
+  load_sf(
+    MainloopSFPipeline mainloop_sf_pipeline,
+    MainloopSFPipelineState mainloop_sf_pipe_producer_state,
+    cute::tuple<UnusedGTensorA, 
+                GTensorPartitionedSFA, GTensorPartitionedSFB,
+                STensorSFA, STensorSFB,
+                IdentPartitionedSFA, 
+                IdentPartitionedSFB,
+                InternalLayoutSFA,
+                InternalLayoutSFB> const& mainloop_sf_inputs,
+    TileCoordMNKL const& cta_coord_mnkl,
+    KTileIterator k_tile_iter, int k_tile_count) {
+
+    auto [unused, tSFAgSFA_mkl, tSFBgSFB_nkl,
+          tSFAsSFA, tSFBsSFB,
+          tSFAIdentSFA_mkl, tSFBIdentSFB_nkl,
+          layout_SFA, layout_SFB] = mainloop_sf_inputs;
+
+    // slice out the work coord from partitioned tensors
+    GmemTiledCopySFA scale_copy_a{};
+    GmemTiledCopySFB scale_copy_b{};
+
+    Tensor tSFAgSFA = tSFAgSFA_mkl(_, _, _, get<0>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    Tensor tSFBgSFB = tSFBgSFB_nkl(_, _, _, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    Tensor thr_tile_SFA_k = tSFAIdentSFA_mkl(_0{}, _, _, get<0>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+    Tensor thr_tile_pSFA = make_tensor<bool>(shape(filter_zeros(thr_tile_SFA_k(_,_,_0{}), tSFAgSFA(_0{},_,_,_0{}).stride())));
+    Tensor thr_tile_SFB_k = tSFBIdentSFB_nkl(_0{}, _, _, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    Tensor thr_tile_pSFB = make_tensor<bool>(shape(filter_zeros(thr_tile_SFB_k(_,_,_0{}), tSFBgSFB(_0{},_,_,_0{}).stride())));
+
+    // Issue the loads
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+      // LOCK pipe_producer_state for _writing_
+      mainloop_sf_pipeline.producer_acquire(mainloop_sf_pipe_producer_state);
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(thr_tile_pSFA); ++i) {
+        Tensor thr_tile_SFA = filter_zeros(thr_tile_SFA_k(_,_,*k_tile_iter), tSFAgSFA(_0{},_,_,_0{}).stride()); 
+        thr_tile_pSFA(i) = elem_less(thr_tile_SFA(i), shape(filter_zeros(layout_SFA))) && ThreadIdxX() % 32 < size(scale_copy_a);
+      }
+      
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(thr_tile_pSFB); ++i) {
+        Tensor thr_tile_SFB = filter_zeros(thr_tile_SFB_k(_,_,*k_tile_iter), tSFBgSFB(_0{},_,_,_0{}).stride()); 
+        thr_tile_pSFB(i) = elem_less(thr_tile_SFB(i), shape(filter_zeros(layout_SFB))) && ThreadIdxX() % 32 < size(scale_copy_b);
+      }
+
+      copy_if(scale_copy_a, thr_tile_pSFA, filter_zeros(tSFAgSFA(_,_,_,*k_tile_iter)), filter_zeros(tSFAsSFA(_,_,_,mainloop_sf_pipe_producer_state.index())));
+      copy_if(scale_copy_b, thr_tile_pSFB, filter_zeros(tSFBgSFB(_,_,_,*k_tile_iter)), filter_zeros(tSFBsSFB(_,_,_,mainloop_sf_pipe_producer_state.index())));
+      mainloop_sf_pipeline.producer_commit(mainloop_sf_pipe_producer_state, cutlass::arch::cpasync_barrier_arrive_noinc);        
+
+      syncwarp();
+
+      ++mainloop_sf_pipe_producer_state;
+      --k_tile_count;
+      ++k_tile_iter;
+    }
+
+    return cute::make_tuple(mainloop_sf_pipe_producer_state, k_tile_iter);
+
+ }
+
+  /// Perform a Producer Epilogue to prevent early exit of ctas in a Cluster
+  CUTLASS_DEVICE void
+  load_sf_tail(
+      MainloopSFPipeline mainloop_sf_pipeline, 
+      MainloopSFPipelineState mainloop_sf_pipe_producer_state) {
+    // Issue the epilogue waits
+    // This helps avoid early exit of ctas in Cluster
+    // Waits for all stages to either be released (all
+    // Consumer UNLOCKs), or if the stage was never used
+    // then would just be acquired since the phase was
+    // still inverted from make_producer_start_state
+    mainloop_sf_pipeline.producer_tail(mainloop_sf_pipe_producer_state);
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Consumer Perspective
+  template <
+    class FrgEngine, class FrgLayout,
+    class FragmentA, class FragmentB,
+    class CtaTileCoord
+  >
+  CUTLASS_DEVICE auto
+  mma(cute::tuple<MainloopABPipeline,
+                  AccumulatorPipeline> pipelines,
+      cute::tuple<MainloopABPipelineState,
+                  AccumulatorPipelineState> pipeline_states,
+      cute::Tensor<FrgEngine, FrgLayout>& accumulators,
+      cute::tuple<TiledMma, FragmentA, FragmentB> const& mma_inputs,
+      CtaTileCoord cta_tile_coord,
+      int k_tile_count) {
+    static_assert(is_tmem<FrgEngine>::value, "Accumulator must be tmem resident.");
+    static_assert(rank(FrgLayout{}) == 4, "Accumulator must be MMA-partitioned: (MMA, MMA_M, MMA_N, P)");
+    auto [tiled_mma, tCrA, tCrB] = mma_inputs;
+
+    auto [mainloop_pipeline, accumulator_pipeline] = pipelines;
+    auto [mainloop_pipe_consumer_state, accumulator_pipe_producer_state] = pipeline_states;
+
+    uint32_t skip_wait = k_tile_count <= 0;
+    auto barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
+
+    //
+    // PIPELINED MAIN LOOP
+    //
+    tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+      // WAIT on mainloop_pipe_consumer_state until its data are available
+      // (phase bit flips from mainloop_pipe_consumer_state.phase() value)
+      mainloop_pipeline.consumer_wait(mainloop_pipe_consumer_state);
+
+      // Compute on k_tile
+      int read_stage = mainloop_pipe_consumer_state.index();
+      // Save current mainlop pipeline read state
+      auto curr_mainloop_pipe_consumer_state = mainloop_pipe_consumer_state;
+
+      // Advance mainloop_pipe
+      ++mainloop_pipe_consumer_state;
+      --k_tile_count;
+      skip_wait = k_tile_count <= 0;
+      // Peek at next iteration
+      barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int scale_k_iter = 0; scale_k_iter < size<3>(tCrA); ++scale_k_iter) {
+        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+
+        auto acc = slice_accumulator(accumulators, accumulator_pipe_producer_state.index());
+        static_assert(is_tmem<remove_cvref_t<decltype(acc)>>::value, "Accumulator must be tmem resident.");
+        static_assert(rank(remove_cvref_t<decltype(acc)>{}) == 3, "Accumulator must be MMA-partitioned: (MMA, MMA_M, MMA_N)");
+
+        // for each set of scale_k_iter we zero the accumulator
+        tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+        // Unroll the K mode manually so we can set scale C to 1
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+          // (V,M) x (V,N) => (V,M,N)
+          cute::gemm(tiled_mma,
+                     tCrA(_,_,k_block,scale_k_iter,read_stage),
+                     tCrB(_,_,k_block,scale_k_iter,read_stage),
+                     acc);
+          tiled_mma.accumulate_ = UMMA::ScaleOut::One;
+        }
+        accumulator_pipeline.producer_commit(accumulator_pipe_producer_state);
+        ++accumulator_pipe_producer_state;
+      }
+      mainloop_pipeline.consumer_release(curr_mainloop_pipe_consumer_state);
+
+    }
+
+    return make_tuple(mainloop_pipe_consumer_state, accumulator_pipe_producer_state);
+
+  }
+
+  /// Transform
+  template <
+    class FrgEngine,
+    class FrgLayout,
+    class TensorsSFA,
+    class TensorsSFB,
+    class CtaTileCoord,
+    class CopyOpT2R,
+    class EpilogueTile
+  >
+  CUTLASS_DEVICE auto 
+  accum(
+      cute::tuple<AccumulatorPipeline, MainloopSFPipeline> pipelines,
+      cute::tuple<AccumulatorPipelineState, MainloopSFPipelineState> consumer_states,
+      cute::Tensor<FrgEngine, FrgLayout> const& accumulators,
+      cute::tuple<TensorsSFA, TensorsSFB> const& transform_inputs,
+      CtaTileCoord cta_tile_coord,
+      CopyOpT2R,
+      EpilogueTile,
+      int k_tile_count) {
+
+    static_assert(size<0>(EpilogueTile{}) <= size<0>(CtaShape_MNK{}), "Restrict epilogue tile to be smaller than or equal to CTA Tile");
+    static_assert(size<1>(EpilogueTile{}) <= size<1>(CtaShape_MNK{}), "Restrict epilogue tile to be smaller than or equal to CTA Tile");
+
+
+    //
+    // PIPELINED Transform
+    //
+    
+    Tensor acc = slice_accumulator(accumulators, _0{});
+    Tensor tAcc = acc(make_coord(_,_),_0{},_0{});
+    Tensor tAcc_epi = flat_divide(tAcc, EpilogueTile{});                          // (EPI_TILE_M,EPI_TILE_N,EPI_M,EPI_N)
+    auto [sSFA_, sSFB_] = transform_inputs;
+
+    // Append N with a stride of 0 to SFA
+    Tensor sSFA = make_tensor(sSFA_.data(), make_layout(
+      make_shape(get<0>(sSFA_.shape()), get<1>(CtaShape_MNK{}), get<1>(sSFA_.shape()), get<2>(sSFA_.shape())),
+      make_stride(get<0>(sSFA_.stride()), _0{}, get<1>(sSFA_.stride()), get<2>(sSFA_.stride()))
+    ));
+
+    CUTE_STATIC_ASSERT_V(size<0>(sSFA) == size<0>(tAcc));
+    CUTE_STATIC_ASSERT_V(size<1>(sSFA) == size<1>(tAcc));
+
+    Tensor sSFA_epi = flat_divide(sSFA, EpilogueTile{});
+
+    // Append M with a stride of 0 to SFB
+    Tensor sSFB = make_tensor(sSFB_.data(), make_layout(
+      make_shape(get<0>(CtaShape_MNK{}), get<0>(sSFB_.shape()), get<1>(sSFB_.shape()), get<2>(sSFB_.shape())),
+      make_stride(_0{}, get<0>(sSFB_.stride()), get<1>(sSFB_.stride()), get<2>(sSFB_.stride()))
+    ));
+
+    CUTE_STATIC_ASSERT_V(size<0>(sSFB) == size<0>(tAcc));
+    CUTE_STATIC_ASSERT_V(size<1>(sSFB) == size<1>(tAcc));
+
+    Tensor sSFB_epi = flat_divide(sSFB, EpilogueTile{});
+
+    TiledCopy tiled_t2r_epi = make_tmem_copy(CopyOpT2R{}, tAcc_epi(_,_,_0{},_0{}));
+
+    int thread_idx = ThreadIdxX() % size(tiled_t2r_epi);
+
+    ThrCopy thread_t2r_epi = tiled_t2r_epi.get_slice(thread_idx);   
+
+    Tensor acc_ident_epi = make_identity_tensor(shape(tAcc_epi));
+    
+    Tensor tTR_rAcc_epi = thread_t2r_epi.partition_D(acc_ident_epi);                // (T2R, T2R_M, T2R_N, EPI_M, EPI_N)
+
+    Tensor tTR_sSFA_epi = thread_t2r_epi.partition_D(sSFA_epi);                     // (T2R, T2R_M, T2R_N, EPI_M, EPI_N)
+    Tensor tTR_sSFB_epi = thread_t2r_epi.partition_D(sSFB_epi);                     // (T2R, T2R_M, T2R_N, EPI_M, EPI_N)
+    
+    static_assert(rank(decltype(tTR_sSFA_epi){}) == 7);
+
+    Tensor tTR_FullAcc = make_tensor<ElementAccumulator>(shape(tTR_rAcc_epi));
+    Tensor tTR_PartAcc = make_tensor<ElementAccumulator>(shape(tTR_rAcc_epi(_,_,_,_0{},_0{})));
+
+    Tensor tTR_rSFA_compact = make_fragment_like<ElementAccumulator>(filter_zeros(tTR_sSFA_epi(_,_,_,_,_,_,_0{})));
+    Tensor tTR_rSFB_compact = make_fragment_like<ElementAccumulator>(filter_zeros(tTR_sSFB_epi(_,_,_,_,_,_,_0{})));
+
+    Layout tTR_rSFA_layout = make_layout(tTR_sSFA_epi(_,_,_,_,_,_,_0{}).shape(), tTR_rSFA_compact.stride());
+    Layout tTR_rSFB_layout = make_layout(tTR_sSFB_epi(_,_,_,_,_,_,_0{}).shape(), tTR_rSFB_compact.stride());
+
+    // Zero our accumulator
+    clear(tTR_FullAcc);
+
+    auto [accumulator_pipeline, mainloop_sf_pipeline] = pipelines;
+    auto [accumulator_pipe_state, mainloop_sf_pipe_state] = consumer_states;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+
+      mainloop_sf_pipeline.consumer_wait(mainloop_sf_pipe_state);
+      int read_idx = mainloop_sf_pipe_state.index();
+
+      copy(filter_zeros(tTR_sSFA_epi(_,_,_,_,_,_,read_idx)), tTR_rSFA_compact);
+      copy(filter_zeros(tTR_sSFB_epi(_,_,_,_,_,_,read_idx)), tTR_rSFB_compact);
+
+      CUTE_STATIC_ASSERT_V(cosize(tTR_rSFA_layout) == size(tTR_rSFA_compact));
+      CUTE_STATIC_ASSERT_V(cosize(tTR_rSFB_layout) == size(tTR_rSFB_compact));
+      
+      Tensor tTR_rSFA = make_tensor(tTR_rSFA_compact.data(), tTR_rSFA_layout);
+      Tensor tTR_rSFB = make_tensor(tTR_rSFB_compact.data(), tTR_rSFB_layout);
+      
+      mainloop_sf_pipeline.consumer_release(mainloop_sf_pipe_state);
+      ++mainloop_sf_pipe_state;
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < ScaleKsPerTile; ++k_block) {
+
+        accumulator_pipeline.consumer_wait(accumulator_pipe_state);
+
+        Tensor acc = slice_accumulator(accumulators, accumulator_pipe_state.index());
+        Tensor tAcc = acc(make_coord(_,_),_0{},_0{});
+        Tensor tAcc_epi = flat_divide(tAcc, EpilogueTile{});                   // (EPI_TILE_M, EPI_TILE_N, EPI_M, EPI_N)
+        Tensor tTR_tAcc = thread_t2r_epi.partition_S(tAcc_epi);                     // (T2R, T2R_M, T2R_N, EPI_M, EPI_N)
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int epi_m = 0; epi_m < size<2>(tAcc_epi); ++epi_m) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int epi_n = 0; epi_n < size<3>(tAcc_epi); ++epi_n) {
+
+            auto scale_a = tTR_rSFA(_,_,_,epi_m,epi_n,k_block * ScaleGranularityK);
+            auto scale_b = tTR_rSFB(_,_,_,epi_m,epi_n,k_block * ScaleGranularityK);
+
+            Tensor full_acc = tTR_FullAcc(_,_,_,epi_m,epi_n);
+            // Compute tmem load predication if necessary
+            copy(tiled_t2r_epi, tTR_tAcc(_,_,_,epi_m,epi_n), tTR_PartAcc);
+            cutlass::arch::fence_view_async_tmem_load();
+ 
+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(full_acc); ++i) {
+              ElementAccumulator scale = scale_a(i) * scale_b(i);
+              full_acc(i) += scale * tTR_PartAcc(i);
+            }
+          }
+        }              
+        cutlass::arch::fence_view_async_tmem_load();
+        accumulator_pipeline.consumer_release(accumulator_pipe_state);
+        // release acc
+        ++accumulator_pipe_state;
+      } 
+
+      --k_tile_count;
+    }
+
+    return cute::make_tuple(tTR_FullAcc, tiled_t2r_epi, cute::make_tuple(accumulator_pipe_state, mainloop_sf_pipe_state));
+ }
+
+  //
+  // Methods to perform different parts of TMA/Tensormap modifications
+  //
+
+  CUTLASS_DEVICE auto
+  tensormaps_init(
+      Params const& mainloop_params,
+      TensorMapStorage& shared_tensormaps,
+      int32_t const sm_count,
+      int32_t const sm_idx) const {
+    cute::TmaDescriptor* gmem_tensormap = mainloop_params.tensormaps;
+
+    cute::TmaDescriptor* tma_desc_a = &gmem_tensormap[sm_idx];
+    cute::TmaDescriptor* tma_desc_b = &gmem_tensormap[sm_idx + sm_count];
+
+    if (cute::elect_one_sync()) {
+      // Bringing tensormaps from params to smem for modification later
+      Tensor pA_tensormap = make_tensor(observed_tma_load_a_->get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sA_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_A), Int<1>{}, Int<1>{});
+      Tensor pB_tensormap = make_tensor(observed_tma_load_b_->get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sB_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_B), Int<1>{}, Int<1>{});
+
+      copy(recast<uint128_t>(pA_tensormap), recast<uint128_t>(sA_tensormap));
+      copy(recast<uint128_t>(pB_tensormap), recast<uint128_t>(sB_tensormap));
+    }
+    syncwarp();
+
+    return cute::make_tuple(tma_desc_a, tma_desc_b);
+  }
+
+  // Replace address for the global tensor (to be done by single thread)
+  CUTLASS_DEVICE
+  void
+  tensormaps_replace_global_address(
+      TensorMapStorage& shared_tensormaps,
+      Params const& mainloop_params,
+      int32_t next_batch) {
+    // Replacing global_address for the next batch
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_A,
+                                                    mainloop_params.ptr_A[next_batch]);
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_B,
+                                                    mainloop_params.ptr_B[next_batch]);
+  }
+
+  // Replace dim and strides for the global tensor - used only for Grouped GEMM (to be done by single thread)
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE
+  void
+  tensormaps_replace_global_tensor_properties(
+      TensorMapStorage& shared_tensormaps,
+      Params const& mainloop_params,
+      int32_t next_group,
+      ProblemShape_MNKL problem_shape_mnkl) {
+    const uint32_t M = get<0>(problem_shape_mnkl);
+    const uint32_t N = get<1>(problem_shape_mnkl);
+    const uint32_t K = get<2>(problem_shape_mnkl);
+    // Replace all dims for consistency
+    constexpr int MaxTensorRank = 5;
+    cute::array<uint32_t, MaxTensorRank> prob_shape_A  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_A = {0,0,0,0,0};
+    cute::array<uint32_t, MaxTensorRank> prob_shape_B  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_B = {0,0,0,0,0};
+
+    TmaInternalElementA const* ptr_A = nullptr;
+    Tensor tensor_a = make_tensor(ptr_A, make_shape(M,K,Int<1>{}), mainloop_params.dA[next_group]);
+
+    TmaInternalElementB const* ptr_B = nullptr;
+    Tensor tensor_b = make_tensor(ptr_B, make_shape(N,K,Int<1>{}), mainloop_params.dB[next_group]);
+
+    cute::detail::fill_tma_gmem_shape_stride(*observed_tma_load_a_, tensor_a, 
+                                             prob_shape_A, prob_stride_A);
+    cute::detail::fill_tma_gmem_shape_stride(*observed_tma_load_b_, tensor_b, 
+                                             prob_shape_B, prob_stride_B);
+
+    // Convert strides to byte strides
+    for (uint64_t& stride : prob_stride_A) {
+      stride = (stride * sizeof_bits_v<TmaInternalElementA>) / 8;
+    }
+    for (uint64_t& stride : prob_stride_B) {
+      stride = (stride * sizeof_bits_v<TmaInternalElementB>) / 8;
+    }
+
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_A,
+                                                            prob_shape_A,
+                                                            prob_stride_A);
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_B,
+                                                            prob_shape_B,
+                                                            prob_stride_B);
+  }
+
+  // The entire warp must call this function collectively (that is, the instructions are aligned)
+  template <class TensorMapA, class TensorMapB, class ProblemShape>
+  CUTLASS_DEVICE
+  void
+  tensormaps_perform_update(
+      TensorMapStorage& shared_tensormaps,
+      Params const& mainloop_params,
+      cute::tuple<TensorMapA, TensorMapB> const& input_tensormaps,
+      ProblemShape problem_shape,
+      int32_t next_batch) {
+    if (cute::elect_one_sync()) {
+      // Replacing global_address for the next batch
+      tensormaps_replace_global_address(shared_tensormaps, mainloop_params, next_batch);
+
+      if constexpr (IsGroupedGemmKernel) {
+        auto problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(next_batch), 1);
+        // Replacing global dims and strides for the next batch
+        tensormaps_replace_global_tensor_properties(shared_tensormaps,
+          mainloop_params, next_batch, problem_shape_MNKL);
+      }
+    }
+    // Ensure warp is converged before issuing tensormap fence release
+    syncwarp();
+    // Entire warp must do this (ie its aligned)
+    tensormaps_cp_fence_release(shared_tensormaps, input_tensormaps);
+  }
+
+  template <class TensorMapA, class TensorMapB>
+  CUTLASS_DEVICE
+  void
+  tensormaps_cp_fence_release (
+      TensorMapStorage& shared_tensormaps,
+      cute::tuple<TensorMapA, TensorMapB> const& input_tensormaps) {
+    if (cute::elect_one_sync()) {
+      cute::tma_desc_commit_group();
+      cute::tma_desc_wait_group();
+    }
+    // Entire warp must do this (i.e. it's aligned)
+    tma_descriptor_cp_fence_release(get<0>(input_tensormaps), shared_tensormaps.smem_tensormap_A);
+    tma_descriptor_cp_fence_release(get<1>(input_tensormaps), shared_tensormaps.smem_tensormap_B);
+  }
+
+  // The entire warp must call this function collectively (that is, the instructions are aligned)
+  template <class TensorMapA, class TensorMapB>
+  CUTLASS_DEVICE
+  void
+  tensormaps_fence_acquire(cute::tuple<TensorMapA, TensorMapB> const& input_tensormaps) {
+    cute::tma_descriptor_fence_acquire(get<0>(input_tensormaps));
+    cute::tma_descriptor_fence_acquire(get<1>(input_tensormaps));
+  }
+
+private:
+
+  typename Params::TMA_A const* observed_tma_load_a_{nullptr};
+  typename Params::TMA_B const* observed_tma_load_b_{nullptr};
+
+  ClusterShape cluster_shape_;
+  uint32_t block_rank_in_cluster_;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp b/include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp
index 19c9363ffd..23cddbd5e7 100644
--- a/include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp
@@ -667,12 +667,14 @@ struct CollectiveMma<
 
     uint32_t skip_wait = k_tile_count <= 0;
     auto barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
-    bool is_first_iter = true;
 
     //
     // PIPELINED MAIN LOOP
     //
     tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+    // Wait for tmem accumulator buffer to become empty with a flipped phase
+    accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+
     CUTLASS_PRAGMA_NO_UNROLL
     while (k_tile_count > 0) {
       // WAIT on mainloop_pipe_consumer_state until its data are available
@@ -690,11 +692,6 @@ struct CollectiveMma<
       skip_wait = k_tile_count <= 0;
       // Peek at next iteration
       barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
-      // Wait for tmem accumulator buffer to become empty with a flipped phase
-      if (is_first_iter) {
-        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
-        is_first_iter = false;
-      }
 
       // Unroll the K mode manually so we can set scale C to 1
       CUTLASS_PRAGMA_UNROLL
diff --git a/include/cutlass/gemm/collective/sm100_mma_warpspecialized_blockwise_scaling.hpp b/include/cutlass/gemm/collective/sm100_mma_warpspecialized_blockwise_scaling.hpp
index 3172e0b495..2c0712a4c6 100644
--- a/include/cutlass/gemm/collective/sm100_mma_warpspecialized_blockwise_scaling.hpp
+++ b/include/cutlass/gemm/collective/sm100_mma_warpspecialized_blockwise_scaling.hpp
@@ -41,7 +41,7 @@
 #include "cutlass/trace.h"
 #include "cutlass/kernel_hardware_info.hpp"
 #include "cutlass/detail/sm100_tmem_helper.hpp"
-#include "cutlass/detail/sm100_blockwise_scale_layout.hpp"
+#include "cutlass/detail/blockwise_scale_layout.hpp"
 
 #include "cute/algorithm/functional.hpp"
 #include "cute/arch/cluster_sm90.hpp"
@@ -70,11 +70,11 @@ template <
   class ElementB_,
   class StridePairB_,
   class TiledMma_,
-  class GmemTiledCopyA_,
+  class GmemTiledCopyPairA_,
   class SmemLayoutAtomA_,
   class SmemCopyAtomA_,
   class TransformA_,
-  class GmemTiledCopyB_,
+  class GmemTiledCopyPairB_,
   class SmemLayoutAtomB_,
   class SmemCopyAtomB_,
   class TransformB_>
@@ -90,11 +90,11 @@ struct CollectiveMma<
     ElementB_,
     StridePairB_,
     TiledMma_,
-    GmemTiledCopyA_,
+    GmemTiledCopyPairA_,
     SmemLayoutAtomA_,
     SmemCopyAtomA_,
     TransformA_,
-    GmemTiledCopyB_,
+    GmemTiledCopyPairB_,
     SmemLayoutAtomB_,
     SmemCopyAtomB_,
     TransformB_>
@@ -142,9 +142,6 @@ struct CollectiveMma<
 
   static constexpr int K_BLOCK_MMAS_PER_SCALE_K = ScaleGranularityK / size<2>(typename TiledMma::AtomShape_MNK{});
 
-  static constexpr int TILE_M = size<0>(TileShape{});
-  static constexpr int TILE_N = size<1>(TileShape{});
-
   using ScaleConfig = cutlass::detail::Sm100BlockwiseScaleConfig<ScaleGranularityM, 
       ScaleGranularityN, 
       ScaleGranularityK, 
@@ -156,8 +153,6 @@ struct CollectiveMma<
 
   using CtaShape_MNK = decltype(shape_div(TileShape{}, AtomThrShapeMNK{}));
 
-  static_assert(size<0>(AtomThrShapeMNK{}) == 1, "2SM MMA is not yet supported");
-
   static_assert(size<0>(CtaShape_MNK{}) >= ScaleGranularityM, "Scale Granularity must be smaller than or equal to the tile shape");
   static_assert(size<1>(CtaShape_MNK{}) >= ScaleGranularityN, "Scale Granularity must be smaller than or equal to the tile shape");
   static_assert(size<2>(CtaShape_MNK{}) >= ScaleGranularityK, "Scale Granularity must be smaller than or equal to the tile shape");
@@ -180,8 +175,10 @@ struct CollectiveMma<
   static constexpr bool IsRuntimeDataType = IsRuntimeDataTypeA && IsRuntimeDataTypeB;
 
   using ElementAccumulator = typename TiledMma::ValTypeC;
-  using GmemTiledCopyA = GmemTiledCopyA_;
-  using GmemTiledCopyB = GmemTiledCopyB_;
+  using GmemTiledCopyA = cute::remove_cvref_t<decltype(get<0>(GmemTiledCopyPairA_{}))>;
+  using GmemTiledCopySFA = cute::remove_cvref_t<decltype(get<1>(GmemTiledCopyPairA_{}))>;
+  using GmemTiledCopyB = cute::remove_cvref_t<decltype(get<0>(GmemTiledCopyPairB_{}))>;
+  using GmemTiledCopySFB = cute::remove_cvref_t<decltype(get<1>(GmemTiledCopyPairB_{}))>;
   using SmemLayoutAtomA = SmemLayoutAtomA_;
   using SmemLayoutAtomB = SmemLayoutAtomB_;
   using SmemCopyAtomA = SmemCopyAtomA_;
@@ -190,22 +187,25 @@ struct CollectiveMma<
   using TransformB = TransformB_;
   using ArchTag = typename DispatchPolicy::ArchTag;
 
-  using MainloopPipeline = cutlass::PipelineTmaUmmaAsync<
-                             DispatchPolicy::Stages,
-                             ClusterShape,
-                             AtomThrShapeMNK>;
-  using MainloopPipelineState = typename MainloopPipeline::PipelineState;
+  using MainloopABPipeline = cutlass::PipelineTmaUmmaAsync<
+                                DispatchPolicy::Stages,
+                                ClusterShape,
+                                AtomThrShapeMNK>;
+  using MainloopABPipelineState = typename MainloopABPipeline::PipelineState;
 
-  using Load2TransformPipeline = cutlass::PipelineAsync<DispatchPolicy::Stages>;
-  using Load2TransformPipelineState = typename Load2TransformPipeline::PipelineState;
+  using MainloopSFPipeline = cutlass::PipelineAsync<DispatchPolicy::Stages>;
+  using MainloopSFPipelineState = typename MainloopSFPipeline::PipelineState;
 
-  using Mma2TransformPipeline = cutlass::PipelineUmmaAsync<
+  using AccumulatorPipeline = cutlass::PipelineUmmaAsync<
                                   AccumulatorPipelineStageCount,
                                   AtomThrShapeMNK>;
-  using Mma2TransformPipelineState = typename Mma2TransformPipeline::PipelineState;
+  using AccumulatorPipelineState = typename AccumulatorPipeline::PipelineState;
+
+  static constexpr int AlignmentSFA = GmemTiledCopySFA::AtomNumVal::value * sizeof(typename GmemTiledCopySFA::ValType) / sizeof(ElementAccumulator);
+  static constexpr int AlignmentSFB = GmemTiledCopySFB::AtomNumVal::value * sizeof(typename GmemTiledCopySFB::ValType) / sizeof(ElementAccumulator);
 
-  // Two arrivals per CTA (1 arrival and 1 arrival through cp.async.mbarrier)
-  static constexpr int NumLoad2TransformProducerThreadEvents = 2;
+  // Two arrivals per thread in the warp (1 arrival and 1 arrival through cp.async.mbarrier)
+  static constexpr int NumMainloopSFProducerThreadEvents = 64;
 
   static_assert(rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtomA must be rank 2 (M,K)");
   static_assert(((size<0,0>(MmaShapeA_MK{}) * size<1>(MmaShapeA_MK{})) % size<0>(SmemLayoutAtomA{})) == 0,
@@ -277,43 +277,28 @@ struct CollectiveMma<
     append(stride(SmemLayoutAtomSFB{}), size(filter_zeros(SmemLayoutAtomSFB{})))
   ));
 
-  // Scaling gmem-to-smem copy atom 
-  static constexpr int LeadingScalesPerTileSFA = size<0,1>(LayoutSFA{}.stride()) == 1 ? ScaleMsPerTile : ScaleKsPerTile;
-  using ScaleCopyTypeA = cute::uint_byte_t<cute::min(static_cast<int>(sizeof(ElementAccumulator)) * LeadingScalesPerTileSFA, 16)>;
-  using SmemScalingCopyAtomA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ScaleCopyTypeA>, ElementAccumulator>;
-  static constexpr int ElementsPerSFACopy = static_cast<int>(sizeof(ScaleCopyTypeA) / sizeof(ElementAccumulator));
-
-  static constexpr int LeadingScalesPerTileSFB = size<0,1>(LayoutSFB{}.stride()) == 1 ? ScaleNsPerTile : ScaleKsPerTile;
-  using ScaleCopyTypeB = cute::uint_byte_t<cute::min(static_cast<int>(sizeof(ElementAccumulator)) * LeadingScalesPerTileSFB, 16)>;
-  using SmemScalingCopyAtomB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ScaleCopyTypeB>, ElementAccumulator>;
-  static constexpr int ElementsPerSFBCopy = static_cast<int>(sizeof(ScaleCopyTypeB) / sizeof(ElementAccumulator));
-
-  using TiledCopyScaleA = decltype(make_tiled_copy(SmemScalingCopyAtomA{}, Layout<Shape<_1>>{}, Layout<Shape<Int<ElementsPerSFACopy>>>{}));
-  using TiledCopyScaleB = decltype(make_tiled_copy(SmemScalingCopyAtomB{}, Layout<Shape<_1>>{}, Layout<Shape<Int<ElementsPerSFBCopy>>>{}));
-
   struct SharedStorage {
     struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::ArrayEngine<SmemAllocTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::ArrayEngine<SmemAllocTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
-      cute::ArrayEngine<ElementAccumulator, cute::cosize_v<SmemLayoutScaleA>> smem_scale_A;
-      cute::ArrayEngine<ElementAccumulator, cute::cosize_v<SmemLayoutScaleB>> smem_scale_B;
+      cute::ArrayEngine<ElementAccumulator, cute::cosize_v<SmemLayoutScaleA>> smem_SFA;
+      cute::ArrayEngine<ElementAccumulator, cute::cosize_v<SmemLayoutScaleB>> smem_SFB;
     } tensors;
 
-    using PipelineStorage = typename MainloopPipeline::SharedStorage;
-    PipelineStorage pipeline;
-
-    using Load2TransformPipelineStorage = typename Load2TransformPipeline::SharedStorage;
-    Load2TransformPipelineStorage transform2load_pipeline;
+    using PipelineABStorage = typename MainloopABPipeline::SharedStorage;
+    using PipelineSFStorage = typename MainloopSFPipeline::SharedStorage;
+    using AccumulatorPipelineStorage = typename AccumulatorPipeline::SharedStorage;
 
-    using Mma2TransformPipelineStorage = typename Mma2TransformPipeline::SharedStorage;
-    Mma2TransformPipelineStorage mma2transform_pipeline;
+    struct PipelineStorage {
+      alignas(16) PipelineABStorage pipeline_ab;
+      alignas(16) PipelineSFStorage pipeline_sf;
+      alignas(16) AccumulatorPipelineStorage pipeline_accum;
+    };
   };
 
   // Expose shared storage for tensors/pipelines separately to allow kernel layer to reorder them.
   using TensorStorage = typename SharedStorage::TensorStorage;
   using PipelineStorage = typename SharedStorage::PipelineStorage;
-  using Mma2TransformPipelineStorage = typename SharedStorage::Mma2TransformPipelineStorage;
-  using Load2TransformPipelineStorage = typename SharedStorage::Load2TransformPipelineStorage;
 
   // Only one thread issues the TMA and updates the barriers in a 2SM MMA, adjust bytes accordingly
   static constexpr uint32_t TmaTransactionBytes =
@@ -328,12 +313,9 @@ struct CollectiveMma<
   template<
     class KTileCount,
     class GTensorPartitionedA, class GTensorPartitionedB,
-    class STensorA, class STensorB,
-    class GTensorPartitionedScaleA, class GTensorPartitionedScaleB,
-    class IdentTensorPartitionedScaleA, class IdentTensorPartitionedScaleB,
-    class STensorScaleA, class STensorScaleB
+    class STensorA, class STensorB
   >
-  struct LoadParams {
+  struct LoadABParams {
     // for scheduler
     KTileCount k_tiles;
     // for input tensor values
@@ -342,6 +324,32 @@ struct CollectiveMma<
     STensorA tAsA;
     STensorB tBsB;
 
+    // the TMA multicast masks
+    uint16_t mcast_mask_a;
+    uint16_t mcast_mask_b;
+
+    CUTLASS_DEVICE
+    LoadABParams (
+        KTileCount k_tiles_,
+        GTensorPartitionedA tAgA_mkl_, GTensorPartitionedB tBgB_nkl_,
+        STensorA tAsA_, STensorB tBsB_,
+        uint16_t mcast_mask_a_, uint16_t mcast_mask_b_)
+    : k_tiles(k_tiles_)
+    , tAgA_mkl(tAgA_mkl_), tBgB_nkl(tBgB_nkl_)
+    , tAsA(tAsA_), tBsB(tBsB_)
+    , mcast_mask_a(mcast_mask_a_), mcast_mask_b(mcast_mask_b_) {}
+  };
+
+  template<
+    class KTileCount,
+    class GTensorPartitionedScaleA, class GTensorPartitionedScaleB,
+    class IdentTensorPartitionedScaleA, class IdentTensorPartitionedScaleB,
+    class STensorScaleA, class STensorScaleB
+  >
+  struct LoadSFParams {
+    // for scheduler
+    KTileCount k_tiles;
+
     GTensorPartitionedScaleA tSFAgSFA_mkl;
     GTensorPartitionedScaleB tSFBgSFB_nkl;
     IdentTensorPartitionedScaleA tSFAIdentSFA_mkl;
@@ -349,30 +357,20 @@ struct CollectiveMma<
     STensorScaleA tSFAsSFA;
     STensorScaleB tSFBsSFB;
 
-    // the TMA multicast masks
-    uint16_t mcast_mask_a;
-    uint16_t mcast_mask_b;
-
     LayoutSFA layout_SFA;
     LayoutSFB layout_SFB;
 
     CUTLASS_DEVICE
-    LoadParams (
+    LoadSFParams (
         KTileCount k_tiles_,
-        GTensorPartitionedA tAgA_mkl_, GTensorPartitionedB tBgB_nkl_,
-        STensorA tAsA_, STensorB tBsB_,
         GTensorPartitionedScaleA tSFAgSFA_mkl_, GTensorPartitionedScaleB tSFBgSFB_nkl_,
         IdentTensorPartitionedScaleA tSFAIdentSFA_mkl_, IdentTensorPartitionedScaleB tSFBIdentSFB_nkl_,
         STensorScaleA tSFAsSFA_, STensorScaleB tSFBsSFB_,
-        uint16_t mcast_mask_a_, uint16_t mcast_mask_b_,
         LayoutSFA layout_SFA_, LayoutSFB layout_SFB_)
     : k_tiles(k_tiles_)
-    , tAgA_mkl(tAgA_mkl_), tBgB_nkl(tBgB_nkl_)
-    , tAsA(tAsA_), tBsB(tBsB_)
     , tSFAgSFA_mkl(tSFAgSFA_mkl_), tSFBgSFB_nkl(tSFBgSFB_nkl_)
     , tSFAIdentSFA_mkl(tSFAIdentSFA_mkl_), tSFBIdentSFB_nkl(tSFBIdentSFB_nkl_)
     , tSFAsSFA(tSFAsSFA_), tSFBsSFB(tSFBsSFB_)
-    , mcast_mask_a(mcast_mask_a_), mcast_mask_b(mcast_mask_b_)
     , layout_SFA(layout_SFA_), layout_SFB(layout_SFB_) {}
   };
 
@@ -393,14 +391,14 @@ struct CollectiveMma<
   template<
     class STensorScaleA, class STensorScaleB
   >
-  struct TransformParams {
+  struct AccumTransformParams {
     // for scheduler
  
     STensorScaleA sSFA;
     STensorScaleB sSFB;
 
     CUTLASS_DEVICE
-    TransformParams (
+    AccumTransformParams (
         STensorScaleA sSFA_, STensorScaleB sSFB_)
     :  sSFA(sSFA_), sSFB(sSFB_) {}
   };
@@ -412,9 +410,9 @@ struct CollectiveMma<
     StrideA dA{};
     ArrayElementB const* ptr_B{nullptr};
     StrideB dB{};
-    ElementAccumulator const* ptr_scale_A{nullptr};
+    ElementAccumulator const* ptr_SFA{nullptr};
     LayoutSFA layout_SFA{};
-    ElementAccumulator const* ptr_scale_B{nullptr};
+    ElementAccumulator const* ptr_SFB{nullptr};
     LayoutSFB layout_SFB{};
     RuntimeDataTypeA runtime_data_type_a{};
     RuntimeDataTypeB runtime_data_type_b{};
@@ -451,9 +449,9 @@ struct CollectiveMma<
     RuntimeDataTypeA runtime_data_type_a;
     RuntimeDataTypeB runtime_data_type_b;
 
-    ElementAccumulator const* ptr_scale_A;
+    ElementAccumulator const* ptr_SFA;
     LayoutSFA layout_SFA;
-    ElementAccumulator const* ptr_scale_B;
+    ElementAccumulator const* ptr_SFB;
     LayoutSFB layout_SFB;
   };
 
@@ -539,9 +537,9 @@ struct CollectiveMma<
       hw_info.cluster_shape_fallback,
       args.runtime_data_type_a,
       args.runtime_data_type_b,
-      args.ptr_scale_A,
+      args.ptr_SFA,
       args.layout_SFA,
-      args.ptr_scale_B,
+      args.ptr_SFB,
       args.layout_SFB
     };
   }
@@ -568,8 +566,8 @@ struct CollectiveMma<
       CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
     }
 
-    bool implementable_sf = cutlass::detail::check_alignment<sizeof(ScaleCopyTypeA) / sizeof(ElementAccumulator)>(args.layout_SFA);
-    implementable_sf = implementable_sf && cutlass::detail::check_alignment<sizeof(ScaleCopyTypeB) / sizeof(ElementAccumulator)>(args.layout_SFB);
+    bool implementable_sf = cutlass::detail::check_alignment<AlignmentSFA>(args.layout_SFA);
+    implementable_sf = implementable_sf && cutlass::detail::check_alignment<AlignmentSFB>(args.layout_SFB);
 
     if (!implementable_sf) {
       CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for Scale Factors.\n");
@@ -628,20 +626,12 @@ struct CollectiveMma<
   /// gB_nkl - The tiled tma tensor for input B
   /// tAsA - partitioned smem tensor for A
   /// tBsB - partitioned smem tensor for B
-  /// tSFAgSFA_mkl - partitioned gmem tensor for SFA
-  /// tSFBgSFB_nkl - partitioned gmem tensor for SFB
-  /// tSFAIdentSFA_mkl - partitioned identity tensor for SFA in gmem
-  /// tSFBIdentSFB_nkl - partitioned identity tensor for SFB in gmem
-  /// tSFAsSFA - partitioned smem tensor for SFA 
-  /// tSFBsSFB - partitioned smem tensor for SFB
   /// mcast_mask_a - tma multicast mask for A
   /// mcast_mask_b - tma multicast mask for B
-  /// layout_SFA - layout of SFA in gmem
-  /// layout_SFB - layout of SFB in gmem
   template <class ProblemShape_MNKL,
             class MainloopParams>
   CUTLASS_DEVICE auto
-  load_init(
+  load_ab_init(
       ProblemShape_MNKL const& problem_shape_MNKL,
       MainloopParams const& mainloop_params,
       TensorStorage& shared_tensors) const {
@@ -686,10 +676,38 @@ struct CollectiveMma<
     uint16_t mcast_mask_a = create_tma_multicast_mask<2>(cta_layout_vmnk, cta_coord_vmnk);
     uint16_t mcast_mask_b = create_tma_multicast_mask<1>(cta_layout_vmnk, cta_coord_vmnk);
 
-    // Scales
+    LoadABParams load_params {
+      shape<3>(gA_mkl),                               // for scheduler
+      tAgA_mkl, tBgB_nkl, tAsA, tBsB,                 // for input tensor values
+      mcast_mask_a, mcast_mask_b,                     // multicast masks
+    };
+    return load_params;
+  }
+
+  /// Set up the data needed by this collective for load.
+  /// Return load params containing 
+  /// tSFAgSFA_mkl - partitioned gmem tensor for SFA
+  /// tSFBgSFB_nkl - partitioned gmem tensor for SFB
+  /// tSFAIdentSFA_mkl - partitioned identity tensor for SFA in gmem
+  /// tSFBIdentSFB_nkl - partitioned identity tensor for SFB in gmem
+  /// tSFAsSFA - partitioned smem tensor for SFA 
+  /// tSFBsSFB - partitioned smem tensor for SFB
+  /// layout_SFA - layout of SFA in gmem
+  /// layout_SFB - layout of SFB in gmem
+  template <class ProblemShape_MNKL,
+            class MainloopParams>
+  CUTLASS_DEVICE auto
+  load_sf_init(
+      ProblemShape_MNKL const& problem_shape_MNKL,
+      MainloopParams const& mainloop_params,
+      TensorStorage& shared_tensors) const {
+    using X = Underscore;
+
+    // Separate out problem shape for convenience
+    auto [M,N,K,L] = problem_shape_MNKL;
 
-    Tensor mSFA_mkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_scale_A), mainloop_params.layout_SFA);    // (m,k,l)
-    Tensor mSFB_nkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_scale_B), mainloop_params.layout_SFB);    // (n,k,l)
+    Tensor mSFA_mkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_SFA), mainloop_params.layout_SFA);    // (m,k,l)
+    Tensor mSFB_nkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_SFB), mainloop_params.layout_SFB);    // (n,k,l)
 
     Tensor SFA_mkl_ident = make_identity_tensor(shape(mainloop_params.layout_SFA));
 
@@ -710,15 +728,15 @@ struct CollectiveMma<
     static_assert(rank(decltype(gSFB_nkl){}) == 5);
 
     // 1 thread copies entire set of scalar
-    TiledCopyScaleA scale_copy_a{};
-    TiledCopyScaleB scale_copy_b{};
+    GmemTiledCopySFA scale_copy_a{};
+    GmemTiledCopySFB scale_copy_b{};
 
-    ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(_0{});
-    ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(_0{});
+    ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(ThreadIdxX() % size(scale_copy_a));
+    ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(ThreadIdxX() % size(scale_copy_b));
 
-    Tensor sSFA = make_tensor(make_smem_ptr(shared_tensors.smem_scale_A.begin()), 
+    Tensor sSFA = make_tensor(make_smem_ptr(shared_tensors.smem_SFA.begin()), 
         SmemLayoutScaleA{});                                                                          // (CTA_M,CTA_K,P)
-    Tensor sSFB = make_tensor(make_smem_ptr(shared_tensors.smem_scale_B.begin()), 
+    Tensor sSFB = make_tensor(make_smem_ptr(shared_tensors.smem_SFB.begin()), 
         SmemLayoutScaleB{});                                                                          // (CTA_M,CTA_K,P)
 
     Tensor tSFAgSFA_mkl = thr_scale_copy_a.partition_S(gSFA_mkl);                        // (CPY, BLK_M, BLK_K, m, k, l)
@@ -733,19 +751,18 @@ struct CollectiveMma<
     static_assert(rank(decltype(tSFAgSFA_mkl){}) == 6);
     static_assert(rank(decltype(tSFBgSFB_nkl){}) == 6);
 
-    LoadParams load_params {
-      shape<3>(gA_mkl),                               // for scheduler
-      tAgA_mkl, tBgB_nkl, tAsA, tBsB,                 // for input tensor values
+    LoadSFParams load_params {
+      size<3>(gSFA_mkl),
       tSFAgSFA_mkl, tSFBgSFB_nkl,                     // for input scale tensor values
       tSFAIdentSFA_mkl, tSFBIdentSFB_nkl,             // for predicating scale tensor copies
       tSFAsSFA, tSFBsSFB,                             // for scale tensor values
-      mcast_mask_a, mcast_mask_b,                     // multicast masks
       mainloop_params.layout_SFA,                     // for predicating scale tensor copies
       mainloop_params.layout_SFB                      // for predicating scale tensor copies
     };
     return load_params;
   }
 
+
   /// Set up the data needed by this collective for mma compute.
   template <class AccTensor>
   CUTLASS_DEVICE auto
@@ -756,8 +773,27 @@ struct CollectiveMma<
     Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
 
     // Allocate "fragments/descriptors" for A and B matrices
-    Tensor tCrA = TiledMma::make_fragment_A(sA);                                               // (MMA,MMA_M,MMA_K,PIPE)
-    Tensor tCrB = TiledMma::make_fragment_B(sB);                                               // (MMA,MMA_N,MMA_K,PIPE)
+    Tensor tCrA_ = TiledMma::make_fragment_A(sA);                                              // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCrB_ = TiledMma::make_fragment_B(sB);                                              // (MMA,MMA_N,MMA_K,PIPE)
+
+    CUTE_STATIC_ASSERT_V(rank(tCrA_) == _4{});
+
+    auto mma_tile_shape_A = make_shape(get<0>(shape(tCrA_.layout())), 
+                                       get<1>(shape(tCrA_.layout())), 
+                                       Int<K_BLOCK_MMAS_PER_SCALE_K>{}, 
+                                       _1{});
+
+    auto mma_tile_shape_B = make_shape(get<0>(shape(tCrB_.layout())), 
+                                       get<1>(shape(tCrB_.layout())), 
+                                       Int<K_BLOCK_MMAS_PER_SCALE_K>{}, 
+                                       _1{});
+
+    Tensor tCrA = flat_divide(tCrA_, 
+        mma_tile_shape_A)(_,_,_,_0{},_0{},_0{},_,_);                      // (MMA,MMA_M,MMA_K_PER_SCALE,MMA_K_REST,PIPE)
+
+    Tensor tCrB = flat_divide(tCrB_, 
+        mma_tile_shape_B)(_,_,_,_0{},_0{},_0{},_,_);                      // (MMA,MMA_N,MMA_K_PER_SCALE,MMA_K_REST,PIPE)
+
 
     CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<3>(sA));                                          // PIPE
     CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<3>(sB));
@@ -780,7 +816,7 @@ struct CollectiveMma<
   /// Set up the data needed by this collective for transform.
   template <class ProblemShape_MNKL>
   CUTLASS_DEVICE auto
-  transform_init(
+  accum_init(
       ProblemShape_MNKL const& problem_shape_MNKL,
       TensorStorage& shared_tensors) const {
     using X = Underscore;
@@ -788,13 +824,13 @@ struct CollectiveMma<
     // Separate out problem shape for convenience
     auto [M,N,K,L] = problem_shape_MNKL;
 
-    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_A.begin()), 
+    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.begin()), 
         SmemLayoutScaleA{});                                                        // (ScaleMsPerTile,ScakeKsPerTile,P)
-    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_B.begin()), 
+    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.begin()), 
         SmemLayoutScaleB{});                                                        // (ScaleNsPerTile,ScaleKsPerTile,P)
 
 
-    TransformParams transform_params {
+    AccumTransformParams transform_params {
       sSFA, sSFB                        // for input tensor values
     };
     return transform_params;
@@ -803,45 +839,26 @@ struct CollectiveMma<
   /// Perform a collective-scoped matrix multiply-accumulate
   /// Producer Perspective
   template <
-    class LoadParams,
+    class LoadABParams,
     class TileCoordMNKL,
     class KTileIterator
   >
   CUTLASS_DEVICE auto
-  load(
-      MainloopPipeline mainloop_pipeline,
-      Load2TransformPipeline load2transform_pipeline,
-      MainloopPipelineState mainloop_pipe_producer_state, 
-      Load2TransformPipelineState load2transform_pipe_producer_state,
-      LoadParams const& load_inputs,
+  load_ab(
+      MainloopABPipeline mainloop_pipeline,
+      MainloopABPipelineState mainloop_pipe_producer_state, 
+      LoadABParams const& load_inputs,
       TileCoordMNKL const& cta_coord_mnkl,
       KTileIterator k_tile_iter, int k_tile_count) {
 
     auto [unused_k_tiles,
           tAgA_mkl, tBgB_nkl, tAsA, tBsB,
-          tSFAgSFA_mkl, tSFBgSFB_nkl, 
-          tSFAIdentSFA_mkl, tSFBIdentSFB_nkl,
-          tSFAsSFA, tSFBsSFB,
-          mcast_mask_a, mcast_mask_b,
-          layout_SFA, layout_SFB] = load_inputs;
+          mcast_mask_a, mcast_mask_b] = load_inputs;
 
     // slice out the work coord from partitioned tensors
     Tensor tAgA = tAgA_mkl(_, get<0>(cta_coord_mnkl) / size(typename TiledMma::AtomThrID{}), _, get<3>(cta_coord_mnkl));
     Tensor tBgB = tBgB_nkl(_, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
 
-    TiledCopyScaleA scale_copy_a{};
-    TiledCopyScaleB scale_copy_b{};
-
-    Tensor tSFAgSFA = tSFAgSFA_mkl(_, _, _, get<0>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
-
-    Tensor tSFBgSFB = tSFBgSFB_nkl(_, _, _, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
-
-    Tensor thr_tile_SFA_k = tSFAIdentSFA_mkl(_0{}, _, _, get<0>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
-    Tensor thr_tile_pSFA = make_tensor<bool>(shape(filter_zeros(thr_tile_SFA_k(_,_,_0{}), tSFAgSFA(_0{},_,_,_0{}).stride())));
-    Tensor thr_tile_SFB_k = tSFBIdentSFB_nkl(_0{}, _, _, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
-
-    Tensor thr_tile_pSFB = make_tensor<bool>(shape(filter_zeros(thr_tile_SFB_k(_,_,_0{}), tSFBgSFB(_0{},_,_,_0{}).stride())));
-
     auto barrier_token = mainloop_pipeline.producer_try_acquire(mainloop_pipe_producer_state);
 
     // Issue the Mainloop loads
@@ -849,10 +866,8 @@ struct CollectiveMma<
     while (k_tile_count > 0) {
       // LOCK mainloop_pipe_producer_state for _writing_
       mainloop_pipeline.producer_acquire(mainloop_pipe_producer_state, barrier_token);
-      
-      load2transform_pipeline.producer_acquire(load2transform_pipe_producer_state);
 
-      using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+      using BarrierType = typename MainloopABPipeline::ProducerBarrierType;
       BarrierType* tma_barrier = mainloop_pipeline.producer_get_barrier(mainloop_pipe_producer_state);
 
       int write_stage = mainloop_pipe_producer_state.index();
@@ -860,51 +875,111 @@ struct CollectiveMma<
       ++mainloop_pipe_producer_state;
       barrier_token = mainloop_pipeline.producer_try_acquire(mainloop_pipe_producer_state);
 
+      if (cute::elect_one_sync()) {
+        copy(observed_tma_load_a_->with(*tma_barrier, mcast_mask_a), tAgA(_,*k_tile_iter), tAsA(_,write_stage));
+        copy(observed_tma_load_b_->with(*tma_barrier, mcast_mask_b), tBgB(_,*k_tile_iter), tBsB(_,write_stage));
+      }
+
+      --k_tile_count;
+      ++k_tile_iter;
+    }
+
+    return cute::make_tuple(mainloop_pipe_producer_state, k_tile_iter);
+  }
+
+  /// Perform a Producer Epilogue to prevent early exit of ctas in a Cluster
+  CUTLASS_DEVICE void
+  load_ab_tail(
+      MainloopABPipeline mainloop_pipeline, 
+      MainloopABPipelineState mainloop_pipe_producer_state) {
+    // Issue the epilogue waits
+    // This helps avoid early exit of ctas in Cluster
+    // Waits for all stages to either be released (all
+    // Consumer UNLOCKs), or if the stage was never used
+    // then would just be acquired since the phase was
+    // still inverted from make_producer_start_state
+    mainloop_pipeline.producer_tail(mainloop_pipe_producer_state);
+  }
+
+  /// Perform a collective-scoped transform
+  /// Load producer Perspective
+  template <
+    class LoadSFParams,
+    class TileCoordMNKL,
+    class KTileIterator
+  >
+  CUTLASS_DEVICE auto
+  load_sf(
+      MainloopSFPipeline mainloop_sf_pipeline,
+      MainloopSFPipelineState mainloop_sf_pipe_producer_state,
+      LoadSFParams const& load_inputs,
+      TileCoordMNKL const& cta_coord_mnkl,
+      KTileIterator k_tile_iter, int k_tile_count) {
+
+    auto [unused_k_tiles,
+          tSFAgSFA_mkl, tSFBgSFB_nkl, 
+          tSFAIdentSFA_mkl, tSFBIdentSFB_nkl,
+          tSFAsSFA, tSFBsSFB,
+          layout_SFA, layout_SFB] = load_inputs;
+
+    // slice out the work coord from partitioned tensors
+    GmemTiledCopySFA scale_copy_a{};
+    GmemTiledCopySFB scale_copy_b{};
+
+    Tensor tSFAgSFA = tSFAgSFA_mkl(_, _, _, get<0>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    Tensor tSFBgSFB = tSFBgSFB_nkl(_, _, _, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    Tensor thr_tile_SFA_k = tSFAIdentSFA_mkl(_0{}, _, _, get<0>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+    Tensor thr_tile_pSFA = make_tensor<bool>(shape(filter_zeros(thr_tile_SFA_k(_,_,_0{}), tSFAgSFA(_0{},_,_,_0{}).stride())));
+    Tensor thr_tile_SFB_k = tSFBIdentSFB_nkl(_0{}, _, _, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    Tensor thr_tile_pSFB = make_tensor<bool>(shape(filter_zeros(thr_tile_SFB_k(_,_,_0{}), tSFBgSFB(_0{},_,_,_0{}).stride())));
+
+    // Issue the loads
+    CUTLASS_PRAGMA_NO_UNROLL
+    while (k_tile_count > 0) {
+      // LOCK pipe_producer_state for _writing_
+      mainloop_sf_pipeline.producer_acquire(mainloop_sf_pipe_producer_state);
+
       CUTLASS_PRAGMA_UNROLL
       for (int i = 0; i < size(thr_tile_pSFA); ++i) {
         Tensor thr_tile_SFA = filter_zeros(thr_tile_SFA_k(_,_,*k_tile_iter), tSFAgSFA(_0{},_,_,_0{}).stride()); 
-        thr_tile_pSFA(i) = elem_less(thr_tile_SFA(i), shape(filter_zeros(layout_SFA)));
+        thr_tile_pSFA(i) = elem_less(thr_tile_SFA(i), shape(filter_zeros(layout_SFA))) && ThreadIdxX() % 32 < size(scale_copy_a);
       }
       
       CUTLASS_PRAGMA_UNROLL
       for (int i = 0; i < size(thr_tile_pSFB); ++i) {
         Tensor thr_tile_SFB = filter_zeros(thr_tile_SFB_k(_,_,*k_tile_iter), tSFBgSFB(_0{},_,_,_0{}).stride()); 
-        thr_tile_pSFB(i) = elem_less(thr_tile_SFB(i), shape(filter_zeros(layout_SFB)));
+        thr_tile_pSFB(i) = elem_less(thr_tile_SFB(i), shape(filter_zeros(layout_SFB))) && ThreadIdxX() % 32 < size(scale_copy_b);
       }
 
-      if (cute::elect_one_sync()) {
-        copy(observed_tma_load_a_->with(*tma_barrier, mcast_mask_a), tAgA(_,*k_tile_iter), tAsA(_,write_stage));
-        copy(observed_tma_load_b_->with(*tma_barrier, mcast_mask_b), tBgB(_,*k_tile_iter), tBsB(_,write_stage));
-        copy_if(scale_copy_a, thr_tile_pSFA, filter_zeros(tSFAgSFA(_,_,_,*k_tile_iter)), filter_zeros(tSFAsSFA(_,_,_,load2transform_pipe_producer_state.index())));
-        copy_if(scale_copy_b, thr_tile_pSFB, filter_zeros(tSFBgSFB(_,_,_,*k_tile_iter)), filter_zeros(tSFBsSFB(_,_,_,load2transform_pipe_producer_state.index())));
-        load2transform_pipeline.producer_commit(load2transform_pipe_producer_state, cutlass::arch::cpasync_barrier_arrive_noinc);        
-      }
+      copy_if(scale_copy_a, thr_tile_pSFA, filter_zeros(tSFAgSFA(_,_,_,*k_tile_iter)), filter_zeros(tSFAsSFA(_,_,_,mainloop_sf_pipe_producer_state.index())));
+      copy_if(scale_copy_b, thr_tile_pSFB, filter_zeros(tSFBgSFB(_,_,_,*k_tile_iter)), filter_zeros(tSFBsSFB(_,_,_,mainloop_sf_pipe_producer_state.index())));
+      mainloop_sf_pipeline.producer_commit(mainloop_sf_pipe_producer_state, cutlass::arch::cpasync_barrier_arrive_noinc);        
 
       syncwarp();
 
-      ++load2transform_pipe_producer_state;
+      ++mainloop_sf_pipe_producer_state;
       --k_tile_count;
       ++k_tile_iter;
     }
 
-    return cute::make_tuple(mainloop_pipe_producer_state, load2transform_pipe_producer_state, k_tile_iter);
+    return cute::make_tuple(mainloop_sf_pipe_producer_state, k_tile_iter);
   }
 
   /// Perform a Producer Epilogue to prevent early exit of ctas in a Cluster
   CUTLASS_DEVICE void
-  load_tail(
-      MainloopPipeline mainloop_pipeline, 
-      Load2TransformPipeline load2transform_pipeline, 
-      MainloopPipelineState mainloop_pipe_producer_state,
-      Load2TransformPipelineState load2transform_pipe_producer_state) {
+  load_sf_tail(
+      MainloopSFPipeline mainloop_sf_pipeline, 
+      MainloopSFPipelineState mainloop_sf_pipe_producer_state) {
     // Issue the epilogue waits
     // This helps avoid early exit of ctas in Cluster
     // Waits for all stages to either be released (all
     // Consumer UNLOCKs), or if the stage was never used
     // then would just be acquired since the phase was
     // still inverted from make_producer_start_state
-    mainloop_pipeline.producer_tail(mainloop_pipe_producer_state);
-    load2transform_pipeline.producer_tail(load2transform_pipe_producer_state);
+    mainloop_sf_pipeline.producer_tail(mainloop_sf_pipe_producer_state);
   }
 
   /// Perform a collective-scoped matrix multiply-accumulate
@@ -916,10 +991,10 @@ struct CollectiveMma<
   >
   CUTLASS_DEVICE auto
   mma(
-      cute::tuple<MainloopPipeline,
-                  Mma2TransformPipeline> pipelines,
-      cute::tuple<MainloopPipelineState,
-                  Mma2TransformPipelineState> pipeline_states,
+      cute::tuple<MainloopABPipeline,
+                  AccumulatorPipeline> pipelines,
+      cute::tuple<MainloopABPipelineState,
+                  AccumulatorPipelineState> pipeline_states,
       TmemStorage tmem_storage,
       MmaParams const& mma_inputs,
       CtaTileCoord cta_tile_coord,
@@ -927,10 +1002,10 @@ struct CollectiveMma<
     auto [tiled_mma, tCrA, tCrB] = mma_inputs;
 
     auto [mainloop_pipeline, 
-          mma2transform_pipeline] = pipelines;
+          accumulator_pipeline] = pipelines;
 
     auto [mainloop_pipe_consumer_state, 
-          mma2transform_pipe_producer_state] = pipeline_states;
+          accumulator_pipe_producer_state] = pipeline_states;
 
     uint32_t skip_wait = k_tile_count <= 0;
     auto barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
@@ -958,54 +1033,50 @@ struct CollectiveMma<
       // Peek at next iteration
       barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
 
-      static_assert(size<2>(tCrA) / K_BLOCK_MMAS_PER_SCALE_K, "k blocks must be divisible by K_BLOCK_MMAS_PER_SCALE_K");
-
       CUTLASS_PRAGMA_UNROLL
-      for (int scale_k_blocks = 0; scale_k_blocks < size<2>(tCrA) / K_BLOCK_MMAS_PER_SCALE_K; ++scale_k_blocks) {
-        mma2transform_pipeline.producer_acquire(mma2transform_pipe_producer_state);
+      for (int scale_k_iter = 0; scale_k_iter < size<3>(tCrA); ++scale_k_iter) {
+        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
 
-        auto acc = get<0>(slice_accumulator(tmem_storage, mma2transform_pipe_producer_state.index()));
+        auto acc = get<0>(slice_accumulator(tmem_storage, accumulator_pipe_producer_state.index()));
         static_assert(is_tmem<remove_cvref_t<decltype(acc)>>::value, "Accumulator must be tmem resident.");
         static_assert(rank(remove_cvref_t<decltype(acc)>{}) == 3, "Accumulator must be MMA-partitioned: (MMA, MMA_M, MMA_N)");
 
         // for each set of scale_k_blocks we zero the accumulator
         tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
-        int start_k_block = scale_k_blocks * size<2>(tCrA) / K_BLOCK_MMAS_PER_SCALE_K;
         // Unroll the K mode manually so we can set scale C to 1
         CUTLASS_PRAGMA_UNROLL
-        for (int k_block_offset = 0; k_block_offset < K_BLOCK_MMAS_PER_SCALE_K; ++k_block_offset) {
-          int k_block = start_k_block + k_block_offset;
+        for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
           // (V,M) x (V,N) => (V,M,N)
           cute::gemm(tiled_mma,
-                     tCrA(_,_,k_block,read_stage),
-                     tCrB(_,_,k_block,read_stage),
+                     tCrA(_,_,k_block,scale_k_iter,read_stage),
+                     tCrB(_,_,k_block,scale_k_iter,read_stage),
                      acc);
           tiled_mma.accumulate_ = UMMA::ScaleOut::One;
         }
-        mma2transform_pipeline.producer_commit(mma2transform_pipe_producer_state);
-        ++mma2transform_pipe_producer_state;
+        accumulator_pipeline.producer_commit(accumulator_pipe_producer_state);
+        ++accumulator_pipe_producer_state;
       }
       mainloop_pipeline.consumer_release(curr_mainloop_pipe_consumer_state);
 
     }
 
-    return make_tuple(mainloop_pipe_consumer_state, mma2transform_pipe_producer_state);
+    return make_tuple(mainloop_pipe_consumer_state, accumulator_pipe_producer_state);
   }
 
   /// Transform
   template <
-    class TransformParams,
+    class AccumTransformParams,
     class TmemStorage,
     class CtaTileCoord,
     class CopyOpT2R,
     class EpilogueTile
   >
   CUTLASS_DEVICE auto 
-  transform(
-      cute::tuple<Mma2TransformPipeline, Load2TransformPipeline> pipelines,
-      cute::tuple<Mma2TransformPipelineState, Load2TransformPipelineState> consumer_states,
+  accum(
+      cute::tuple<AccumulatorPipeline, MainloopSFPipeline> pipelines,
+      cute::tuple<AccumulatorPipelineState, MainloopSFPipelineState> consumer_states,
       TmemStorage tmem_storage,
-      TransformParams const& transform_inputs,
+      AccumTransformParams const& transform_inputs,
       CtaTileCoord cta_tile_coord,
       CopyOpT2R,
       EpilogueTile,
@@ -1076,14 +1147,14 @@ struct CollectiveMma<
     // Zero our accumulator
     clear(tTR_FullAcc);
 
-    auto [mma2transform_pipeline, load2transform_pipeline] = pipelines;
-    auto [mma2transform_pipe_state, load2transform_pipe_state] = consumer_states;
+    auto [accumulator_pipeline, mainloop_sf_pipeline] = pipelines;
+    auto [accumulator_pipe_state, mainloop_sf_pipe_state] = consumer_states;
 
     CUTLASS_PRAGMA_NO_UNROLL
     while (k_tile_count > 0) {
 
-      load2transform_pipeline.consumer_wait(load2transform_pipe_state);
-      int read_idx = load2transform_pipe_state.index();
+      mainloop_sf_pipeline.consumer_wait(mainloop_sf_pipe_state);
+      int read_idx = mainloop_sf_pipe_state.index();
 
       copy(filter_zeros(tTR_sSFA_epi(_,_,_,_,_,_,read_idx)), tTR_rSFA_compact);
       copy(filter_zeros(tTR_sSFB_epi(_,_,_,_,_,_,read_idx)), tTR_rSFB_compact);
@@ -1094,15 +1165,15 @@ struct CollectiveMma<
       Tensor tTR_rSFA = make_tensor(tTR_rSFA_compact.data(), tTR_rSFA_layout);
       Tensor tTR_rSFB = make_tensor(tTR_rSFB_compact.data(), tTR_rSFB_layout);
       
-      load2transform_pipeline.consumer_release(load2transform_pipe_state);
-      ++load2transform_pipe_state;
+      mainloop_sf_pipeline.consumer_release(mainloop_sf_pipe_state);
+      ++mainloop_sf_pipe_state;
 
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < ScaleKsPerTile; ++k_block) {
 
-        mma2transform_pipeline.consumer_wait(mma2transform_pipe_state);
+        accumulator_pipeline.consumer_wait(accumulator_pipe_state);
 
-        Tensor acc = get<0>(slice_accumulator(tmem_storage, mma2transform_pipe_state.index()));
+        Tensor acc = get<0>(slice_accumulator(tmem_storage, accumulator_pipe_state.index()));
         Tensor tAcc = acc(make_coord(_,_),_0{},_0{});
         Tensor tAcc_epi = flat_divide(tAcc, EpilogueTile{});                   // (EPI_TILE_M, EPI_TILE_N, EPI_M, EPI_N)
         Tensor tTR_tAcc = thread_t2r_epi.partition_S(tAcc_epi);                     // (T2R, T2R_M, T2R_N, EPI_M, EPI_N)
@@ -1128,15 +1199,15 @@ struct CollectiveMma<
           }
         }              
         cutlass::arch::fence_view_async_tmem_load();
-        mma2transform_pipeline.consumer_release(mma2transform_pipe_state);
+        accumulator_pipeline.consumer_release(accumulator_pipe_state);
         // release acc
-        ++mma2transform_pipe_state;
+        ++accumulator_pipe_state;
       } 
 
       --k_tile_count;
     }
 
-    return cute::make_tuple(tTR_FullAcc, tiled_t2r_epi, cute::make_tuple(mma2transform_pipe_state, load2transform_pipe_state));
+    return cute::make_tuple(tTR_FullAcc, tiled_t2r_epi, cute::make_tuple(accumulator_pipe_state, mainloop_sf_pipe_state));
  }
 
 protected:
diff --git a/include/cutlass/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp b/include/cutlass/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
new file mode 100644
index 0000000000..f8d1a00a10
--- /dev/null
+++ b/include/cutlass/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
@@ -0,0 +1,824 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+
+
+#pragma once
+#include <cuda_bf16.h>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cutlass/numeric_conversion.h"
+#include "cutlass/detail/sm100_tmem_helper.hpp"
+#include "cutlass/detail/cluster.hpp"
+
+#include "cute/algorithm/functional.hpp"
+#include "cute/arch/cluster_sm90.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cute/atom/copy_atom.hpp"
+#include "cute/algorithm/gemm.hpp"
+#include "cute/tensor_predicate.hpp"
+#include "cute/arch/mma_sm100.hpp"
+#include "cutlass/trace.h"
+#include "cutlass/kernel_hardware_info.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::collective {
+using namespace cute;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// WarpSpecialized Mainloop for Mixed Input Kernels
+template <
+  int Load2TransformPipelineStageCount_,
+  int Transform2MmaPipelineStageCount_,
+  int SchedulerPipelineStageCount_,
+  int AccumulatorPipelineStageCount_,
+  class ClusterShape,
+  class TileShape_,
+  class ElementA_,
+  class StrideA_,
+  class ElementB_,
+  class StrideB_,
+  class TiledMma_,
+  class GmemTiledCopyA_,
+  class SmemLayoutAtomsA_,
+  class CopyAtomsA_,
+  class TransformA_,
+  class GmemTiledCopyB_,
+  class SmemLayoutAtomsB_,
+  class CopyAtomsB_,
+  class TransformB_>
+struct CollectiveMma<
+    MainloopSm100TmaUmmaWarpSpecializedMixedInput<
+      Load2TransformPipelineStageCount_,
+      Transform2MmaPipelineStageCount_,
+      SchedulerPipelineStageCount_,
+      AccumulatorPipelineStageCount_,
+      ClusterShape>,
+    TileShape_,
+    ElementA_,
+    StrideA_,
+    ElementB_,
+    StrideB_,
+    TiledMma_,
+    GmemTiledCopyA_,
+    SmemLayoutAtomsA_,
+    CopyAtomsA_,
+    TransformA_,
+    GmemTiledCopyB_,
+    SmemLayoutAtomsB_,
+    CopyAtomsB_,
+    TransformB_>
+{
+  //
+  // Type Aliases
+  //
+
+  // Determine MMA type: MMA_1SM vs MMA_2SM
+  using AtomThrShapeMNK = Shape<decltype(shape<0>(typename TiledMma_::ThrLayoutVMNK{})), _1, _1>;
+  using DispatchPolicy = MainloopSm100TmaUmmaWarpSpecializedMixedInput<
+                            Load2TransformPipelineStageCount_,
+                            Transform2MmaPipelineStageCount_,
+                            SchedulerPipelineStageCount_,
+                            AccumulatorPipelineStageCount_,
+                            ClusterShape>;
+  using TileShape = TileShape_;
+  using TiledMma = TiledMma_;
+  static constexpr bool IsDynamicCluster = not cute::is_static_v<ClusterShape>;
+  using CtaShape_MNK = decltype(shape_div(TileShape{}, AtomThrShapeMNK{}));
+
+  // Define A and B block shapes for reduced size TMA_LOADs
+  using CtaShapeA_MK = decltype(partition_shape_A(TiledMma{}, make_shape(size<0>(TileShape{}), size<2>(TileShape{}))));
+  using CtaShapeB_NK = decltype(partition_shape_B(TiledMma{}, make_shape(size<1>(TileShape{}), size<2>(TileShape{}))));
+
+  using ElementA = ElementA_;
+  using StrideA = StrideA_;
+  using ElementAMma = typename TiledMma::ValTypeA;
+
+  static constexpr int IsSubbyteA = cute::sizeof_bits_v<ElementA> < 8;
+  using TmaElementA = cute::conditional_t<IsSubbyteA, uint8_t, ElementA>;
+
+  using ElementB = ElementB_;
+  using StrideB = StrideB_;
+  using ElementBMma = typename TiledMma::ValTypeB;
+
+  using StrideScale = cute::Stride<cute::Int<1>, int64_t, int64_t>;
+  using NonVoidStrideScale = cute::conditional_t<
+      cute::is_void_v<StrideScale>, cute::Stride<_1, int64_t, int64_t>, StrideScale>;
+
+  using ElementAccumulator = typename TiledMma::ValTypeC;
+  using GmemTiledCopyA = GmemTiledCopyA_;
+  using GmemTiledCopyB = GmemTiledCopyB_;
+  using SmemLayoutAtomsA = SmemLayoutAtomsA_;
+  using SmemLayoutAtomsB = SmemLayoutAtomsB_;
+  using CopyAtomsA = CopyAtomsA_;
+  using CopyAtomsB = CopyAtomsB_;
+  using TransformA = TransformA_;
+  using TransformB = TransformB_;
+  using ArchTag = typename DispatchPolicy::ArchTag;
+
+  static_assert(sizeof(ElementA) < 2, "Matrix to be scaled should be provided in A otherwise input is not supported");
+  static_assert(cute::is_same_v<ElementAMma, cutlass::bfloat16_t> || cute::is_same_v<ElementAMma, cutlass::half_t> || cute::is_same_v<ElementAMma, cutlass::float_e4m3_t>, "Compute type A should be cutlass::bfloat16_t or cutlass::half_t or cutlass::float_e4m3_t");
+
+  using Load2TransformPipeline = cutlass::PipelineTmaTransformAsync<
+                             DispatchPolicy::Load2TransformPipelineStageCount,
+                             AtomThrShapeMNK>;
+  using Load2TransformPipelineState = typename Load2TransformPipeline::PipelineState;
+
+  using Transform2MmaPipeline = cutlass::PipelineUmmaConsumerAsync<
+                              DispatchPolicy::Transform2MmaPipelineStageCount,
+                              AtomThrShapeMNK>;
+  using Transform2MmaPipelineState = typename Transform2MmaPipeline::PipelineState;
+
+  using Mma2AccumPipeline =  cutlass::PipelineUmmaAsync<
+                              DispatchPolicy::Schedule::AccumulatorPipelineStageCount,
+                              AtomThrShapeMNK>;
+  using Mma2AccumPipelineState = typename Mma2AccumPipeline::PipelineState;
+
+  // Thread Counts
+  static constexpr uint32_t NumAccumThreads = 128; //Maintains compatibility with input_transform kernel
+  static constexpr uint32_t NumTransformationThreads = 128;
+
+  // Get the Algorithm parameters
+  constexpr static int AccumulatorPipelineStageCount = DispatchPolicy::Schedule::AccumulatorPipelineStageCount;
+  constexpr static int StagesPerTile = size<2>(CtaShapeA_MK{});
+
+  using SmemLayoutAtomA = typename SmemLayoutAtomsA::InputLayoutAtom;
+  using SmemLayoutAtomACompute = typename SmemLayoutAtomsA::ComputeLayoutAtom;
+  using SmemLayoutAtomB = typename SmemLayoutAtomsB::InputLayoutAtom;
+  using SmemLayoutAtomBCompute = typename SmemLayoutAtomsB::ComputeLayoutAtom;
+
+  using InputCopyAtomA = typename CopyAtomsA::InputCopyAtom;
+  using ComputeCopyAtomA = typename CopyAtomsA::ComputeCopyAtom;
+  using InputCopyAtomB = typename CopyAtomsB::InputCopyAtom;
+  using ComputeCopyAtomB = typename CopyAtomsB::ComputeCopyAtom;
+
+  static_assert(rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
+  static_assert(((size<0,0>(CtaShapeA_MK{}) * size<1>(CtaShapeA_MK{})) % size<0>(SmemLayoutAtomACompute{})) == 0, "SmemLayoutAtomCompute must evenly divide tile shape.");
+  static_assert(((size<0,1>(CtaShapeA_MK{}) * size<2>(CtaShapeA_MK{})) % size<1>(SmemLayoutAtomACompute{})) == 0, "SmemLayoutAtomCompute must evenly divide tile shape.");
+
+  static_assert(rank(SmemLayoutAtomB{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
+  static_assert(((size<0,0>(CtaShapeB_NK{}) * size<1>(CtaShapeB_NK{})) % size<0>(SmemLayoutAtomBCompute{})) == 0, "SmemLayoutAtomCompute must evenly divide tile shape.");
+  static_assert(((size<0,1>(CtaShapeB_NK{}) * size<2>(CtaShapeB_NK{})) % size<1>(SmemLayoutAtomBCompute{})) == 0, "SmemLayoutAtomCompute must evenly divide tile shape.");
+
+  // Tile along K mode first before tiling over MN. PIPE mode last as usual.
+  // This maximizes TMA boxes due to better smem-K vectorization, reducing total issued TMAs.
+  using SmemLayoutA = decltype(UMMA::tile_to_mma_shape(
+      SmemLayoutAtomA{},
+      append(CtaShapeA_MK{}, Int<DispatchPolicy::Load2TransformPipelineStageCount>{}),
+             (cute::conditional_t<cutlass::gemm::detail::is_mn_major<StrideA>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{})));
+
+  using SmemLayoutACompute = decltype(UMMA::tile_to_mma_shape(
+      SmemLayoutAtomACompute{},
+      append(CtaShapeA_MK{}, Int<DispatchPolicy::Load2TransformPipelineStageCount>{}),
+             (cute::conditional_t<cutlass::gemm::detail::is_mn_major<StrideA>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{})));
+
+  using SmemLayoutB = decltype(UMMA::tile_to_mma_shape(
+      SmemLayoutAtomB{},
+      append(CtaShapeB_NK{}, Int<DispatchPolicy::Load2TransformPipelineStageCount>{}),
+             (cute::conditional_t<cutlass::gemm::detail::is_mn_major<StrideB>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{})));
+
+  static_assert(DispatchPolicy::Load2TransformPipelineStageCount >= 2 && DispatchPolicy::Load2TransformPipelineStageCount >= 2,
+                "Specialization requires Stages set to value 2 or more.");
+  static_assert((cute::is_base_of<cute::UMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value ||
+                 cute::is_base_of<cute::UMMA::tmem_frg_base,      typename TiledMma::FrgTypeA>::value  ) &&
+                 cute::is_base_of<cute::UMMA::DescriptorIterator, typename TiledMma::FrgTypeB>::value,
+                 "MMA atom must A operand from SMEM or TMEM and B operand from SMEM for this mainloop.");
+  static_assert((cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>),
+                 "GmemTiledCopyA - invalid TMA copy atom specified.");
+  static_assert((cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>),
+                 "GmemTiledCopyB -  invalid TMA copy atom specified.");
+
+  struct PipelineStorage {
+    using Load2TransformPipelineStorage = typename Load2TransformPipeline::SharedStorage;
+    alignas(16) Load2TransformPipelineStorage load2transform_pipeline;
+    using Transform2MmaPipelineStorage = typename Transform2MmaPipeline::SharedStorage;
+    alignas(16) Transform2MmaPipelineStorage transform2mma_pipeline;
+    using Mma2AccumPipelineStorage = typename Mma2AccumPipeline::SharedStorage;
+    alignas(16) Mma2AccumPipelineStorage mma2accum_pipeline;
+  };
+
+  struct SharedStorage {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
+
+      struct TensorStorageUntransformed {
+        cute::ArrayEngine<ElementA, cute::cosize_v<SmemLayoutA>> smem_A;
+        cute::ArrayEngine<ElementB, cute::cosize_v<SmemLayoutB>> smem_B;
+      };
+
+      struct TensorStorageTransformedAinSmem {
+        alignas(1024) cute::ArrayEngine<ElementAMma, cute::cosize_v<SmemLayoutACompute>> smem_ACompute;
+        alignas(1024) cute::ArrayEngine<ElementBMma, cute::cosize_v<SmemLayoutB>> smem_BCompute;
+      };
+
+      union TensorStorageTransformedAinTmem {
+        alignas(1024) cute::ArrayEngine<ElementAMma, 1> smem_ACompute;  // No smem_ACompute
+        alignas(1024) cute::ArrayEngine<ElementBMma, cute::cosize_v<SmemLayoutB>> smem_BCompute;
+      };
+
+      using TensorStorageTransformed = cute::conditional_t<
+                                      cute::is_base_of<cute::UMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value,
+                                      TensorStorageTransformedAinSmem,
+                                      TensorStorageTransformedAinTmem>;
+
+      TensorStorageUntransformed input;
+      TensorStorageTransformed compute;
+    } tensors;
+
+    PipelineStorage pipeline;
+  };
+  using TensorStorage = typename SharedStorage::TensorStorage;
+
+  // Different from other GEMM kernels, both CTAs should be aware of loads. Both CTAs will work on
+  // loaded input A and B matrices to convert the data type
+  static constexpr uint32_t TmaTransactionBytes =
+    cutlass::bits_to_bytes(size<0>(SmemLayoutA{}) * size<1>(SmemLayoutA{}) * size<2>(SmemLayoutA{}) * static_cast<uint32_t>(sizeof_bits<ElementA>::value))+
+    cutlass::bits_to_bytes(size<0>(SmemLayoutB{}) * size<1>(SmemLayoutB{}) * size<2>(SmemLayoutB{}) * static_cast<uint32_t>(sizeof_bits<ElementB>::value));
+
+  // Host side kernel arguments
+  struct Arguments {
+    ElementA const* ptr_A{nullptr};
+    StrideA dA{};
+    ElementB const* ptr_B{nullptr};
+    StrideB dB{};
+  };
+
+  // Device side kernel params
+  struct Params {
+    using ClusterLayout_VMNK = decltype(tiled_divide(make_layout(conditional_return<IsDynamicCluster>(make_shape(uint32_t(0), uint32_t(0), Int<1>{}), ClusterShape{})), 
+                                                     make_tile(typename TiledMma::AtomThrID{})));
+
+    using TMA_A = decltype(make_tma_atom_A_sm100<TmaElementA>(
+        GmemTiledCopyA{},
+        make_tensor(static_cast<ElementA const*>(nullptr), repeat_like(StrideA{}, int32_t(0)), StrideA{}),
+        SmemLayoutA{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        ClusterLayout_VMNK{})
+      );
+    using TMA_B = decltype(make_tma_atom_B_sm100<ElementB>(
+        GmemTiledCopyB{},
+        make_tensor(static_cast<ElementB const*>(nullptr), repeat_like(StrideB{}, int32_t(0)), StrideB{}),
+        SmemLayoutB{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        ClusterLayout_VMNK{})
+      );
+    TMA_A tma_load_a;
+    TMA_B tma_load_b;
+    TMA_A tma_load_a_fallback;
+    TMA_B tma_load_b_fallback;
+    dim3 cluster_shape_fallback;
+  };
+
+  CUTLASS_DEVICE
+  CollectiveMma(Params const& params, ClusterShape cluster_shape, uint32_t block_rank_in_cluster)
+    : cluster_shape_(cluster_shape)
+    , block_rank_in_cluster_(block_rank_in_cluster) {
+    if constexpr (IsDynamicCluster) {
+      const bool is_fallback_cluster = (cute::size<0>(cluster_shape_) == params.cluster_shape_fallback.x && 
+                                        cute::size<1>(cluster_shape_) == params.cluster_shape_fallback.y);
+      observed_tma_load_a_ = is_fallback_cluster ? &params.tma_load_a_fallback : &params.tma_load_a;
+      observed_tma_load_b_ = is_fallback_cluster ? &params.tma_load_b_fallback : &params.tma_load_b;
+    } 
+    else {
+      observed_tma_load_a_ = &params.tma_load_a;
+      observed_tma_load_b_ = &params.tma_load_b;
+    }
+  }
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cutlass::KernelHardwareInfo const& hw_info = cutlass::KernelHardwareInfo{}) {
+    (void) workspace;
+
+    // Optionally append 1s until problem shape is rank-4 (MNKL), in case it is only rank-3 (MNK)
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    Tensor tensor_a = make_tensor(args.ptr_A, make_layout(make_shape(M,K,L), args.dA));
+    Tensor tensor_b = make_tensor(args.ptr_B, make_layout(make_shape(N,K,L), args.dB));
+  
+    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, hw_info.cluster_shape);
+    // Cluster layout for TMA construction
+    auto cluster_layout_vmnk = tiled_divide(make_layout(cluster_shape), make_tile(typename TiledMma::AtomThrID{}));
+
+    auto cluster_shape_fallback = cutlass::detail::select_cluster_shape(ClusterShape{}, hw_info.cluster_shape_fallback);
+    // Cluster layout for TMA construction
+    auto cluster_layout_vmnk_fallback = tiled_divide(make_layout(cluster_shape_fallback), make_tile(typename TiledMma::AtomThrID{}));
+
+    typename Params::TMA_A tma_load_a = make_tma_atom_A_sm100<TmaElementA>(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk);
+
+    typename Params::TMA_B tma_load_b = make_tma_atom_B_sm100<ElementB>(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk);
+
+    typename Params::TMA_A tma_load_a_fallback = make_tma_atom_A_sm100<TmaElementA>(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk_fallback);
+
+    typename Params::TMA_B tma_load_b_fallback = make_tma_atom_B_sm100<ElementB>(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,_,cute::Int<0>{}),
+        TileShape{},
+        TiledMma{},
+        cluster_layout_vmnk_fallback);
+
+    return {
+      tma_load_a,
+      tma_load_b,
+      tma_load_a_fallback,
+      tma_load_b_fallback,
+      hw_info.cluster_shape_fallback
+    };
+  }
+
+  template<class ProblemShape>
+  static bool
+  can_implement(
+      ProblemShape const& problem_shape,
+      [[maybe_unused]] Arguments const& args) {
+    constexpr int tma_alignment_bits = 128;
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    bool implementable = true;
+    constexpr int min_tma_aligned_elements_A = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
+    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), StrideA{});
+    constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
+    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{});
+
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+    }
+    return implementable;
+  }
+
+  /// Issue Tma Descriptor Prefetch -- ideally from a single thread for best performance
+  CUTLASS_DEVICE static void
+  prefetch_tma_descriptors(Params const& params) {
+    if constexpr (IsDynamicCluster) {
+      dim3 cs = cute::cluster_shape();
+      const bool is_fallback_cluster = (cs.x == params.cluster_shape_fallback.x && cs.y == params.cluster_shape_fallback.y);
+      if (is_fallback_cluster) {
+        cute::prefetch_tma_descriptor(params.tma_load_a_fallback.get_tma_descriptor());
+        cute::prefetch_tma_descriptor(params.tma_load_b_fallback.get_tma_descriptor());
+      }
+      else {
+        cute::prefetch_tma_descriptor(params.tma_load_a.get_tma_descriptor());
+        cute::prefetch_tma_descriptor(params.tma_load_b.get_tma_descriptor());
+      }
+    }
+    else {
+      cute::prefetch_tma_descriptor(params.tma_load_a.get_tma_descriptor());
+      cute::prefetch_tma_descriptor(params.tma_load_b.get_tma_descriptor());
+    }
+  }
+
+  /// Construct A Single Stage's Accumulator Shape
+  CUTLASS_DEVICE auto 
+  partition_accumulator_shape() {
+    auto acc_shape = partition_shape_C(TiledMma{}, take<0,2>(TileShape{}));  // ((MMA_TILE_M,MMA_TILE_N),MMA_M,MMA_N)
+
+    return acc_shape;
+  }
+
+  /// Produce the inputs to the transform threads by loading inputs from gmem -> smem
+  template <
+    class GTensorA, class GTensorB,
+    class GTensorPartitionedA, class GTensorPartitionedB,
+    class STensorA, class STensorB,
+    class TileCoordMNKL,
+    class KTileIterator
+  >
+  CUTLASS_DEVICE auto
+  load(
+      Params const& params,
+      Load2TransformPipeline pipeline,
+      Load2TransformPipelineState load2xform_pipeline_state,
+      cute::tuple<GTensorA, GTensorB,
+                  GTensorPartitionedA, GTensorPartitionedB,
+                  STensorA, STensorB,
+                  uint16_t, uint16_t> const& load_inputs,
+      TileCoordMNKL const& cta_coord_mnkl,
+      KTileIterator k_tile_iter, int k_tile_count) {
+
+    auto [unused_gA, unused_gB,
+          tAgA_mkl, tBgB_nkl, tAsA, tBsB,
+          mcast_mask_a, mcast_mask_b] = load_inputs;
+
+    // slice out the work coord from tiled tensors
+    Tensor tAgA = tAgA_mkl(_, get<0>(cta_coord_mnkl) / size(typename TiledMma::AtomThrID{}), _, get<3>(cta_coord_mnkl));
+    Tensor tBgB = tBgB_nkl(_, get<1>(cta_coord_mnkl), _, get<3>(cta_coord_mnkl));
+
+    uint32_t skip_wait = (k_tile_count <= 0);
+    auto pipeline_flag = pipeline.producer_try_acquire(load2xform_pipeline_state, skip_wait);
+
+    // Issue the Mainloop loads
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count) {
+      // LOCK mainloop_load2xform_pipeline_state for _writing_
+      pipeline.producer_acquire(load2xform_pipeline_state, pipeline_flag);
+      int write_stage = load2xform_pipeline_state.index();
+
+      using BarrierType = typename Load2TransformPipeline::ProducerBarrierType;
+      BarrierType* tma_barrier = pipeline.producer_get_barrier(load2xform_pipeline_state);
+
+      // Advance mainloop_pipe
+      ++load2xform_pipeline_state;
+
+      skip_wait = (k_tile_count <= 1);
+      pipeline_flag = pipeline.producer_try_acquire(load2xform_pipeline_state, skip_wait);
+
+      copy(observed_tma_load_a_->with(*tma_barrier, mcast_mask_a), tAgA(_,*k_tile_iter), tAsA(_,write_stage));
+      copy(observed_tma_load_b_->with(*tma_barrier, mcast_mask_b), tBgB(_,*k_tile_iter), tBsB(_,write_stage));
+
+      ++k_tile_iter;
+    }
+
+    return cute::make_tuple(load2xform_pipeline_state, k_tile_iter);
+
+  }
+
+  /// Set up the data needed by this collective for load.
+  /// Returned tuple must contain at least two elements, with the first two elements being:
+  /// gA_mkl - The tiled tensor for input A
+  /// gB_nkl - The tiled tensor for input B
+  // Other inputs needed for load(): partitioned AB tensors for gmem and smem, and mcast masks
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_init(
+      ProblemShape_MNKL const& problem_shape_MNKL,
+      Params const& params,
+      TensorStorage& shared_storage) const {
+    auto [gA_mkl, gB_nkl] = tile_input_tensors(params, problem_shape_MNKL);
+
+    ThrMMA cta_mma = TiledMma{}.get_slice(blockIdx.x % size(typename TiledMma::AtomThrID{}));
+
+    Tensor tCgA_mkl = cta_mma.partition_A(gA_mkl);          // (MMA, MMA_M, MMA_K, m, k, l)
+    Tensor tCgB_nkl = cta_mma.partition_B(gB_nkl);          // (MMA, MMA_N, MMA_K, n, k, l)
+
+    Tensor sA = make_tensor(make_smem_ptr(shared_storage.input.smem_A.begin()), SmemLayoutA{});  // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_storage.input.smem_B.begin()), SmemLayoutB{});  // (MMA,MMA_N,MMA_K,PIPE)
+
+    // Define the CTA-in-cluster Layout and Coord
+    Layout cta_layout_mnk  = make_layout(cluster_shape_);
+    Layout cta_layout_vmnk = tiled_divide(cta_layout_mnk, make_tile(typename TiledMma::AtomThrID{}));
+    auto cta_coord_vmnk  = cta_layout_vmnk.get_flat_coord(block_rank_in_cluster_);
+
+    // Project the cta_layout for tma_a along the n-modes
+    auto [tAgA_mkl, tAsA] = tma_partition(*observed_tma_load_a_,
+                                      get<2>(cta_coord_vmnk), make_layout(size<2>(cta_layout_vmnk)),
+                                      group_modes<0,3>(sA), group_modes<0,3>(tCgA_mkl));
+
+    // Project the cta_layout for tma_b along the m-modes
+    auto [tBgB_nkl, tBsB] = tma_partition(*observed_tma_load_b_,
+                                      get<1>(cta_coord_vmnk), make_layout(size<1>(cta_layout_vmnk)),
+                                      group_modes<0,3>(sB), group_modes<0,3>(tCgB_nkl));
+
+    // TMA Multicast Masks
+    uint16_t mcast_mask_a = create_tma_multicast_mask<2>(cta_layout_vmnk, cta_coord_vmnk);
+    uint16_t mcast_mask_b = create_tma_multicast_mask<1>(cta_layout_vmnk, cta_coord_vmnk);
+
+
+    return cute::make_tuple(
+        gA_mkl, gB_nkl,                        // for scheduler
+        tAgA_mkl, tBgB_nkl, tAsA, tBsB,        // for input tensor values
+        mcast_mask_a, mcast_mask_b);           // multicast masks
+  }
+
+  template<
+    class KTileIterator, class Accumulator,
+    class GTensorA, class DstCopyA, class SrcTensorA, class DstTensorA,
+    class GTensorB
+  >
+  CUTLASS_DEVICE auto
+  transform(
+      Load2TransformPipeline load2transform_pipeline,
+      Load2TransformPipelineState load2transform_pipeline_consumer_state,
+      Transform2MmaPipeline transform2mma_pipeline,
+      Transform2MmaPipelineState transform2mma_pipeline_producer_state,
+      Accumulator accumulators,
+      cute::tuple<GTensorA, DstCopyA, SrcTensorA, DstTensorA,
+                  GTensorB> input_operands,
+      KTileIterator k_tile_iter, int k_tile_count) {
+
+    cutlass::arch::NamedBarrier transform_bar(NumTransformationThreads, cutlass::arch::ReservedNamedBarriers::TransformBarrier);
+
+    // tAsA : (Copy,#Copy),MMA_Rest,MMA_M_Rest,MMA_K_Rest, SmemStages (In SMEM)
+    // tAdA : (Copy,#Copy),MMA_Rest,MMA_M_Rest,MMA_K_Rest, NumComputeMtxs, SmemStages (In SMEM or TMEM)
+    // tBsB : (Copy,#Copy),MMA_Rest,MMA_N_Rest,MMA_K_Rest, SmemStages (In SMEM)
+    // tBsB : (Copy,#Copy),MMA_Rest,MMA_N_Rest,MMA_K_Rest, NumComputeMtxs, SmemStages (In SMEM)
+    auto [unused_tAgA, dst_copy_A, tAsA, tAsACompute,
+          unused_tBgB] = input_operands;
+
+    // Create the tensors in registers
+    auto tArA = make_tensor<ElementA>(tAsA(_,_,_,_,0).shape());
+    auto tArACompute = make_tensor<ElementAMma>(tAsA(_,_,_,_,0).shape());
+
+    auto tArA_x2 = recast<Array<ElementA,2>>(tArA);
+    auto tArACompute_x2 = recast<Array<ElementAMma,2>>(tArACompute);
+
+
+    uint32_t skip_wait = (k_tile_count <= 0);
+    auto load2transform_flag = load2transform_pipeline.consumer_try_wait(load2transform_pipeline_consumer_state, skip_wait);
+    auto transform2mma_flag = transform2mma_pipeline.producer_try_acquire(transform2mma_pipeline_producer_state, skip_wait);
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count) {
+
+      load2transform_pipeline.consumer_wait(load2transform_pipeline_consumer_state, load2transform_flag);
+      transform2mma_pipeline.producer_acquire(transform2mma_pipeline_producer_state, transform2mma_flag);
+
+      int load2transform_consumer_index = load2transform_pipeline_consumer_state.index(); // read stage
+      int transform2mma_producer_index = transform2mma_pipeline_producer_state.index(); //write stage
+
+      auto curr_load2transform_pipeline_consumer_state = load2transform_pipeline_consumer_state;
+      auto curr_transform2mma_pipeline_producer_state = transform2mma_pipeline_producer_state;
+
+      // Copy the input A matrix from SMEM
+      copy(AutoVectorizingCopy{}, tAsA(_,_,_,_,load2transform_consumer_index), tArA);
+      //Transform Input A stored in registers
+      cute::transform(tArA_x2, tArACompute_x2, cutlass::NumericArrayConverter<ElementAMma, ElementA, 2, cutlass::FloatRoundStyle::round_to_nearest_satfinite>::convert);
+      //Transformed A stored in TMEM
+      copy(dst_copy_A, tArACompute, tAsACompute(_,_,_,_,transform2mma_producer_index));
+
+      // Loads from SMEM are done. Signal the mainloop load as early as possible
+      transform_bar.sync();
+      load2transform_pipeline.consumer_release(curr_load2transform_pipeline_consumer_state);
+
+      // fence for SMEM writes
+      cutlass::arch::fence_view_async_shared();
+      if constexpr (is_tmem<decltype(tAsACompute)>::value) {
+        // fence for TMEM writes if A operand is coming from TMEM
+        cutlass::arch::fence_view_async_tmem_store();
+      }
+
+      // Let the MMA know we are done transforming
+      transform2mma_pipeline.producer_commit(curr_transform2mma_pipeline_producer_state);
+
+      // Next pipeline stage
+      ++load2transform_pipeline_consumer_state;
+      ++transform2mma_pipeline_producer_state;
+
+      skip_wait = (k_tile_count <= 1);
+      // Peek the next pipeline stage's barriers
+      load2transform_flag = load2transform_pipeline.consumer_try_wait(load2transform_pipeline_consumer_state, skip_wait);
+      transform2mma_flag = transform2mma_pipeline.producer_try_acquire(transform2mma_pipeline_producer_state, skip_wait);
+    }
+    return cute::make_tuple(load2transform_pipeline_consumer_state, transform2mma_pipeline_producer_state);
+  }
+
+  template<class ProblemShape_MNKL, class Accumulator>
+  CUTLASS_DEVICE auto
+  transform_init(
+      Params const& params,
+      ProblemShape_MNKL const& problem_shape_MNKL,
+      Accumulator accumulators,
+      TensorStorage& shared_storage) {
+
+    auto [gA_mkl, gB_nkl] = tile_input_tensors(params, problem_shape_MNKL);
+
+    Tensor sA_orig = make_tensor(make_smem_ptr(shared_storage.input.smem_A.begin()), SmemLayoutA{}); 
+    Tensor sA = as_position_independent_swizzle_tensor(sA_orig); //tCsA
+    Tensor sACompute = make_tensor(make_smem_ptr(shared_storage.compute.smem_ACompute.begin()), SmemLayoutACompute{}); //tCsACompute
+
+    // Map input, compute, and fragment tensors to
+    //   Copy strategies and partitioned tensors. These will become the input
+    //   operands of the transform function. Depending on MMA atom type, the
+    //   operands can reside in SMEM or TMEM
+    auto setup_copy_ops = [&] (
+        auto tensor_input,
+        auto input_copy_atom,
+        auto tensor_compute,
+        auto make_fragment,
+        auto compute_copy_atom) constexpr {
+
+      auto fragment_compute = make_fragment(tensor_compute); //tCrA(Compute)
+      if constexpr (cute::is_tmem<cute::remove_cvref_t<decltype(fragment_compute)>>::value) {
+        // For M=128 with 2CTA MMA atoms, the TMEM tensor for A has a duplicated allocation.
+        // Instead of allocation a 64x16 TMEM tensor, we have a 128x16 allocation
+        // See: TmemAllocMode::Duplicated.
+        Tensor tensor_input2x = [&] () constexpr {
+        if constexpr (decltype(size<0,0>(fragment_compute) == Int<128>{} && size<0,0>(tensor_input) == Int<64>{})::value) {
+          return make_tensor(tensor_input.data(),
+                             logical_product(tensor_input.layout(),
+                                             make_tile(make_tile(Layout<_2,_0>{},_),_,_,_)));   // ((128,16),m,k,PIPE)
+          }
+          else {
+            return tensor_input;
+          }
+        }();  //tCsA_2x 
+
+        fragment_compute.data() = accumulators.data().get() + cutlass::detail::find_tmem_tensor_col_offset(accumulators); //tCrA.data()
+        auto reg2tmem_tiled_copy = make_tmem_copy(compute_copy_atom, fragment_compute(_,_,0,0));
+        auto thr_reg2tmem_tiled_copy = reg2tmem_tiled_copy.get_slice(threadIdx.x % NumTransformationThreads);
+        auto partitioned_tensor_input = thr_reg2tmem_tiled_copy.partition_S(tensor_input2x);
+        auto partitioned_tensor_compute = thr_reg2tmem_tiled_copy.partition_D(fragment_compute);
+        return cute::make_tuple(reg2tmem_tiled_copy, partitioned_tensor_input, partitioned_tensor_compute);
+      }
+      else {
+        auto tensor_compute_ind_sw = as_position_independent_swizzle_tensor(tensor_compute);
+        auto reg2smem_tiled_copy = make_cotiled_copy(compute_copy_atom, Layout<Shape <_128,_8>, Stride<  _8,_1>>{},
+                                                     tensor_compute(_,_,0,0).layout());
+
+        auto thr_reg2smem_tiled_copy = reg2smem_tiled_copy.get_slice(threadIdx.x % NumTransformationThreads);
+        auto partitioned_tensor_input = thr_reg2smem_tiled_copy.partition_S(tensor_input);
+        auto partitioned_tensor_compute = thr_reg2smem_tiled_copy.partition_D(tensor_compute_ind_sw);
+
+        return cute::make_tuple(AutoVectorizingCopy{}, partitioned_tensor_input, partitioned_tensor_compute);
+      }
+    };
+
+    auto [dst_copy_A, tAsA, tAsACompute] =
+        setup_copy_ops(sA, InputCopyAtomA{}, sACompute, [&](auto &arg) {return TiledMma::make_fragment_A(arg);}, ComputeCopyAtomA{});
+
+    return cute::make_tuple(gA_mkl, dst_copy_A, tAsA, tAsACompute,
+                            gB_nkl);
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Consumer Perspective
+  template <
+    class FrgEngine, class FrgLayout,
+    class TensorA, class TensorB
+  >
+  CUTLASS_DEVICE auto
+  mma(
+      Transform2MmaPipeline transform2mma_pipeline,
+      Transform2MmaPipelineState transform2mma_pipeline_consumer_state,
+      Mma2AccumPipeline mma2accum_pipeline,
+      Mma2AccumPipelineState mma2accum_pipeline_producer_state,
+      cute::Tensor<FrgEngine, FrgLayout> const& accumulators,
+      cute::tuple<TensorA, TensorB> const& input_operands,
+      int k_tile_count
+  ) {
+    TiledMma tiled_mma;
+
+    auto curr_transform2mma_pipeline_consumer_state = transform2mma_pipeline_consumer_state;
+    auto next_transform2mma_pipeline_consumer_state = transform2mma_pipeline_consumer_state;
+
+    uint32_t skip_wait = (k_tile_count <= 0);
+    auto transform2mma_flag = transform2mma_pipeline.consumer_try_wait(next_transform2mma_pipeline_consumer_state, skip_wait);
+    ++next_transform2mma_pipeline_consumer_state;
+
+
+    // tCrA : (MMA), MMA_M, MMA_K, SmemStage  (In SMEM or TMEM)
+    //      We use SMEM stages to match #buffers in Load <-> Convert
+    // tCrB : (MMA), MMA_N, MMA_K, SmemStages (In SMEM)
+    auto const [tCrA, tCrB] = input_operands;
+
+    int remaining_accum_promotions = k_tile_count;
+    uint32_t mma2accum_skip_wait = (remaining_accum_promotions <= 0);
+    auto mma2accum_flag = mma2accum_pipeline.producer_try_acquire(mma2accum_pipeline_producer_state, mma2accum_skip_wait);
+    mma2accum_pipeline.producer_acquire(mma2accum_pipeline_producer_state, mma2accum_flag);
+    auto curr_mma2accum_pipeline_producer_state = mma2accum_pipeline_producer_state;
+    ++mma2accum_pipeline_producer_state;
+    
+    // No accumulator addition to the k_tile initially
+    tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count) {
+
+      transform2mma_pipeline.consumer_wait(curr_transform2mma_pipeline_consumer_state, transform2mma_flag);
+
+      int transform2mma_pipeline_consumer_state_index = curr_transform2mma_pipeline_consumer_state.index(); //read_stage
+      int mma2accum_pipeline_producer_state_index = curr_mma2accum_pipeline_producer_state.index();  //write_stage
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+
+        auto tCtC = accumulators(_,_,_,mma2accum_pipeline_producer_state_index);
+
+        auto tCrA0 = tCrA(_,_,_,transform2mma_pipeline_consumer_state_index);
+        auto tCrB0 = tCrB(_,_,_,transform2mma_pipeline_consumer_state_index);
+
+        cute::gemm(tiled_mma, tCrA0(_,_,k_block), tCrB0(_,_,k_block), tCtC);               // A[0]*B[0]
+        tiled_mma.accumulate_ = UMMA::ScaleOut::One;
+        
+      }
+
+      transform2mma_pipeline.consumer_release(curr_transform2mma_pipeline_consumer_state);
+
+      skip_wait = (k_tile_count <= 1);
+      transform2mma_flag = transform2mma_pipeline.consumer_try_wait(next_transform2mma_pipeline_consumer_state, skip_wait);
+
+      curr_transform2mma_pipeline_consumer_state = next_transform2mma_pipeline_consumer_state;
+      ++next_transform2mma_pipeline_consumer_state;
+    }
+
+    mma2accum_pipeline.producer_commit(curr_mma2accum_pipeline_producer_state);
+
+    return cute::make_tuple(curr_transform2mma_pipeline_consumer_state, mma2accum_pipeline_producer_state);
+  }
+
+  template<class FrgEngine, class FrgLayout>
+  CUTLASS_DEVICE auto
+  mma_init(cute::Tensor<FrgEngine, FrgLayout> const& accumulators, TensorStorage& shared_storage) const {
+    TiledMma tiled_mma;
+
+    auto get_tCrA = [&] () constexpr {
+      if constexpr (cute::is_base_of<cute::UMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value) {
+        Tensor sACompute = make_tensor(make_smem_ptr(shared_storage.compute.smem_ACompute.begin()), SmemLayoutACompute{});
+        return tiled_mma.make_fragment_A(sACompute);
+      }
+      else {
+        auto tCrA = tiled_mma.make_fragment_A(shape(SmemLayoutACompute{}));
+        tCrA.data() = accumulators.data().get() + cutlass::detail::find_tmem_tensor_col_offset(accumulators);
+        return tCrA;
+      }
+    };
+
+    Tensor tCrA = get_tCrA();
+    Tensor sB = make_tensor(make_smem_ptr(shared_storage.input.smem_B.begin()), SmemLayoutB{});
+    Tensor tCrB = tiled_mma.make_fragment_B(sB);
+    return cute::make_tuple(tCrA, tCrB);
+  }
+
+  template<class FrgEngine, class FrgLayout, class TmemCopyAtom, class EpilogueTile>
+  CUTLASS_DEVICE auto
+  accum_init(cute::Tensor<FrgEngine, FrgLayout> const& accumulators, TmemCopyAtom tmem_cp_atom, EpilogueTile epilogue_tile) {
+    return accumulators;
+  }
+
+private:
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE
+  constexpr auto
+  tile_input_tensors(Params const& params, ProblemShape_MNKL const& problem_shape_MNKL) const {
+    using X = cute::Underscore;
+    // Separate out problem shape for convenience
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mkl = observed_tma_load_a_->get_tma_tensor(make_shape(M,K,L));
+    Tensor mB_nkl = observed_tma_load_b_->get_tma_tensor(make_shape(N,K,L));
+
+    // Tile the tensors and defer the slice
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});
+
+    return cute::make_tuple(gA_mkl, gB_nkl);
+  }
+
+  typename Params::TMA_A const* observed_tma_load_a_ = nullptr;
+  typename Params::TMA_B const* observed_tma_load_b_ = nullptr;
+
+  ClusterShape cluster_shape_;
+  uint32_t block_rank_in_cluster_;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp b/include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp
index d332e1eaf8..c6f161b9ba 100644
--- a/include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp
@@ -866,6 +866,11 @@ struct CollectiveMma<
     // PIPELINED MAIN LOOP
     //
     tiled_mma.accumulate_ = UMMA::ScaleOut::Zero;
+    if constexpr (not IsOverlappingAccum) {
+      // Wait for tmem accumulator buffer to become empty with a flipped phase
+      accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+    }
+
     CUTLASS_PRAGMA_NO_UNROLL
     while (k_tile_count > 0) {
       // WAIT on mainloop_pipe_consumer_state until its data are available
@@ -884,15 +889,23 @@ struct CollectiveMma<
       // Peek at next iteration
       barrier_token = mainloop_pipeline.consumer_try_wait(mainloop_pipe_consumer_state, skip_wait);
 
-      if (iter % UtccpReuseCnt == 0) {
+      if constexpr (UtccpReuseCnt == 1) {
         if (cute::elect_one_sync()) {
           copy(tiled_copy_s2t_E, thr_tCsE_s2t(_,_,_,_,read_stage), thr_tCtE_s2t);
         }
       }
+      else {
+        if (not (iter & 1)) {
+          if (cute::elect_one_sync()) {
+            copy(tiled_copy_s2t_E, thr_tCsE_s2t(_,_,_,_,read_stage), thr_tCtE_s2t);
+          }
+        }
+      }
 
-      // Wait for tmem accumulator buffer to become empty with a flipped phase
-      if (iter == 0) {
-        accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+      if constexpr (IsOverlappingAccum) {
+        if (iter == 0) {
+          accumulator_pipeline.producer_acquire(accumulator_pipe_producer_state);
+        }
       }
 
       // Unroll the K mode manually so we can set scale C to 1
diff --git a/include/cutlass/gemm/collective/sm120_blockscaled_mma_array_tma.hpp b/include/cutlass/gemm/collective/sm120_blockscaled_mma_array_tma.hpp
new file mode 100755
index 0000000000..4fcd7bbee1
--- /dev/null
+++ b/include/cutlass/gemm/collective/sm120_blockscaled_mma_array_tma.hpp
@@ -0,0 +1,1161 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/detail/dependent_false.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/trace.h"
+#include "cutlass/numeric_types.h"
+
+#include "cute/arch/cluster_sm90.hpp"
+#include "cute/arch/copy_sm90.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cute/algorithm/functional.hpp"
+#include "cute/algorithm/gemm.hpp"
+#include "cute/tensor_predicate.hpp"
+#include "cute/numeric/arithmetic_tuple.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::collective {
+using namespace cute;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  int Stages,
+  int SchedulerPipelineStageCount,
+  class ClusterShape,
+  class KernelScheduleType,
+  class TileShape_,
+  class ElementPairA_,
+  class StridePairA_,
+  class ElementPairB_,
+  class StridePairB_,
+  class TiledMma_,
+  class GmemTiledCopyPairA_,
+  class SmemLayoutAtomsA_,
+  class SmemCopyAtomsA_,
+  class TransformA_,
+  class GmemTiledCopyPairB_,
+  class SmemLayoutAtomsB_,
+  class SmemCopyAtomsB_,
+  class TransformB_>
+struct CollectiveMma<
+    MainloopSm120ArrayTmaWarpSpecializedBlockScaled<Stages, SchedulerPipelineStageCount, ClusterShape, KernelScheduleType>,
+    TileShape_,
+    ElementPairA_,
+    StridePairA_,
+    ElementPairB_,
+    StridePairB_,
+    TiledMma_,
+    GmemTiledCopyPairA_,
+    SmemLayoutAtomsA_,
+    SmemCopyAtomsA_,
+    TransformA_,
+    GmemTiledCopyPairB_,
+    SmemLayoutAtomsB_,
+    SmemCopyAtomsB_,
+    TransformB_> {
+  //
+  // Type Aliases
+  //
+  using DispatchPolicy = MainloopSm120ArrayTmaWarpSpecializedBlockScaled<Stages, SchedulerPipelineStageCount, ClusterShape, KernelScheduleType>;
+  using TileShape = TileShape_;
+  using ElementPairA = ElementPairA_;
+  using ElementPairB = ElementPairB_;
+  using StridePairA = StridePairA_;
+  using StridePairB = StridePairB_;
+
+  static_assert(cute::is_same_v<remove_cvref_t<decltype(get<1>(ElementPairA{}))>,
+                                remove_cvref_t<decltype(get<1>(ElementPairB{}))>>, "SFA and SFB data types should be the same");
+
+  using RuntimeDataTypeA = void*;
+  using RuntimeDataTypeB = void*;
+
+   // A and B matrices
+  using ElementA = remove_cvref_t<decltype(get<0>(ElementPairA{}))>;
+  using StrideA  = remove_cvref_t<decltype(get<0>(StridePairA{}))>;
+  using InternalStrideA  = cute::remove_pointer_t<StrideA>;
+
+  using ElementB = remove_cvref_t<decltype(get<0>(ElementPairB{}))>;
+  using StrideB  = remove_cvref_t<decltype(get<0>(StridePairB{}))>;
+  using InternalStrideB  = cute::remove_pointer_t<StrideB>;
+  
+  // SFA and SFB
+  using ElementSF = remove_cvref_t<decltype(get<1>(ElementPairA{}))>;
+  using LayoutSFA = remove_cvref_t<decltype(get<1>(StridePairA{}))>;
+  using LayoutSFB = remove_cvref_t<decltype(get<1>(StridePairB{}))>;
+  using InternalLayoutSFA = cute::remove_pointer_t<LayoutSFA>;
+  using InternalLayoutSFB = cute::remove_pointer_t<LayoutSFB>;
+
+
+  using ArrayElementA = ElementA;
+  using ArrayElementB = ElementB;
+
+  using TiledMma = TiledMma_;
+  using CtaShape_MNK = decltype(shape_div(TileShape{}, ClusterShape{}));
+  using ElementAccumulator = typename TiledMma::ValTypeC;
+
+  static constexpr int SFVecSize = TiledMma::Traits::SFVecSize;
+  using Sm1xxBlkScaledConfig = cutlass::detail::Sm1xxBlockScaledConfig<SFVecSize>;
+
+  // Gmem copies
+  using GmemTiledCopyPairA = GmemTiledCopyPairA_;
+  using GmemTiledCopyPairB = GmemTiledCopyPairB_;
+  using GmemTiledCopyA    = remove_cvref_t<decltype(get<0>(GmemTiledCopyPairA{}))>;
+  using GmemTiledCopySFA  = remove_cvref_t<decltype(get<1>(GmemTiledCopyPairA{}))>;
+  using GmemTiledCopyB    = remove_cvref_t<decltype(get<0>(GmemTiledCopyPairB{}))>;
+  using GmemTiledCopySFB  = remove_cvref_t<decltype(get<1>(GmemTiledCopyPairB{}))>;
+
+  // Smem copies
+  using SmemLayoutAtomsA = SmemLayoutAtomsA_;
+  using SmemLayoutAtomsB = SmemLayoutAtomsB_;
+
+  using SmemLayoutAtomA   = remove_cvref_t<decltype(get<0>(SmemLayoutAtomsA{}))>;
+  using SmemLayoutAtomSFA = remove_cvref_t<decltype(get<1>(SmemLayoutAtomsA{}))>;
+  using SmemLayoutAtomB   = remove_cvref_t<decltype(get<0>(SmemLayoutAtomsB{}))>;
+  using SmemLayoutAtomSFB = remove_cvref_t<decltype(get<1>(SmemLayoutAtomsB{}))>;
+
+  using SmemCopyAtomsA =  SmemCopyAtomsA_;
+  using SmemCopyAtomsB =  SmemCopyAtomsB_;
+
+  using SmemCopyAtomA   = remove_cvref_t<decltype(get<0>(SmemCopyAtomsA{}))>;
+  using SmemCopyAtomSFA = remove_cvref_t<decltype(get<1>(SmemCopyAtomsA{}))>;
+
+  using SmemCopyAtomB   = remove_cvref_t<decltype(get<0>(SmemCopyAtomsB{}))>;
+  using SmemCopyAtomSFB = remove_cvref_t<decltype(get<1>(SmemCopyAtomsB{}))>;
+
+  using TransformA = TransformA_;
+  using TransformB = TransformB_;
+
+  using ArchTag = typename DispatchPolicy::ArchTag;
+
+  static constexpr int ThreadCount = size(TiledMma{});
+
+  using MainloopPipeline = cutlass::PipelineTmaAsync<DispatchPolicy::Stages>;
+
+  using PipelineParams = typename MainloopPipeline::Params;
+  using PipelineState  = typename cutlass::PipelineState<DispatchPolicy::Stages>;
+
+  // One threads per CTA are producers (1 for operand tile)
+  static constexpr int NumProducerThreadEvents = 1;
+
+  static_assert(rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
+  static_assert((size<0>(TileShape{}) % size<0>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+
+  static_assert(rank(SmemLayoutAtomB{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
+  static_assert((size<1>(TileShape{}) % size<0>(SmemLayoutAtomB{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomB{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+
+  static_assert(not cute::is_void_v<SmemCopyAtomA>,
+    "SM120 mainloop must specify a copy atom for A operand smem->rmem reads.");
+  static_assert(not cute::is_void_v<SmemCopyAtomB>,
+    "SM120 mainloop must specify a copy atom for B operand smem->rmem reads.");
+
+  // Tile along modes in a way that maximizes the TMA box size.
+  using SmemLayoutA = decltype(tile_to_shape(
+      SmemLayoutAtomA{},
+      make_shape(shape<0>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
+      conditional_t< ::cutlass::gemm::detail::is_major<0,StrideA>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+  using SmemLayoutB = decltype(tile_to_shape(
+      SmemLayoutAtomB{},
+      make_shape(shape<1>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
+      conditional_t< ::cutlass::gemm::detail::is_major<0,StrideB>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+
+  // SmemLayoutAtomSFA and SmemLayoutAtomSFB are for whole CTA tiles. We add the number of pipeline stages here.
+  // The number of pipeline stages is the same as the number of pipeline stages from AB Load <-> MainLoop
+  using SmemLayoutSFA = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFA{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFA{}), size(filter_zeros(SmemLayoutAtomSFA{})))
+  ));
+
+  using SmemLayoutSFB = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFB{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFB{}), size(filter_zeros(SmemLayoutAtomSFB{})))
+  ));
+
+  static_assert(rank(SmemLayoutA{}) == 3, "Smem layout must be rank 3.");
+  static_assert(rank(SmemLayoutB{}) == 3, "Smem layout must be rank 3.");
+
+  static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 2 or more.");
+  static_assert(not cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value &&
+                not cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeB>::value,
+                "MMA atom must source both A and B operands from rmem for this mainloop.");
+  static_assert(cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD>, "GmemTiledCopy - invalid SM90 TMA copy atom specified.");
+  static_assert(cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD>, "GmemTiledCopy - invalid SM90 TMA copy atom specified.");
+
+  static constexpr bool IsF8F6F4 = detail::is_sm120_f8f6f4<TiledMma, ElementA, ElementB>();
+
+  // For all other types, cast to size equivalent uint type to avoid any rounding by TMA.
+  using TmaInternalElementA = cute::conditional_t<not IsF8F6F4,
+                                                  ElementA,
+                              cute::conditional_t<cute::is_same_v<ElementA, cutlass::float_e2m1_t>,
+                                                  cutlass::detail::float_e2m1_unpacksmem_t,
+                              cute::conditional_t<cute::is_same_v<ElementA, cutlass::float_e2m3_t>,
+                                                cutlass::detail::float_e2m3_unpacksmem_t,
+                              cute::conditional_t<cute::is_same_v<ElementA, cutlass::float_e3m2_t>,
+                                                cutlass::detail::float_e3m2_unpacksmem_t,
+                                                uint_bit_t<sizeof_bits_v<ElementA>>>>>>;
+
+  using TmaInternalElementB = cute::conditional_t<not IsF8F6F4,
+                                                  ElementB,
+                              cute::conditional_t<cute::is_same_v<ElementB, cutlass::float_e2m1_t>,
+                                                  cutlass::detail::float_e2m1_unpacksmem_t,
+                              cute::conditional_t<cute::is_same_v<ElementB, cutlass::float_e2m3_t>,
+                                                cutlass::detail::float_e2m3_unpacksmem_t,
+                              cute::conditional_t<cute::is_same_v<ElementB, cutlass::float_e3m2_t>,
+                                                cutlass::detail::float_e3m2_unpacksmem_t,
+                                                uint_bit_t<sizeof_bits_v<ElementB>>>>>>;
+
+  using SmemAllocTypeA = cute::conditional_t<IsF8F6F4, uint8_t, typename TiledMma::ValTypeA>;
+  using SmemAllocTypeB = cute::conditional_t<IsF8F6F4, uint8_t, typename TiledMma::ValTypeB>;
+
+  // Set the bytes transferred in this TMA transaction (may involve multiple issues)
+  static constexpr uint32_t TmaTransactionBytesMK = static_cast<uint32_t>(
+    cutlass::bits_to_bytes(cosize(take<0,2>(SmemLayoutSFA{})) * cute::sizeof_bits_v<ElementSF>) +
+    cutlass::bits_to_bytes(size(take<0,2>(SmemLayoutA{})) * sizeof_bits<ElementA>::value));
+
+  static constexpr uint32_t TmaTransactionBytesNK = static_cast<uint32_t>(
+    cutlass::bits_to_bytes(cosize(take<0,2>(SmemLayoutSFB{})) * cute::sizeof_bits_v<ElementSF>) +
+    cutlass::bits_to_bytes(size(take<0,2>(SmemLayoutB{})) * sizeof_bits<ElementB>::value));
+
+  static constexpr uint32_t TmaTransactionBytes = TmaTransactionBytesMK + TmaTransactionBytesNK;
+
+  struct SharedStorage {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
+      alignas(1024) cute::ArrayEngine<SmemAllocTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
+      alignas(1024) cute::ArrayEngine<SmemAllocTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
+      cute::ArrayEngine<ElementSF, cute::cosize_v<SmemLayoutSFA>> smem_SFA;
+      cute::ArrayEngine<ElementSF, cute::cosize_v<SmemLayoutSFB>> smem_SFB;
+    } tensors;
+
+    struct TensorMapStorage : cute::aligned_struct<128, _0> {
+      cute::TmaDescriptor smem_tensormap_A;
+      cute::TmaDescriptor smem_tensormap_B;
+      cute::TmaDescriptor smem_tensormap_SFA;
+      cute::TmaDescriptor smem_tensormap_SFB;
+    } tensormaps;
+
+    using PipelineStorage = typename MainloopPipeline::SharedStorage;
+    alignas(16) PipelineStorage pipeline_storage;
+  };
+
+  using TensorStorage = typename SharedStorage::TensorStorage;
+  using PipelineStorage = typename SharedStorage::PipelineStorage;
+  using TensorMapStorage = typename SharedStorage::TensorMapStorage;
+
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
+
+  // Host side kernel arguments
+  struct Arguments {
+    ElementA const** ptr_A{nullptr};
+    StrideA dA{};
+    ElementB const** ptr_B{nullptr};
+    StrideB dB{};
+    ElementSF const** ptr_SFA{nullptr};
+    LayoutSFA layout_SFA{};
+    ElementSF const** ptr_SFB{nullptr};
+    LayoutSFB layout_SFB{};
+  };
+
+  // Device side kernel params
+  struct Params {
+    // Assumption: StrideA is congruent with Problem_MK
+    using TMA_A = decltype(make_tma_copy(
+        GmemTiledCopyA{},
+        make_tensor(recast_ptr<TmaInternalElementA>(nullptr), repeat_like(InternalStrideA{}, int32_t(0)), InternalStrideA{}),
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        _1{}));  // No programmatic multicast
+    // Assumption: StrideB is congruent with Problem_NK
+    using TMA_B = decltype(make_tma_copy(
+        GmemTiledCopyB{},
+        make_tensor(recast_ptr<TmaInternalElementB>(nullptr), repeat_like(InternalStrideB{}, int32_t(0)), InternalStrideB{}),
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
+        _1{}));  // No programmatic multicast
+
+    using TMA_SFA = decltype(make_tma_copy<uint16_t>(
+        GmemTiledCopySFA{},
+        make_tensor(static_cast<ElementSF const*>(nullptr), InternalLayoutSFA{}),
+        SmemLayoutSFA{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        _1{}));  // No programmatic multicast
+
+
+    using TMA_SFB = decltype(make_tma_copy<uint16_t>(
+        GmemTiledCopySFB{},
+        make_tensor(static_cast<ElementSF const*>(nullptr), InternalLayoutSFB{}),
+        SmemLayoutSFB{}(_,_,cute::Int<0>{}),
+        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
+        _1{}));  // No programmatic multicast
+
+    TMA_A tma_load_a;
+    TMA_B tma_load_b;
+    TMA_SFA tma_load_sfa;
+    TMA_SFB tma_load_sfb;
+    uint32_t tma_transaction_bytes = TmaTransactionBytes;
+    uint32_t tma_transaction_bytes_mk = TmaTransactionBytesMK;
+    uint32_t tma_transaction_bytes_nk = TmaTransactionBytesNK;
+    cute::TmaDescriptor* tensormaps;
+    ElementA const** ptr_A;
+    StrideA dA;
+    ElementB const** ptr_B;
+    StrideB dB;
+    ElementSF const** ptr_SFA;
+    LayoutSFA layout_SFA;
+    ElementSF const** ptr_SFB;
+    LayoutSFB layout_SFB;
+  };
+
+  //
+  // Methods
+  //
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shapes, Arguments const& args, void* workspace) {
+    (void) workspace;
+    // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
+    // These will be replaced with correct values before the initial tma load.
+    auto init_shape = repeat_like(typename ProblemShape::UnderlyingProblemShape{}, int32_t(1));
+    constexpr int tma_alignment_bits = 128;
+    auto init_M = tma_alignment_bits;
+    auto init_N = tma_alignment_bits;
+    auto init_K = tma_alignment_bits;
+    // Batches/Groups are managed by using appropriate pointers to input matrices
+    const uint32_t init_L = 1;
+    TmaInternalElementA const* ptr_A_first_batch = nullptr;
+    TmaInternalElementB const* ptr_B_first_batch = nullptr;
+    ElementSF const* ptr_SFA_first_batch = nullptr;
+    ElementSF const* ptr_SFB_first_batch = nullptr;
+
+    InternalStrideA stride_a;
+    InternalStrideB stride_b;
+    InternalLayoutSFA layout_SFA;
+    InternalLayoutSFB layout_SFB;
+
+    if constexpr (IsGroupedGemmKernel) {
+      // Strides for Grouped Gemm will be replaced prior to the first access regardless.
+      stride_a = InternalStrideA{};
+      stride_b = InternalStrideB{};
+      layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(init_M, init_N, init_K, 1));
+      layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(init_M, init_N, init_K, 1));
+    }
+    else {
+      // Tensor shapes for Ptr-Array are initialized correctly only here.
+      auto problem_shape_MNK = problem_shapes.get_host_problem_shape(0);
+      init_M = get<0>(problem_shape_MNK);
+      init_N = get<1>(problem_shape_MNK);
+      init_K = get<2>(problem_shape_MNK);
+
+      stride_a = args.dA;
+      stride_b = args.dB;
+      layout_SFA = args.layout_SFA;
+      layout_SFB = args.layout_SFB;
+    }
+
+    Tensor tensor_a = make_tensor(ptr_A_first_batch, make_layout(make_shape(init_M,init_K,init_L), stride_a));
+    Tensor tensor_b = make_tensor(ptr_B_first_batch, make_layout(make_shape(init_N,init_K,init_L), stride_b));
+    Tensor tensor_sfa = make_tensor(ptr_SFA_first_batch, layout_SFA);
+    Tensor tensor_sfb = make_tensor(ptr_SFB_first_batch, layout_SFB);
+
+    typename Params::TMA_A tma_load_a = make_tma_copy(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        _1{}); // No programmatic multicast
+    typename Params::TMA_B tma_load_b = make_tma_copy(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
+        _1{}); // No programmatic multicast
+
+    typename Params::TMA_SFA tma_load_sfa = make_tma_copy<uint16_t>(
+        GmemTiledCopySFA{},
+        tensor_sfa,
+        SmemLayoutSFA{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        _1{}); // No programmatic multicast
+
+    typename Params::TMA_SFB tma_load_sfb = make_tma_copy<uint16_t>(
+        GmemTiledCopySFB{},
+        tensor_sfb,
+        SmemLayoutSFB{}(_,_,cute::Int<0>{}),
+        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
+        _1{}); // No programmatic multicast
+
+    return {
+      tma_load_a,
+      tma_load_b,
+      tma_load_sfa,
+      tma_load_sfb,
+      TmaTransactionBytes,
+      TmaTransactionBytesMK,
+      TmaTransactionBytesNK,
+      reinterpret_cast<cute::TmaDescriptor*>(workspace),
+      reinterpret_cast<ArrayElementA const**>(args.ptr_A),
+      args.dA,
+      reinterpret_cast<ArrayElementB const**>(args.ptr_B),
+      args.dB,
+      reinterpret_cast<ElementSF const**>(args.ptr_SFA),
+      args.layout_SFA,
+      reinterpret_cast<ElementSF const**>(args.ptr_SFB),
+      args.layout_SFB
+    };
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args, int sm_count) {
+    constexpr uint32_t NumInputTensors = 4;
+    constexpr size_t SizeOfCuTensorMap = sizeof(cute::TmaDescriptor);
+    // Allocate gmem space for input tensormaps per each SM, A tensormap copies followed by B tensormap copies
+    return (NumInputTensors * SizeOfCuTensorMap * sm_count);
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream, CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  template<class ProblemShape>
+  CUTLASS_HOST_DEVICE static bool
+  can_implement(
+      ProblemShape problem_shapes,
+      [[maybe_unused]] Arguments const& args) {
+
+    constexpr int tma_alignment_bits_A = cutlass::detail::get_input_alignment_bits<ElementA, IsF8F6F4>();
+    constexpr int tma_alignment_bits_B = cutlass::detail::get_input_alignment_bits<ElementB, IsF8F6F4>();
+    constexpr int min_tma_aligned_elements_A = tma_alignment_bits_A / cutlass::sizeof_bits<ElementA>::value;
+    constexpr int min_tma_aligned_elements_B = tma_alignment_bits_B / cutlass::sizeof_bits<ElementB>::value;
+    
+    bool implementable = true;
+    if (problem_shapes.is_host_problem_shape_available()) {
+      // Check alignment for all problem sizes
+      for (int i = 0; i < problem_shapes.groups(); i++) {
+        auto problem_shape_MNKL = append<4>(problem_shapes.get_host_problem_shape(i), 1);
+        auto [M,N,K,L] = problem_shape_MNKL;
+        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), InternalStrideA{});
+        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), InternalStrideB{});
+      }
+    }
+
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+    }
+    return implementable;
+  }
+
+  // Temporary adhoc partitioning for scaling factors.
+  template <class SFATensor, class Atom, class TiledThr, class TiledPerm>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  thrfrg_SFA(SFATensor&& sfatensor, TiledMMA<Atom, TiledThr, TiledPerm>& mma)
+  {
+    CUTE_STATIC_ASSERT_V(rank(sfatensor) >= Int<2>{});
+
+    using AtomShape_MNK  = typename Atom::Shape_MNK;
+    using AtomLayoutSFA_TV = typename Atom::Traits::SFALayout;
+
+    auto permutation_mnk = TiledPerm{};
+    auto thr_layout_vmnk = mma.get_thr_layout_vmnk();
+
+    // Reorder the tensor for the TiledAtom
+    auto t_tile = make_tile(get<0>(permutation_mnk),
+                            get<2>(permutation_mnk));
+    auto t_tensor = logical_divide(sfatensor, t_tile);                 // (PermM,PermK)
+
+    // Tile the tensor for the Atom
+    auto a_tile = make_tile(make_layout(size<0>(AtomShape_MNK{})),
+                            make_layout(size<2>(AtomShape_MNK{})));
+    auto a_tensor = zipped_divide(t_tensor, a_tile);                 // ((AtomM,AtomK),(RestM,RestK))
+
+    // Transform the Atom mode from (M,K) to (Thr,Val)
+    auto tv_tensor = a_tensor.compose(AtomLayoutSFA_TV{},_);           // ((ThrV,FrgV),(RestM,RestK))
+
+    // Tile the tensor for the Thread
+    auto thr_tile = make_tile(_,
+                              make_tile(make_layout(size<1>(thr_layout_vmnk)),
+                                        make_layout(size<3>(thr_layout_vmnk))));
+    auto thr_tensor = zipped_divide(tv_tensor, thr_tile);            // ((ThrV,(ThrM,ThrK)),(FrgV,(RestM,RestK)))
+
+    return thr_tensor;
+  }
+
+  template <class SFBTensor, class Atom, class TiledThr, class TiledPerm>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  thrfrg_SFB(SFBTensor&& sfbtensor, TiledMMA<Atom, TiledThr, TiledPerm>& mma)
+  {
+    CUTE_STATIC_ASSERT_V(rank(sfbtensor) >= Int<2>{});
+
+    using AtomShape_MNK  = typename Atom::Shape_MNK;
+    using AtomLayoutSFB_TV = typename Atom::Traits::SFBLayout;
+
+    auto permutation_mnk = TiledPerm{};
+    auto thr_layout_vmnk = mma.get_thr_layout_vmnk();
+
+    // Reorder the tensor for the TiledAtom
+    auto t_tile = make_tile(get<1>(permutation_mnk),
+                            get<2>(permutation_mnk));
+    auto t_tensor = logical_divide(sfbtensor, t_tile);                 // (PermN,PermK)
+
+    // Tile the tensor for the Atom
+    auto a_tile = make_tile(make_layout(size<1>(AtomShape_MNK{})),
+                            make_layout(size<2>(AtomShape_MNK{})));
+    auto a_tensor = zipped_divide(t_tensor, a_tile);                 // ((AtomN,AtomK),(RestN,RestK))
+
+    // Transform the Atom mode from (M,K) to (Thr,Val)
+    auto tv_tensor = a_tensor.compose(AtomLayoutSFB_TV{},_);           // ((ThrV,FrgV),(RestN,RestK))
+
+    // Tile the tensor for the Thread
+    auto thr_tile = make_tile(_,
+                              make_tile(make_layout(size<2>(thr_layout_vmnk)),
+                                        make_layout(size<3>(thr_layout_vmnk))));
+    auto thr_tensor = zipped_divide(tv_tensor, thr_tile);            // ((ThrV,(ThrN,ThrK)),(FrgV,(RestN,RestK)))
+    return thr_tensor;
+  }
+
+  template <class SFATensor, class ThrMma>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  partition_fragment_SFA(SFATensor&& sfatensor, ThrMma& thread_mma)
+  {
+    using ValTypeSF = typename ThrMma::Atom::Traits::ValTypeSF;
+    auto thr_tensor = make_tensor(static_cast<SFATensor&&>(sfatensor).data(), thrfrg_SFA(sfatensor.layout(),thread_mma));
+    auto thr_vmnk = thread_mma.thr_vmnk_;
+    auto thr_vmk = make_coord(get<0>(thr_vmnk), make_coord(get<1>(thr_vmnk), get<3>(thr_vmnk)));
+    auto partition_SFA =  thr_tensor(thr_vmk, make_coord(_, repeat<rank<1,1>(thr_tensor)>(_)));
+    return make_fragment_like<ValTypeSF>(partition_SFA);
+  }
+
+  template <class SFBTensor, class ThrMma>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  partition_fragment_SFB(SFBTensor&& sfbtensor, ThrMma& thread_mma)
+  {
+    using ValTypeSF = typename ThrMma::Atom::Traits::ValTypeSF;
+    auto thr_tensor = make_tensor(static_cast<SFBTensor&&>(sfbtensor).data(), thrfrg_SFB(sfbtensor.layout(),thread_mma));
+    auto thr_vmnk = thread_mma.thr_vmnk_;
+    auto thr_vnk = make_coord(get<0>(thr_vmnk), make_coord(get<2>(thr_vmnk), get<3>(thr_vmnk)));
+    auto partition_SFB =  thr_tensor(thr_vnk, make_coord(_, repeat<rank<1,1>(thr_tensor)>(_)));
+    return make_fragment_like<ValTypeSF>(partition_SFB);
+  }
+
+  template<class TiledMma>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  get_layoutSFA_TV(TiledMma& mma)
+  {
+    // (M,K) -> (M,K)
+    auto tile_shape_mnk = tile_shape(mma);
+    auto ref_A = make_layout(make_shape(size<0>(tile_shape_mnk), size<2>(tile_shape_mnk)));
+    auto thr_layout_vmnk = mma.get_thr_layout_vmnk();
+
+    // (ThrV,(ThrM,ThrK)) -> (ThrV,(ThrM,ThrN,ThrK))
+    auto atile = make_tile(_,
+                          make_tile(make_layout(make_shape (size<1>(thr_layout_vmnk), size<2>(thr_layout_vmnk)),
+                                                make_stride(               Int<1>{} ,                Int<0>{} )),
+                                    _));
+
+    // thr_idx -> (ThrV,ThrM,ThrN,ThrK)
+    auto thridx_2_thrid = right_inverse(thr_layout_vmnk);
+    // (thr_idx,val) -> (M,K)
+    return thrfrg_SFA(ref_A, mma).compose(atile, _).compose(thridx_2_thrid, _);
+  }
+
+  template<class TiledMma>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  get_layoutSFB_TV(TiledMma& mma)
+  {
+    // (N,K) -> (N,K)
+    auto tile_shape_mnk = tile_shape(mma);
+    auto ref_B = make_layout(make_shape(size<1>(tile_shape_mnk), size<2>(tile_shape_mnk)));
+    auto thr_layout_vmnk = mma.get_thr_layout_vmnk();
+
+    // (ThrV,(ThrM,ThrK)) -> (ThrV,(ThrM,ThrN,ThrK))
+    auto btile = make_tile(_,
+                          make_tile(make_layout(make_shape (size<1>(thr_layout_vmnk), size<2>(thr_layout_vmnk)),
+                                                make_stride(               Int<0>{} ,                Int<1>{} )),
+                                    _));
+
+    // thr_idx -> (ThrV,ThrM,ThrN,ThrK)
+    auto thridx_2_thrid = right_inverse(thr_layout_vmnk);
+    // (thr_idx,val) -> (M,K)
+    return thrfrg_SFB(ref_B, mma).compose(btile, _).compose(thridx_2_thrid, _);
+  }
+
+  /// Set up the data needed by this collective for load and mma.
+  /// Returns a tuple of tensors. The collective and the kernel layer have the contract
+  /// Returned tuple must contain at least two elements, with the first two elements being:
+  /// gA_mkl - The tma tensor, A after a local tile so it has shape  (BLK_M,BLK_K,m,k,l)
+  /// gB_nkl - The tma tensor, B after a local tile so it has shape  (BLK_N,BLK_K,n,k,l)
+  /// The rest of the tensors can be specified as needed by this collective.
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_init(ProblemShape_MNKL const& problem_shape_MNKL, Params const& params) const {
+    using X = Underscore;
+    // Separate out problem shape for convenience
+    auto [M, N, K, L] = problem_shape_MNKL;
+    const int32_t init_L = 1;
+
+    // TMA requires special handling of strides to deal with coord codomain mapping
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mkl = params.tma_load_a.get_tma_tensor(make_shape(M,K,init_L));                          // (m,k,l)
+    Tensor mB_nkl = params.tma_load_b.get_tma_tensor(make_shape(N,K,init_L));                          // (n,k,l)
+    
+    // Represent the full tensor of Scale factors
+    InternalLayoutSFA layout_SFA{};
+    InternalLayoutSFB layout_SFB{};
+    if constexpr (IsGroupedGemmKernel) {
+      layout_SFA = params.layout_SFA[0];
+      layout_SFB = params.layout_SFB[0];
+    }
+    else {
+      layout_SFA = params.layout_SFA;
+      layout_SFB = params.layout_SFB;
+    }
+
+    Tensor mSFA_mkl = params.tma_load_sfa.get_tma_tensor(shape(layout_SFA));
+    Tensor mSFB_nkl = params.tma_load_sfb.get_tma_tensor(shape(layout_SFB));
+
+    // Make tiled views, defer the slice
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k,l)
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});        // (BLK_N,BLK_K,n,k,l)
+
+    Tensor gSFA_mkl = local_tile(mSFA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});    // (TILE_M,TILE_K,m,k,l)
+    Tensor gSFB_nkl = local_tile(mSFB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});    // (TILE_N,TILE_K,n,k,l)
+
+    return cute::make_tuple(gA_mkl, gB_nkl, gSFA_mkl, gSFB_nkl);
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Producer Perspective
+  template <
+    class TensorA, class TensorB,
+    class TensorSFA, class TensorSFB,
+    class TensorMapA, class TensorMapB,
+    class TensorMapSFA, class TensorMapSFB,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load(
+      Params const& params,
+      MainloopPipeline pipeline,
+      PipelineState smem_pipe_write,
+      cute::tuple<TensorA, TensorB, TensorSFA, TensorSFB> const& load_inputs,
+      cute::tuple<TensorMapA, TensorMapB, TensorMapSFA, TensorMapSFB> const& input_tensormaps,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+
+    if (lane_predicate) {
+
+      Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});        // (BLK_M,BLK_K,PIPE)
+      Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});        // (BLK_N,BLK_K,PIPE)
+      Tensor sSFA = make_tensor(make_smem_ptr(shared_tensors.smem_SFA.begin()), SmemLayoutSFA{});  // (BLK_M,BLK_K,PIPE)
+      Tensor sSFB = make_tensor(make_smem_ptr(shared_tensors.smem_SFB.begin()), SmemLayoutSFB{});  // (BLK_N,BLK_K,PIPE)
+
+      //
+      // Prepare the TMA loads for A, B, SFA and SFB
+      //
+
+      auto [gA_mkl, gB_nkl, gSFA_mkl, gSFB_nkl] = load_inputs;
+
+      auto block_tma_a = params.tma_load_a.get_slice(0);
+      auto block_tma_b = params.tma_load_b.get_slice(0);
+
+      auto block_tma_sfa = params.tma_load_sfa.get_slice(0);
+      auto block_tma_sfb = params.tma_load_sfb.get_slice(0);
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+
+      Tensor gA =   gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+      Tensor gB =   gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
+      Tensor gSFA = gSFA_mkl(_,_,m_coord,_,l_coord);                                                   // (BLK_M,BLK_K,k)
+      Tensor gSFB = gSFB_nkl(_,_,n_coord,_,l_coord);                                                   // (BLK_N,BLK_K,k)
+
+      // Partition source and destination tensors for tma copies
+      Tensor tAgA = block_tma_a.partition_S(gA);                                              // (TMA,TMA_M,TMA_K,k)
+      Tensor tAsA = block_tma_a.partition_D(sA);                                              // (TMA,TMA_M,TMA_K,PIPE)
+
+      Tensor tBgB = block_tma_b.partition_S(gB);                                              // (TMA,TMA_N,TMA_K,k)
+      Tensor tBsB = block_tma_b.partition_D(sB);                                              // (TMA,TMA_N,TMA_K,PIPE)
+
+      Tensor tAgSFA = block_tma_sfa.partition_S(gSFA);                                        // (TMA,TMA_M,TMA_K,k)
+      Tensor tAsSFA = block_tma_sfa.partition_D(sSFA);                                        // (TMA,TMA_M,TMA_K,PIPE)
+
+      Tensor tBgSFB = block_tma_sfb.partition_S(gSFB);                                        // (TMA,TMA_N,TMA_K,k)
+      Tensor tBsSFB = block_tma_sfb.partition_D(sSFB);                                        // (TMA,TMA_N,TMA_K,PIPE)
+
+      // Mainloop
+      CUTLASS_PRAGMA_NO_UNROLL
+      for ( ; k_tile_count > 0; --k_tile_count) {
+        // LOCK smem_pipe_write for _writing_
+        pipeline.producer_acquire(smem_pipe_write);
+
+        //
+        // Copy gmem to smem for *k_tile_iter
+        //
+
+        using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+        BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
+
+        int write_stage = smem_pipe_write.index();
+        copy(params.tma_load_a.with(get<0>(input_tensormaps),*tma_barrier), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
+        copy(params.tma_load_b.with(get<1>(input_tensormaps),*tma_barrier), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
+
+        copy(params.tma_load_sfa.with(get<2>(input_tensormaps),*tma_barrier), tAgSFA(_,_,_,*k_tile_iter), tAsSFA(_,_,_,write_stage));
+        copy(params.tma_load_sfb.with(get<3>(input_tensormaps),*tma_barrier), tBgSFB(_,_,_,*k_tile_iter), tBsSFB(_,_,_,write_stage));
+
+        // Advance k tile
+        ++k_tile_iter;
+        ++smem_pipe_write;
+      }
+    }
+    syncwarp();
+  }
+
+  /// Perform a Producer Epilogue to prevent early exit of blocks in a Cluster
+  CUTLASS_DEVICE void
+  load_tail(MainloopPipeline pipeline, PipelineState smem_pipe_write) {
+    int lane_predicate = cute::elect_one_sync();
+
+    // Issue the epilogue waits
+    if (lane_predicate) {
+      /* This helps avoid early exit of blocks in Cluster
+       * Waits for all stages to either be released (all
+       * Consumer UNLOCKs), or if the stage was never used
+       * then would just be acquired since the phase was
+       * still inverted from make_producer_start_state
+       */
+      pipeline.producer_tail(smem_pipe_write);
+    }
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Consumer Perspective
+  template <
+    class FrgTensorC
+  >
+  CUTLASS_DEVICE void
+  mma(MainloopPipeline pipeline,
+      PipelineState smem_pipe_read,
+      FrgTensorC& accum,
+      int k_tile_count,
+      int thread_idx,
+      TensorStorage& shared_tensors,
+      [[maybe_unused]] Params const& params) {
+    using namespace cute;
+
+    static_assert(is_rmem<FrgTensorC>::value, "C tensor must be rmem resident.");
+
+    clear(accum);
+
+    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});         // (BLK_M,BLK_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});         // (BLK_N,BLK_K,PIPE)
+    Tensor sSFA = make_tensor(make_smem_ptr(shared_tensors.smem_SFA.begin()), SmemLayoutSFA{});  // (BLK_M,BLK_K,PIPE)
+    Tensor sSFB = make_tensor(make_smem_ptr(shared_tensors.smem_SFB.begin()), SmemLayoutSFB{});  // (BLK_N,BLK_K,PIPE)
+
+    //
+    // Define C accumulators and A/B partitioning
+    //
+
+    TiledMma tiled_mma;
+    auto thread_mma = tiled_mma.get_thread_slice(thread_idx);
+
+    // Allocate fragments and descriptors
+    Tensor tCrA = thread_mma.partition_fragment_A(sA(_,_,Int<0>{}));                         // (MMA,MMA_M,MMA_K)
+    Tensor tCrB = thread_mma.partition_fragment_B(sB(_,_,Int<0>{}));                         // (MMA,MMA_N,MMA_K)
+
+    Tensor tCrSFA = partition_fragment_SFA(sSFA(_,_,Int<0>{}), thread_mma);                  // (MMA,MMA_M,MMA_K)
+    Tensor tCrSFB = partition_fragment_SFB(sSFB(_,_,Int<0>{}), thread_mma);                  // (MMA,MMA_N,MMA_K)
+
+    //
+    // Copy from smem to registers
+    //
+
+    // A
+    auto smem_tiled_copy_A = make_tiled_copy_A(SmemCopyAtomA{}, tiled_mma);
+    auto smem_thr_copy_A   = smem_tiled_copy_A.get_thread_slice(thread_idx);
+    Tensor tCsA            = smem_thr_copy_A.partition_S(
+      as_position_independent_swizzle_tensor(sA));                                      // (CPY,CPY_M,CPY_K,PIPE)
+    Tensor tCrA_copy_view  = smem_thr_copy_A.retile_D(tCrA);                            //      (CPY,CPY_M,CPY_K)
+
+    // B
+    auto smem_tiled_copy_B = make_tiled_copy_B(SmemCopyAtomB{}, tiled_mma);
+    auto smem_thr_copy_B   = smem_tiled_copy_B.get_thread_slice(thread_idx);
+    Tensor tCsB            = smem_thr_copy_B.partition_S(
+      as_position_independent_swizzle_tensor(sB));                                      // (CPY,CPY_M,CPY_K,PIPE)
+    Tensor tCrB_copy_view  = smem_thr_copy_B.retile_D(tCrB);                            //      (CPY,CPY_M,CPY_K)
+
+    // SFA
+    auto tile_shape_mnk = tile_shape(tiled_mma);
+    auto smem_tiled_copy_SFA = make_tiled_copy_impl(SmemCopyAtomSFA{},
+                                                    get_layoutSFA_TV(tiled_mma),
+                                                    make_shape(size<0>(tile_shape_mnk), size<2>(tile_shape_mnk))
+                                                  );
+    auto smem_thr_copy_SFA   = smem_tiled_copy_SFA.get_thread_slice(thread_idx);
+    Tensor tCsSFA            = smem_thr_copy_SFA.partition_S(
+        as_position_independent_swizzle_tensor(sSFA));                                      // (CPY,CPY_M,CPY_K,PIPE)
+    Tensor tCrSFA_copy_view  = smem_thr_copy_SFA.retile_D(tCrSFA);                          //      (CPY,CPY_M,CPY_K)
+
+    // SFB
+    auto smem_tiled_copy_SFB = make_tiled_copy_impl(SmemCopyAtomSFB{},
+                                                    get_layoutSFB_TV(tiled_mma),
+                                                    make_shape(size<1>(tile_shape_mnk), size<2>(tile_shape_mnk))
+                                                  );
+    auto smem_thr_copy_SFB   = smem_tiled_copy_SFB.get_thread_slice(thread_idx);
+    Tensor tCsSFB            = smem_thr_copy_SFB.partition_S(
+      as_position_independent_swizzle_tensor(sSFB));                                       // (CPY,CPY_N,CPY_K,PIPE)
+    Tensor tCrSFB_copy_view  = smem_thr_copy_SFB.retile_D(tCrSFB);                         //      (CPY,CPY_N,CPY_K)
+
+    CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(tCrA_copy_view));                        // CPY_M
+    CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCrA_copy_view));                        // CPY_K
+    CUTE_STATIC_ASSERT_V(size<1>(tCrA) == size<1>(accum));                                 // MMA_M
+    CUTE_STATIC_ASSERT_V(size<1>(tCrB) == size<2>(accum));                                 // MMA_N
+    CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCsB));                                  // CPY_K
+    CUTE_STATIC_ASSERT_V(size<3>(tCsA) == size<3>(tCsB));                                  // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sA));                    // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sB));                    // PIPE
+
+    CUTE_STATIC_ASSERT_V(size<1>(tCsSFA) == size<1>(tCrSFA_copy_view));                    // CPY_M
+    CUTE_STATIC_ASSERT_V(size<2>(tCsSFA) == size<2>(tCrSFA_copy_view));                    // CPY_K
+    CUTE_STATIC_ASSERT_V(size<1>(tCrSFA) == size<1>(accum));                               // MMA_M
+    CUTE_STATIC_ASSERT_V(size<1>(tCrSFB) == size<2>(accum));                               // MMA_N
+    CUTE_STATIC_ASSERT_V(size<2>(tCsSFA) == size<2>(tCsSFB));                              // CPY_K
+    CUTE_STATIC_ASSERT_V(size<3>(tCsSFA) == size<3>(tCsSFB));                              // PIPE
+    CUTE_STATIC_ASSERT_V(size<2>(sA) == size<2>(sSFA));                                    // PIPE
+    CUTE_STATIC_ASSERT_V(size<2>(sB) == size<2>(sSFA));                                    // PIPE
+
+    //
+    // PIPELINED MAIN LOOP
+    //
+
+    // Size of the register pipeline
+    auto K_BLOCK_MAX = size<2>(tCrA);
+
+    int read_stage = smem_pipe_read.index();
+    auto tCsA_stage   = tCsA(_,_,_,read_stage);
+    auto tCsB_stage   = tCsB(_,_,_,read_stage);
+    auto tCsSFA_stage = tCsSFA(_,_,_,read_stage);
+    auto tCsSFB_stage = tCsSFB(_,_,_,read_stage);
+    
+    auto copy_kblock = [&](auto k_block) {
+        // copy smem->rmem for A/B operand
+      copy(smem_tiled_copy_A, tCsA_stage(_,_,k_block), tCrA_copy_view(_,_,k_block));
+      copy(smem_tiled_copy_B, tCsB_stage(_,_,k_block), tCrB_copy_view(_,_,k_block));
+
+      // Left shift A,B for FP4
+      using MMAOp = typename TiledMma::MMA_Op;
+      fp4_shift_A(MMAOp{}, tCrA_copy_view(_,_,k_block));
+      fp4_shift_B(MMAOp{}, tCrB_copy_view(_,_,k_block));
+
+      
+      // Copy smem->rmem for SFA/SFB operand
+      copy(tCsSFA_stage(_,_,k_block), tCrSFA_copy_view(_,_,k_block));
+      copy(tCsSFB_stage(_,_,k_block), tCrSFB_copy_view(_,_,k_block));
+    };
+
+    auto gemm_kblock = [&](auto k_block) {
+      // (V,M) x (V,N) => (V,M,N)
+      cute::gemm(tiled_mma, make_zip_tensor(tCrA(_,_,k_block), tCrSFA(_,_,k_block)), make_zip_tensor(tCrB(_,_,k_block), tCrSFB(_,_,k_block)), accum);
+    };
+
+    pipeline.consumer_wait(smem_pipe_read);
+
+    copy_kblock(_0{});
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 1; --k_tile_count) {
+      //
+      // Compute on k_tile
+      //
+      for_each(make_int_sequence<K_BLOCK_MAX>{}, [&] (auto k_block) {
+
+        auto k_block_next = ((k_block + 1) == K_BLOCK_MAX) ? 0 : (k_block + 1);
+        
+        if (k_block == K_BLOCK_MAX - 1) {
+          cutlass::arch::NamedBarrier::sync(
+          thr_size(tiled_mma), cutlass::arch::ReservedNamedBarriers::Sm120MainloopBarrier);
+          // UNLOCK smem_pipe_read, done _computing_ on it
+          pipeline.consumer_release(smem_pipe_read);
+          ++smem_pipe_read;
+          read_stage = smem_pipe_read.index();
+          tCsA_stage   = tCsA(_,_,_,read_stage);
+          tCsB_stage   = tCsB(_,_,_,read_stage);
+          tCsSFA_stage = tCsSFA(_,_,_,read_stage);
+          tCsSFB_stage = tCsSFB(_,_,_,read_stage);
+          pipeline.consumer_wait(smem_pipe_read);
+        }
+
+        copy_kblock(k_block_next);
+        gemm_kblock(k_block);
+
+      });
+    } // k_tile_count
+
+    //
+    // Hoist out last k_tile
+    //
+    for_each(make_int_sequence<K_BLOCK_MAX>{}, [&] (auto k_block) {
+
+      auto k_block_next = ((k_block + 1) == K_BLOCK_MAX) ? 0 : (k_block + 1);
+      
+      if (k_block == K_BLOCK_MAX - 1) {
+        cutlass::arch::NamedBarrier::sync(
+        thr_size(tiled_mma), cutlass::arch::ReservedNamedBarriers::Sm120MainloopBarrier);
+        // UNLOCK smem_pipe_read, done _computing_ on it
+        pipeline.consumer_release(smem_pipe_read);
+        ++smem_pipe_read;
+      }
+
+      if (k_block_next > 0) {
+        copy_kblock(k_block_next);
+      }
+      gemm_kblock(k_block);
+
+    });
+}
+
+  /// Perform a Consumer Epilogue to release all buffers
+  CUTLASS_DEVICE void
+  mma_tail(MainloopPipeline, PipelineState, int) {
+  }
+
+
+ //
+  // Methods to perform different parts of TMA/Tensormap modifications
+  //
+
+  CUTLASS_DEVICE auto
+  tensormaps_init(
+      Params const& mainloop_params,
+      TensorMapStorage& shared_tensormaps,
+      int32_t sm_count,
+      int32_t sm_idx) {
+    cute::TmaDescriptor* gmem_tensormap = reinterpret_cast<cute::TmaDescriptor*>(mainloop_params.tensormaps);
+
+    cute::TmaDescriptor* tma_desc_a = &gmem_tensormap[sm_idx];
+    cute::TmaDescriptor* tma_desc_b = &gmem_tensormap[sm_idx + sm_count];
+    cute::TmaDescriptor* tma_desc_sfa = &gmem_tensormap[sm_idx + 2 * sm_count];
+    cute::TmaDescriptor* tma_desc_sfb = &gmem_tensormap[sm_idx + 3 * sm_count];
+
+    if (cute::elect_one_sync()) {
+      // Bringing tensormaps from params to smem for modification later
+      Tensor pA_tensormap = make_tensor(mainloop_params.tma_load_a.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sA_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_A), Int<1>{}, Int<1>{});
+      Tensor pB_tensormap = make_tensor(mainloop_params.tma_load_b.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sB_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_B), Int<1>{}, Int<1>{});
+
+      Tensor pSFA_tensormap = make_tensor(mainloop_params.tma_load_sfa.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sSFA_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_SFA), Int<1>{}, Int<1>{});
+      Tensor pSFB_tensormap = make_tensor(mainloop_params.tma_load_sfb.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sSFB_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_SFB), Int<1>{}, Int<1>{});
+
+      copy(recast<uint128_t>(pA_tensormap), recast<uint128_t>(sA_tensormap));
+      copy(recast<uint128_t>(pB_tensormap), recast<uint128_t>(sB_tensormap));
+      copy(recast<uint128_t>(pSFA_tensormap), recast<uint128_t>(sSFA_tensormap));
+      copy(recast<uint128_t>(pSFB_tensormap), recast<uint128_t>(sSFB_tensormap));
+    }
+    syncwarp();
+    return cute::make_tuple(tma_desc_a, tma_desc_b, tma_desc_sfa, tma_desc_sfb);
+  }
+
+  // Replace address for the global tensor (to be done by single thread)
+  CUTLASS_DEVICE
+  void
+  tensormaps_replace_global_address(
+      TensorMapStorage& shared_tensormaps,
+      Params const& mainloop_params,
+      int32_t next_batch) {
+    // Replacing global_address for the next batch
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_A,
+                                                    mainloop_params.ptr_A[next_batch]);
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_B,
+                                                    mainloop_params.ptr_B[next_batch]);
+
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_SFA,
+                                                    mainloop_params.ptr_SFA[next_batch]);
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_SFB,
+                                                    mainloop_params.ptr_SFB[next_batch]);
+  }
+
+  // Replace dim and strides for the global tensor - used only for Grouped GEMM (to be done by single thread)
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE
+  void
+  tensormaps_replace_global_tensor_properties(
+      TensorMapStorage& shared_tensormaps,
+      Params const& mainloop_params,
+      int32_t next_group,
+      ProblemShape_MNKL problem_shape_mnkl) {
+    const uint32_t M = get<0>(problem_shape_mnkl);
+    const uint32_t N = get<1>(problem_shape_mnkl);
+    const uint32_t K = get<2>(problem_shape_mnkl);
+    // Replace all dims for consistency
+    constexpr int MaxTensorRank = 5;
+    cute::array<uint32_t, MaxTensorRank> prob_shape_A  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_A = {0,0,0,0,0};
+    cute::array<uint32_t, MaxTensorRank> prob_shape_SFA  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_SFA = {0,0,0,0,0};
+    cute::array<uint32_t, MaxTensorRank> prob_shape_B  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_B = {0,0,0,0,0};
+    cute::array<uint32_t, MaxTensorRank> prob_shape_SFB  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_SFB = {0,0,0,0,0};
+
+    TmaInternalElementA const* ptr_A = nullptr;
+    Tensor tensor_a = make_tensor(ptr_A, make_shape(M,K,Int<1>{}), mainloop_params.dA[next_group]);
+
+    ElementSF const* ptr_SF = nullptr;
+    Tensor tensor_sfa = make_tensor(ptr_SF, mainloop_params.layout_SFA[next_group]);
+
+    TmaInternalElementB const* ptr_B = nullptr;
+    Tensor tensor_b = make_tensor(ptr_B, make_shape(N,K,Int<1>{}), mainloop_params.dB[next_group]);
+
+    Tensor tensor_sfb = make_tensor(ptr_SF, mainloop_params.layout_SFB[next_group]);
+
+    cute::detail::fill_tma_gmem_shape_stride(mainloop_params.tma_load_a, tensor_a, 
+                                             prob_shape_A, prob_stride_A);
+    cute::detail::fill_tma_gmem_shape_stride(mainloop_params.tma_load_sfa, tensor_sfa,
+                                             prob_shape_SFA, prob_stride_SFA);
+    cute::detail::fill_tma_gmem_shape_stride(mainloop_params.tma_load_b, tensor_b, 
+                                             prob_shape_B, prob_stride_B);
+    cute::detail::fill_tma_gmem_shape_stride(mainloop_params.tma_load_sfb, tensor_sfb,
+                                             prob_shape_SFB, prob_stride_SFB);
+    // Convert strides to byte strides
+    for (uint64_t& stride : prob_stride_A) {
+      stride = (stride * sizeof_bits_v<TmaInternalElementA>) / 8;
+    }
+    for (uint64_t& stride : prob_stride_SFA) {
+      stride = (stride * sizeof_bits_v<ElementSF>) / 8;
+    }
+    for (uint64_t& stride : prob_stride_B) {
+      stride = (stride * sizeof_bits_v<TmaInternalElementB>) / 8;
+    }
+    for (uint64_t& stride : prob_stride_SFB) {
+      stride = (stride * sizeof_bits_v<ElementSF>) / 8;
+    }
+
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_A,
+                                                            prob_shape_A,
+                                                            prob_stride_A);
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_SFA,
+                                                            prob_shape_SFA,
+                                                            prob_stride_SFA);
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_B,
+                                                            prob_shape_B,
+                                                            prob_stride_B);
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_SFB,
+                                                            prob_shape_SFB,
+                                                            prob_stride_SFB);                                                   
+  }
+
+  // The entire warp must call this function collectively (that is, the instructions are aligned)
+  template <class TensorMapA, class TensorMapB, class TensorMapSFA, class TensorMapSFB, class ProblemShape_MNKL>
+  CUTLASS_DEVICE
+  void
+  tensormaps_perform_update(
+      TensorMapStorage& shared_tensormaps,
+      Params const& mainloop_params,
+      cute::tuple<TensorMapA, TensorMapB, TensorMapSFA, TensorMapSFB> const& input_tensormaps,
+      ProblemShape_MNKL problem_shape_mnkl,
+      int32_t next_batch) {
+    if (cute::elect_one_sync()) {
+      // Replacing global_address for the next batch
+      tensormaps_replace_global_address(shared_tensormaps, mainloop_params, next_batch);
+
+      if constexpr (IsGroupedGemmKernel) {
+        // Replacing global dims and strides for the next batch
+        tensormaps_replace_global_tensor_properties(shared_tensormaps,
+          mainloop_params, next_batch, problem_shape_mnkl);
+      }
+    }
+  }
+
+  template <class TensorMapA, class TensorMapB, class TensorMapSFA, class TensorMapSFB>
+  CUTLASS_DEVICE
+  void
+  tensormaps_cp_fence_release (
+      TensorMapStorage& shared_tensormaps,
+      cute::tuple<TensorMapA, TensorMapB, TensorMapSFA, TensorMapSFB> const& input_tensormaps) {
+    // Entire warp must do this (i.e. it's aligned)
+    tma_descriptor_cp_fence_release(get<0>(input_tensormaps), shared_tensormaps.smem_tensormap_A);
+    tma_descriptor_cp_fence_release(get<1>(input_tensormaps), shared_tensormaps.smem_tensormap_B);
+
+    tma_descriptor_cp_fence_release(get<2>(input_tensormaps), shared_tensormaps.smem_tensormap_SFA);
+    tma_descriptor_cp_fence_release(get<3>(input_tensormaps), shared_tensormaps.smem_tensormap_SFB);
+  }
+
+  // The entire warp must call this function collectively (that is, the instructions are aligned)
+  template <class TensorMapA, class TensorMapB, class TensorMapSFA, class TensorMapSFB>
+  CUTLASS_DEVICE
+  void
+  tensormaps_fence_acquire(cute::tuple<TensorMapA, TensorMapB, TensorMapSFA, TensorMapSFB> const& input_tensormaps) {
+    cute::tma_descriptor_fence_acquire(get<0>(input_tensormaps));
+    cute::tma_descriptor_fence_acquire(get<1>(input_tensormaps));
+    cute::tma_descriptor_fence_acquire(get<2>(input_tensormaps));
+    cute::tma_descriptor_fence_acquire(get<3>(input_tensormaps));
+  }
+
+  template <class InputTensors, class ProblemShape_MNKL>
+  CUTLASS_DEVICE
+  InputTensors
+  tensors_perform_update(
+      InputTensors const& input_tensors,
+      [[maybe_unused]] Params const& mainloop_params,
+      [[maybe_unused]] ProblemShape_MNKL problem_shape_mnkl,
+      [[maybe_unused]] int32_t next_batch) {
+    return input_tensors;
+  }
+  
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp b/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp
index a9748eb252..83d38188be 100755
--- a/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp
+++ b/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp
@@ -285,14 +285,14 @@ struct CollectiveMma<
     using TMA_A = decltype(make_tma_copy(
         GmemTiledCopyA{},
         make_tensor(recast_ptr<TmaInternalElementA>(nullptr), repeat_like(StrideA{}, int32_t(0)), StrideA{}),
-        SmemLayoutA{}(_,_,0),
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
         make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
         _1{}));  // No programmatic multicast
     // Assumption: StrideB is congruent with Problem_NK
     using TMA_B = decltype(make_tma_copy(
         GmemTiledCopyB{},
         make_tensor(recast_ptr<TmaInternalElementB>(nullptr), repeat_like(StrideB{}, int32_t(0)), StrideB{}),
-        SmemLayoutB{}(_,_,0),
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
         make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
         _1{}));  // No programmatic multicast
 
diff --git a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input.hpp b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input.hpp
index d2fa888c37..bfa0c3a867 100644
--- a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input.hpp
@@ -432,6 +432,10 @@ struct CollectiveMma<
       init_M = get<0>(problem_shape_MNK);
       init_N = get<1>(problem_shape_MNK);
       init_K = get<2>(problem_shape_MNK);
+      if constexpr (SwapAB) {
+        init_M = get<1>(problem_shape_MNK);
+        init_N = get<0>(problem_shape_MNK);
+      }
 
       if constexpr (not SwapAB) {
         dA = args.dA;
@@ -491,7 +495,7 @@ struct CollectiveMma<
                     : args_setup(args.ptr_A, args.ptr_B);
     }
     else if constexpr (ModeHasScales) {
-      auto scale_k = 1;
+      auto scale_k = ceil_div(init_K, args.chunk_size);
       ElementScale const* ptr_S = reinterpret_cast<ElementScale const*>(args.ptr_S);
       StrideScale dS{};
       Tensor tensor_scale = make_tensor(detail::get_logical_ptr(ptr_S), make_layout(make_shape(init_M,scale_k,mock_L), dS));
@@ -595,7 +599,7 @@ struct CollectiveMma<
         }
         else if constexpr (ModeHasScales) {
           const int scale_mn = SwapAB ? N : M;
-          const int scale_k = (K + args.chunk_size - 1) / args.chunk_size;
+          const int scale_k = ceil_div(K, args.chunk_size);
           constexpr int min_tma_aligned_elements_scale = tma_alignment_bits / cutlass::sizeof_bits<ElementScale>::value;
           implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_scale>(cute::make_shape(scale_mn,scale_k,L), StrideScale{});
           implementable = implementable && (args.chunk_size == K || ((args.chunk_size % size<2>(TileShape{})) == 0));
@@ -659,14 +663,15 @@ struct CollectiveMma<
       return cute::make_tuple(gA_mkl, gB_nkl);
     } 
     else if constexpr (ModeHasScales) {
+      const int scale_mn = SwapAB ? N : M;
       auto scale_k = mainloop_params.scale_k;
-      Tensor mS_mkl = mainloop_params.tma_load_scale.get_tma_tensor(make_shape(M,scale_k,L));      // (m,scale_k,l)
+      Tensor mS_mkl = mainloop_params.tma_load_scale.get_tma_tensor(make_shape(scale_mn,scale_k,L));
       Tensor gS_mkl = local_tile(mS_mkl, ScaleTileShape{}, make_coord(_,_));       // (BLK_M,BLK_Scale_K,m,scale_k,l)
       if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
         return cute::make_tuple(gA_mkl, gB_nkl, gS_mkl);
       }
       else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
-        Tensor mZ_mkl = mainloop_params.tma_load_zero.get_tma_tensor(make_shape(M,scale_k,L));      // (m,scale_k,l)
+        Tensor mZ_mkl = mainloop_params.tma_load_zero.get_tma_tensor(make_shape(scale_mn,scale_k,L));
         Tensor gZ_mkl = local_tile(mZ_mkl, ScaleTileShape{}, make_coord(_,_));      // (BLK_M,BLK_Scale_K,m,scale_k,l)
         return cute::make_tuple(gA_mkl, gB_nkl, gS_mkl, gZ_mkl);
       }
@@ -1217,8 +1222,8 @@ struct CollectiveMma<
       Params const& mainloop_params,
       int32_t next_group,
       ProblemShape_MNKL problem_shape_mnkl) {
-    const uint32_t M = get<0>(problem_shape_mnkl);
-    const uint32_t N = get<1>(problem_shape_mnkl);
+    const uint32_t M = (SwapAB? get<1>(problem_shape_mnkl) : get<0>(problem_shape_mnkl));
+    const uint32_t N = (SwapAB? get<0>(problem_shape_mnkl) : get<1>(problem_shape_mnkl));
     const uint32_t K = get<2>(problem_shape_mnkl);
     
     // Replace all dims for consistency
@@ -1245,14 +1250,14 @@ struct CollectiveMma<
 
     if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
       NonVoidElementScale const* ptr_S = nullptr;
-      auto scale_k = 1;
+      auto scale_k = ceil_div(K, mainloop_params.chunk_size);
       Tensor tensor_scale = make_tensor(detail::get_logical_ptr(ptr_S), make_shape(M,scale_k,Int<1>{}), mainloop_params.dS[next_group]);
       cute::detail::fill_tma_gmem_shape_stride(mainloop_params.tma_load_scale, tensor_scale, 
                                              prob_shape_scale, prob_stride_scale);
     }
     else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
       ElementZero const* ptr_Z = nullptr;
-      auto scale_k = 1;
+      auto scale_k = ceil_div(K, mainloop_params.chunk_size);
       Tensor tensor_zero = make_tensor(detail::get_logical_ptr(ptr_Z), make_shape(M,scale_k,Int<1>{}), mainloop_params.dS[next_group]);
       cute::detail::fill_tma_gmem_shape_stride(mainloop_params.tma_load_zero, tensor_zero, 
                                                prob_shape_zero, prob_stride_zero);
diff --git a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
index a0b00a81aa..1f38d9c24c 100644
--- a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
@@ -46,6 +46,8 @@
 #include "cute/tensor_predicate.hpp"
 #include "cute/numeric/arithmetic_tuple.hpp"
 
+#include "cutlass/detail/blockwise_scale_layout.hpp"
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::collective {
@@ -57,14 +59,11 @@ template <
   int Stages,
   class ClusterShape,
   class KernelSchedule,
-  int ScaleGranularityM_,
-  int ScaleGranularityN_,
-  int ScalePromotionInterval_,
   class TileShape_,
   class ElementA_,
-  class StrideA_,
+  class StridePairA_,
   class ElementB_,
-  class StrideB_,
+  class StridePairB_,
   class TiledMma_,
   class GmemTiledCopyA_,
   class SmemLayoutAtomA_,
@@ -75,12 +74,12 @@ template <
   class SmemCopyAtomB_,
   class TransformB_>
 struct CollectiveMma<
-    MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<Stages, ClusterShape, KernelSchedule, ScaleGranularityM_, ScaleGranularityN_, ScalePromotionInterval_>,
+    MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<Stages, ClusterShape, KernelSchedule>,
     TileShape_,
     ElementA_,
-    StrideA_,
+    StridePairA_,
     ElementB_,
-    StrideB_,
+    StridePairB_,
     TiledMma_,
     GmemTiledCopyA_,
     SmemLayoutAtomA_,
@@ -94,14 +93,18 @@ struct CollectiveMma<
   //
   // Type Aliases
   //
-  using DispatchPolicy = MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<Stages, ClusterShape, KernelSchedule, ScaleGranularityM_, ScaleGranularityN_, ScalePromotionInterval_>;
+  using DispatchPolicy = MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<Stages, ClusterShape, KernelSchedule>;
   using TileShape = TileShape_;
   using ElementA = ElementA_;
-  using StrideA = StrideA_;
+  using StrideA = cute::tuple_element_t<0,StridePairA_>;
+  using LayoutSFA = cute::tuple_element_t<1,StridePairA_>;
   using InternalStrideA = cute::remove_pointer_t<StrideA>;
+  using InternalLayoutSFA = cute::remove_pointer_t<LayoutSFA>;
   using ElementB = ElementB_;
-  using StrideB = StrideB_;
+  using StrideB = cute::tuple_element_t<0,StridePairB_>;
+  using LayoutSFB = cute::tuple_element_t<1,StridePairB_>;
   using InternalStrideB = cute::remove_pointer_t<StrideB>;
+  using InternalLayoutSFB = cute::remove_pointer_t<LayoutSFB>;
   using TiledMma = TiledMma_;
   using ElementAccumulator = typename TiledMma::ValTypeC;
   using ElementBlockScale = ElementAccumulator;
@@ -121,11 +124,16 @@ struct CollectiveMma<
   using PipelineParams = typename MainloopPipeline::Params;
   using CtaShape_MNK = decltype(shape_div(TileShape{}, ClusterShape{}));
 
-  static constexpr int NumProducerThreadEvents = 2;
+  static constexpr int NumProducerThreadEvents = 33;
+
+  static constexpr int ScaleGranularityM = size<0,0>(InternalLayoutSFA{});
+  static constexpr int ScaleGranularityN = size<0,0>(InternalLayoutSFB{});
+  static constexpr int ScaleGranularityK = size<1,0>(InternalLayoutSFA{});
 
-  static constexpr int ScaleGranularityM = ScaleGranularityM_ == 0 ? size<0>(TileShape{}) : ScaleGranularityM_;
-  static constexpr int ScaleGranularityN = ScaleGranularityN_ == 0 ? size<1>(TileShape{}) : ScaleGranularityN_;
-  static constexpr int ScalePromotionInterval = ScalePromotionInterval_;
+  static_assert(size<2>(TileShape{}) % ScaleGranularityK == 0);
+  static_assert(ScaleGranularityK % size<2>(typename TiledMma::AtomShape_MNK{}) == 0);
+
+  static constexpr int ScalePromotionInterval = ScaleGranularityK / size<2>(typename TiledMma::AtomShape_MNK{});
   static_assert(ScalePromotionInterval % 4 == 0, "ScalePromotionInterval must be a multiple of 4.");
 
   static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
@@ -142,6 +150,10 @@ struct CollectiveMma<
   static_assert((size<0>(TileShape{}) % ScaleGranularityM) == 0, "FP8 scaling granularity must evenly divide tile shape along M.");
   static_assert((size<1>(TileShape{}) % ScaleGranularityN) == 0, "FP8 scaling granularity must evenly divide tile shape along N.");
 
+  using ScaleConfig = ::cutlass::detail::Sm90BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>;
+  using SmemLayoutAtomSFA = decltype(ScaleConfig::smem_atom_layoutSFA(TileShape{}));
+  using SmemLayoutAtomSFB = decltype(ScaleConfig::smem_atom_layoutSFB(TileShape{}));
+
   // Tile along modes in a way that maximizes the TMA box size.
   using SmemLayoutA = decltype(tile_to_shape(
       SmemLayoutAtomA{},
@@ -153,14 +165,23 @@ struct CollectiveMma<
       cute::conditional_t< ::cutlass::gemm::detail::is_major<0,StrideB>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
 
   // Block scaling gmem-to-smem copy atom 
-  using BlockScaleCopyTypeA = cute::uint_byte_t<cute::min(static_cast<int>(sizeof(ElementBlockScale)) * ScaleMsPerTile, 16)>;
-  using BlockScaleCopyTypeB = cute::uint_byte_t<cute::min(static_cast<int>(sizeof(ElementBlockScale)) * ScaleNsPerTile, 16)>;
-  using SmemBlockScalingCopyAtomA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<BlockScaleCopyTypeA>, ElementBlockScale>;
-  using SmemBlockScalingCopyAtomB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<BlockScaleCopyTypeB>, ElementBlockScale>;
+  //  we can have partial tiles in M or N, so don't vectorize those loads
+  using CopyAtomSFA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
+  using CopyAtomSFB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
+
+  static constexpr int AlignmentSFA = 1;
+  static constexpr int AlignmentSFB = 1;
 
   // Block scaling smem layout
-  using SmemLayoutScaleA = Layout<Shape<Int<ScaleMsPerTile>, Int<DispatchPolicy::Stages>>>;
-  using SmemLayoutScaleB = Layout<Shape<Int<ScaleNsPerTile>, Int<DispatchPolicy::Stages>>>;
+  using SmemLayoutSFA = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFA{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFA{}), size(filter_zeros(SmemLayoutAtomSFA{})))
+  ));
+  using SmemLayoutSFB = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFB{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFB{}), size(filter_zeros(SmemLayoutAtomSFB{})))
+  ));
+
 
   static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 2 or more.");
   static_assert(cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value &&
@@ -185,8 +206,8 @@ struct CollectiveMma<
     struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
-      cute::array_aligned<ElementBlockScale, cute::cosize_v<SmemLayoutScaleA>> smem_scale_A;
-      cute::array_aligned<ElementBlockScale, cute::cosize_v<SmemLayoutScaleB>> smem_scale_B;
+      cute::array_aligned<ElementBlockScale, cute::cosize_v<SmemLayoutSFA>> smem_SFA;
+      cute::array_aligned<ElementBlockScale, cute::cosize_v<SmemLayoutSFB>> smem_SFB;
     } tensors;
 
     struct TensorMapStorage : cute::aligned_struct<128, _0> {
@@ -209,8 +230,10 @@ struct CollectiveMma<
     StrideA dA;
     ElementB const** ptr_B;
     StrideB dB;
-    ElementBlockScale const** ptr_scale_A;
-    ElementBlockScale const** ptr_scale_B;
+    ElementBlockScale const** ptr_SFA;
+    LayoutSFA layout_SFA;
+    ElementBlockScale const** ptr_SFB;
+    LayoutSFB layout_SFB;
   };
 
   // Device side kernel params
@@ -238,8 +261,10 @@ struct CollectiveMma<
     InternalElementB const** ptr_B;
     StrideB dB;
     // Block scaling factors for A and B
-    ElementBlockScale const** ptr_scale_A; 
-    ElementBlockScale const** ptr_scale_B;
+    ElementBlockScale const** ptr_SFA;
+    LayoutSFA layout_SFA;
+    ElementBlockScale const** ptr_SFB;
+    LayoutSFB layout_SFB;
   };
 
   //
@@ -307,8 +332,10 @@ struct CollectiveMma<
       args.dA,
       reinterpret_cast<InternalElementB const**>(args.ptr_B),
       args.dB,
-      args.ptr_scale_A,
-      args.ptr_scale_B
+      args.ptr_SFA,
+      args.layout_SFA,
+      args.ptr_SFB,
+      args.layout_SFB
     };
   }
 
@@ -372,8 +399,8 @@ struct CollectiveMma<
   load_init(
     ProblemShape_MNKL const& problem_shape_MNKL,
     Params const& mainloop_params,
-    ElementBlockScale const* ptr_scale_A = nullptr,
-    ElementBlockScale const* ptr_scale_B = nullptr
+    ElementBlockScale const* ptr_SFA = nullptr,
+    ElementBlockScale const* ptr_SFB = nullptr
   ) const {
 
     using X = Underscore;
@@ -383,27 +410,21 @@ struct CollectiveMma<
 
     // TMA requires special handling of strides to deal with coord codomain mapping
     // Represent the full tensors -- get these from TMA
-    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K,init_L));                            // (m,k,l)
-    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,init_L));                            // (n,k,l)
+    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K,init_L));                        // (m,k,l)
+    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,init_L));                        // (n,k,l)
 
     // Make tiled views, defer the slice
-    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});  // (BLK_M,BLK_K,m,k,l)
-    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});  // (BLK_N,BLK_K,n,k,l)
-    auto tK = get<3>(gA_mkl.shape());
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});         // (BLK_M,BLK_K,m,k,l)
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});         // (BLK_N,BLK_K,n,k,l)
 
     // Make the tiled views of scale tensors
-    auto scaleA_shape = make_shape(ceil_div(M, ScaleGranularityM), tK, L); // (scale_m,k,l)
-    auto scaleB_shape = make_shape(ceil_div(N, ScaleGranularityN), tK, L); // (scale_n,k,l)
-    auto scaleA_layout = make_ordered_layout(scaleA_shape, Step<_0, _1, _2>{});
-    auto scaleB_layout = make_ordered_layout(scaleB_shape, Step<_0, _1, _2>{});
-
-    // Note that mScaleA_mkl and mScaleB_nkl are already blocked tiled in the `m` host and
-    // gScaleA_mkl and gScaleB_nkl in `g` global memory are same as mScaleA_mkl and mScaleB_nkl.
 
-    Tensor mScaleA_mkl = make_tensor(make_gmem_ptr(ptr_scale_A), scaleA_layout); // (scale_m,k,l)
-    Tensor mScaleB_nkl = make_tensor(make_gmem_ptr(ptr_scale_B), scaleB_layout); // (scale_n,k,l)
+    Tensor mSFA_mkl = make_tensor(make_gmem_ptr(ptr_SFA),
+        ScaleConfig::tile_atom_to_shape_SFA(make_shape(M, N, K, init_L)));                              // (scale_m,k,l)
+    Tensor mSFB_nkl = make_tensor(make_gmem_ptr(ptr_SFB),
+        ScaleConfig::tile_atom_to_shape_SFB(make_shape(M, N, K, init_L)));                              // (scale_n,k,l)
 
-    return cute::make_tuple(gA_mkl, gB_nkl, mScaleA_mkl, mScaleB_nkl);
+    return cute::make_tuple(gA_mkl, gB_nkl, mSFA_mkl, mSFB_nkl);
 
   }
 
@@ -430,10 +451,12 @@ struct CollectiveMma<
     int lane_predicate = cute::elect_one_sync();
     // Blockscaling: Tma loads for load_input and CpAsync for load_scale
     if (lane_predicate) {
-      Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});                        // (BLK_M,BLK_K,PIPE)
-      Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});                        // (BLK_N,BLK_K,PIPE)
-      Tensor sScaleA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_A.data()), SmemLayoutScaleA{});  // (ScaleMsPerTile,k)
-      Tensor sScaleB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_B.data()), SmemLayoutScaleB{});  // (ScaleNsPerTile,k)
+      Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});         // (BLK_M,BLK_K,PIPE)
+      Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});         // (BLK_N,BLK_K,PIPE)
+      Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.data()),
+          SmemLayoutSFA{});                                                                           // (BLK_M,BLK_K,P)
+      Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.data()),
+          SmemLayoutSFB{});                                                                           // (BLK_N,BLK_K,P)
 
       //
       // Prepare the TMA loads for A and B
@@ -454,26 +477,26 @@ struct CollectiveMma<
       Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
 
       // Block scaling: load_scale has scaling tensors in global memory which are not tiled
-      Tensor mScaleA_mkl = get<2>(load_inputs);
-      Tensor mScaleB_nkl = get<3>(load_inputs);
+      Tensor mSFA_mkl = get<2>(load_inputs);
+      Tensor mSFB_nkl = get<3>(load_inputs);
 
-      auto scales_m = get<0>(mScaleA_mkl.shape());
-      auto scales_n = get<0>(mScaleB_nkl.shape());
+      Tensor gSFA_mkl = local_tile(mSFA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});   // (BLK_M,BLK_K,m,k,l)
+      Tensor gSFB_nkl = local_tile(mSFB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});   // (BLK_N,BLK_K,n,k,l)
 
-      Tensor gScaleA = local_tile(mScaleA_mkl, make_tile(Int<ScaleMsPerTile>{}), make_coord(m_coord,_,l_coord));                   // (ScaleMsPerTile,k,1)
-      Tensor gScaleB = local_tile(mScaleB_nkl, make_tile(Int<ScaleNsPerTile>{}), make_coord(n_coord,_,l_coord));                   // (ScaleNsPerTile,k,1)
+      Tensor gSFA_k = gSFA_mkl(_,_,m_coord,_,l_coord);
+      Tensor gSFB_k = gSFB_nkl(_,_,n_coord,_,l_coord);
 
-      TiledCopy scale_copy_a = make_tiled_copy(SmemBlockScalingCopyAtomA{}, Layout<Shape<_1>>{}, Layout<Shape<Int<ScaleMsPerTile>>>{});
-      TiledCopy scale_copy_b = make_tiled_copy(SmemBlockScalingCopyAtomB{}, Layout<Shape<_1>>{}, Layout<Shape<Int<ScaleNsPerTile>>>{});
+      TiledCopy scale_copy_a = make_tiled_copy(CopyAtomSFA{}, Layout<Shape<_1>>{}, Layout<Shape<_1>>{});
+      TiledCopy scale_copy_b = make_tiled_copy(CopyAtomSFB{}, Layout<Shape<_1>>{}, Layout<Shape<_1>>{});
 
-      ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(ThreadIdxX());
-      ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(ThreadIdxX());
+      ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(_0{});
+      ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(_0{});
 
-      Tensor tAgA_ScaleA = thr_scale_copy_a.partition_S(gScaleA);
-      Tensor tAsA_ScaleA = thr_scale_copy_a.partition_D(sScaleA);
+      Tensor tSFAgSFA_k = thr_scale_copy_a.partition_S(gSFA_k);
+      Tensor tSFAsSFA   = thr_scale_copy_a.partition_D(sSFA);
 
-      Tensor tBgB_ScaleB = thr_scale_copy_b.partition_S(gScaleB);
-      Tensor tBsB_ScaleB = thr_scale_copy_b.partition_D(sScaleB);
+      Tensor tSFBgSFB_k = thr_scale_copy_b.partition_S(gSFB_k);
+      Tensor tSFBsSFB   = thr_scale_copy_b.partition_D(sSFB);
 
       // Applies the mapping from block_tma_a
       Tensor tAgA = block_tma_a.partition_S(gA);                                                 // (TMA,TMA_M,TMA_K,k)
@@ -503,8 +526,7 @@ struct CollectiveMma<
 
       // Mainloop
       CUTLASS_PRAGMA_NO_UNROLL
-      for ( ; k_tile_count > 0; --k_tile_count)
-      {
+      for ( ; k_tile_count > 0; --k_tile_count) {
         // LOCK smem_pipe_write for _writing_
         pipeline.producer_acquire(smem_pipe_write);
 
@@ -519,11 +541,6 @@ struct CollectiveMma<
         copy(mainloop_params.tma_load_a.with(get<0>(input_tensormaps), *tma_barrier, mcast_mask_a), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
         copy(mainloop_params.tma_load_b.with(get<1>(input_tensormaps), *tma_barrier, mcast_mask_b), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
 
-        // Copy scale tensors from global memory to shared memory
-        copy(scale_copy_a, tAgA_ScaleA(_,_,*k_tile_iter), tAsA_ScaleA(_,_,write_stage));
-        copy(scale_copy_b, tBgB_ScaleB(_,_,*k_tile_iter), tBsB_ScaleB(_,_,write_stage));
-        pipeline.producer_commit(smem_pipe_write, cutlass::arch::cpasync_barrier_arrive_noinc);
-
         ++k_tile_iter;
 
         // Advance smem_pipe_write
@@ -548,6 +565,117 @@ struct CollectiveMma<
     }
   }
 
+  // Perform a collective-scoped matrix multiply-accumulate
+  // Producer Perspective
+  template <
+    class TensorA, class TensorB,
+    class TensorSFA, class TensorSFB,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load_auxiliary(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline,
+      PipelineState smem_pipe_write,
+      cute::tuple<TensorA,
+                  TensorB,
+                  TensorSFA,
+                  TensorSFB> const& load_inputs,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.data()),
+        SmemLayoutSFA{});                                                                             // (BLK_M,BLK_K,P)
+    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.data()),
+        SmemLayoutSFB{});                                                                             // (BLK_N,BLK_K,P)
+
+    // Partition the inputs based on the current block coordinates.
+    auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+
+    // Block scaling: load_scale has scaling tensors in global memory which are not tiled
+    Tensor mSFA_mkl  = get<2>(load_inputs);
+    Tensor mSFB_nkl  = get<3>(load_inputs);
+    Layout layoutSFA = mSFA_mkl.layout();
+    Layout layoutSFB = mSFB_nkl.layout();
+
+    Tensor iSFA_mkl = make_identity_tensor(shape(layoutSFA));                                // (m,k,l)
+    Tensor iSFB_nkl = make_identity_tensor(shape(layoutSFB));                                // (n,k,l)
+
+
+    Tensor gSFA_mkl = local_tile(mSFA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});     // (BLK_M,BLK_K,m,k,l)
+    Tensor cSFA_mkl = local_tile(iSFA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});     // (BLK_M,BLK_K,m,k,l)
+    Tensor gSFB_nkl = local_tile(mSFB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});     // (BLK_N,BLK_K,n,k,l)
+    Tensor cSFB_nkl = local_tile(iSFB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});     // (BLK_N,BLK_K,n,k,l)
+
+    Tensor gSFA_k = gSFA_mkl(_,_,m_coord,_,l_coord);
+    Tensor cSFA_k = cSFA_mkl(_,_,m_coord,_,l_coord);
+    Tensor gSFB_k = gSFB_nkl(_,_,n_coord,_,l_coord);
+    Tensor cSFB_k = cSFB_nkl(_,_,n_coord,_,l_coord);
+
+    TiledCopy scale_copy_a = make_tiled_copy(CopyAtomSFA{}, Layout<Shape<_32>>{}, Layout<Shape<_1>>{});
+    TiledCopy scale_copy_b = make_tiled_copy(CopyAtomSFB{}, Layout<Shape<_32>>{}, Layout<Shape<_1>>{});
+
+    ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(thread_idx);
+    ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(thread_idx);
+
+    Tensor tSFAgSFA_k = thr_scale_copy_a.partition_S(gSFA_k);
+    Tensor tSFAcSFA_k = thr_scale_copy_a.partition_S(cSFA_k);
+    Tensor tSFAsSFA   = thr_scale_copy_a.partition_D(sSFA);
+
+    Tensor tSFBgSFB_k = thr_scale_copy_b.partition_S(gSFB_k);
+    Tensor tSFBcSFB_k = thr_scale_copy_b.partition_S(cSFB_k);
+    Tensor tSFBsSFB   = thr_scale_copy_b.partition_D(sSFB);
+
+    Tensor tSFApSFA = make_tensor<bool>(shape(filter_zeros(tSFAsSFA(_,_,_,_0{}))));                 // (CPY,CPY_M,CPY_K)
+    Tensor tSFBpSFB = make_tensor<bool>(shape(filter_zeros(tSFBsSFB(_,_,_,_0{}))));                 // (CPY,CPY_N,CPY_K)
+
+    auto SFA_shape = shape(layoutSFA);
+    auto SFB_shape = shape(layoutSFB);
+
+    // Mainloop
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count) {
+      // LOCK smem_pipe_write for _writing_
+      pipeline.producer_acquire(smem_pipe_write);
+
+      // Since scale granularity K is multiple of BLK_K we do not have to consider if that is OOB
+      bool load_sfa = thread_idx < ScaleMsPerTile;
+      Tensor tSFAcSFA = tSFAcSFA_k(_,_,_,*k_tile_iter);
+      Tensor tSFAcSFA_compact = filter_zeros(tSFAcSFA);
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tSFApSFA); ++i) {
+        tSFApSFA(i) = load_sfa && elem_less(tSFAcSFA_compact(i), SFA_shape);
+      }
+
+      bool load_sfb = thread_idx < ScaleNsPerTile;
+      Tensor tSFBcSFB = tSFBcSFB_k(_,_,_,*k_tile_iter);
+      Tensor tSFBcSFB_compact = filter_zeros(tSFBcSFB);
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tSFBpSFB); ++i) {
+        tSFBpSFB(i) = load_sfb && elem_less(tSFBcSFB_compact(i), SFB_shape);
+      }
+
+      //
+      // Copy gmem to smem for *k_tile_iter
+      //
+      int write_stage = smem_pipe_write.index();
+
+      // Copy scale tensors from global memory to shared memory
+      copy_if(scale_copy_a, tSFApSFA, filter_zeros(tSFAgSFA_k(_,_,_,*k_tile_iter)), filter_zeros(tSFAsSFA(_,_,_,write_stage)));
+      copy_if(scale_copy_b, tSFBpSFB, filter_zeros(tSFBgSFB_k(_,_,_,*k_tile_iter)), filter_zeros(tSFBsSFB(_,_,_,write_stage)));
+
+      pipeline.producer_commit(smem_pipe_write, cutlass::arch::cpasync_barrier_arrive_noinc);
+
+      ++k_tile_iter;
+
+      // Advance smem_pipe_write
+      ++smem_pipe_write;
+    }
+  }
+
 
   template<
     class EngineAccum,
@@ -604,20 +732,30 @@ struct CollectiveMma<
     static_assert(cute::is_void_v<SmemCopyAtomB>,
       "SM90 GMMA mainloops cannot have a non-void copy atom for smem sourced instructions.");
 
-    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});          // (BLK_M,BLK_K,PIPE)
-    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
+    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});           // (BLK_M,BLK_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});           // (BLK_N,BLK_K,PIPE)
 
     // Block scaling
-    Tensor sScaleAViewAsC = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_A.data()),
-      Layout<
-        Shape<Shape<Int<ScaleGranularityM>, Int<ScaleMsPerTile>>, cute::tuple_element_t<1, TileShape>, Int<DispatchPolicy::Stages>>,
-        Stride<Stride<_0, _1>, _0, Int<ScaleMsPerTile>>
-      >{}); // ((ScaleGranularityM,ScaleMsPerTile),n,k)
-    Tensor sScaleBViewAsC = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_B.data()),
-      Layout<
-        Shape<cute::tuple_element_t<0, TileShape>, Shape<Int<ScaleGranularityN>, Int<ScaleNsPerTile>>, Int<DispatchPolicy::Stages>>,
-        Stride<_0, Stride<_0, _1>, Int<ScaleNsPerTile>>
-      >{}); // (m,(ScaleGranularityN,ScaleNsPerTile),k)
+    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.data()), make_layout(
+        make_shape(shape<0>(SmemLayoutSFA{}),
+                   get<1>(TileShape{}),
+                   make_shape(shape<1>(SmemLayoutSFA{}),
+                              shape<2>(SmemLayoutSFA{}))),
+        make_stride(stride<0>(SmemLayoutSFA{}), _0{},
+                    make_stride(stride<1>(SmemLayoutSFA{}),
+                                stride<2>(SmemLayoutSFA{})))
+      ));                                                                                     // (BLK_M,BLK_N,(BLK_K,P))
+    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.data()), make_layout(
+        make_shape(get<0>(TileShape{}),
+                   shape<0>(SmemLayoutSFB{}),
+                   make_shape(shape<1>(SmemLayoutSFB{}),
+                              shape<2>(SmemLayoutSFB{}))),
+        make_stride(_0{},
+                    stride<0>(SmemLayoutSFB{}),
+                    make_stride(stride<1>(SmemLayoutSFB{}),
+                                stride<2>(SmemLayoutSFB{})))
+      ));                                                                                     // (BLK_M,BLK_N,(BLK_K,P))
+
 
     //
     // Define C accumulators and A/B partitioning
@@ -640,8 +778,9 @@ struct CollectiveMma<
     TiledMma tiled_mma;
     auto thread_mma = tiled_mma.get_slice(warp_group_thread_layout(warp_group_idx));
 
-    Tensor tCsScaleAViewAsC = tiled_mma.get_slice(thread_idx).partition_C(sScaleAViewAsC);    // (MMA,MMA_M,MMA_N,PIPE), `thread_mma` above is correct when partitioning A and B, but it is not correct when partitioning C.
-    Tensor tCsScaleBViewAsC = tiled_mma.get_slice(thread_idx).partition_C(sScaleBViewAsC);    // (MMA,MMA_M,MMA_N,PIPE), `thread_mma` above is correct when partitioning A and B, but it is not correct when partitioning C.
+    Tensor tCsSFA = tiled_mma.get_slice(thread_idx).partition_C(sSFA);                 // (MMA,MMA_M,MMA_N,(MMA_K,PIPE))
+    Tensor tCsSFB = tiled_mma.get_slice(thread_idx).partition_C(sSFB);                 // (MMA,MMA_M,MMA_N,(MMA_K,PIPE))
+
 
     Tensor tCsA = thread_mma.partition_A(sA);                                                 // (MMA,MMA_M,MMA_K,PIPE)
     Tensor tCsB = thread_mma.partition_B(sB);                                                 // (MMA,MMA_N,MMA_K,PIPE)
@@ -667,8 +806,9 @@ struct CollectiveMma<
     PipelineState smem_pipe_release = smem_pipe_read;
 
     // Per block scale values for operand A and B
-    Tensor tCrScaleAViewAsC = make_tensor_like<ElementBlockScale>(tCsScaleAViewAsC(_, _, _, 0));    // (MMA,MMA_M,MMA_N)
-    Tensor tCrScaleBViewAsC = make_tensor_like<ElementBlockScale>(tCsScaleBViewAsC(_, _, _, 0));    // (MMA,MMA_M,MMA_N)
+    // Since scale factors always broadcast across MMA_K we slice that away
+    Tensor tCrSFA = make_tensor_like<ElementBlockScale>(tCsSFA(_, _, _, _0{}));                     // (MMA,MMA_M,MMA_N)
+    Tensor tCrSFB = make_tensor_like<ElementBlockScale>(tCsSFB(_, _, _, _0{}));                     // (MMA,MMA_M,MMA_N)
 
     // Prologue GMMAs
     int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
@@ -676,10 +816,77 @@ struct CollectiveMma<
     tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
     // fence_operand();
     GmmaFP8Accumulation accumulation(accum, ScalePromotionInterval, size<2>(tCrA));
+    {
+      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
+      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      pipeline.consumer_wait(smem_pipe_read, barrier_token);
+
+      if constexpr (ScalePromotionInterval != 4) {
+        if (accumulation.prepare_if_needed()) {
+          tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+        }
+      }
+      else {
+        // Always zero out the accumulator for finest granularity
+        tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+      }
+
+      int read_stage = smem_pipe_read.index();
+      // Load per block scale values from shared memory to registers
+      copy(tCsSFA(_,_,_,make_coord(_0{},read_stage)), tCrSFA);
+      copy(tCsSFB(_,_,_,make_coord(_0{},read_stage)), tCrSFB);
+
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
+        tCrSFA(_0{}) = tCrSFA(_0{}) * tCrSFB(_0{});
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_b = tCrSFB(_0{});
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(filter_zeros(tCrSFA)); i++) {
+          filter_zeros(tCrSFA)(i) = filter_zeros(tCrSFA)(i) * scale_b;
+        }
+      }
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
+        ElementBlockScale scale_a = tCrSFA(_0{});
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(filter_zeros(tCrSFB)); i++) {
+          filter_zeros(tCrSFB)(i) = filter_zeros(tCrSFB)(i) * scale_a;
+        }
+      }
+
+      warpgroup_arrive();
+      // Unroll the K mode manually to set scale D to 1
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+        // (V,M) x (V,N) => (V,M,N)
+        cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
+        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+      }
+
+      warpgroup_commit_batch();
+
+      // Block scale the accumulators with reg tensor `tCrSFA` and `tCrSFB`
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_ab = tCrSFA(_0{});
+        scale_if_needed(accumulation, scale_ab);
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
+        scale_if_needed(accumulation, tCrSFA);
+      }
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
+        scale_if_needed(accumulation, tCrSFB);
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
+        scale_if_needed(accumulation, tCrSFA, tCrSFB);
+      }
+
+      ++smem_pipe_read;
+    }
+
     warpgroup_fence_operand(accumulation());
 
     CUTLASS_PRAGMA_UNROLL
-    for (int k_tile_prologue = prologue_mma_count; k_tile_prologue > 0; --k_tile_prologue)
+    for (int k_tile_prologue = prologue_mma_count - 1; k_tile_prologue > 0; --k_tile_prologue)
     {
       // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
       auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
@@ -697,49 +904,51 @@ struct CollectiveMma<
 
       int read_stage = smem_pipe_read.index();
       // Load per block scale values from shared memory to registers
-      copy(tCsScaleAViewAsC(_, _, _, read_stage), tCrScaleAViewAsC);
-      copy(tCsScaleBViewAsC(_, _, _, read_stage), tCrScaleBViewAsC);
+      copy(tCsSFA(_,_,_,make_coord(_0{},read_stage)), tCrSFA);
+      copy(tCsSFB(_,_,_,make_coord(_0{},read_stage)), tCrSFB);
+
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        tCrScaleAViewAsC.data()[0] = tCrScaleAViewAsC.data()[0] * tCrScaleBViewAsC.data()[0];
+        tCrSFA(_0{}) = tCrSFA(_0{}) * tCrSFB(_0{});
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_b = tCrScaleBViewAsC.data()[0];
+        ElementBlockScale scale_b = tCrSFB(_0{});
         CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleAViewAsC); i++) {
-          tCrScaleAViewAsC.data()[i] = tCrScaleAViewAsC.data()[i] * scale_b;
+        for (int i = 0; i < size(filter_zeros(tCrSFA)); i++) {
+          filter_zeros(tCrSFA)(i) = filter_zeros(tCrSFA)(i) * scale_b;
         }
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        ElementBlockScale scale_a = tCrScaleAViewAsC.data()[0];
+        ElementBlockScale scale_a = tCrSFA(_0{});
         CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleBViewAsC); i++) {
-          tCrScaleBViewAsC.data()[i] = tCrScaleBViewAsC.data()[i] * scale_a;
+        for (int i = 0; i < size(filter_zeros(tCrSFB)); i++) {
+          filter_zeros(tCrSFB)(i) = filter_zeros(tCrSFB)(i) * scale_a;
         }
       }
+
       warpgroup_arrive();
       // Unroll the K mode manually to set scale D to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
-        // (V,M,K) x (V,N,K) => (V,M,N)
+        // (V,M) x (V,N) => (V,M,N)
         cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
       }
 
       warpgroup_commit_batch();
 
-      // Block scale the accumulators with reg tensor `tCrScaleAViewAsC` and `tCrScaleBViewAsC`
+      // Block scale the accumulators with reg tensor `tCrSFA` and `tCrSFB`
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
+        ElementBlockScale scale_ab = tCrSFA(_0{});
         scale_if_needed(accumulation, scale_ab);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC);
+        scale_if_needed(accumulation, tCrSFA);
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFB);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFA, tCrSFB);
       }
 
       ++smem_pipe_read;
@@ -763,26 +972,28 @@ struct CollectiveMma<
       int read_stage = smem_pipe_read.index();
       // fence_operand();
       // Load per block scale values from shared memory to registers (at most twice per block along M and/or N)
-      copy(tCsScaleAViewAsC(_, _, _, read_stage), tCrScaleAViewAsC);
-      copy(tCsScaleBViewAsC(_, _, _, read_stage), tCrScaleBViewAsC);
+      copy(tCsSFA(_,_,_,make_coord(_0{},read_stage)), tCrSFA);
+      copy(tCsSFB(_,_,_,make_coord(_0{},read_stage)), tCrSFB);
+
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        tCrScaleAViewAsC.data()[0] = tCrScaleAViewAsC.data()[0] * tCrScaleBViewAsC.data()[0];
+        tCrSFA(_0{}) = tCrSFA(_0{}) * tCrSFB(_0{});
       }
-      if constexpr (ScaleMsPerTile > 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_b = tCrScaleBViewAsC.data()[0];
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_b = tCrSFB(_0{});
         CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleAViewAsC); i++) {
-          tCrScaleAViewAsC.data()[i] = tCrScaleAViewAsC.data()[i] * scale_b;
+        for (int i = 0; i < size(filter_zeros(tCrSFA)); i++) {
+          filter_zeros(tCrSFA)(i) = filter_zeros(tCrSFA)(i) * scale_b;
         }
       }
-      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile > 1) {
-        ElementBlockScale scale_a = tCrScaleAViewAsC.data()[0];
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
+        ElementBlockScale scale_a = tCrSFA(_0{});
         CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleBViewAsC); i++) {
-          tCrScaleBViewAsC.data()[i] = tCrScaleBViewAsC.data()[i] * scale_a;
+        for (int i = 0; i < size(filter_zeros(tCrSFB)); i++) {
+          filter_zeros(tCrSFB)(i) = filter_zeros(tCrSFB)(i) * scale_a;
         }
       }
 
+
       if constexpr (ScalePromotionInterval != 4) {
         if (accumulation.prepare_if_needed()) {
           tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
@@ -799,7 +1010,7 @@ struct CollectiveMma<
       // Unroll the K mode manually to set scale D to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
-        // (V,M,K) x (V,N,K) => (V,M,N)
+        // (V,M) x (V,N) => (V,M,N)
         cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
       }
@@ -809,19 +1020,19 @@ struct CollectiveMma<
       warpgroup_wait<K_PIPE_MMAS>();
       warpgroup_fence_operand(accumulation());
 
-      // Block scale the accumulators with reg tensor `tCrScaleAViewAsC` and `tCrScaleBViewAsC`
+      // Block scale the accumulators with reg tensor `tCrSFA` and `tCrSFB`
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
+        ElementBlockScale scale_ab = tCrSFA(_0{});
         scale_if_needed(accumulation, scale_ab);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC);
+        scale_if_needed(accumulation, tCrSFA);
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFB);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFA, tCrSFB);
       }
 
       // UNLOCK smem_pipe_release, done _computing_ on it
@@ -834,17 +1045,17 @@ struct CollectiveMma<
     if constexpr (ScalePromotionInterval != 4) {
       // residues only exists when granularity is not the finnest
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
+        ElementBlockScale scale_ab = tCrSFA(_0{});
         accumulation.scale_residue_if_needed(scale_ab);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        accumulation.scale_residue_if_needed(tCrScaleAViewAsC);
+        accumulation.scale_residue_if_needed(tCrSFA);
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        accumulation.scale_residue_if_needed(tCrScaleBViewAsC);
+        accumulation.scale_residue_if_needed(tCrSFB);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
-        accumulation.scale_residue_if_needed(tCrScaleAViewAsC, tCrScaleBViewAsC);
+        accumulation.scale_residue_if_needed(tCrSFA, tCrSFB);
       }
     }
 
@@ -1014,8 +1225,8 @@ struct CollectiveMma<
       return load_init(
         problem_shape_mnkl,
         mainloop_params,
-        mainloop_params.ptr_scale_A[next_batch],
-        mainloop_params.ptr_scale_B[next_batch]
+        mainloop_params.ptr_SFA[next_batch],
+        mainloop_params.ptr_SFB[next_batch]
       );
     } else {
       auto [gA_mkl, gB_nkl, mScaleA_mkl, mScaleB_nkl] = input_tensors;
@@ -1023,8 +1234,8 @@ struct CollectiveMma<
       auto scaleA_layout = mScaleA_mkl.layout();
       auto scaleB_layout = mScaleB_nkl.layout();
 
-      mScaleA_mkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_scale_A[next_batch]), scaleA_layout); // (m,ScaleMsPerTile,k,l)
-      mScaleB_nkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_scale_B[next_batch]), scaleB_layout); // (n,ScaleNsPerTile,k,l)
+      mScaleA_mkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_SFA[next_batch]), scaleA_layout); // (m,ScaleMsPerTile,k,l)
+      mScaleB_nkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_SFB[next_batch]), scaleB_layout); // (n,ScaleNsPerTile,k,l)
       return cute::make_tuple(gA_mkl, gB_nkl, mScaleA_mkl, mScaleB_nkl);
     }
   }
diff --git a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp
index 495b7ee10d..88b2560e35 100644
--- a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp
@@ -426,7 +426,7 @@ struct CollectiveMma<
       return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, 0, 0, tma_transaction_bytes, 1, dA, dB };
     } 
     else if constexpr (ModeHasScales) {
-      auto scale_k = (K + args.group_size - 1) / args.group_size;
+      auto scale_k = ceil_div(K, args.group_size);
       ElementScale const* ptr_S = args.ptr_S;
       StrideScale dS = args.dS;
       Tensor tensor_scale = make_tensor(detail::get_logical_ptr(ptr_S), make_layout(make_shape(M,scale_k,L), dS));
@@ -483,7 +483,7 @@ struct CollectiveMma<
     } 
     else if constexpr (ModeHasScales) {
       const int scale_mn = SwapAB ? N : M;
-      const int scale_k = (K + args.group_size - 1) / args.group_size;
+      const int scale_k = ceil_div(K, args.group_size);
       constexpr int min_tma_aligned_elements_scale = tma_alignment_bits / cutlass::sizeof_bits<ElementScale>::value;
       check_aligned_S = cutlass::detail::check_alignment<min_tma_aligned_elements_scale>(cute::make_shape(scale_mn,scale_k,L), args.dS);
       check_mode_args = check_mode_args && (args.group_size == K || ((args.group_size % size<2>(TileShape{})) == 0));
diff --git a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
index bcfc38e097..4c66528865 100644
--- a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
@@ -45,6 +45,8 @@
 #include "cute/tensor_predicate.hpp"
 #include "cute/numeric/arithmetic_tuple.hpp"
 
+#include "cutlass/detail/blockwise_scale_layout.hpp"
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::collective {
@@ -57,14 +59,11 @@ template <
   int Stages,
   class ClusterShape,
   class KernelSchedule,
-  int ScaleGranularityM_,
-  int ScaleGranularityN_,
-  int ScalePromotionInterval_,
   class TileShape_,
   class ElementA_,
-  class StrideA_,
+  class StridePairA_,
   class ElementB_,
-  class StrideB_,
+  class StridePairB_,
   class TiledMma_,
   class GmemTiledCopyA_,
   class SmemLayoutAtomA_,
@@ -75,12 +74,12 @@ template <
   class SmemCopyAtomB_,
   class TransformB_>
 struct CollectiveMma<
-    MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<Stages, ClusterShape, KernelSchedule, ScaleGranularityM_, ScaleGranularityN_, ScalePromotionInterval_>,
+    MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<Stages, ClusterShape, KernelSchedule>,
     TileShape_,
     ElementA_,
-    StrideA_,
+    StridePairA_,
     ElementB_,
-    StrideB_,
+    StridePairB_,
     TiledMma_,
     GmemTiledCopyA_,
     SmemLayoutAtomA_,
@@ -89,22 +88,24 @@ struct CollectiveMma<
     GmemTiledCopyB_,
     SmemLayoutAtomB_,
     SmemCopyAtomB_,
-    TransformB_>
-{
+    TransformB_> {
   //
   // Type Aliases
   //
-  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<Stages, ClusterShape, KernelSchedule, ScaleGranularityM_, ScaleGranularityN_, ScalePromotionInterval_>;
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<Stages, ClusterShape, KernelSchedule>;
   using TileShape = TileShape_;
   using ElementA = ElementA_;
-  using StrideA = StrideA_;
+  using StrideA = cute::tuple_element_t<0,StridePairA_>;
+  using LayoutSFA = cute::tuple_element_t<1,StridePairA_>;
   using ElementB = ElementB_;
-  using StrideB = StrideB_;
+  using StrideB = cute::tuple_element_t<0,StridePairB_>;
+  using LayoutSFB = cute::tuple_element_t<1,StridePairB_>;
   using TiledMma = TiledMma_;
   using ElementAccumulator = typename TiledMma::ValTypeC;
   using ElementBlockScale = ElementAccumulator;
   using GmemTiledCopyA = GmemTiledCopyA_;
   using GmemTiledCopyB = GmemTiledCopyB_;
+  using GmemTiledCopyScaleTMA = cute::SM90_TMA_LOAD;
   using SmemLayoutAtomA = SmemLayoutAtomA_;
   using SmemLayoutAtomB = SmemLayoutAtomB_;
   using SmemCopyAtomA = SmemCopyAtomA_;
@@ -118,16 +119,24 @@ struct CollectiveMma<
   using PipelineState = cutlass::PipelineState<DispatchPolicy::Stages>;
   using PipelineParams = typename MainloopPipeline::Params;
 
-  // Two threads per CTA are producers (1 for operand tile `tma`, and 32 for scales `cp.async`)
-  static constexpr int NumProducerThreadEvents = 33;
+  static constexpr int ScaleGranularityM = size<0,0>(LayoutSFA{});
+  static constexpr int ScaleGranularityN = size<0,0>(LayoutSFB{});
+  static constexpr int ScaleGranularityK = size<1,0>(LayoutSFA{});
 
-  static constexpr int ScaleGranularityM = ScaleGranularityM_ == 0 ? size<0>(TileShape{}) : ScaleGranularityM_;
-  static constexpr int ScaleGranularityN = ScaleGranularityN_ == 0 ? size<1>(TileShape{}) : ScaleGranularityN_;
-  static constexpr int ScalePromotionInterval = ScalePromotionInterval_;
+  static_assert(size<2>(TileShape{}) % ScaleGranularityK == 0);
+  static_assert(ScaleGranularityK % size<2>(typename TiledMma::AtomShape_MNK{}) == 0);
+
+  static constexpr int ScalePromotionInterval = ScaleGranularityK / size<2>(typename TiledMma::AtomShape_MNK{});
   static_assert(ScalePromotionInterval % 4 == 0, "ScalePromotionInterval must be a multiple of 4.");
   static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
   static constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
 
+  static constexpr int ScaleTmaThreshold = 32;
+  static constexpr bool IsTmaLoadSFA = ScaleMsPerTile >= ScaleTmaThreshold && ScaleNsPerTile < ScaleTmaThreshold;
+  static constexpr bool IsTmaLoadSFB = ScaleNsPerTile >= ScaleTmaThreshold && ScaleMsPerTile < ScaleTmaThreshold;
+  // Two threads per CTA are producers (1 for operand tile `tma`, and 32 for scales `cp.async`)
+  static constexpr int NumProducerThreadEvents = ((IsTmaLoadSFA && IsTmaLoadSFB)? 1 : 33);
+
   static_assert(cute::rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
   static_assert((size<0>(TileShape{}) % size<0>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
   static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
@@ -139,6 +148,10 @@ struct CollectiveMma<
   static_assert((size<0>(TileShape{}) % ScaleGranularityM) == 0, "FP8 scaling granularity must evenly divide tile shape along M.");
   static_assert((size<1>(TileShape{}) % ScaleGranularityN) == 0, "FP8 scaling granularity must evenly divide tile shape along N.");
 
+  using ScaleConfig = ::cutlass::detail::Sm90BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>;
+  using SmemLayoutAtomSFA = decltype(ScaleConfig::smem_atom_layoutSFA(TileShape{}));
+  using SmemLayoutAtomSFB = decltype(ScaleConfig::smem_atom_layoutSFB(TileShape{}));
+
   // Tile along modes in a way that maximizes the TMA box size.
   using SmemLayoutA = decltype(tile_to_shape(
       SmemLayoutAtomA{},
@@ -151,12 +164,22 @@ struct CollectiveMma<
 
   // Block scaling gmem-to-smem copy atom 
   //  we can have partial tiles in M or N, so don't vectorize those loads
-  using SmemBlockScalingCopyAtomA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
-  using SmemBlockScalingCopyAtomB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
+  using CopyAtomSFA = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
+  using CopyAtomSFB = Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<ElementBlockScale>, ElementBlockScale>;
+
+  static constexpr int AlignmentSFA = 1;
+  static constexpr int AlignmentSFB = 1;
 
   // Block scaling smem layout
-  using SmemLayoutScaleA = Layout<Shape<Int<ScaleMsPerTile>, Int<DispatchPolicy::Stages>>>;
-  using SmemLayoutScaleB = Layout<Shape<Int<ScaleNsPerTile>, Int<DispatchPolicy::Stages>>>;
+  using SmemLayoutSFA = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFA{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFA{}), size(filter_zeros(SmemLayoutAtomSFA{})))
+  ));
+  using SmemLayoutSFB = decltype(make_layout(
+    append(shape(SmemLayoutAtomSFB{}), Int<DispatchPolicy::Stages>{}),
+    append(stride(SmemLayoutAtomSFB{}), size(filter_zeros(SmemLayoutAtomSFB{})))
+  ));
+
 
   static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 1 or more.");
   static_assert(cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value &&
@@ -172,10 +195,10 @@ struct CollectiveMma<
   struct SharedStorage
   {
     struct TensorStorage : cute::aligned_struct<128> {
-      cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;  // mxk
-      cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;  // nxk
-      cute::array_aligned<ElementBlockScale, cute::cosize_v<SmemLayoutScaleA>> smem_scale_A; // ScaleMsPerTile x k
-      cute::array_aligned<ElementBlockScale, cute::cosize_v<SmemLayoutScaleB>> smem_scale_B; // ScaleNsPerTile x k
+      cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;  // TILE_M x PIPE_K
+      cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;  // TILE_N x PIPE_K
+      CUTE_ALIGNAS(128) cute::array<ElementBlockScale, cute::cosize_v<SmemLayoutSFA>> smem_SFA; // ScaleMsPerTile x PIPE_K
+      CUTE_ALIGNAS(128) cute::array<ElementBlockScale, cute::cosize_v<SmemLayoutSFB>> smem_SFB; // ScaleNsPerTile x PIPE_K
     } tensors;
 
     using PipelineStorage = typename MainloopPipeline::SharedStorage;
@@ -191,34 +214,69 @@ struct CollectiveMma<
     ElementB const* ptr_B;
     StrideB dB;
     uint32_t mma_promotion_interval = 4;
-    ElementBlockScale const* ptr_scale_A; 
-    ElementBlockScale const* ptr_scale_B;
+    ElementBlockScale const* ptr_SFA;
+    LayoutSFA layout_SFA;
+    ElementBlockScale const* ptr_SFB;
+    LayoutSFB layout_SFB;
   };
 
   // Device side kernel params
   struct Params {
+    static auto getTmaSFA() {
+      if constexpr (IsTmaLoadSFA) {
+        return make_tma_copy(
+          GmemTiledCopyScaleTMA{},
+          make_tensor(static_cast<ElementBlockScale const*>(nullptr), filter_zeros(LayoutSFA{})),
+          filter_zeros(SmemLayoutSFA{}(_,_,_0{})),
+          Shape<Int<ScaleMsPerTile>, Int<1>>{},
+          _1{});
+      }
+      else {
+        return nullptr;
+      }
+    }
+    static auto getTmaSFB() {
+      if constexpr (IsTmaLoadSFB) {
+        return make_tma_copy(
+          GmemTiledCopyScaleTMA{},
+          make_tensor(static_cast<ElementBlockScale const*>(nullptr), filter_zeros(LayoutSFB{})),
+          filter_zeros(SmemLayoutSFB{}(_,_,_0{})),
+          Shape<Int<ScaleNsPerTile>, Int<1>>{},
+          _1{});
+      }
+      else {
+        return nullptr;
+      }
+    }
     // Assumption: StrideA is congruent with Problem_MK
     using TMA_A = decltype(make_tma_copy_A_sm90(
         GmemTiledCopyA{},
         make_tensor(static_cast<ElementA const*>(nullptr), repeat_like(StrideA{}, int32_t(0)), StrideA{}),
-        SmemLayoutA{}(_,_,0),
+        SmemLayoutA{}(_,_,_0{}),
         TileShape{},
         ClusterShape{}));
     // Assumption: StrideB is congruent with Problem_NK
     using TMA_B = decltype(make_tma_copy_B_sm90(
         GmemTiledCopyB{},
         make_tensor(static_cast<ElementB const*>(nullptr), repeat_like(StrideB{}, int32_t(0)), StrideB{}),
-        SmemLayoutB{}(_,_,0),
+        SmemLayoutB{}(_,_,_0{}),
         TileShape{},
         ClusterShape{}));
+    // NOTE: Does make_tma_copy supports 0 stride?
+    using TMA_SFA = decltype(getTmaSFA());
+    using TMA_SFB = decltype(getTmaSFB());
     TMA_A tma_load_a;
     TMA_B tma_load_b;
+    TMA_SFA tma_load_sfa;
+    TMA_SFB tma_load_sfb;
     uint32_t tma_transaction_bytes = TmaTransactionBytes;
     uint32_t tma_transaction_bytes_mk = TmaTransactionBytesMK;
     uint32_t tma_transaction_bytes_nk = TmaTransactionBytesNK;
     // Block scaling factors for A and B
-    ElementBlockScale const* ptr_scale_A; 
-    ElementBlockScale const* ptr_scale_B;
+    ElementBlockScale const* ptr_SFA;
+    ElementBlockScale const* ptr_SFB;
+    LayoutSFA layout_SFA;
+    LayoutSFB layout_SFB;
   };
 
   //
@@ -236,7 +294,11 @@ struct CollectiveMma<
 
     auto ptr_A = reinterpret_cast<ElementA const*>(args.ptr_A);
     auto ptr_B = reinterpret_cast<ElementB const*>(args.ptr_B);
+    auto ptr_SFA = reinterpret_cast<ElementBlockScale const*>(args.ptr_SFA);
+    auto ptr_SFB = reinterpret_cast<ElementBlockScale const*>(args.ptr_SFB);
 
+    Tensor tensor_sfa = make_tensor(ptr_SFA, filter_zeros(args.layout_SFA));
+    Tensor tensor_sfb = make_tensor(ptr_SFB, filter_zeros(args.layout_SFB));
     Tensor tensor_a = make_tensor(ptr_A, make_layout(make_shape(M,K,L), args.dA));
     Tensor tensor_b = make_tensor(ptr_B, make_layout(make_shape(N,K,L), args.dB));
     typename Params::TMA_A tma_load_a = make_tma_copy_A_sm90(
@@ -251,18 +313,42 @@ struct CollectiveMma<
         SmemLayoutB{}(_,_,cute::Int<0>{}),
         TileShape{},
         ClusterShape{});
+    typename Params::TMA_SFA tma_load_sfa{};
+    if constexpr (IsTmaLoadSFA) {
+      tma_load_sfa = make_tma_copy(
+          GmemTiledCopyScaleTMA{},
+          tensor_sfa,
+          filter_zeros(SmemLayoutSFA{})(_,_,cute::Int<0>{}),
+          Shape<Int<ScaleMsPerTile>, Int<1>>{},
+          _1{});
+    }
+    typename Params::TMA_SFB tma_load_sfb{};
+    if constexpr (IsTmaLoadSFB) {
+      tma_load_sfb = make_tma_copy(
+          GmemTiledCopyScaleTMA{},
+          tensor_sfb,
+          filter_zeros(SmemLayoutSFB{})(_,_,cute::Int<0>{}),
+          Shape<Int<ScaleNsPerTile>, Int<1>>{},
+          _1{});
+    }
     uint32_t transaction_bytes_mk = TmaTransactionBytesMK;
     uint32_t transaction_bytes_nk = TmaTransactionBytesNK;
-    uint32_t transaction_bytes = transaction_bytes_mk + transaction_bytes_nk;
+    uint32_t transaction_bytes_sfa = TmaTransactionBytesSFA;
+    uint32_t transaction_bytes_sfb = TmaTransactionBytesSFB;
+    uint32_t transaction_bytes = transaction_bytes_mk + transaction_bytes_nk + transaction_bytes_sfa + transaction_bytes_sfb;
 
     return {
       tma_load_a,
       tma_load_b,
+      tma_load_sfa,
+      tma_load_sfb,
       transaction_bytes,
       transaction_bytes_mk,
       transaction_bytes_nk,
-      args.ptr_scale_A,
-      args.ptr_scale_B
+      args.ptr_SFA,
+      args.ptr_SFB,
+      args.layout_SFA,
+      args.layout_SFB,
     };
   }
 
@@ -277,20 +363,39 @@ struct CollectiveMma<
 
     bool implementable = true;
     constexpr int min_tma_aligned_elements_A = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
-    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), StrideA{});
+    if (!cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), StrideA{})) {
+      implementable = false;
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem size doesn't meet the minimum alignment requirements for using TMA to load tensor A.\n");
+    }
     constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
-    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{});
+    if (!cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{})) {
+      implementable = false;
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem size doesn't meet the minimum alignment requirements for using TMA to load tensor B.\n");
+    }
+    constexpr int min_tma_aligned_elements_S = tma_alignment_bits / cutlass::sizeof_bits<ElementBlockScale>::value;
+    if (IsTmaLoadSFA && !cutlass::detail::check_alignment<min_tma_aligned_elements_S>(args.layout_SFA)) {
+      implementable = false;
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem size doesn't meet the minimum alignment requirements for using TMA to load scale A.\n");
+    }
+    if (IsTmaLoadSFB && !cutlass::detail::check_alignment<min_tma_aligned_elements_S>(args.layout_SFB)) {
+      implementable = false;
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem size doesn't meet the minimum alignment requirements for using TMA to load scale B.\n");
+    }
 
     /* MMA promotion interval should be a multiple of 4, since each mainloop iteration would issue 4 MMA instructions. */
     constexpr int pipe_k = size<2>(TileShape{}) / tile_size<2>(TiledMma{});
-    implementable = implementable && (args.mma_promotion_interval % 4 == 0) && (args.mma_promotion_interval == ScalePromotionInterval);
-    implementable = implementable && (pipe_k % 4 == 0) && (pipe_k <= args.mma_promotion_interval);
+    if (args.mma_promotion_interval % 4 != 0 ||
+        args.mma_promotion_interval != ScalePromotionInterval ||
+        args.mma_promotion_interval % pipe_k != 0 ||
+        pipe_k > args.mma_promotion_interval) {
+      implementable = false;
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Argument mma_promotion_interval is invalid.\n");
+    }
 
     // We expect full tiles in K
-    implementable = implementable && (K % size<2>(TileShape{}) == 0);
-
-    if (!implementable) {
-      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+    if (K % size<2>(TileShape{}) != 0) {
+      implementable = false;
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem size K is incompatible with tile size.\n");
     }
     return implementable;
   }
@@ -301,7 +406,12 @@ struct CollectiveMma<
         cutlass::bits_to_bytes(size<0>(SmemLayoutA{}) * size<1>(SmemLayoutA{}) * static_cast<uint32_t>(sizeof_bits<ElementA>::value));
   static constexpr uint32_t TmaTransactionBytesNK =
         cutlass::bits_to_bytes(size<0>(SmemLayoutB{}) * size<1>(SmemLayoutB{}) * static_cast<uint32_t>(sizeof_bits<ElementB>::value));
-  static constexpr uint32_t TmaTransactionBytes = TmaTransactionBytesMK + TmaTransactionBytesNK;
+
+  static constexpr uint32_t TmaTransactionBytesSFA =
+        (IsTmaLoadSFA? cutlass::bits_to_bytes(ScaleMsPerTile * static_cast<uint32_t>(sizeof_bits<ElementBlockScale>::value)): 0);
+  static constexpr uint32_t TmaTransactionBytesSFB =
+        (IsTmaLoadSFB? cutlass::bits_to_bytes(ScaleNsPerTile * static_cast<uint32_t>(sizeof_bits<ElementBlockScale>::value)): 0);
+  static constexpr uint32_t TmaTransactionBytes = TmaTransactionBytesMK + TmaTransactionBytesNK + TmaTransactionBytesSFA + TmaTransactionBytesSFB;
 
   /// Issue Tma Descriptor Prefetch -- ideally from a single thread for best performance
   CUTLASS_DEVICE
@@ -309,6 +419,12 @@ struct CollectiveMma<
   {
     cute::prefetch_tma_descriptor(mainloop_params.tma_load_a.get_tma_descriptor());
     cute::prefetch_tma_descriptor(mainloop_params.tma_load_b.get_tma_descriptor());
+    if constexpr (IsTmaLoadSFA) {
+      cute::prefetch_tma_descriptor(mainloop_params.tma_load_sfa.get_tma_descriptor());
+    }
+    if constexpr (IsTmaLoadSFB) {
+      cute::prefetch_tma_descriptor(mainloop_params.tma_load_sfb.get_tma_descriptor());
+    }
   }
 
   /// Set up the data needed by this collective for load and mma.
@@ -325,27 +441,33 @@ struct CollectiveMma<
 
     // TMA requires special handling of strides to deal with coord codomain mapping
     // Represent the full tensors -- get these from TMA
-    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K,L));                            // (m,k,l)
-    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,L));                            // (n,k,l)
+    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K,L));                             // (m,k,l)
+    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,L));                             // (n,k,l)
 
     // Make tiled views, defer the slice
-    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k,l)
-    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});        // (BLK_N,BLK_K,n,k,l)
-
-    auto tK = get<3>(gA_mkl.shape());
-
-    // Make the tiled views of scale tensors
-    auto scaleA_shape = make_shape(ceil_div(M, ScaleGranularityM), tK, L); // (scale_m,k,l)
-    auto scaleA_layout = make_ordered_layout(scaleA_shape, Step<_0, _1, _2>{});
-    auto scaleB_shape = make_shape(ceil_div(N, ScaleGranularityN), tK, L); // (scale_n,k,l)
-    auto scaleB_layout = make_ordered_layout(scaleB_shape, Step<_0, _1, _2>{});
-
-    // Note that mScaleA_mkl and mScaleB_nkl are already blocked tiled in the `m` host and
-    // gScaleA_mkl and gScaleB_nkl in `g` global memory are same as mScaleA_mkl and mScaleB_nkl.
-    Tensor mScaleA_mkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_scale_A), scaleA_layout); // (scale_m,k,l)
-    Tensor mScaleB_nkl = make_tensor(make_gmem_ptr(mainloop_params.ptr_scale_B), scaleB_layout); // (scale_n,k,l)
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});         // (BLK_M,BLK_K,m,k,l)
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});         // (BLK_N,BLK_K,n,k,l)
+
+    // Note that mSFA_mkl and mSFB_nkl are already blocked tiled in the `m` host and
+    // gScaleA_mkl and gScaleB_nkl in `g` global memory are same as mSFA_mkl and mSFB_nkl.
+    auto mSFA_mkl = [&]() {
+      if constexpr (IsTmaLoadSFA) {
+        return mainloop_params.tma_load_sfa.get_tma_tensor(shape(filter_zeros(mainloop_params.layout_SFA)));
+      }
+      else {
+        return make_tensor(make_gmem_ptr(mainloop_params.ptr_SFA), mainloop_params.layout_SFA); // (scale_m,k,l)
+      }
+    }();
+    auto mSFB_nkl = [&]() {
+      if constexpr (IsTmaLoadSFB) {
+        return mainloop_params.tma_load_sfb.get_tma_tensor(shape(filter_zeros(mainloop_params.layout_SFB)));
+      }
+      else {
+        return make_tensor(make_gmem_ptr(mainloop_params.ptr_SFB), mainloop_params.layout_SFB); // (scale_n,k,l)
+      }
+    }();
 
-    return cute::make_tuple(gA_mkl, gB_nkl, mScaleA_mkl, mScaleB_nkl);
+    return cute::make_tuple(gA_mkl, gB_nkl, mSFA_mkl, mSFB_nkl);
   }
 
   /// Perform a collective-scoped matrix multiply-accumulate
@@ -370,8 +492,8 @@ struct CollectiveMma<
     // Blockscaling: Tma loads for load_input and CpAsync for load_scale
     Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});        // (BLK_M,BLK_K,PIPE)
     Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});        // (BLK_N,BLK_K,PIPE)
-    Tensor sScaleA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_A.data()), SmemLayoutScaleA{}); // (ScaleMsPerTile,k)
-    Tensor sScaleB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_B.data()), SmemLayoutScaleB{}); // (ScaleNsPerTile,k)
+    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.data()), filter_zeros(SmemLayoutSFA{})); // (ScaleMsPerTile,PIPE)
+    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.data()), filter_zeros(SmemLayoutSFB{})); // (ScaleNsPerTile,PIPE)
 
     //
     // Prepare the TMA loads for A and B
@@ -382,6 +504,8 @@ struct CollectiveMma<
 
     Tensor gA_mkl = get<0>(load_inputs);
     Tensor gB_nkl = get<1>(load_inputs);
+    Tensor mSFA_mkl = get<2>(load_inputs);
+    Tensor mSFB_nkl = get<3>(load_inputs);
 
     auto block_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
     auto block_tma_b = mainloop_params.tma_load_b.get_slice(cluster_local_block_id.x);
@@ -390,69 +514,46 @@ struct CollectiveMma<
     auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
     Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
     Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
-
-
-    // Block scaling: load_scale has scaling tensors in global memory which are not tiled
-    Tensor mScaleA_mkl = get<2>(load_inputs);
-    Tensor mScaleB_nkl = get<3>(load_inputs);
-    auto scales_m = get<0>(mScaleA_mkl.shape());
-    auto scales_n = get<0>(mScaleB_nkl.shape());
-
-    Tensor cScaleA_mkl = make_identity_tensor(mScaleA_mkl.shape());
-    Tensor cScaleB_nkl = make_identity_tensor(mScaleB_nkl.shape());
-
-    Tensor gScaleA = local_tile(
-      mScaleA_mkl, make_tile(Int<ScaleMsPerTile>{}),
-      make_coord(m_coord,_,l_coord));                   // (ScaleMsPerTile,k,1)
-    Tensor cScaleA = local_tile(
-      cScaleA_mkl, make_tile(Int<ScaleMsPerTile>{}),
+    Tensor gSFA = local_tile(
+      mSFA_mkl, make_tile(Int<ScaleMsPerTile>{}, Int<1>{}),
       make_coord(m_coord,_,l_coord));
-    Tensor gScaleB = local_tile(
-      mScaleB_nkl, make_tile(Int<ScaleNsPerTile>{}),
-      make_coord(n_coord,_,l_coord));                   // (ScaleNsPerTile,k,1)
-    Tensor cScaleB = local_tile(
-      cScaleB_nkl, make_tile(Int<ScaleNsPerTile>{}),
+    Tensor gSFB = local_tile(
+      mSFB_nkl, make_tile(Int<ScaleNsPerTile>{}, Int<1>{}),
       make_coord(n_coord,_,l_coord));
 
-    TiledCopy scale_copy_a = make_tiled_copy(SmemBlockScalingCopyAtomA{},
-      Layout<Shape<_32>>{}, Layout<Shape<_1>>{});
-    TiledCopy scale_copy_b = make_tiled_copy(SmemBlockScalingCopyAtomB{}, 
-      Layout<Shape<_32>>{}, Layout<Shape<_1>>{});
-    ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(ThreadIdxX());
-    ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(ThreadIdxX());
-
-    Tensor tAgA_ScaleA = thr_scale_copy_a.partition_S(gScaleA);
-    Tensor tAcA_ScaleA = thr_scale_copy_a.partition_S(cScaleA);
-    Tensor tAsA_ScaleA = thr_scale_copy_a.partition_D(sScaleA);
-
-    Tensor tBgB_ScaleB = thr_scale_copy_b.partition_S(gScaleB);
-    Tensor tBcB_ScaleB = thr_scale_copy_b.partition_S(cScaleB);
-    Tensor tBsB_ScaleB = thr_scale_copy_b.partition_D(sScaleB);
-
     // Applies the mapping from block_tma_a
-    Tensor tAgA = block_tma_a.partition_S(gA);                                              // (TMA,TMA_M,TMA_K,k)
-    Tensor tAsA = block_tma_a.partition_D(sA);                                              // (TMA,TMA_M,TMA_K,PIPE)
-
-    Tensor tBgB = block_tma_b.partition_S(gB);                                              // (TMA,TMA_N,TMA_K,k)
-    Tensor tBsB = block_tma_b.partition_D(sB);                                              // (TMA,TMA_N,TMA_K,PIPE)
-
-    Tensor tApA_ScaleA = make_tensor<bool>(shape(tAsA_ScaleA(_,_,0)));
-    Tensor tBpB_ScaleB = make_tensor<bool>(shape(tBsB_ScaleB(_,_,0)));
-
-    #pragma unroll
-    for (int i = 0; i < size(tApA_ScaleA); ++i) {
-      tApA_ScaleA(i) = get<0>(tAcA_ScaleA(i)) <
-        std::min(scales_m, (m_coord + 1) * ScaleMsPerTile);
-    }
-
-    #pragma unroll
-    for (int i = 0; i < size(tBpB_ScaleB); ++i) {
-      tBpB_ScaleB(i) = get<0>(tBcB_ScaleB(i)) <
-        std::min(scales_n, (n_coord + 1) * ScaleNsPerTile);
-    }
+    Tensor tAgA = block_tma_a.partition_S(gA);                                                 // (TMA,TMA_M,TMA_K,k)
+    Tensor tAsA = block_tma_a.partition_D(sA);                                                 // (TMA,TMA_M,TMA_K,PIPE)
+
+    Tensor tBgB = block_tma_b.partition_S(gB);                                                 // (TMA,TMA_N,TMA_K,k)
+    Tensor tBsB = block_tma_b.partition_D(sB);                                                 // (TMA,TMA_N,TMA_K,PIPE)
+
+    auto [tAgA_SFA, tAsA_SFA] = [&]() {
+      if constexpr (IsTmaLoadSFA) {
+        auto block_tma_sfa = mainloop_params.tma_load_sfa.get_slice(cluster_local_block_id.y);
+        Tensor tAgA_SFA_ = block_tma_sfa.partition_S(gSFA);
+        Tensor tAsA_SFA_ = block_tma_sfa.partition_D(sSFA);
+        return cute::make_tuple(tAgA_SFA_, tAsA_SFA_);
+      }
+      else {
+        return cute::make_tuple(0, 0);
+      }
+    }();
+    auto [tBgB_SFB, tBsB_SFB] = [&]() {
+      if constexpr (IsTmaLoadSFB) {
+        auto block_tma_sfb = mainloop_params.tma_load_sfb.get_slice(cluster_local_block_id.y);
+        Tensor tBgB_SFB_ = block_tma_sfb.partition_S(gSFB);
+        Tensor tBsB_SFB_ = block_tma_sfb.partition_D(sSFB);
+        return cute::make_tuple(tBgB_SFB_, tBsB_SFB_);
+      }
+      else {
+        return cute::make_tuple(0, 0);
+      }
+    }();
 
     uint16_t mcast_mask_a = 0;
     uint16_t mcast_mask_b = 0;
+    uint16_t mcast_mask_sf = 0;
 
     // Issue TmaLoads for GEMM operands A/B and CpAsync for scale tensors
     // Maps the tile -> block, value
@@ -488,9 +589,115 @@ struct CollectiveMma<
       if (lane_predicate) copy(mainloop_params.tma_load_b.with(*tma_barrier, mcast_mask_b), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
 
       // Copy scale tensors from global memory to shared memory
-      copy_if(scale_copy_a, tApA_ScaleA, tAgA_ScaleA(_,_,*k_tile_iter), tAsA_ScaleA(_,_,write_stage));
-      copy_if(scale_copy_b, tBpB_ScaleB, tBgB_ScaleB(_,_,*k_tile_iter), tBsB_ScaleB(_,_,write_stage));
-      pipeline.producer_commit(smem_pipe_write, cutlass::arch::cpasync_barrier_arrive_noinc);
+      if constexpr (IsTmaLoadSFA) {
+        if (lane_predicate) {
+          copy(mainloop_params.tma_load_sfa.with(*tma_barrier, mcast_mask_sf), tAgA_SFA(_,_,_,*k_tile_iter), tAsA_SFA(_,_,_,write_stage));
+        }
+      }
+      if constexpr (IsTmaLoadSFB) {
+        if (lane_predicate) {
+          copy(mainloop_params.tma_load_sfb.with(*tma_barrier, mcast_mask_sf), tBgB_SFB(_,_,_,*k_tile_iter), tBsB_SFB(_,_,_,write_stage));
+        }
+      }
+      ++k_tile_iter;
+
+      // Advance smem_pipe_write
+      ++smem_pipe_write;
+    }
+  }
+
+  template <
+    class TensorA, class TensorB,
+    class TensorScaleA, class TensorScaleB,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load_auxiliary(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline,
+      PipelineState smem_pipe_write,
+      cute::tuple<TensorA, TensorB, TensorScaleA, TensorScaleB> const& load_inputs,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    // Block scaling: load_scale has scaling tensors in global memory which are not tiled
+    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.data()), SmemLayoutSFA{}); // (ScaleMsPerTile,k)
+    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.data()), SmemLayoutSFB{}); // (ScaleNsPerTile,k)
+
+    auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+
+    Tensor mSFA_mkl = get<2>(load_inputs);
+    Tensor mSFB_nkl = get<3>(load_inputs);
+
+    Tensor iSFA_mkl = make_identity_tensor(shape(mainloop_params.layout_SFA));                                // (m,k,l)
+    Tensor iSFB_nkl = make_identity_tensor(shape(mainloop_params.layout_SFB));                                // (n,k,l)
+
+    Tensor gSFA_mkl = local_tile(mSFA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});     // (BLK_M,BLK_K,m,k,l)
+    Tensor cSFA_mkl = local_tile(iSFA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});     // (BLK_M,BLK_K,m,k,l)
+    Tensor gSFB_nkl = local_tile(mSFB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});     // (BLK_N,BLK_K,n,k,l)
+    Tensor cSFB_nkl = local_tile(iSFB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});     // (BLK_N,BLK_K,n,k,l)
+
+    Tensor gSFA_k = gSFA_mkl(_,_,m_coord,_,l_coord);
+    Tensor cSFA_k = cSFA_mkl(_,_,m_coord,_,l_coord);
+    Tensor gSFB_k = gSFB_nkl(_,_,n_coord,_,l_coord);
+    Tensor cSFB_k = cSFB_nkl(_,_,n_coord,_,l_coord);
+
+    TiledCopy scale_copy_a = make_tiled_copy(CopyAtomSFA{},
+      Layout<Shape<_32>>{}, Layout<Shape<_1>>{});
+    TiledCopy scale_copy_b = make_tiled_copy(CopyAtomSFB{},
+      Layout<Shape<_32>>{}, Layout<Shape<_1>>{});
+    ThrCopy thr_scale_copy_a = scale_copy_a.get_slice(thread_idx);
+    ThrCopy thr_scale_copy_b = scale_copy_b.get_slice(thread_idx);
+
+    Tensor tSFAgSFA_k = thr_scale_copy_a.partition_S(gSFA_k);
+    Tensor tSFAcSFA_k = thr_scale_copy_a.partition_S(cSFA_k);
+    Tensor tSFAsSFA   = thr_scale_copy_a.partition_D(sSFA);
+
+    Tensor tSFBgSFB_k = thr_scale_copy_b.partition_S(gSFB_k);
+    Tensor tSFBcSFB_k = thr_scale_copy_b.partition_S(cSFB_k);
+    Tensor tSFBsSFB   = thr_scale_copy_b.partition_D(sSFB);
+
+    Tensor tSFApSFA = make_tensor<bool>(shape(filter_zeros(tSFAsSFA(_,_,_,_0{}))));                 // (CPY,CPY_M,CPY_K)
+    Tensor tSFBpSFB = make_tensor<bool>(shape(filter_zeros(tSFBsSFB(_,_,_,_0{}))));                 // (CPY,CPY_N,CPY_K)
+
+    auto SFA_shape = shape(mainloop_params.layout_SFA);
+    auto SFB_shape = shape(mainloop_params.layout_SFB);
+
+    // Mainloop
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count) {
+      // LOCK smem_pipe_write for _writing_
+      pipeline.producer_acquire(smem_pipe_write);
+
+      // Since scale granularity K is multiple of BLK_K we do not have to consider if that is OOB
+      bool load_sfa = thread_idx < ScaleMsPerTile;
+      Tensor tSFAcSFA = tSFAcSFA_k(_,_,_,*k_tile_iter);
+      Tensor tSFAcSFA_compact = filter_zeros(tSFAcSFA);
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tSFApSFA); ++i) {
+        tSFApSFA(i) = load_sfa && elem_less(get<0>(tSFAcSFA_compact(i)), get<0>(SFA_shape));
+      }
+
+      bool load_sfb = thread_idx < ScaleNsPerTile;
+      Tensor tSFBcSFB = tSFBcSFB_k(_,_,_,*k_tile_iter);
+      Tensor tSFBcSFB_compact = filter_zeros(tSFBcSFB);
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < size(tSFBpSFB); ++i) {
+        tSFBpSFB(i) = load_sfb && elem_less(get<0>(tSFBcSFB_compact(i)), get<0>(SFB_shape));
+      }
+      int write_stage = smem_pipe_write.index();
+      // Copy scale tensors from global memory to shared memory
+      if constexpr (!IsTmaLoadSFA) {
+        copy_if(scale_copy_a, tSFApSFA, filter_zeros(tSFAgSFA_k(_,_,_,*k_tile_iter)), filter_zeros(tSFAsSFA(_,_,_,write_stage)));
+      }
+      if constexpr (!IsTmaLoadSFB) {
+        copy_if(scale_copy_b, tSFBpSFB, filter_zeros(tSFBgSFB_k(_,_,_,*k_tile_iter)), filter_zeros(tSFBsSFB(_,_,_,write_stage)));
+      }
+      if constexpr (!IsTmaLoadSFA || !IsTmaLoadSFB) {
+        pipeline.producer_commit(smem_pipe_write, cutlass::arch::cpasync_barrier_arrive_noinc);
+      }
 
       ++k_tile_iter;
 
@@ -577,16 +784,24 @@ struct CollectiveMma<
     Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
 
     // Block scaling
-    Tensor sScaleAViewAsC = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_A.data()),
-      Layout<
-        Shape<Shape<Int<ScaleGranularityM>, Int<ScaleMsPerTile>>, cute::tuple_element_t<1, TileShape>, Int<DispatchPolicy::Stages>>,
-        Stride<Stride<_0, _1>, _0, Int<ScaleMsPerTile>>
-      >{}); // ((ScaleGranularityM,ScaleMsPerTile),n,k)
-    Tensor sScaleBViewAsC = make_tensor(cute::make_smem_ptr(shared_tensors.smem_scale_B.data()),
-      Layout<
-        Shape<cute::tuple_element_t<0, TileShape>, Shape<Int<ScaleGranularityN>, Int<ScaleNsPerTile>>, Int<DispatchPolicy::Stages>>,
-        Stride<_0, Stride<_0, _1>, Int<ScaleNsPerTile>>
-      >{}); // (m,(ScaleGranularityN,ScaleNsPerTile),k)
+    Tensor sSFA = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFA.data()), make_layout(
+        make_shape(get<0>(shape(SmemLayoutSFA{})),
+                   get<1>(TileShape{}),
+                   make_shape(get<1>(shape(SmemLayoutSFA{})),
+                   get<2>(shape(SmemLayoutSFA{})))),
+        make_stride(get<0>(stride(SmemLayoutSFA{})), _0{},
+                    make_stride(get<1>(stride(SmemLayoutSFA{})), get<2>(stride(SmemLayoutSFA{}))))
+      ));                                                                                       // (BLK_M,BLK_N,(BLK_K,P))
+    Tensor sSFB = make_tensor(cute::make_smem_ptr(shared_tensors.smem_SFB.data()), make_layout(
+        make_shape(get<0>(TileShape{}),
+                   get<0>(shape(SmemLayoutSFB{})),
+                   make_shape(get<1>(shape(SmemLayoutSFB{})),
+                   get<2>(shape(SmemLayoutSFB{})))),
+        make_stride(_0{},
+                    get<0>(stride(SmemLayoutSFB{})),
+                    make_stride(get<1>(stride(SmemLayoutSFB{})),
+                    get<2>(stride(SmemLayoutSFB{}))))
+      ));                                                                                       // (BLK_M,BLK_N,(BLK_K,P))
 
     //
     // Define C accumulators and A/B partitioning
@@ -609,22 +824,22 @@ struct CollectiveMma<
     TiledMma tiled_mma;
     auto thread_mma = tiled_mma.get_slice(warp_group_thread_layout(warp_group_idx));
 
-    Tensor tCsScaleAViewAsC = tiled_mma.get_slice(thread_idx).partition_C(sScaleAViewAsC);    // (MMA,MMA_M,MMA_N,PIPE), `thread_mma` above is correct when partitioning A and B, but it is not correct when partitioning C.
-    Tensor tCsScaleBViewAsC = tiled_mma.get_slice(thread_idx).partition_C(sScaleBViewAsC);    // (MMA,MMA_M,MMA_N,PIPE), `thread_mma` above is correct when partitioning A and B, but it is not correct when partitioning C.
+    Tensor tCsSFA = tiled_mma.get_slice(thread_idx).partition_C(sSFA);                 // (MMA,MMA_M,MMA_N,(MMA_K,PIPE))
+    Tensor tCsSFB = tiled_mma.get_slice(thread_idx).partition_C(sSFB);                 // (MMA,MMA_M,MMA_N,(MMA_K,PIPE))
 
-    Tensor tCsA = thread_mma.partition_A(sA);                                                 // (MMA,MMA_M,MMA_K,PIPE)
-    Tensor tCsB = thread_mma.partition_B(sB);                                                 // (MMA,MMA_N,MMA_K,PIPE)
+    Tensor tCsA = thread_mma.partition_A(sA);                                                  // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCsB = thread_mma.partition_B(sB);                                                  // (MMA,MMA_N,MMA_K,PIPE)
 
     // Allocate "fragments/descriptors"
-    Tensor tCrA = thread_mma.make_fragment_A(tCsA);                                           // (MMA,MMA_M,MMA_K,PIPE)
-    Tensor tCrB = thread_mma.make_fragment_B(tCsB);                                           // (MMA,MMA_N,MMA_K,PIPE)
+    Tensor tCrA = thread_mma.make_fragment_A(tCsA);                                            // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCrB = thread_mma.make_fragment_B(tCsB);                                            // (MMA,MMA_N,MMA_K,PIPE)
 
-    CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(accum));                                                         // M
-    CUTE_STATIC_ASSERT_V(size<1>(tCsB) == size<2>(accum));                                                         // N
-    CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCsB));                                                          // K
-    CUTE_STATIC_ASSERT_V(size<3>(tCsA) == size<3>(tCsB));                                                       // PIPE
-    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sA));                                         // PIPE
-    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sB));                                         // PIPE
+    CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(accum));                                                          // M
+    CUTE_STATIC_ASSERT_V(size<1>(tCsB) == size<2>(accum));                                                          // N
+    CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCsB));                                                           // K
+    CUTE_STATIC_ASSERT_V(size<3>(tCsA) == size<3>(tCsB));                                                        // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sA));                                          // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sB));                                          // PIPE
 
     //
     // PIPELINED MAIN LOOP
@@ -636,93 +851,81 @@ struct CollectiveMma<
     PipelineState smem_pipe_release = smem_pipe_read;
 
     // Per block scale values for operand A and B
-    Tensor tCrScaleAViewAsC = make_tensor_like<ElementBlockScale>(tCsScaleAViewAsC(_, _, _, 0));    // (MMA,MMA_M,MMA_N)
-    Tensor tCrScaleBViewAsC = make_tensor_like<ElementBlockScale>(tCsScaleBViewAsC(_, _, _, 0));    // (MMA,MMA_M,MMA_N)
+    // Since scale factors always broadcast across MMA_K we slice that away
+    Tensor tCrSFA = make_tensor_like<ElementBlockScale>(tCsSFA(_, _, _, _0{}));                     // (MMA,MMA_M,MMA_N)
+    Tensor tCrSFB = make_tensor_like<ElementBlockScale>(tCsSFB(_, _, _, _0{}));                     // (MMA,MMA_M,MMA_N)
 
     // Prologue GMMAs
-    int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
 
     tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
 
+    // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
+    auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+    pipeline.consumer_wait(smem_pipe_read, barrier_token);
     GmmaFP8Accumulation accumulation(accum, ScalePromotionInterval, size<2>(tCrA));
     warpgroup_fence_operand(accumulation());
-    CUTLASS_PRAGMA_UNROLL
-    for (int k_tile_prologue = prologue_mma_count; k_tile_prologue > 0; --k_tile_prologue)
     {
-      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
-      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
-      pipeline.consumer_wait(smem_pipe_read, barrier_token);
-
-      if constexpr (ScalePromotionInterval != 4) {
-        if (accumulation.prepare_if_needed()) {
-          tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
-        }
-      }
-      else {
-        // Always zero out the accumulator for finest granularity
-        tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
-      }
-
       int read_stage = smem_pipe_read.index();
 
       // Load per block scale values from shared memory to registers
-      copy(tCsScaleAViewAsC(_, _, _, read_stage), tCrScaleAViewAsC);
-      copy(tCsScaleBViewAsC(_, _, _, read_stage), tCrScaleBViewAsC);
-      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        tCrScaleAViewAsC.data()[0] = tCrScaleAViewAsC.data()[0] * tCrScaleBViewAsC.data()[0];
-      }
-      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_b = tCrScaleBViewAsC.data()[0];
-        CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleAViewAsC); i++) {
-          tCrScaleAViewAsC.data()[i] = tCrScaleAViewAsC.data()[i] * scale_b;
-        }
-      }
-      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        ElementBlockScale scale_a = tCrScaleAViewAsC.data()[0];
-        CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleBViewAsC); i++) {
-          tCrScaleBViewAsC.data()[i] = tCrScaleBViewAsC.data()[i] * scale_a;
-        }
-      }
+      copy(tCsSFA(_,_,_,make_coord(_0{}, read_stage)), tCrSFA);
+      copy(tCsSFB(_,_,_,make_coord(_0{}, read_stage)), tCrSFB);
 
+      warpgroup_fence_operand(accumulation());
       warpgroup_arrive();
       // Unroll the K mode manually to set scale D to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
-        // (V,M,K) x (V,N,K) => (V,M,N)
+        // (V,M) x (V,N) => (V,M,N)
         cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
       }
       warpgroup_commit_batch();
+      warpgroup_fence_operand(accumulation());
 
-      // Block scale the accumulators with reg tensor `tCrScaleAViewAsC` and `tCrScaleBViewAsC`
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
+        tCrSFA(_0{}) = tCrSFA(_0{}) * tCrSFB(_0{});
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_b = tCrSFB(_0{});
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(filter_zeros(tCrSFA)); i++) {
+          filter_zeros(tCrSFA)(i) = filter_zeros(tCrSFA)(i) * scale_b;
+        }
+      }
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
+        ElementBlockScale scale_a = tCrSFA(_0{});
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(filter_zeros(tCrSFB)); i++) {
+          filter_zeros(tCrSFB)(i) = filter_zeros(tCrSFB)(i) * scale_a;
+        }
+      }
+      warpgroup_wait<0>();
+      ++smem_pipe_read;
+      barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      // Block scale the accumulators with reg tensor `tCrSFA` and `tCrSFB`
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_ab = tCrSFA(_0{});
         scale_if_needed(accumulation, scale_ab);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC);
+        scale_if_needed(accumulation, tCrSFA);
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFB);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFA, tCrSFB);
       }
-
-      ++smem_pipe_read;
     }
 
     warpgroup_fence_operand(accumulation());
     // Mainloop GMMAs
-    k_tile_count -= prologue_mma_count;
+    k_tile_count -= 1;
 
     CUTLASS_PRAGMA_NO_UNROLL
-    for ( ; k_tile_count > 0; --k_tile_count)
+    for ( ; k_tile_count > 1; --k_tile_count)
     {
-      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
-      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
       pipeline.consumer_wait(smem_pipe_read, barrier_token);
 
       //
@@ -732,25 +935,84 @@ struct CollectiveMma<
       int read_stage = smem_pipe_read.index();
 
       // Load per block scale values from shared memory to registers (at most twice per block along M and/or N)
-      copy(tCsScaleAViewAsC(_, _, _, read_stage), tCrScaleAViewAsC);
-      copy(tCsScaleBViewAsC(_, _, _, read_stage), tCrScaleBViewAsC);
+      copy(tCsSFA(_,_,_,make_coord(_0{}, read_stage)), tCrSFA);
+      copy(tCsSFB(_,_,_,make_coord(_0{}, read_stage)), tCrSFB);
+
+      if constexpr (ScalePromotionInterval != 4) {
+        if (accumulation.prepare_if_needed()) {
+          tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+        }
+      }
+      else {
+        // Always zero out the accumulator for finest granularity
+        tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+      }
+
+      warpgroup_fence_operand(accumulation());
+      warpgroup_arrive();
+      // Unroll the K mode manually to set scale D to 1
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+        // (V,M) x (V,N) => (V,M,N)
+        cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
+        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+      }
+      warpgroup_commit_batch();
+
+      /// Wait on the GMMA barrier for K_PIPE_MMAS (or fewer) outstanding to ensure smem_pipe_write is consumed
+      warpgroup_fence_operand(accumulation());
+
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        tCrScaleAViewAsC.data()[0] = tCrScaleAViewAsC.data()[0] * tCrScaleBViewAsC.data()[0];
+        tCrSFA(_0{}) = tCrSFA(_0{}) * tCrSFB(_0{});
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_b = tCrScaleBViewAsC.data()[0];
+        ElementBlockScale scale_b = tCrSFB(_0{});
         CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleAViewAsC); i++) {
-          tCrScaleAViewAsC.data()[i] = tCrScaleAViewAsC.data()[i] * scale_b;
+        for (int i = 0; i < size(filter_zeros(tCrSFA)); i++) {
+          filter_zeros(tCrSFA)(i) = filter_zeros(tCrSFA)(i) * scale_b;
         }
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        ElementBlockScale scale_a = tCrScaleAViewAsC.data()[0];
+        ElementBlockScale scale_a = tCrSFA(_0{});
         CUTLASS_PRAGMA_UNROLL
-        for (int i = 0; i < size(tCrScaleBViewAsC); i++) {
-          tCrScaleBViewAsC.data()[i] = tCrScaleBViewAsC.data()[i] * scale_a;
+        for (int i = 0; i < size(filter_zeros(tCrSFB)); i++) {
+          filter_zeros(tCrSFB)(i) = filter_zeros(tCrSFB)(i) * scale_a;
         }
       }
+      warpgroup_wait<0>();
+      pipeline.consumer_release(smem_pipe_release); // Unlock previous tile
+      ++smem_pipe_read;
+      barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      // Block scale the accumulators with reg tensor `tCrSFA` and `tCrSFB`
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_ab = tCrSFA(_0{});
+        scale_if_needed(accumulation, scale_ab);
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
+        scale_if_needed(accumulation, tCrSFA);
+      }
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
+        scale_if_needed(accumulation, tCrSFB);
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
+        scale_if_needed(accumulation, tCrSFA, tCrSFB);
+      }
+
+      // Advance smem_pipe_read and smem_pipe_release
+      ++smem_pipe_release;
+    }
+    if (k_tile_count == 1) {
+      pipeline.consumer_wait(smem_pipe_read, barrier_token);
+
+      //
+      // Compute on k_tile
+      //
+
+      int read_stage = smem_pipe_read.index();
+
+      // Load per block scale values from shared memory to registers (at most twice per block along M and/or N)
+      copy(tCsSFA(_,_,_,make_coord(_0{}, read_stage)), tCrSFA);
+      copy(tCsSFB(_,_,_,make_coord(_0{}, read_stage)), tCrSFB);
 
       if constexpr (ScalePromotionInterval != 4) {
         if (accumulation.prepare_if_needed()) {
@@ -767,52 +1029,63 @@ struct CollectiveMma<
       // Unroll the K mode manually to set scale D to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
-        // (V,M,K) x (V,N,K) => (V,M,N)
+        // (V,M) x (V,N) => (V,M,N)
         cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
       }
       warpgroup_commit_batch();
 
       /// Wait on the GMMA barrier for K_PIPE_MMAS (or fewer) outstanding to ensure smem_pipe_write is consumed
-      warpgroup_wait<K_PIPE_MMAS>();
       warpgroup_fence_operand(accumulation());
 
-      // Block scale the accumulators with reg tensor `tCrScaleAViewAsC` and `tCrScaleBViewAsC`
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
+        tCrSFA(_0{}) = tCrSFA(_0{}) * tCrSFB(_0{});
+      }
+      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_b = tCrSFB(_0{});
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(filter_zeros(tCrSFA)); i++) {
+          filter_zeros(tCrSFA)(i) = filter_zeros(tCrSFA)(i) * scale_b;
+        }
+      }
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
+        ElementBlockScale scale_a = tCrSFA(_0{});
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(filter_zeros(tCrSFB)); i++) {
+          filter_zeros(tCrSFB)(i) = filter_zeros(tCrSFB)(i) * scale_a;
+        }
+      }
+      warpgroup_wait<0>();
+      pipeline.consumer_release(smem_pipe_release); // Unlock previous tile
+      // Block scale the accumulators with reg tensor `tCrSFA` and `tCrSFB`
+      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
+        ElementBlockScale scale_ab = tCrSFA(_0{});
         scale_if_needed(accumulation, scale_ab);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC);
+        scale_if_needed(accumulation, tCrSFA);
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFB);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
-        scale_if_needed(accumulation, tCrScaleAViewAsC, tCrScaleBViewAsC);
+        scale_if_needed(accumulation, tCrSFA, tCrSFB);
       }
-
-      pipeline.consumer_release(smem_pipe_release);                 // UNLOCK smem_pipe_release, done _computing_ on it
-
-      // Advance smem_pipe_read and smem_pipe_release
-      ++smem_pipe_read;
-      ++smem_pipe_release;
     }
-
     if constexpr (ScalePromotionInterval != 4) {
       // residues only exists when granularity is not the finnest
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
-        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
+        ElementBlockScale scale_ab = tCrSFA(_0{});
         accumulation.scale_residue_if_needed(scale_ab);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
-        accumulation.scale_residue_if_needed(tCrScaleAViewAsC);
+        accumulation.scale_residue_if_needed(tCrSFA);
       }
       if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
-        accumulation.scale_residue_if_needed(tCrScaleBViewAsC);
+        accumulation.scale_residue_if_needed(tCrSFB);
       }
       if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
-        accumulation.scale_residue_if_needed(tCrScaleAViewAsC, tCrScaleBViewAsC);
+        accumulation.scale_residue_if_needed(tCrSFA, tCrSFB);
       }
     }
 
@@ -822,19 +1095,9 @@ struct CollectiveMma<
   /// Perform a Consumer Epilogue to release all buffers
   CUTLASS_DEVICE void
   mma_tail(MainloopPipeline pipeline, PipelineState smem_pipe_release, int k_tile_count) {
-    // Prologue GMMAs
-    int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
-    k_tile_count -= prologue_mma_count;
-
-    smem_pipe_release.advance(k_tile_count);
-
-    // Wait on all GMMAs to complete
-    warpgroup_wait<0>();
-
-    for (int count = 0; count < prologue_mma_count; ++count) {
-      pipeline.consumer_release(smem_pipe_release);                 // UNLOCK smem_pipe_release, done _computing_ on it
-      ++smem_pipe_release;
-    }
+    // The pipeline is not released in the first iteration
+    smem_pipe_release.advance(k_tile_count - 1);
+    pipeline.consumer_release(smem_pipe_release);
   }
 };
 
diff --git a/include/cutlass/gemm/device/base_grouped.h b/include/cutlass/gemm/device/base_grouped.h
index fc59d7ffe1..d9c2423b2b 100644
--- a/include/cutlass/gemm/device/base_grouped.h
+++ b/include/cutlass/gemm/device/base_grouped.h
@@ -127,8 +127,8 @@ class BaseGrouped {
   }
 
   /// Copy from `data` to `workspace`
-  Status copy_to_workspace(void* workspace, void* data, size_t bytes) {
-    cudaError_t cuda_error = cudaMemcpy(workspace, data, bytes, cudaMemcpyHostToDevice);
+  Status copy_to_workspace(void* workspace, void* data, size_t bytes, cudaStream_t stream = nullptr) {
+    cudaError_t cuda_error = cudaMemcpyAsync(workspace, data, bytes, cudaMemcpyHostToDevice, stream);
     if (cuda_error != cudaSuccess) {
       // Call cudaGetLastError() to clear the error bit
       cuda_error = cudaGetLastError();
@@ -142,14 +142,14 @@ class BaseGrouped {
   }
 
   /// Precomputes scheduling information for the grouped GEMM
-  Status precompute(Arguments const &args, int32_t tile_count, void* workspace) {
+  Status precompute(Arguments const &args, int32_t tile_count, void* workspace, cudaStream_t stream = nullptr) {
     size_t workspace_bytes = get_workspace_size(args);
     std::vector<uint8_t> host_workspace(workspace_bytes);
     BaseKernel::ProblemVisitor::host_precompute(args.host_problem_sizes,
                                                 args.problem_count,
                                                 args.threadblock_count,
                                                 (void*)host_workspace.data());
-    return copy_to_workspace(workspace, host_workspace.data(), workspace_bytes);
+    return copy_to_workspace(workspace, host_workspace.data(), workspace_bytes, stream);
   }
 
   /// Reorder `data` according to `indices`
@@ -361,7 +361,7 @@ class BaseGrouped {
 
     if (BaseKernel::ProblemVisitor::kRequiresPrecomputation) {
       int32_t tile_count = group_tile_count(args);
-      Status status = precompute(args, tile_count, workspace);
+      Status status = precompute(args, tile_count, workspace, stream);
       if (status != Status::kSuccess) {
         return status;
       }
@@ -388,7 +388,7 @@ class BaseGrouped {
   }
 
   /// Lightweight update given a subset of arguments
-  Status update(Arguments const &args, void *workspace = nullptr) {
+  Status update(Arguments const &args, void *workspace = nullptr, cudaStream_t stream = nullptr) {
 
     size_t workspace_bytes = get_workspace_size(args);
 
@@ -398,7 +398,7 @@ class BaseGrouped {
 
     if (BaseKernel::ProblemVisitor::kRequiresPrecomputation) {
       int32_t tile_count = group_tile_count(args);
-      Status status = precompute(args, tile_count, workspace);
+      Status status = precompute(args, tile_count, workspace, stream);
       if (status != Status::kSuccess) {
         return status;
       }
diff --git a/include/cutlass/gemm/dispatch_policy.hpp b/include/cutlass/gemm/dispatch_policy.hpp
index 89f4cbf720..359776c038 100644
--- a/include/cutlass/gemm/dispatch_policy.hpp
+++ b/include/cutlass/gemm/dispatch_policy.hpp
@@ -35,7 +35,7 @@
 
 #include "cute/layout.hpp"
 #include "cute/numeric/integral_constant.hpp" // cute::false_type
-#include "cute/arch/copy_sm100.hpp"
+#include "cute/atom/copy_traits_sm100.hpp"
 //////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::detail {
@@ -72,7 +72,8 @@ namespace detail {
 
 enum class KernelInputTransformType {
     FastF32,
-    InterleavedComplexTF32
+    InterleavedComplexTF32,
+    MixedInput
 };
 
 } // namespace detail
@@ -93,6 +94,13 @@ struct Has_SwapAB <T, CUTE_STL_NAMESPACE::void_t<decltype(T::SwapAB)>>
 template <typename T>
 static constexpr bool Has_SwapAB_v = Has_SwapAB<T>::value;
 
+// additional producer warp role check for block scaling mainloop
+template<typename T>
+struct HasAuxiliaryLoad : cute::false_type{};
+
+template <typename T>
+static constexpr bool HasAuxiliaryLoad_v = HasAuxiliaryLoad<T>::value;
+
 } // namespace kernel::detail
 
 //////////////////////////////////////////////////////////////////////////////
@@ -117,42 +125,10 @@ struct KernelPtrArrayTmaWarpSpecializedCooperative { };
 struct KernelPtrArrayTmaWarpSpecializedPingpong { };
 
 // FP8 related policies (including Blocked Scaled Accumulation)
-template<
-  // `ScaleGranularityM`/`ScaleGranularityN` specifies scaling granularity along M/N, while zero-value
-  // `ScaleGranularityM`/`ScaleGranularityN` indicates that scaling granularity is
-  // `size<0>(TileShape_MNK{})`/`size<1>(TileShape_MNK{})` along M/N.
-  int ScaleGranularityM_ = 0,
-  int ScaleGranularityN_ = 0,
-  // `ScalePromotionInterval` specifies the interval to promote the accumulator for scaling
-  // It is required to be a multiple of 4 and specified in terms of number of MMA instructions
-  // in the reduction dimension. i.e for FP8 kernels, it is 
-  // ScalePromotionInterval * MMA_K = ScalePromotionInterval * 32 = 128 elements in K by default
-  int ScalePromotionInterval_ = 4
-
->
-struct KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum: KernelTmaWarpSpecializedCooperative {
-  constexpr static int ScaleGranularityM = ScaleGranularityM_;
-  constexpr static int ScaleGranularityN = ScaleGranularityN_;
-  constexpr static int ScalePromotionInterval = ScalePromotionInterval_;
-};
-
-template<
-  // `ScaleGranularityM`/`ScaleGranularityN` specifies scaling granularity along M/N, while zero-value
-  // `ScaleGranularityM`/`ScaleGranularityN` indicates that scaling granularity is
-  // `size<0>(TileShape_MNK{})`/`size<1>(TileShape_MNK{})` along M/N.
-  int ScaleGranularityM_,
-  int ScaleGranularityN_,
-  // `ScalePromotionInterval` specifies the interval to promote the accumulator for scaling
-  // It is required to be a multiple of 4 and specified in terms of number of MMA instructions
-  // in the reduction dimension. i.e for FP8 kernels, it is 
-  // ScalePromotionInterval * MMA_K = ScalePromotionInterval * 32 = 128 elements in K by default
-  int ScalePromotionInterval_ = 4
->
-struct KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum: KernelPtrArrayTmaWarpSpecializedCooperative {
-  constexpr static int ScaleGranularityM = ScaleGranularityM_;
-  constexpr static int ScaleGranularityN = ScaleGranularityN_;
-  constexpr static int ScalePromotionInterval = ScalePromotionInterval_;
-};
+struct KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum: KernelTmaWarpSpecializedCooperative { };
+struct KernelTmaWarpSpecializedPingpongFP8BlockScaledAccum: KernelTmaWarpSpecializedPingpong { };
+struct KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum: KernelPtrArrayTmaWarpSpecializedCooperative { };
+struct KernelPtrArrayTmaWarpSpecializedPingpongFP8BlockScaledAccum: KernelPtrArrayTmaWarpSpecializedPingpong { };
 
 // Policies to opt into mixed type GEMMs
 struct KernelTmaWarpSpecializedMixedInput : KernelTmaWarpSpecialized { };
@@ -337,23 +313,14 @@ struct MainloopSm90TmaGmmaWarpSpecializedFP8
 template<
   int Stages_,
   class ClusterShape_ = Shape<_1,_1,_1>,
-  class KernelSchedule = KernelTmaWarpSpecialized,
-  // `ScaleGranularityM`/`ScaleGranularityN` specifies scaling granularity along M/N, while zero-value
-  // `ScaleGranularityM`/`ScaleGranularityN` indicates that scaling granularity is
-  // `size<0>(TileShape_MNK{})`/`size<1>(TileShape_MNK{})` along M/N.
-  int ScaleGranularityM = 0,
-  int ScaleGranularityN = 0,
-  // `ScalePromotionInterval` specifies the interval to promote the accumulator for scaling
-  // It is required to be a multiple of 4 and specified in terms of number of MMA instructions
-  // in the reduction dimension. i.e for FP8 kernels, it is 
-  // ScalePromotionInterval * MMA_K = ScalePromotionInterval * 32 = 128 elements in K by default
-  int ScalePromotionInterval = 4
+  class KernelSchedule = KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum
 >
 struct MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8
   : MainloopSm90TmaGmmaWarpSpecialized<Stages_, ClusterShape_, KernelSchedule> {
   static_assert(
-    cute::is_same_v<KernelSchedule, KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum<ScaleGranularityM, ScaleGranularityN, ScalePromotionInterval>>,
-    "KernelSchedule must be one of the warp specialized policies");
+    cute::is_same_v<KernelSchedule, KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum> ||
+    cute::is_same_v<KernelSchedule, KernelTmaWarpSpecializedPingpongFP8BlockScaledAccum>,
+    "KernelSchedule must be one of the warp specialized FP8 block scale policies");
 };
 
 // n-buffer in smem (Hopper TMA), pipelined with Hopper GMMA and TMA, Warp specialized dynamic schedule for Ptr-Array and Grouped Gemm
@@ -434,19 +401,17 @@ struct MainloopSm90ArrayTmaGmmaWarpSpecializedMixedInput {
 template<
   int Stages_,
   class ClusterShape_ = Shape<_1,_1,_1>,
-  class KernelSchedule = KernelPtrArrayTmaWarpSpecializedCooperative,
-  // `ScaleGranularityM`/`ScaleGranularityN` specifies scaling granularity along M/N, while zero-value
-  // `ScaleGranularityM`/`ScaleGranularityN` indicates that scaling granularity is
-  // `size<0>(TileShape_MNK{})`/`size<1>(TileShape_MNK{})` along M/N.
-  int ScaleGranularityM = 0,
-  int ScaleGranularityN = 0,
-  int ScalePromotionInterval = 4
+  class KernelSchedule = KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum
 >
 struct MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling
   : MainloopSm90ArrayTmaGmmaWarpSpecialized<Stages_, ClusterShape_, KernelSchedule> {
   static_assert(
-    cute::is_same_v<KernelSchedule, KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum<ScaleGranularityM, ScaleGranularityN>>,
-    "KernelSchedule must be one of the warp specialized policies");
+    cute::is_any_of_v<
+      KernelSchedule,
+      KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum,
+      KernelPtrArrayTmaWarpSpecializedPingpongFP8BlockScaledAccum
+    >,
+    "KernelSchedule must be one of the warp specialized FP8 block scale policies");
 };
 
 
@@ -478,6 +443,15 @@ struct KernelTmaWarpSpecializedMmaTransformSm100 final {
   static constexpr int AccumulatorPipelineStageCount = AccumulatorPipelineStageCount_;
 };
 
+template<
+  int SchedulerPipelineStageCount_,
+  int AccumulatorPipelineStageCount_
+>
+struct KernelPtrArrayTmaWarpSpecializedMmaTransformSm100 final {
+  static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+  static constexpr int AccumulatorPipelineStageCount = AccumulatorPipelineStageCount_;
+};
+
 // Sparse Gemm
 template<
   int SchedulerPipelineStageCount_,
@@ -541,26 +515,48 @@ struct KernelPtrArrayTmaWarpSpecializedInputTransformSm100 final {
 
 // SM120 kernel schedules
 template< int SchedulerPipelineStageCount_>
-struct KernelTmaWarpSpecializedCooperativeSm120 : KernelTmaWarpSpecializedCooperative {
+struct KernelTmaWarpSpecializedCooperativeSm120 : KernelTmaWarpSpecializedCooperative { 
+  static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+};
+
+template< int SchedulerPipelineStageCount_>
+struct KernelTmaWarpSpecializedPingpongSm120 : KernelTmaWarpSpecializedPingpong { 
   static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
 };
 
+
 template< int SchedulerPipelineStageCount_>
-struct KernelTmaWarpSpecializedPingpongSm120 : KernelTmaWarpSpecializedPingpong {
+struct KernelTmaWarpSpecializedCooperativeBlockScaledSm120 : KernelTmaWarpSpecializedCooperative { 
   static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
 };
 
+template< int SchedulerPipelineStageCount_>
+struct KernelTmaWarpSpecializedPingpongBlockScaledSm120 : KernelTmaWarpSpecializedPingpong { 
+  static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+};
 
+// SM120 dense Ptr-array kernel schedules
 template< int SchedulerPipelineStageCount_>
-struct KernelTmaWarpSpecializedCooperativeBlockScaledSm120 : KernelTmaWarpSpecializedCooperative {
+struct KernelPtrArrayTmaWarpSpecializedCooperativeSm120 : KernelPtrArrayTmaWarpSpecializedCooperative {
   static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
 };
 
 template< int SchedulerPipelineStageCount_>
-struct KernelTmaWarpSpecializedPingpongBlockScaledSm120 : KernelTmaWarpSpecializedPingpong {
+struct KernelPtrArrayTmaWarpSpecializedPingpongSm120 : KernelPtrArrayTmaWarpSpecializedPingpong {
   static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
 };
 
+template< int SchedulerPipelineStageCount_>
+struct KernelPtrArrayTmaWarpSpecializedCooperativeBlockScaledSm120 : KernelPtrArrayTmaWarpSpecializedCooperative {
+  static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+};
+
+template< int SchedulerPipelineStageCount_>
+struct KernelPtrArrayTmaWarpSpecializedPingpongBlockScaledSm120 : KernelPtrArrayTmaWarpSpecializedPingpong {
+  static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+};
+
+// SM120 sparse kernel schedules
 template< int SchedulerPipelineStageCount_, bool isAsymmetric_>
 struct KernelTmaWarpSpecializedCooperativeSparseSm120 {
   static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
@@ -572,6 +568,39 @@ struct KernelTmaWarpSpecializedCooperativeSparseBlockScaledSm120 {
   static constexpr int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
   static constexpr bool isAsymmetric = isAsymmetric_;
 };
+
+// Auxiliary Load Tag.
+
+namespace kernel::detail {
+
+template<
+  int Stages,
+  class ClusterShape,
+  class KernelSchedule
+>
+struct HasAuxiliaryLoad<
+  MainloopSm90ArrayTmaGmmaWarpSpecializedBlockScaling<
+    Stages,
+    ClusterShape,
+    KernelSchedule
+  >
+> : cute::true_type{};
+
+template<
+  int Stages,
+  class ClusterShape,
+  class KernelSchedule
+>
+struct HasAuxiliaryLoad<
+  MainloopSm90TmaGmmaWarpSpecializedBlockScalingFP8<
+    Stages,
+    ClusterShape,
+    KernelSchedule
+  >
+> : cute::true_type{};
+
+} // namespace kernel::detail
+
 //////////////////////////////////////////////////////////////////////////////
 
 //
@@ -605,12 +634,16 @@ struct KernelScheduleSm100PtrArrayDenseGemm : KernelScheduleSm100DenseGemm {};
 struct KernelPtrArrayTmaWarpSpecialized1SmSm100 final : KernelSchedule1Sm, KernelScheduleSm100PtrArrayDenseGemm {};
 struct KernelPtrArrayTmaWarpSpecialized2SmSm100 final : KernelSchedule2Sm, KernelScheduleSm100PtrArrayDenseGemm {};
 
-
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
-// SM100 Blockwise GEMM Dispatch Policies
+// SM100 Blockwise GEMM + Ptr-Array GEMM Dispatch Policies
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
 struct KernelScheduleSm100Blockwise  : KernelScheduleSm100 {};
 struct KernelTmaWarpSpecializedBlockwise1SmSm100 final : KernelSchedule1Sm, KernelScheduleSm100Blockwise {};
+struct KernelTmaWarpSpecializedBlockwise2SmSm100 final : KernelSchedule2Sm, KernelScheduleSm100Blockwise {};
+
+struct KernelScheduleSm100PtrArrayBlockwise  : KernelScheduleSm100Blockwise {};
+struct KernelPtrArrayTmaWarpSpecializedBlockwise1SmSm100 final : KernelSchedule1Sm, KernelScheduleSm100PtrArrayBlockwise {};
+struct KernelPtrArrayTmaWarpSpecializedBlockwise2SmSm100 final : KernelSchedule2Sm, KernelScheduleSm100PtrArrayBlockwise {};
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
 // SM100 Planar Complex GEMM Dispatch Policies
@@ -641,6 +674,14 @@ struct KernelTmaWarpSpecialized2SmFastFP32Sm100 final : KernelSchedule2Sm, Kerne
 struct KernelTmaWarpSpecialized1SmFastFP32SmemSm100 final : KernelSchedule1Sm, KernelTmaWarpSpecializedFastFP32SmemSm100 { };
 struct KernelTmaWarpSpecialized2SmFastFP32SmemSm100 final : KernelSchedule2Sm, KernelTmaWarpSpecializedFastFP32SmemSm100 { };
 
+///////////////////////////////////////////////////////////////////////////////////////////////////////
+// SM100 Mixed Precision Input GEMM Dispatch Policies
+///////////////////////////////////////////////////////////////////////////////////////////////////////
+struct KernelScheduleSm100MixedInputGemm           : KernelScheduleSm100 {};
+struct KernelTmaWarpSpecializedMixedInputSmemSm100 : KernelScheduleSm100MixedInputGemm { };
+struct KernelTmaWarpSpecialized1SmMixedInputSm100 final : KernelSchedule1Sm, KernelScheduleSm100MixedInputGemm { };
+struct KernelTmaWarpSpecialized1SmMixedInputSmemSm100 final : KernelSchedule1Sm, KernelTmaWarpSpecializedMixedInputSmemSm100 { };
+
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
 // SM100 Ptr-Array FastF32 (9xBF16) GEMM Dispatch Policies
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -664,7 +705,7 @@ struct KernelSparseTmaWarpSpecialized2SmSm100 final : KernelSchedule2Sm, KernelS
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
 // SM100 BlockScaled Dense GEMM Dispatch Policies
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
-struct KernelScheduleBlockScaledGemmSm100   : KernelScheduleSm100 {};
+struct KernelScheduleBlockScaledGemmSm100   : KernelScheduleSm100 {};                  
 struct KernelScheduleMxNvf4Sm100            : KernelScheduleBlockScaledGemmSm100 {};
 struct KernelScheduleMxf8f6f4Sm100          : KernelScheduleBlockScaledGemmSm100 {};
 // Block Scaled Dense GEMM: Specialize for instruction type, scale factor vector size, and 1SM vs. 2SM
@@ -731,14 +772,13 @@ struct KernelScheduleF8f6f4Sm120 final : KernelScheduleSm120DenseGemm {};
 struct KernelScheduleBlockScaledGemmSm120 : KernelScheduleSm120 {};
 struct KernelScheduleMxf8f6f4Sm120        : KernelScheduleBlockScaledGemmSm120 {};
 struct KernelScheduleMxNvf4Sm120          : KernelScheduleBlockScaledGemmSm120 {};
-// Block Scaled Sparse GEMM: Specialize for instruction type, scale factor vector size.
+// Block Scaled GEMM: Specialize for instruction type, scale factor vector size.
 struct KernelTmaWarpSpecializedNvf4Sm120             final : KernelScheduleMxNvf4Sm120, KernelTmaWarpSpecializedCooperative { };
 struct KernelTmaWarpSpecializedPingpongNvf4Sm120     final : KernelScheduleMxNvf4Sm120, KernelTmaWarpSpecializedPingpong { };
 struct KernelTmaWarpSpecializedMxf4Sm120             final : KernelScheduleMxNvf4Sm120, KernelTmaWarpSpecializedCooperative { };
 struct KernelTmaWarpSpecializedPingpongMxf4Sm120     final : KernelScheduleMxNvf4Sm120, KernelTmaWarpSpecializedPingpong { };
 struct KernelTmaWarpSpecializedMxf8f6f4Sm120         final : KernelScheduleMxf8f6f4Sm120, KernelTmaWarpSpecializedCooperative { };
 struct KernelTmaWarpSpecializedPingpongMxf8f6f4Sm120 final : KernelScheduleMxf8f6f4Sm120, KernelTmaWarpSpecializedPingpong { };
-
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
 // SM120 Sparse GEMM Dispatch Policies
 ///////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -789,6 +829,21 @@ struct MainloopSm100TmaUmmaWarpSpecializedBlockwiseScaling {
   constexpr static bool IsOverlappingAccum = false;
 };
 
+// n-buffer in smem, pipelined with Blackwell UMMA and TMA, Warp specialized dynamic schedule
+template<
+  int Stages_,
+  int SchedulerPipelineStageCount_,
+  int AccumulatorPipelineStageCount_,
+  class ClusterShape_ = Shape<_1,_1,_1>
+>
+struct MainloopSm100ArrayTmaUmmaWarpSpecializedBlockwiseScaling {
+  constexpr static int Stages = Stages_;
+  using ClusterShape = ClusterShape_;
+  using ArchTag = arch::Sm100;
+  using Schedule = KernelPtrArrayTmaWarpSpecializedMmaTransformSm100<SchedulerPipelineStageCount_, AccumulatorPipelineStageCount_>;
+  constexpr static bool IsOverlappingAccum = false;
+};
+
 // n-buffer in smem, pipelined with Blackwell UMMA and TMA, Warp specialized dynamic schedule
 template<
   int Stages_,
@@ -1012,16 +1067,30 @@ template<
 struct MainloopSm120TmaWarpSpecialized {
   constexpr static int Stages = Stages_;
   using ClusterShape = ClusterShape_;
-  using KernelSchedule = KernelSchedule_;
-
+  using Schedule = KernelSchedule_;
   constexpr static int PipelineAsyncMmaStages = 0;
   using ArchTag = arch::Sm120;
+};
 
-  using Schedule = cute::conditional_t<cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, KernelSchedule>,
-                                       KernelTmaWarpSpecializedPingpongSm120<SchedulerPipelineStageCount_>,
-                                       KernelTmaWarpSpecializedCooperativeSm120<SchedulerPipelineStageCount_>>;
+template<
+  int Stages_,
+  int SchedulerPipelineStageCount_,
+  class ClusterShape_,
+  class KernelSchedule_
+>
+struct MainloopSm120ArrayTmaWarpSpecialized {
+  constexpr static int Stages = Stages_;
+  using ClusterShape = ClusterShape_;
+  using Schedule = KernelSchedule_;
+  constexpr static int PipelineAsyncMmaStages = 0;
+  using ArchTag = arch::Sm120;
+  static_assert(
+    cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, Schedule> ||
+    cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, Schedule>,
+    "KernelSchedule must be one of the Ptr-Array or Grouped Gemm TMA Warp Specialized Cooperative or Pingpong policies");
 };
 
+
 template<
   int Stages_,
   int SchedulerPipelineStageCount_,
@@ -1032,17 +1101,31 @@ struct MainloopSm120TmaWarpSpecializedBlockScaled {
   constexpr static int Stages = Stages_;
   constexpr static int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
   using ClusterShape = ClusterShape_;
-  using KernelSchedule = KernelSchedule_;
-
+  using Schedule = KernelSchedule_;
   constexpr static int PipelineAsyncMmaStages = 0;
   using ArchTag = arch::Sm120;
+};
 
-  using Schedule = cute::conditional_t<cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, KernelSchedule>,
-                                       KernelTmaWarpSpecializedPingpongBlockScaledSm120<SchedulerPipelineStageCount_>,
-                                       KernelTmaWarpSpecializedCooperativeBlockScaledSm120<SchedulerPipelineStageCount_>>;
+template<
+  int Stages_,
+  int SchedulerPipelineStageCount_,
+  class ClusterShape_,
+  class KernelSchedule_
+>
+struct MainloopSm120ArrayTmaWarpSpecializedBlockScaled {
+  constexpr static int Stages = Stages_;
+  constexpr static int SchedulerPipelineStageCount = SchedulerPipelineStageCount_;
+  using ClusterShape = ClusterShape_;
+  constexpr static int PipelineAsyncMmaStages = 0;
+  using Schedule = KernelSchedule_;
+  using ArchTag = arch::Sm120;
 
+  static_assert(cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, Schedule> ||
+                cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, Schedule>,
+                "KernelSchedule must be one of the Ptr-Array or Grouped Gemm TMA Warp Specialized Cooperative or Pingpong policies.");
 };
 
+
 template<
   int StagesA_,
   int StagesB_,
diff --git a/include/cutlass/gemm/kernel/gemm_universal.hpp b/include/cutlass/gemm/kernel/gemm_universal.hpp
index 95e82e717f..ae1cca26d4 100644
--- a/include/cutlass/gemm/kernel/gemm_universal.hpp
+++ b/include/cutlass/gemm/kernel/gemm_universal.hpp
@@ -68,6 +68,7 @@ struct IsCutlass3ArrayKernel<ProblemShape, cute::void_t<typename ProblemShape::U
 #include "cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized.hpp"
 #include "cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_input_transform.hpp"
 #include "cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_input_transform.hpp"
+#include "cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_mma_transform.hpp"
 #include "cutlass/gemm/kernel/sm100_sparse_gemm_tma_warpspecialized.hpp"
 #include "cutlass/gemm/kernel/sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp"
 
diff --git a/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized.hpp b/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized.hpp
index 97ead0a37c..63cc47142b 100644
--- a/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized.hpp
+++ b/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized.hpp
@@ -176,10 +176,13 @@ class GemmUniversal<
   using AccumulatorPipeline = cutlass::PipelineUmmaAsync<AccumulatorPipelineStageCount, AtomThrShapeMNK>;
   using AccumulatorPipelineState = typename AccumulatorPipeline::PipelineState;
 
-  using CLCPipeline = cutlass::PipelineCLCFetchAsync<SchedulerPipelineStageCount, ClusterShape>;
+  using CLCPipeline = cute::conditional_t<IsSchedDynamicPersistent,
+    cutlass::PipelineCLCFetchAsync<SchedulerPipelineStageCount, ClusterShape>,
+    cutlass::PipelineAsync<SchedulerPipelineStageCount>>;
   using CLCPipelineState = typename CLCPipeline::PipelineState;
-
-  using CLCThrottlePipeline = cutlass::PipelineAsync<SchedulerPipelineStageCount>;
+  using CLCThrottlePipeline = cute::conditional_t<IsSchedDynamicPersistent,
+    cutlass::PipelineAsync<SchedulerPipelineStageCount>,
+    cutlass::PipelineEmpty>;
   using CLCThrottlePipelineState = typename CLCThrottlePipeline::PipelineState;
 
   using TmemAllocator = cute::conditional_t<cute::size(cute::shape<0>(typename TiledMma::ThrLayoutVMNK{})) == 1,
@@ -331,7 +334,8 @@ class GemmUniversal<
     if constexpr (IsGroupedGemmKernel) {
       // Group GEMM currently only supports rank-3 problem shapes
       implementable &= (args.mode == GemmUniversalMode::kGrouped && rank(typename ProblemShape::UnderlyingProblemShape{}) == 3);
-    } else {
+    }
+    else {
       implementable &= (args.mode == GemmUniversalMode::kArray && rank(typename ProblemShape::UnderlyingProblemShape{}) == 4);
     }
     if (!implementable) {
@@ -486,7 +490,7 @@ class GemmUniversal<
     auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, cute::cluster_shape());
     int cluster_size = size(cluster_shape);
     uint32_t cta_rank_in_cluster = cute::block_rank_in_cluster();
-    bool is_first_cta_in_cluster = cta_rank_in_cluster == 0;
+    bool is_first_cta_in_cluster = IsSchedDynamicPersistent ? (cta_rank_in_cluster == 0) : true;
     int cta_coord_v = cta_rank_in_cluster % size<0>(typename TiledMma::AtomThrID{});
     bool is_mma_leader_cta = cta_coord_v == 0;
     constexpr bool has_mma_peer_cta = size(AtomThrShapeMNK{}) == 2;
@@ -555,22 +559,43 @@ class GemmUniversal<
 
     // CLC pipeline
     typename CLCPipeline::Params clc_pipeline_params;
+
     if (WarpCategory::Sched == warp_category) {
-      clc_pipeline_params.role = CLCPipeline::ThreadCategory::ProducerConsumer;
+      clc_pipeline_params.role = IsSchedDynamicPersistent ? 
+        CLCPipeline::ThreadCategory::ProducerConsumer :
+        CLCPipeline::ThreadCategory::Producer;
     }
     else {
       clc_pipeline_params.role = CLCPipeline::ThreadCategory::Consumer;
     }
-    clc_pipeline_params.producer_blockid = 0;
+
+    clc_pipeline_params.initializing_warp = 1;
     clc_pipeline_params.producer_arv_count = 1;
-    clc_pipeline_params.consumer_arv_count = NumSchedThreads + cluster_size *
-                                                 (NumMainloopLoadThreads + NumEpilogueThreads + NumMMAThreads);
-    if (is_epi_load_needed) {
-      clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
+
+    if constexpr (IsSchedDynamicPersistent) {
+      clc_pipeline_params.producer_blockid = 0;
+      clc_pipeline_params.consumer_arv_count = NumSchedThreads + cluster_size *
+                                                  (NumMainloopLoadThreads + NumEpilogueThreads + NumMMAThreads);
+      if (is_epi_load_needed) {
+        clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
+      }
+      clc_pipeline_params.transaction_bytes = CLCResponseSize;
+    } 
+    else {
+      clc_pipeline_params.consumer_arv_count = NumMainloopLoadThreads + NumEpilogueThreads + NumMMAThreads;
+      if (is_epi_load_needed) {
+        clc_pipeline_params.consumer_arv_count += NumEpilogueLoadThreads;
+      }
     }
-    clc_pipeline_params.transaction_bytes = CLCResponseSize;
-    clc_pipeline_params.initializing_warp = 1;
-    CLCPipeline clc_pipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
+    // Now declare the pipeline outside the if constexpr
+    CLCPipeline clc_pipeline = [&]() {
+      if constexpr (IsSchedDynamicPersistent) {
+        return CLCPipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
+      }
+      else {
+        return CLCPipeline(shared_storage.pipelines.clc, clc_pipeline_params);
+      }
+    }();
 
     // Mainloop-Epilogue pipeline
     typename AccumulatorPipeline::Params accumulator_pipeline_params;
@@ -592,16 +617,18 @@ class GemmUniversal<
 
     // CLC throttle pipeline
     typename CLCThrottlePipeline::Params clc_throttle_pipeline_params;
-    if (WarpCategory::MainloopLoad == warp_category) {
-      clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
-    }
-    if (WarpCategory::Sched == warp_category) {
-      clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
+    if constexpr (IsSchedDynamicPersistent) {
+      if (WarpCategory::MainloopLoad == warp_category) {
+        clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
+      }
+      if (WarpCategory::Sched == warp_category) {
+        clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
+      }
+      clc_throttle_pipeline_params.producer_arv_count = NumMainloopLoadThreads;
+      clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
+      clc_throttle_pipeline_params.dst_blockid = 0;
+      clc_throttle_pipeline_params.initializing_warp = 3;
     }
-    clc_throttle_pipeline_params.producer_arv_count = NumMainloopLoadThreads;
-    clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
-    clc_throttle_pipeline_params.dst_blockid = 0;
-    clc_throttle_pipeline_params.initializing_warp = 3;
     CLCThrottlePipeline clc_throttle_pipeline(shared_storage.pipelines.clc_throttle, clc_throttle_pipeline_params);
     CLCThrottlePipelineState clc_pipe_throttle_consumer_state;
     CLCThrottlePipelineState clc_pipe_throttle_producer_state = cutlass::make_producer_start_state<CLCThrottlePipeline>();
@@ -666,7 +693,7 @@ class GemmUniversal<
 
     // TileID scheduler
     TileScheduler scheduler(&shared_storage.clc_response[0], params.scheduler, block_id_in_cluster);
-    typename TileScheduler::WorkTileInfo work_tile_info = scheduler.initial_work_tile_info(cluster_shape);
+    auto work_tile_info = scheduler.initial_work_tile_info(cluster_shape);
     auto cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
     
     //
@@ -797,6 +824,9 @@ class GemmUniversal<
     }
 
     else if (is_participant.sched) {
+      if constexpr (IsSchedDynamicPersistent) {
+        cutlass::arch::wait_on_dependent_grids();
+      }
       // Signal the epilogue warps to proceed once the prologue is complete
       epilogue_throttle_barrier.arrive();
 
@@ -841,6 +871,16 @@ class GemmUniversal<
         } while (work_tile_info.is_valid());
         clc_pipeline.producer_tail(clc_pipe_producer_state);
       }
+      else {
+        do {
+          auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(clc_pipeline, clc_pipe_producer_state);
+          work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++clc_pipe_producer_state;
+          }
+        } while (work_tile_info.is_valid());
+        clc_pipeline.producer_tail(clc_pipe_producer_state);
+      }
     }
 
     else if (is_participant.mma) {
diff --git a/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_input_transform.hpp b/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_input_transform.hpp
index 9a853fd49f..f209728a83 100644
--- a/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_input_transform.hpp
+++ b/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_input_transform.hpp
@@ -170,10 +170,15 @@ class GemmUniversal<
 
   using LoadOrderBarrier = cutlass::OrderedSequenceBarrier<1,2>;
 
-  using CLCPipeline = cutlass::PipelineCLCFetchAsync<SchedulerPipelineStageCount, ClusterShape>;
-  using CLCPipelineState = cutlass::PipelineState<SchedulerPipelineStageCount>;
 
-  using CLCThrottlePipeline = cutlass::PipelineAsync<SchedulerPipelineStageCount>;
+  using CLCPipeline = cute::conditional_t<IsSchedDynamicPersistent,
+    cutlass::PipelineCLCFetchAsync<SchedulerPipelineStageCount, ClusterShape>,
+    cutlass::PipelineAsync<SchedulerPipelineStageCount>>;
+  using CLCPipelineState = typename CLCPipeline::PipelineState;
+
+  using CLCThrottlePipeline = cute::conditional_t<IsSchedDynamicPersistent,
+    cutlass::PipelineAsync<SchedulerPipelineStageCount>,
+    cutlass::PipelineEmpty>;
   using CLCThrottlePipelineState = typename CLCThrottlePipeline::PipelineState;
 
   using TmemAllocator = cute::conditional_t<cute::size(cute::shape<0>(typename TiledMma::ThrLayoutVMNK{})) == 1,
@@ -428,7 +433,7 @@ class GemmUniversal<
     int cta_rank_in_cluster = cute::block_rank_in_cluster();
     auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, cute::cluster_shape());
     int cluster_size                = size(cluster_shape);
-    bool is_first_cta_in_cluster    = (cta_rank_in_cluster == 0);
+    bool is_first_cta_in_cluster    = IsSchedDynamicPersistent ? (cta_rank_in_cluster == 0) : true;
     bool is_mma_leader_cta          = (cta_rank_in_cluster % size<0>(TiledMma{}) == 0);
     // Even if this variable is unused, shape_div still performs useful compile-time checks.
     [[maybe_unused]] auto mma_leader_ctas = size(shape_div(cluster_shape, AtomThrShapeMNK{}));
@@ -552,38 +557,61 @@ class GemmUniversal<
     // Operates Scheduling Warp <--> All Warps
     typename CLCPipeline::Params clc_pipeline_params;
     if (WarpCategory::Sched == warp_category) {
-      clc_pipeline_params.role = CLCPipeline::ThreadCategory::ProducerConsumer;
+      clc_pipeline_params.role = IsSchedDynamicPersistent ? 
+        CLCPipeline::ThreadCategory::ProducerConsumer :
+        CLCPipeline::ThreadCategory::Producer;
     }
     else {
       clc_pipeline_params.role = CLCPipeline::ThreadCategory::Consumer;
     }
-    clc_pipeline_params.producer_blockid = 0;
+
+    clc_pipeline_params.initializing_warp = 1;
     clc_pipeline_params.producer_arv_count = 1;
-    clc_pipeline_params.consumer_arv_count = NumSchedThreads + cluster_size *
-                                                 (NumMainloopLoadThreads + NumEpilogueThreads +
-                                                  NumMMAThreads + NumTransformationThreads);
-    if (is_epi_load_needed) {
-      clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
+
+    if constexpr (IsSchedDynamicPersistent) {
+      clc_pipeline_params.producer_blockid = 0;
+      clc_pipeline_params.consumer_arv_count = NumSchedThreads + cluster_size *
+                                                  (NumMainloopLoadThreads + NumEpilogueThreads + NumMMAThreads +
+                                                   NumTransformationThreads);
+      if (is_epi_load_needed) {
+        clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
+      }
+      clc_pipeline_params.transaction_bytes = CLCResponseSize;
+    } 
+    else {
+      clc_pipeline_params.consumer_arv_count = NumMainloopLoadThreads + NumEpilogueThreads + NumMMAThreads +
+                                               NumTransformationThreads;
+      if (is_epi_load_needed) {
+        clc_pipeline_params.consumer_arv_count += NumEpilogueLoadThreads;
+      }
     }
-    clc_pipeline_params.transaction_bytes = CLCResponseSize;
-    clc_pipeline_params.initializing_warp = 1;
-    CLCPipeline clc_pipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
+    
+    CLCPipeline clc_pipeline = [&]() {
+      if constexpr (IsSchedDynamicPersistent) {
+        return CLCPipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
+      }
+      else {
+        return CLCPipeline(shared_storage.pipelines.clc, clc_pipeline_params);
+      }
+    }();
 
     CLCPipelineState clc_pipeline_consumer_state;
     CLCPipelineState clc_pipeline_producer_state = cutlass::make_producer_start_state<CLCPipeline>();
 
     // CLC throttle pipeline
     typename CLCThrottlePipeline::Params clc_throttle_pipeline_params;
-    if (WarpCategory::MainloopLoad == warp_category) {
-      clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
-    }
-    if (WarpCategory::Sched == warp_category) {
-      clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
+    if constexpr (IsSchedDynamicPersistent) {
+      if (WarpCategory::MainloopLoad == warp_category) {
+        clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
+      }
+      if (WarpCategory::Sched == warp_category) {
+        clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
+      }
+      clc_throttle_pipeline_params.producer_arv_count = NumMainloopLoadThreads;
+      clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
+      clc_throttle_pipeline_params.dst_blockid = 0;
+      clc_throttle_pipeline_params.initializing_warp = 3;
     }
-    clc_throttle_pipeline_params.producer_arv_count = NumMainloopLoadThreads;
-    clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
-    clc_throttle_pipeline_params.dst_blockid = 0;
-    clc_throttle_pipeline_params.initializing_warp = 3;
     CLCThrottlePipeline clc_throttle_pipeline(shared_storage.pipelines.clc_throttle, clc_throttle_pipeline_params);
     CLCThrottlePipelineState clc_pipe_throttle_consumer_state;
     CLCThrottlePipelineState clc_pipe_throttle_producer_state = cutlass::make_producer_start_state<CLCThrottlePipeline>();
@@ -804,9 +832,14 @@ class GemmUniversal<
       // Register reconfiguration
       arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
 
+      if constexpr (IsSchedDynamicPersistent) {
+        cutlass::arch::wait_on_dependent_grids();
+      }
+
       // Signal the epilogue warps to proceed once the prologue is complete
       epilogue_throttle_barrier.arrive();
 
+      // Grouped GEMM uses static tile scheduler
       if constexpr (IsSchedDynamicPersistent) {
         // Whether a new CLC query must be performed.
         // See comment below where this variable is updated for a description of
@@ -849,6 +882,16 @@ class GemmUniversal<
         } while (work_tile_info.is_valid());
         clc_pipeline.producer_tail(clc_pipeline_producer_state);
       }
+      else {
+        do {
+          auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(clc_pipeline, clc_pipeline_producer_state);
+          work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++clc_pipeline_producer_state;
+          }
+        } while (work_tile_info.is_valid());
+        clc_pipeline.producer_tail(clc_pipeline_producer_state);
+      }
     }
 
     else if (is_participant.mma) {
diff --git a/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_mma_transform.hpp b/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_mma_transform.hpp
new file mode 100644
index 0000000000..8c112057c6
--- /dev/null
+++ b/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized_mma_transform.hpp
@@ -0,0 +1,1289 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/workspace.h"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cutlass/detail/cluster.hpp"
+#include "cutlass/arch/grid_dependency_control.h"
+#include "cutlass/fast_math.h"
+#include "cute/arch/cluster_sm90.hpp"
+#include "cutlass/arch/arch.h"
+#include "cutlass/arch/barrier.h"
+#include "cutlass/arch/reg_reconfig.h"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/detail/mainloop_fusion_helper_scale_factor.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/kernel/sm100_tile_scheduler.hpp"
+#include "cutlass/gemm/kernel/sm100_tile_scheduler_group.hpp"
+#include "cutlass/pipeline/pipeline.hpp"
+
+#include "cute/tensor.hpp"
+#include "cute/arch/tmem_allocator_sm100.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+///////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::kernel {
+
+///////////////////////////////////////////////////////////////////////////////
+
+template <
+  class ProblemShape_,
+  class CollectiveMainloop_,
+  class CollectiveEpilogue_,
+  class TileSchedulerTag_
+>
+class GemmUniversal<
+  ProblemShape_,
+  CollectiveMainloop_,
+  CollectiveEpilogue_,
+  TileSchedulerTag_,
+  cute::enable_if_t<
+    cutlass::detail::is_kernel_tag_of_v<typename CollectiveMainloop_::DispatchPolicy::Schedule,
+                                KernelPtrArrayTmaWarpSpecializedMmaTransformSm100>>> {
+public:
+  //
+  // Type Aliases
+  //
+  using ProblemShape = ProblemShape_;
+  static_assert(rank(typename ProblemShape::UnderlyingProblemShape{}) == 3 or rank(typename ProblemShape::UnderlyingProblemShape{}) == 4,
+    "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+
+  // Mainloop derived types
+  using CollectiveMainloop = CollectiveMainloop_;
+  using TileShape = typename CollectiveMainloop::TileShape;
+  using TiledMma  = typename CollectiveMainloop::TiledMma;
+  using ArchTag   = typename CollectiveMainloop::ArchTag;
+  using ElementA  = typename CollectiveMainloop::ElementA;
+  using StrideA   = typename CollectiveMainloop::StrideA;
+  using InternalStrideA = typename CollectiveMainloop::InternalStrideA;
+  using ElementB  = typename CollectiveMainloop::ElementB;
+  using StrideB   = typename CollectiveMainloop::StrideB;
+  using InternalStrideB = typename CollectiveMainloop::InternalStrideB;
+  using LayoutSFA = typename cutlass::detail::LayoutSFAType<CollectiveMainloop>::type;
+  using LayoutSFB = typename cutlass::detail::LayoutSFBType<CollectiveMainloop>::type;
+  using ElementSF = typename cutlass::detail::ElementSFType<CollectiveMainloop>::type;
+  using DispatchPolicy = typename CollectiveMainloop::DispatchPolicy;
+  using Schedule = typename DispatchPolicy::Schedule;
+  using ElementAccumulator = typename CollectiveMainloop::ElementAccumulator;
+  using ClusterShape = typename DispatchPolicy::ClusterShape;
+  using MainloopArguments = typename CollectiveMainloop::Arguments;
+  using MainloopParams = typename CollectiveMainloop::Params;
+  static_assert(ArchTag::kMinComputeCapability >= 100);
+  static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
+
+  // Epilogue derived types
+  using CollectiveEpilogue = CollectiveEpilogue_;
+  using EpilogueTile = typename CollectiveEpilogue::EpilogueTile;
+  using ElementC = typename CollectiveEpilogue::ElementC;
+  using StrideC  = typename CollectiveEpilogue::StrideC;
+  using InternalStrideC = typename CollectiveEpilogue::InternalStrideC;
+  using ElementD = typename CollectiveEpilogue::ElementD;
+  using StrideD  = typename CollectiveEpilogue::StrideD;
+  using InternalStrideD = typename CollectiveEpilogue::InternalStrideD;
+  using EpilogueArguments = typename CollectiveEpilogue::Arguments;
+  using EpilogueParams = typename CollectiveEpilogue::Params;
+
+  // CLC pipeline depth
+  // determines how many waves (stages-1) a warp can race ahead
+  static constexpr uint32_t SchedulerPipelineStageCount = DispatchPolicy::Schedule::SchedulerPipelineStageCount;
+  static constexpr uint32_t AccumulatorPipelineStageCount = DispatchPolicy::Schedule::AccumulatorPipelineStageCount;
+  static constexpr bool IsOverlappingAccum = DispatchPolicy::IsOverlappingAccum;
+
+  static_assert(!IsOverlappingAccum, "Does not support overlapping accumulator");
+
+  // TileID scheduler
+  // Get Blk and Scheduling tile shapes
+  using AtomThrShapeMNK = typename CollectiveMainloop::AtomThrShapeMNK;
+  using CtaShape_MNK = typename CollectiveMainloop::CtaShape_MNK;
+
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
+  using TileSchedulerTag = cute::conditional_t<IsGroupedGemmKernel, GroupScheduler, TileSchedulerTag_>;
+
+  using TileScheduler = typename detail::TileSchedulerSelector<
+    TileSchedulerTag, ArchTag, CtaShape_MNK, ClusterShape, SchedulerPipelineStageCount, ProblemShape>::Scheduler;
+  using TileSchedulerArguments = typename TileScheduler::Arguments;
+  using TileSchedulerParams = typename TileScheduler::Params;
+
+  static constexpr bool IsDynamicCluster = not cute::is_static_v<ClusterShape>;
+
+  // Warp specialization thread count per threadblock
+  static constexpr uint32_t NumSchedThreads          = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumMMAThreads            = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumMainloopABLoadThreads = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumEpilogueLoadThreads   = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumMainloopSFLoadThreads = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumEpilogueThreads       = CollectiveEpilogue::ThreadCount;
+  static constexpr uint32_t NumEpilogueWarps         = NumEpilogueThreads / NumThreadsPerWarp;
+
+
+  static constexpr uint32_t MaxThreadsPerBlock = cute::round_up(NumSchedThreads +
+                                                 NumMainloopABLoadThreads + NumMMAThreads +
+                                                 NumEpilogueLoadThreads + NumEpilogueThreads +
+                                                 NumMainloopSFLoadThreads, 128);
+  static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
+  static constexpr uint32_t NumFixupBarriers = 1;
+  static constexpr uint32_t CLCResponseSize = sizeof(typename TileScheduler::CLCResponse);
+  
+  static constexpr bool IsSchedDynamicPersistent = TileScheduler::IsDynamicPersistent;
+
+  // Pipeline and pipeline state types
+  using MainloopABPipeline = typename CollectiveMainloop::MainloopABPipeline;
+  using MainloopABPipelineState = typename CollectiveMainloop::MainloopABPipelineState;
+
+  using MainloopSFPipeline = typename CollectiveMainloop::MainloopSFPipeline;
+  using MainloopSFPipelineState = typename CollectiveMainloop::MainloopSFPipelineState;
+
+  using EpiLoadPipeline = typename CollectiveEpilogue::LoadPipeline;
+  using EpiLoadPipelineState = typename CollectiveEpilogue::LoadPipelineState;
+
+  using EpiStorePipeline = typename CollectiveEpilogue::StorePipeline;
+  using EpiStorePipelineState = typename CollectiveEpilogue::StorePipelineState;
+
+  using LoadOrderBarrier = cutlass::OrderedSequenceBarrier<1,2>;
+
+  using AccumulatorPipeline = typename CollectiveMainloop::AccumulatorPipeline;
+  using AccumulatorPipelineState = typename AccumulatorPipeline::PipelineState;
+
+  using CLCPipeline = cute::conditional_t<IsSchedDynamicPersistent,
+    cutlass::PipelineCLCFetchAsync<SchedulerPipelineStageCount, ClusterShape>,
+    cutlass::PipelineAsync<SchedulerPipelineStageCount>>;
+  using CLCPipelineState = typename CLCPipeline::PipelineState;
+
+  using CLCThrottlePipeline = cute::conditional_t<IsSchedDynamicPersistent,
+    cutlass::PipelineAsync<SchedulerPipelineStageCount>,
+    cutlass::PipelineEmpty>;
+  using CLCThrottlePipelineState = typename CLCThrottlePipeline::PipelineState;
+
+  using TmemAllocator = cute::conditional_t<cute::size(cute::shape<0>(typename TiledMma::ThrLayoutVMNK{})) == 1,
+      cute::TMEM::Allocator1Sm, cute::TMEM::Allocator2Sm>;
+
+  static constexpr uint32_t GenericRegisterRequirement = 48;
+  static constexpr uint32_t AccumRegisterRequirement = 256;
+
+  // Kernel level shared memory storage
+  struct SharedStorage {
+    // Barriers should be allocated in lower 8KB of SMEM for SM100
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
+      using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
+      using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
+      using LoadOrderBarrierStorage = typename LoadOrderBarrier::SharedStorage;
+      using CLCPipelineStorage = typename CLCPipeline::SharedStorage;
+      using CLCThrottlePipelineStorage = typename CLCThrottlePipeline::SharedStorage;
+
+      alignas(16) MainloopPipelineStorage mainloop;
+      alignas(16) EpiLoadPipelineStorage epi_load;
+      alignas(16) LoadOrderBarrierStorage load_order;
+      alignas(16) CLCPipelineStorage clc;
+      alignas(16) CLCThrottlePipelineStorage clc_throttle;
+      alignas(16) arch::ClusterBarrier tmem_dealloc;
+      alignas(16) arch::ClusterBarrier epilogue_throttle;
+    } pipelines;
+
+    alignas(16) typename TileScheduler::CLCResponse clc_response[SchedulerPipelineStageCount];
+    uint32_t tmem_base_ptr;
+
+    struct TensorMapStorage : cute::aligned_struct<128, _1> {
+      using EpilogueTensorMapStorage = typename CollectiveEpilogue::TensorMapStorage;
+      using MainloopTensorMapStorage = typename CollectiveMainloop::TensorMapStorage;
+      alignas(128) EpilogueTensorMapStorage epilogue;
+      alignas(128) MainloopTensorMapStorage mainloop;
+    } tensormaps;
+
+    struct TensorStorage : cute::aligned_struct<128, _1> {
+      using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
+      using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
+
+      EpilogueTensorStorage epilogue;
+      MainloopTensorStorage mainloop;
+    } tensors;
+  };
+
+  static constexpr int SharedStorageSize = sizeof(SharedStorage);
+  static_assert(SharedStorageSize <= cutlass::arch::sm100_smem_capacity_bytes, "SMEM usage exceeded capacity.");
+
+  // Host facing host arguments
+  struct Arguments {
+    GemmUniversalMode mode{};
+    ProblemShape problem_shape{};
+    MainloopArguments mainloop{};
+    EpilogueArguments epilogue{};
+    KernelHardwareInfo hw_info{};
+    TileSchedulerArguments scheduler{};
+  };
+
+  // Kernel device entry point API
+  struct Params {
+    GemmUniversalMode mode{};
+    ProblemShape problem_shape{};
+    MainloopParams mainloop{};
+    EpilogueParams epilogue{};
+    TileSchedulerParams scheduler{};
+    KernelHardwareInfo hw_info{};
+  };
+
+  enum class WarpCategory : int32_t {
+    MMA            = 0,
+    Sched          = 1,
+    MainloopABLoad = 2,
+    EpilogueLoad   = 3,
+    Epilogue       = 4, // 4 warps
+    MainloopSFLoad = 8,
+    Unused         = 9,
+  };
+
+  struct IsParticipant {
+    uint32_t mma            = false;
+    uint32_t sched          = false;
+    uint32_t main_ab_load   = false;
+    uint32_t epi_load       = false;
+    uint32_t epilogue       = false;
+    uint32_t main_sf_load   = false;
+    uint32_t unused         = false;
+  };
+
+  //
+  // Methods
+  //
+
+  // Convert to underlying arguments.
+  static
+  Params
+  to_underlying_arguments(Arguments const& args, void* workspace) {
+    constexpr uint32_t NumEpilogueSubTiles = 1;
+    CUTLASS_TRACE_HOST("to_underlying_arguments():");
+    ProblemShape problem_shapes = args.problem_shape;
+    // Get SM count if needed, otherwise use user supplied SM count
+    int sm_count = args.hw_info.sm_count;
+    if (!IsGroupedGemmKernel && sm_count != 0) {
+      CUTLASS_TRACE_HOST("  WARNING: SM100 tile scheduler does not allow for user specified SM counts.\n"
+          "  To restrict a kernel's resource usage, consider using CUDA driver APIs instead (green contexts).");
+    }
+    CUTLASS_TRACE_HOST("to_underlying_arguments(): Setting persistent grid SM count to " << sm_count);
+
+    // Calculate workspace pointers
+    uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
+    size_t workspace_offset = 0;
+
+    // Epilogue
+    void* epilogue_workspace = workspace_ptr + workspace_offset;
+    workspace_offset += CollectiveEpilogue::get_workspace_size(problem_shapes, args.epilogue, args.hw_info.sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    void* mainloop_workspace = workspace_ptr + workspace_offset;
+    workspace_offset += CollectiveMainloop::get_workspace_size(problem_shapes, args.mainloop, args.hw_info.sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    // Tile scheduler
+    void* scheduler_workspace = workspace_ptr + workspace_offset;
+    workspace_offset += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, problem_shapes.get_host_problem_shape(0), args.hw_info, NumFixupBarriers, NumEpilogueSubTiles, CollectiveEpilogue::NumAccumulatorMtxs);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    TileSchedulerParams scheduler;
+    if constexpr (IsGroupedGemmKernel) {
+      scheduler = TileScheduler::to_underlying_arguments(
+      problem_shapes, TileShape{}, AtomThrShapeMNK{}, ClusterShape{},
+      args.hw_info, args.scheduler, scheduler_workspace);
+    }
+    else {
+      scheduler = TileScheduler::to_underlying_arguments(
+      problem_shapes.get_host_problem_shape(), TileShape{}, AtomThrShapeMNK{}, ClusterShape{},
+      args.hw_info, args.scheduler, scheduler_workspace
+      );
+    }
+
+    return {
+      args.mode,
+      problem_shapes,
+      CollectiveMainloop::to_underlying_arguments(problem_shapes, args.mainloop, mainloop_workspace, args.hw_info),
+      CollectiveEpilogue::to_underlying_arguments(problem_shapes, args.epilogue, epilogue_workspace),
+      scheduler,
+      args.hw_info
+    };
+  }
+
+  static bool
+  can_implement(Arguments const& args) {
+    bool implementable = true;
+    if constexpr (IsGroupedGemmKernel) {
+      // Group GEMM currently only supports rank-3 problem shapes
+      implementable &= (args.mode == GemmUniversalMode::kGrouped && rank(typename ProblemShape::UnderlyingProblemShape{}) == 3);
+    }
+    else {
+      implementable &= (args.mode == GemmUniversalMode::kArray && rank(typename ProblemShape::UnderlyingProblemShape{}) == 4);
+    }
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Arguments or Problem Shape don't meet the requirements for Ptr Array Gemm or Grouped Gemm.\n");
+      return implementable;
+    }
+    implementable &= CollectiveMainloop::can_implement(args.problem_shape, args.mainloop);
+    implementable &= CollectiveEpilogue::can_implement(args.problem_shape, args.epilogue);
+    implementable &= TileScheduler::can_implement(args.scheduler);
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Mainloop, Epilogue or Scheduler don't meet the requirements for Ptr Array Gemm or Grouped Gemm.\n");
+      return implementable;
+    }
+
+    if constexpr (IsDynamicCluster) {
+      static constexpr int MaxClusterSize = 16;
+      implementable &= size(args.hw_info.cluster_shape) <= MaxClusterSize;
+      implementable &= size(args.hw_info.cluster_shape_fallback) <= MaxClusterSize;
+      implementable &= cutlass::detail::preferred_cluster_can_implement<AtomThrShapeMNK>(args.hw_info.cluster_shape, args.hw_info.cluster_shape_fallback);
+    }
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Dynamic Cluster or Preferred Cluster don't meet the requirements for Ptr Array Gemm or Grouped Gemm.\n");
+      return implementable;
+    }
+
+    constexpr bool IsBlockscaled = !cute::is_void_v<ElementSF>;
+    if constexpr (IsBlockscaled) {
+      if constexpr (IsDynamicCluster) {
+        implementable &= cutlass::detail::preferred_cluster_can_implement<AtomThrShapeMNK>(args.hw_info.cluster_shape, args.hw_info.cluster_shape_fallback);
+        // Special cluster check for scale factor multicasts. Due to limited size of scale factors, we can't multicast among
+        // more than 4 CTAs
+        implementable &= (args.hw_info.cluster_shape.x <= 4 && args.hw_info.cluster_shape.y <= 4 &&
+                          args.hw_info.cluster_shape_fallback.x <= 4 && args.hw_info.cluster_shape_fallback.y <= 4);
+      }
+      else {
+        // Special cluster check for scale factor multicasts. Due to limited size of scale factors, we can't multicast among
+        // more than 4 CTAs
+        implementable &= ((size<0>(ClusterShape{}) <= 4) && (size<1>(ClusterShape{}) <= 4));
+      }
+    }
+
+    return implementable;
+  }
+
+  static size_t
+  get_workspace_size(Arguments const& args) {
+    constexpr uint32_t NumEpilogueSubTiles = 1;
+    size_t workspace_size = 0;
+
+    // Epilogue
+    workspace_size += CollectiveEpilogue::get_workspace_size(args.problem_shape, args.epilogue, args.hw_info.sm_count);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
+    // Mainloop
+    workspace_size += CollectiveMainloop::get_workspace_size(args.problem_shape, args.mainloop, args.hw_info.sm_count);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
+    // Tile scheduler
+    workspace_size += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, args.problem_shape.get_host_problem_shape(0), args.hw_info, NumFixupBarriers, NumEpilogueSubTiles, CollectiveEpilogue::NumAccumulatorMtxs);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
+    return workspace_size;
+  }
+
+  static cutlass::Status
+  initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    constexpr uint32_t NumEpilogueSubTiles = 1;
+    Status status = Status::kSuccess;
+    uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
+    size_t workspace_offset = 0;
+
+    // Epilogue
+    status = CollectiveEpilogue::initialize_workspace(args.problem_shape, args.epilogue, workspace_ptr + workspace_offset, stream, cuda_adapter);
+    workspace_offset += CollectiveEpilogue::get_workspace_size(args.problem_shape, args.epilogue, args.hw_info.sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    // Mainloop
+    status = CollectiveMainloop::initialize_workspace(args.problem_shape, args.mainloop, workspace_ptr + workspace_offset, stream, cuda_adapter);
+    workspace_offset += CollectiveMainloop::get_workspace_size(args.problem_shape, args.mainloop, args.hw_info.sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    // Tile scheduler
+    status = TileScheduler::template initialize_workspace<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, workspace_ptr + workspace_offset, stream, args.problem_shape.get_host_problem_shape(0), args.hw_info, NumFixupBarriers, NumEpilogueSubTiles, CollectiveEpilogue::NumAccumulatorMtxs, cuda_adapter);
+    workspace_offset += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, args.problem_shape.get_host_problem_shape(0), args.hw_info, NumFixupBarriers, NumEpilogueSubTiles, CollectiveEpilogue::NumAccumulatorMtxs);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    return status;
+  }
+
+  // Computes the kernel launch grid shape based on runtime parameters
+  static dim3
+  get_grid_shape(Params const& params) {
+    // NOTE: cluster_shape here is the major cluster shape, not fallback one
+    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, params.hw_info.cluster_shape);
+
+    dim3 grid_shape;
+    if constexpr (IsGroupedGemmKernel) {
+      grid_shape = TileScheduler::get_grid_shape(
+        params.scheduler,
+        params.problem_shape,
+        TileShape{},
+        AtomThrShapeMNK{},
+        cluster_shape,
+        params.hw_info);
+    }
+    else {
+      grid_shape = TileScheduler::get_grid_shape(
+        params.scheduler,
+        params.problem_shape.get_host_problem_shape(),
+        TileShape{},
+        AtomThrShapeMNK{},
+        cluster_shape,
+        params.hw_info);
+    }
+    return grid_shape;
+  }
+
+  static constexpr
+  dim3
+  get_block_shape() {
+    return dim3(MaxThreadsPerBlock, 1, 1);
+  }
+
+  CUTLASS_DEVICE
+  void
+  operator() (Params const& params, char* smem_buf) {
+
+    using namespace cute;
+    using X = Underscore;
+
+    auto problem_shape = params.problem_shape;
+
+    // Account for more than one epilogue warp
+    int warp_idx = canonical_warp_idx_sync();
+    WarpCategory warp_category = [&] () CUTLASS_LAMBDA_FUNC_INLINE {
+      if (warp_idx < static_cast<int>(WarpCategory::Epilogue)) {
+        return WarpCategory(warp_idx);
+      }
+      else if (warp_idx < static_cast<int>(WarpCategory::MainloopSFLoad)) {
+        return WarpCategory::Epilogue;
+      }
+      else if (warp_idx == static_cast<int>(WarpCategory::MainloopSFLoad)) {
+        return WarpCategory::MainloopSFLoad;
+      }
+      else {
+        return WarpCategory::Unused;
+      }
+    }();
+
+
+    uint32_t lane_predicate = cute::elect_one_sync();
+    auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{}, cute::cluster_shape());
+    int cluster_size = size(cluster_shape);
+    uint32_t cta_rank_in_cluster = cute::block_rank_in_cluster();
+    bool is_first_cta_in_cluster = IsSchedDynamicPersistent ? (cta_rank_in_cluster == 0) : true;
+    int cta_coord_v = cta_rank_in_cluster % size<0>(typename TiledMma::AtomThrID{});
+    bool is_mma_leader_cta = cta_coord_v == 0;
+    constexpr bool has_mma_peer_cta = size(AtomThrShapeMNK{}) == 2;
+    [[maybe_unused]] uint32_t mma_peer_cta_rank = has_mma_peer_cta ? cta_rank_in_cluster ^ 1 : cta_rank_in_cluster;
+
+    // Kernel level shared memory storage
+    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
+
+    // In a warp specialized kernel, collectives expose data movement and compute operations separately
+    CollectiveMainloop collective_mainloop(params.mainloop, cluster_shape, cta_rank_in_cluster);
+    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+
+    // Do we load source tensor C or other aux inputs
+    bool is_epi_load_needed = collective_epilogue.is_producer_load_needed();
+    IsParticipant is_participant = {
+      (warp_category == WarpCategory::MMA),                                 // mma
+      (warp_category == WarpCategory::Sched) && is_first_cta_in_cluster,    // sched
+      (warp_category == WarpCategory::MainloopABLoad),                      // main_ab_load
+      (warp_category == WarpCategory::EpilogueLoad) && is_epi_load_needed,  // epi_load
+      (warp_category == WarpCategory::Epilogue),                            // epilogue
+      (warp_category == WarpCategory::MainloopSFLoad),                      // main_sf_load
+      (warp_category == WarpCategory::Unused)                               // unused
+    };
+
+    // Mainloop Load pipeline
+    typename MainloopABPipeline::Params mainloop_ab_pipeline_params;
+    if (WarpCategory::MainloopABLoad == warp_category) {
+      mainloop_ab_pipeline_params.role = MainloopABPipeline::ThreadCategory::Producer;
+    }
+    if (WarpCategory::MMA == warp_category) {
+      mainloop_ab_pipeline_params.role = MainloopABPipeline::ThreadCategory::Consumer;
+    }
+    mainloop_ab_pipeline_params.is_leader = lane_predicate && is_mma_leader_cta && is_participant.main_ab_load;
+    mainloop_ab_pipeline_params.transaction_bytes = CollectiveMainloop::TmaTransactionBytes;
+    mainloop_ab_pipeline_params.initializing_warp = 0;
+    MainloopABPipeline mainloop_ab_pipeline(shared_storage.pipelines.mainloop.pipeline_ab,
+                                       mainloop_ab_pipeline_params,
+                                       cluster_shape,
+                                       cute::true_type{},   // Perform barrier init
+                                       cute::false_type{}); // Delay mask calculation
+
+    typename MainloopSFPipeline::Params mainloop_sf_pipeline_params;
+    if (WarpCategory::MainloopSFLoad == warp_category) {
+      mainloop_sf_pipeline_params.role = MainloopSFPipeline::ThreadCategory::Producer;
+    }
+    if (WarpCategory::Epilogue == warp_category) {
+      mainloop_sf_pipeline_params.role = MainloopSFPipeline::ThreadCategory::Consumer;
+    }
+    mainloop_sf_pipeline_params.initializing_warp = 8;
+    mainloop_sf_pipeline_params.producer_arv_count = CollectiveMainloop::NumMainloopSFProducerThreadEvents;
+    mainloop_sf_pipeline_params.consumer_arv_count = NumEpilogueThreads;
+
+    MainloopSFPipeline mainloop_sf_pipeline(shared_storage.pipelines.mainloop.pipeline_sf,
+                                            mainloop_sf_pipeline_params);
+
+    // Epilogue Load pipeline
+    typename EpiLoadPipeline::Params epi_load_pipeline_params;
+    if (WarpCategory::EpilogueLoad == warp_category) {
+      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Producer;
+    }
+    if (WarpCategory::Epilogue == warp_category) {
+      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Consumer;
+    }
+    epi_load_pipeline_params.dst_blockid = cta_rank_in_cluster;
+    epi_load_pipeline_params.producer_arv_count = NumEpilogueLoadThreads;
+    epi_load_pipeline_params.consumer_arv_count = NumEpilogueThreads;
+    epi_load_pipeline_params.transaction_bytes = CollectiveEpilogue::TmaTransactionBytes;
+    epi_load_pipeline_params.initializing_warp = 4;
+    EpiLoadPipeline epi_load_pipeline(shared_storage.pipelines.epi_load, epi_load_pipeline_params);
+
+    // Epilogue Store pipeline
+    typename EpiStorePipeline::Params epi_store_pipeline_params;
+    epi_store_pipeline_params.always_wait = true;
+    EpiStorePipeline epi_store_pipeline(epi_store_pipeline_params);
+
+    // Load order barrier
+    typename LoadOrderBarrier::Params load_order_barrier_params;
+    load_order_barrier_params.group_id = (warp_category == WarpCategory::MainloopABLoad) ? 0 : 1;
+    load_order_barrier_params.group_size = NumMainloopABLoadThreads;
+    load_order_barrier_params.initializing_warp = 5;
+    LoadOrderBarrier load_order_barrier(shared_storage.pipelines.load_order, load_order_barrier_params);
+
+    // CLC pipeline
+    typename CLCPipeline::Params clc_pipeline_params;
+    if (WarpCategory::Sched == warp_category) {
+      clc_pipeline_params.role = IsSchedDynamicPersistent ? 
+        CLCPipeline::ThreadCategory::ProducerConsumer :
+        CLCPipeline::ThreadCategory::Producer;
+    }
+    else {
+      clc_pipeline_params.role = CLCPipeline::ThreadCategory::Consumer;
+    }
+
+    clc_pipeline_params.initializing_warp = 1;
+    clc_pipeline_params.producer_arv_count = 1;
+
+    if constexpr (IsSchedDynamicPersistent) {
+      clc_pipeline_params.producer_blockid = 0;
+      clc_pipeline_params.consumer_arv_count = NumSchedThreads + cluster_size *
+                                                  (NumMainloopABLoadThreads + NumEpilogueThreads + 
+                                                    NumMainloopSFLoadThreads + NumMMAThreads);
+      if (is_epi_load_needed) {
+        clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
+      }
+      clc_pipeline_params.transaction_bytes = CLCResponseSize;
+    } 
+    else {
+      clc_pipeline_params.consumer_arv_count = NumMainloopABLoadThreads + NumEpilogueThreads + NumMMAThreads +
+                                               NumMainloopSFLoadThreads;
+      if (is_epi_load_needed) {
+        clc_pipeline_params.consumer_arv_count += NumEpilogueLoadThreads;
+      }
+    }
+
+    CLCPipeline clc_pipeline = [&] () {
+      if constexpr (IsSchedDynamicPersistent) {
+        return CLCPipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
+      }
+      else {
+        return CLCPipeline(shared_storage.pipelines.clc, clc_pipeline_params);
+      }
+    } ();
+
+    // Mainloop-Epilogue pipeline
+    typename AccumulatorPipeline::Params accumulator_pipeline_params;
+    if (WarpCategory::MMA == warp_category) {
+      accumulator_pipeline_params.role = AccumulatorPipeline::ThreadCategory::Producer;
+    }
+    if (WarpCategory::Epilogue == warp_category) {
+      accumulator_pipeline_params.role = AccumulatorPipeline::ThreadCategory::Consumer;
+    }
+    // Only one producer thread arrives on this barrier.
+    accumulator_pipeline_params.producer_arv_count = 1;
+    accumulator_pipeline_params.consumer_arv_count = size(AtomThrShapeMNK{}) * NumEpilogueThreads;
+    accumulator_pipeline_params.initializing_warp = 2;
+    AccumulatorPipeline accumulator_pipeline(shared_storage.pipelines.mainloop.pipeline_accum,
+                                                 accumulator_pipeline_params,
+                                                 cluster_shape);
+
+    // CLC throttle pipeline
+    typename CLCThrottlePipeline::Params clc_throttle_pipeline_params;
+    if constexpr (IsSchedDynamicPersistent) {
+      if (WarpCategory::MainloopABLoad == warp_category) {
+        clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
+      }
+      if (WarpCategory::Sched == warp_category) {
+        clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
+      }
+      clc_throttle_pipeline_params.producer_arv_count = NumMainloopABLoadThreads;
+      clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
+      clc_throttle_pipeline_params.dst_blockid = 0;
+      clc_throttle_pipeline_params.initializing_warp = 3;
+    }
+    CLCThrottlePipeline clc_throttle_pipeline(shared_storage.pipelines.clc_throttle, clc_throttle_pipeline_params);
+    CLCThrottlePipelineState clc_pipe_throttle_consumer_state;
+    CLCThrottlePipelineState clc_pipe_throttle_producer_state = cutlass::make_producer_start_state<CLCThrottlePipeline>();
+
+    // Tmem allocator
+    TmemAllocator tmem_allocator{};
+
+    // Sync allocation status between MMA and epilogue warps within CTA
+    arch::NamedBarrier tmem_allocation_result_barrier(NumMMAThreads + NumEpilogueThreads, cutlass::arch::ReservedNamedBarriers::TmemAllocBarrier);
+    // Sync deallocation status between MMA warps of peer CTAs
+    arch::ClusterBarrier& tmem_deallocation_result_barrier = shared_storage.pipelines.tmem_dealloc;
+    [[maybe_unused]] uint32_t dealloc_barrier_phase = 0;
+    
+    if (WarpCategory::MMA == warp_category && has_mma_peer_cta && lane_predicate) {
+      tmem_deallocation_result_barrier.init(NumMMAThreads);
+    }
+
+    // Initialize smem barrier for prologue throttling. Epilogue warps are stalled until the prologue finishes.
+    arch::ClusterBarrier& epilogue_throttle_barrier = shared_storage.pipelines.epilogue_throttle;
+    if (WarpCategory::MMA == warp_category && lane_predicate) {
+      epilogue_throttle_barrier.init(                          NumMMAThreads +
+                                    (is_first_cta_in_cluster ? NumSchedThreads : 0) +
+                                                               NumMainloopABLoadThreads +
+                                    (is_epi_load_needed      ? NumEpilogueLoadThreads : 0));
+    }
+
+    // We need this to guarantee that the Pipeline init is visible
+    // To all producers and consumer threadblocks in the cluster
+    pipeline_init_arrive_relaxed(cluster_size);
+
+    MainloopABPipelineState mainloop_ab_pipe_consumer_state;
+    MainloopABPipelineState mainloop_ab_pipe_producer_state = cutlass::make_producer_start_state<MainloopABPipeline>();
+
+    EpiLoadPipelineState epi_load_pipe_consumer_state;
+    EpiLoadPipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
+
+    // epilogue store pipe is producer-only (consumer is TMA unit, waits via scoreboarding)
+    EpiStorePipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
+
+    CLCPipelineState clc_pipe_consumer_state;
+    CLCPipelineState clc_pipe_producer_state = cutlass::make_producer_start_state<CLCPipeline>();
+
+    AccumulatorPipelineState accumulator_pipe_consumer_state;
+    AccumulatorPipelineState accumulator_pipe_producer_state = cutlass::make_producer_start_state<AccumulatorPipeline>();
+
+    MainloopSFPipelineState mainloop_sf_pipe_consumer_state;
+    MainloopSFPipelineState mainloop_sf_pipe_producer_state = cutlass::make_producer_start_state<MainloopSFPipeline>();
+
+    dim3 block_id_in_cluster = cute::block_id_in_cluster();
+    int32_t sm_id = static_cast<int32_t>(cutlass::arch::SmId());
+
+    // Calculate mask after cluster barrier arrival
+    mainloop_ab_pipeline.init_masks(cluster_shape, block_id_in_cluster);
+    accumulator_pipeline.init_masks(cluster_shape, block_id_in_cluster);
+
+    // TileID scheduler
+    TileScheduler scheduler(&shared_storage.clc_response[0], params.scheduler, block_id_in_cluster);
+    typename TileScheduler::WorkTileInfo work_tile_info = scheduler.initial_work_tile_info(cluster_shape);
+    auto cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+    
+    //
+    // TMEM "Allocation"
+    //
+    // ((MMA_TILE_M,MMA_TILE_N),MMA_M,MMA_N,ACC_PIPE) where ACC_PIPE=2 so we can double buffer our accumulators for mainloop and epilogue.
+    TiledMma tiled_mma;
+    auto acc_shape = collective_mainloop.partition_accumulator_shape();
+    Tensor accumulators = cutlass::detail::make_sm100_accumulator<AccumulatorPipelineStageCount, IsOverlappingAccum>(
+        tiled_mma, acc_shape, EpilogueTile{});
+
+    pipeline_init_wait(cluster_size);
+
+    if constexpr (IsGroupedGemmKernel) {
+      if (not work_tile_info.is_valid()) {
+        // When problem shapes are only on device, the grid launched may be larger than the total number of blocks across groups
+        return;
+      }
+      // In case user wants to engage less SMs than available on device
+      sm_id = BlockIdxX() + (BlockIdxY() * GridDimX());
+    }
+    // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
+    auto problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+
+    if (is_participant.main_ab_load) {
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+
+      // Ensure that the prefetched kernel does not touch
+      // unflushed global memory prior to this instruction
+      cutlass::arch::wait_on_dependent_grids();
+
+      bool do_load_order_arrive = is_epi_load_needed;
+      auto load_inputs = collective_mainloop.load_ab_init(
+          problem_shape_MNKL, params.mainloop,
+          shared_storage.tensors.mainloop,
+          shared_storage.tensormaps.mainloop,
+          params.hw_info.sm_count, sm_id, work_tile_info.L_idx);
+      Tensor gA_mkl = get<0>(load_inputs);
+      // Fetch a copy of tensormaps for the CTA from Params
+      auto input_tensormaps = get<rank(load_inputs) - 1>(load_inputs);
+
+      // Initial batch's tensor address update
+      // Even the first tile for a CTA can be from any of the batches.
+      // And during initialization of the first TMA descriptor on host, we don't initialize to the first batch due to that args value being device-only.
+      bool did_batch_change = true;
+
+      // Signal the epilogue warps to proceed once the prologue is complete
+      epilogue_throttle_barrier.arrive();
+      bool requires_clc_query = true;
+
+      do {
+        int32_t curr_batch = idx2crd(work_tile_info.L_idx, shape<4>(gA_mkl)); // Usually just returns work_tile_info.L_idx;
+        if constexpr (IsGroupedGemmKernel) {
+          problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(curr_batch), 1);
+        }
+        if (did_batch_change) {
+          collective_mainloop.tensormaps_perform_update(
+            shared_storage.tensormaps.mainloop,
+            params.mainloop,
+            input_tensormaps,
+            problem_shape,
+            curr_batch
+          );
+        }
+
+        // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+        auto k_tile_iter = scheduler.get_k_tile_iterator(work_tile_info, problem_shape_MNKL, CtaShape_MNK{}, shape<3>(gA_mkl));
+        auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, CtaShape_MNK{});
+        auto k_tile_prologue = min(MainloopABPipeline::Stages, k_tile_count);
+
+        // Problem Shape and therefore strides that we construct are [M,N,K,L], but since here for the TMA loads
+        // we are managing TMA descriptors to change batches, we need to neglect the L mode 
+        auto cta_coord_mnk = append<4>(make_coord(get<0>(cta_coord_mnkl), get<1>(cta_coord_mnkl), get<2>(cta_coord_mnkl)), Int<0>{});
+
+        if constexpr (IsSchedDynamicPersistent) {
+          if (is_first_cta_in_cluster && requires_clc_query) {
+            clc_throttle_pipeline.producer_acquire(clc_pipe_throttle_producer_state);
+            clc_throttle_pipeline.producer_commit(clc_pipe_throttle_producer_state);
+            ++clc_pipe_throttle_producer_state;
+          }
+        }
+
+        // Start mainloop prologue loads, arrive on the epilogue residual load barrier, resume mainloop loads
+        auto [mainloop_ab_producer_state_next, k_tile_iter_next] = collective_mainloop.load_ab(
+          params.mainloop,
+          mainloop_ab_pipeline,
+          mainloop_ab_pipe_producer_state,
+          load_inputs,
+          cta_coord_mnk,
+          k_tile_iter, k_tile_prologue,
+          did_batch_change
+        );
+        mainloop_ab_pipe_producer_state = mainloop_ab_producer_state_next;
+
+        if (do_load_order_arrive) {
+          load_order_barrier.arrive();
+          do_load_order_arrive = false;
+        }
+
+        auto [mainloop_ab_producer_state_next_, unused_] = collective_mainloop.load_ab(
+          params.mainloop,
+          mainloop_ab_pipeline,
+          mainloop_ab_pipe_producer_state,
+          load_inputs,
+          cta_coord_mnk,
+          k_tile_iter_next, k_tile_count - k_tile_prologue,
+          false /* did_batch_change - prologue loads handle tensormap acquire */
+        );
+        mainloop_ab_pipe_producer_state = mainloop_ab_producer_state_next_;
+
+        // Sync warp to prevent non-participating threads entering next wave early
+        syncwarp();
+
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+        work_tile_info = next_work_tile_info;
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+        requires_clc_query = increment_pipe;
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+        // For subsequent tiles, check if batch changes and therefore, we need tensormap updates
+        did_batch_change = curr_batch != idx2crd(work_tile_info.L_idx, shape<4>(gA_mkl));
+      } while (work_tile_info.is_valid());
+      collective_mainloop.load_ab_tail(mainloop_ab_pipeline, mainloop_ab_pipe_producer_state);
+
+    }
+
+    else if (is_participant.main_sf_load) {
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+
+      int32_t curr_batch = idx2crd(work_tile_info.L_idx, get<3>(problem_shape_MNKL)); // Usually just returns work_tile_info.L_idx;
+
+      auto mainloop_sf_inputs = collective_mainloop.load_sf_init(
+        problem_shape_MNKL, params.mainloop, shared_storage.tensors.mainloop, curr_batch);
+
+      Tensor gA_mkl = get<0>(mainloop_sf_inputs);
+
+      // Ensure that the prefetched kernel does not touch
+      // unflushed global memory prior to this instruction
+      cutlass::arch::wait_on_dependent_grids();
+
+      bool requires_clc_query = true;
+      bool did_batch_change = true;
+
+      do {
+
+        int32_t curr_batch = idx2crd(work_tile_info.L_idx, size<4>(gA_mkl)); // Usually just returns work_tile_info.L_idx;
+        if constexpr (IsGroupedGemmKernel) {
+          problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(curr_batch), 1);
+        }
+        if (did_batch_change) {
+          mainloop_sf_inputs = collective_mainloop.load_sf_update(
+            problem_shape_MNKL, params.mainloop, shared_storage.tensors.mainloop, curr_batch);
+        }
+
+        // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+        auto k_tile_iter = scheduler.get_k_tile_iterator(work_tile_info, problem_shape_MNKL, CtaShape_MNK{}, shape<3>(gA_mkl));
+        auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, CtaShape_MNK{});
+
+        // Problem Shape and therefore strides that we construct are [M,N,K,L], but since here for the loads
+        // we are managingo an array of pointers to change batches, we need to neglect the L mode 
+        auto cta_coord_mnk = append<4>(make_coord(get<0>(cta_coord_mnkl), get<1>(cta_coord_mnkl), get<2>(cta_coord_mnkl)), Int<0>{});
+
+        // Start mainloop prologue loads, arrive on the epilogue residual load barrier, resume mainloop loads
+        auto [mainloop_sf_producer_state_next, k_tile_iter_next] = collective_mainloop.load_sf(
+          mainloop_sf_pipeline,
+          mainloop_sf_pipe_producer_state,
+          mainloop_sf_inputs,
+          cta_coord_mnk,
+          k_tile_iter, k_tile_count
+        );
+        mainloop_sf_pipe_producer_state = mainloop_sf_producer_state_next;
+
+        // Sync warp to prevent non-participating threads entering next wave early
+        syncwarp();
+
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+        work_tile_info = next_work_tile_info;
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+        requires_clc_query = increment_pipe;
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+        did_batch_change = curr_batch != idx2crd(work_tile_info.L_idx, size<4>(gA_mkl));
+      } while (work_tile_info.is_valid());
+
+      collective_mainloop.load_sf_tail(
+        mainloop_sf_pipeline, 
+        mainloop_sf_pipe_producer_state
+      );
+      
+    }
+
+    else if (is_participant.sched) {
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+      
+      // Signal the epilogue warps to proceed once the prologue is complete
+      epilogue_throttle_barrier.arrive();
+
+      // Grouped GEMM uses static tile scheduler
+      if constexpr (IsSchedDynamicPersistent) {
+        // Whether a new CLC query must be performed.
+        // See comment below where this variable is updated for a description of
+        // why this variable is needed.
+        bool requires_clc_query = true;
+
+        do {
+          if (requires_clc_query) {
+            // Throttle CLC query to mitigate workload imbalance caused by skews among persistent workers.
+            clc_throttle_pipeline.consumer_wait(clc_pipe_throttle_consumer_state);
+            clc_throttle_pipeline.consumer_release(clc_pipe_throttle_consumer_state);
+            ++clc_pipe_throttle_consumer_state;
+          
+            // Query next clcID and update producer state
+            clc_pipe_producer_state = scheduler.advance_to_next_work(clc_pipeline, clc_pipe_producer_state);
+          }
+
+          // Fetch next work tile
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+            work_tile_info,
+            clc_pipeline,
+            clc_pipe_consumer_state
+          );
+
+          // Only perform a new CLC query if we consumed a new CLC query result in
+          // `fetch_next_work`. An example of a case in which CLC `fetch_next_work` does
+          // not consume a new CLC query response is when processing stream-K units.
+          // The current stream-K scheduler uses single WorkTileInfo to track multiple
+          // (potentially-partial) tiles to be computed via stream-K. In this case,
+          // `fetch_next_work` simply performs in-place updates on the existing WorkTileInfo,
+          // rather than consuming a CLC query response.
+          requires_clc_query = increment_pipe;
+          if (increment_pipe) {
+            ++clc_pipe_consumer_state;
+          }
+
+          work_tile_info = next_work_tile_info;
+        } while (work_tile_info.is_valid());
+        clc_pipeline.producer_tail(clc_pipe_producer_state);
+      }
+      else {
+        do {
+          auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(clc_pipeline, clc_pipe_producer_state);
+          work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++clc_pipe_producer_state;
+          }
+        } while (work_tile_info.is_valid());
+        clc_pipeline.producer_tail(clc_pipe_producer_state);
+      }
+    }
+
+    else if (is_participant.mma) {
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+
+      // Tmem allocation sequence
+      tmem_allocator.allocate(TmemAllocator::Sm100TmemCapacityColumns, &shared_storage.tmem_base_ptr);
+      syncwarp();
+      tmem_allocation_result_barrier.arrive();
+      uint32_t tmem_base_ptr = shared_storage.tmem_base_ptr;
+      accumulators.data() = tmem_base_ptr;
+      int tmem_non_accumulator_base =  tmem_base_ptr + cutlass::detail::find_tmem_tensor_col_offset(accumulators);
+
+
+      auto mma_inputs = collective_mainloop.mma_init(
+        params.mainloop,
+        collective_mainloop.slice_accumulator(accumulators, 0),
+        shared_storage.tensors.mainloop,
+        tmem_non_accumulator_base /*Start SF TMEM allocation after the accumulator*/);
+
+      // Signal the epilogue warps to proceed once the prologue is complete
+      epilogue_throttle_barrier.arrive();
+
+      do {
+        // Fetch next work tile
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+
+        if constexpr (IsGroupedGemmKernel) {
+          problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+        }
+        auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, CtaShape_MNK{});
+        if (is_mma_leader_cta) {
+          auto [mainloop_ab_pipe_consumer_state_next, accumulator_pipe_producer_state_next] = collective_mainloop.mma(
+            cute::make_tuple(
+              mainloop_ab_pipeline, accumulator_pipeline),
+            cute::make_tuple(
+              mainloop_ab_pipe_consumer_state, accumulator_pipe_producer_state),
+            accumulators,
+            mma_inputs,
+            cta_coord_mnkl,
+            k_tile_count);
+          mainloop_ab_pipe_consumer_state = mainloop_ab_pipe_consumer_state_next;
+          accumulator_pipe_producer_state = accumulator_pipe_producer_state_next;
+        }
+
+        work_tile_info = next_work_tile_info;
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+      } while (work_tile_info.is_valid());
+
+      // Hint on an early release of global memory resources.
+      // The timing of calling this function only influences performance,
+      // not functional correctness.
+      cutlass::arch::launch_dependent_grids();
+
+      // Release the right to allocate before deallocations so that the next CTA can rasterize
+      tmem_allocator.release_allocation_lock();
+
+      // Leader MMA waits for leader + peer epilogues to release accumulator stage
+      if (is_mma_leader_cta) {
+        accumulator_pipeline.producer_tail(accumulator_pipe_producer_state);
+      }
+      // Signal to peer MMA that entire tmem allocation can be deallocated
+      if constexpr (has_mma_peer_cta) {
+        // Leader does wait + arrive, follower does arrive + wait
+        tmem_deallocation_result_barrier.arrive(mma_peer_cta_rank, not is_mma_leader_cta);
+        tmem_deallocation_result_barrier.wait(dealloc_barrier_phase);
+        tmem_deallocation_result_barrier.arrive(mma_peer_cta_rank, is_mma_leader_cta);
+      }
+
+  
+      // Free entire tmem allocation
+      tmem_allocator.free(tmem_base_ptr, TmemAllocator::Sm100TmemCapacityColumns);
+    }
+
+    else if (is_participant.epi_load) {
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+
+      // Ensure that the prefetched kernel does not touch
+      // unflushed global memory prior to this instruction
+      cutlass::arch::wait_on_dependent_grids();
+
+      bool do_load_order_wait = true;
+      bool do_tail_load = false;
+      int current_wave = 0;
+
+      // Fetch a copy of tensormaps for the CTA from Params
+      auto epi_load_tensormap = get<0>(collective_epilogue.load_init(
+          params.epilogue, shared_storage.tensormaps.epilogue, params.hw_info.sm_count, sm_id));
+      // Initial batch's tensor address update
+      // Even the first tile for a CTA can be from any of the batches.
+      // And during initialization of the first TMA descriptor on host, we don't initialize to the first batch due to that args value being device-only.
+      bool did_batch_change = true;
+      constexpr bool IsEpiLoad = true;
+
+      // Signal the epilogue warps to proceed once the prologue is complete
+      epilogue_throttle_barrier.arrive();
+
+      do {
+        int32_t curr_batch = work_tile_info.L_idx;
+        if (did_batch_change) {
+          collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
+            shared_storage.tensormaps.epilogue,
+            params.epilogue,
+            epi_load_tensormap,
+            problem_shape,
+            curr_batch
+          );
+        }
+        bool compute_epilogue = TileScheduler::compute_epilogue(work_tile_info, params.scheduler);
+        // Get current work tile and fetch next work tile
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+        work_tile_info = next_work_tile_info;
+
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+
+        if (compute_epilogue) {
+          if (do_load_order_wait) {
+            load_order_barrier.wait();
+            do_load_order_wait = false;
+          }
+
+          if constexpr (IsGroupedGemmKernel) {
+            problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(curr_batch), 1);
+          }
+          bool reverse_epi_n = IsOverlappingAccum && (current_wave % 2 == 0);
+          epi_load_pipe_producer_state = collective_epilogue.template load<IsOverlappingAccum>(
+            epi_load_pipeline,
+            epi_load_pipe_producer_state,
+            problem_shape_MNKL,
+            CtaShape_MNK{},
+            cta_coord_mnkl,
+            TileShape{},
+            TiledMma{},
+            shared_storage.tensors.epilogue,
+            cute::make_tuple(epi_load_tensormap, did_batch_change),
+            reverse_epi_n
+          );
+
+          do_tail_load = true;
+        }
+        current_wave++;
+
+        // Calculate the cta coordinates of the next work tile
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+        // For subsequent tiles, check if batch changes and therefore, we need tensormap updates
+        did_batch_change = curr_batch != work_tile_info.L_idx;
+      } while (work_tile_info.is_valid());
+
+      // Only perform a tail load if one of the work units processed performed
+      // an epilogue load. An example of a case in which a tail load should not be
+      // performed is in split-K if a cluster is only assigned non-final splits (for which
+      // the cluster does not compute the epilogue).
+      if (do_tail_load) {
+        collective_epilogue.load_tail(
+          epi_load_pipeline, epi_load_pipe_producer_state,
+          epi_store_pipeline, epi_store_pipe_producer_state);
+      }
+    }
+
+    else if (is_participant.epilogue) {
+      // Register reconfiguration
+      arch::warpgroup_reg_alloc<AccumRegisterRequirement>();
+
+      // Throttle the epilogue warps to improve prologue performance
+      static constexpr int epilogue_throttle_phase_bit = 0;
+      epilogue_throttle_barrier.wait(epilogue_throttle_phase_bit);
+      
+      // Wait for tmem allocate here
+      tmem_allocation_result_barrier.arrive_and_wait();
+      uint32_t tmem_base_ptr = shared_storage.tmem_base_ptr;
+      accumulators.data() = tmem_base_ptr;
+
+      auto accum_inputs = collective_mainloop.accum_init(shared_storage.tensors.mainloop); 
+
+      auto warp_idx_in_epi = canonical_warp_idx_sync() - static_cast<int>(WarpCategory::Epilogue);
+      bool do_tail_store = false;
+      // Fetch a copy of tensormaps for the CTA from Params
+      auto epi_store_tensormap = get<0>(collective_epilogue.store_init(
+          params.epilogue, shared_storage.tensormaps.epilogue, params.hw_info.sm_count, sm_id));
+      // Initial batch's tensor address update
+      // Even the first tile for a CTA can be from any of the batches.
+      // And during initialization of the first TMA descriptor on host, we don't initialize to the first batch due to that args value being device-only.
+      bool did_batch_change = true;
+      constexpr bool IsEpiLoad = false;
+
+      auto pipelines = cute::make_tuple(accumulator_pipeline, mainloop_sf_pipeline);
+      auto states = cute::make_tuple(accumulator_pipe_consumer_state, mainloop_sf_pipe_consumer_state);
+
+      do {
+        int32_t curr_batch = work_tile_info.L_idx;
+        if (did_batch_change && warp_idx_in_epi == 0) {
+          collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
+            shared_storage.tensormaps.epilogue,
+            params.epilogue,
+            epi_store_tensormap,
+            problem_shape,
+            curr_batch
+          );
+        }
+        // Fetch next work tile
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+
+        // Fusions may need problem shape for the current group
+        if constexpr (IsGroupedGemmKernel) {
+          problem_shape_MNKL = append<4>(problem_shape.get_problem_shape(curr_batch), 1);
+        }
+
+        // Get accumulator 
+        auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, CtaShape_MNK{});
+
+        auto [accum, tiled_t2r, next_state] = collective_mainloop.accum(
+          pipelines,
+          states,
+          accumulators,
+          accum_inputs,
+          cta_coord_mnkl,
+          typename CollectiveEpilogue::CopyOpT2R{},
+          typename CollectiveEpilogue::EpilogueTile{},
+          k_tile_count
+        );
+
+        states = next_state;
+
+        //
+        // Epilogue and write to gD
+        //
+        // Check to see if tensormaps have been replaced in gmem
+        if (did_batch_change && warp_idx_in_epi == 0) {
+          collective_epilogue.template tensormaps_fence_acquire<IsEpiLoad>(epi_store_tensormap);
+        }
+        auto [load_state_next, store_state_next] = collective_epilogue.store(
+          epi_load_pipeline,
+          epi_load_pipe_consumer_state,
+          epi_store_pipeline,
+          epi_store_pipe_producer_state,
+          problem_shape_MNKL,
+          CtaShape_MNK{},
+          cta_coord_mnkl,
+          TileShape{},
+          TiledMma{},
+          accum,
+          shared_storage.tensors.epilogue,
+          epi_store_tensormap,
+          tiled_t2r // tiled_t2r
+        );
+        
+        do_tail_store |= TileScheduler::compute_epilogue(work_tile_info, params.scheduler);
+
+        epi_load_pipe_consumer_state = load_state_next;
+        epi_store_pipe_producer_state = store_state_next;
+
+        work_tile_info = next_work_tile_info;
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+        // For subsequent tiles, check if batch changes and therefore, we need tensormap updates
+        did_batch_change = curr_batch != work_tile_info.L_idx;
+      } while (work_tile_info.is_valid());
+
+      // Only perform a tail store if one of the work units processed performed
+      // an epilogue. An example of a case in which a tail load should not be
+      // performed is in split-K if a cluster is only assigned non-final splits (for which
+      // the cluster does not compute the epilogue).
+      if (do_tail_store) {
+        collective_epilogue.store_tail(
+          epi_load_pipeline, epi_load_pipe_consumer_state,
+          epi_store_pipeline, epi_store_pipe_producer_state,
+          CtaShape_MNK{});
+      }
+    }
+    else {
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+    }
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::kernel
diff --git a/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp b/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp
index 7f66a654e5..98236b99b9 100644
--- a/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp
+++ b/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp
@@ -649,6 +649,9 @@ class GemmUniversal<
     }
 
     else if (is_participant.sched) {
+      if constexpr (IsSchedDynamicPersistent) {
+        cutlass::arch::wait_on_dependent_grids();
+      }
       if constexpr (IsSchedDynamicPersistent) {
         // Whether a new CLC query must be performed.
         // See comment below where this variable is updated for a description of
diff --git a/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_input_transform.hpp b/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_input_transform.hpp
index 36638c4d58..ab2c2d49d8 100644
--- a/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_input_transform.hpp
+++ b/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_input_transform.hpp
@@ -717,7 +717,9 @@ class GemmUniversal<
       // Register reconfiguration
       arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
 
-      
+      if constexpr (IsSchedDynamicPersistent) {
+        cutlass::arch::wait_on_dependent_grids();
+      }
 
       // Signal the epilogue warps to proceed once the prologue is complete
       epilogue_throttle_barrier.arrive();
@@ -1007,6 +1009,39 @@ class GemmUniversal<
           // Advance the mm2accum pipe
           mma2accum_pipeline_consumer_state = mma2accum_pipeline_consumer_state_next;
         }
+        else if constexpr (InputTransformType == cutlass::gemm::detail::KernelInputTransformType::MixedInput) {
+
+          mma2accum_pipeline.consumer_wait(mma2accum_pipeline_consumer_state);
+
+          // Accumulators
+          Tensor accumulators = bulk_tmem(_,_,_,mma2accum_pipeline_consumer_state.index()); // ((MMA_TILE_M,MMA_TILE_N),MMA_M,MMA_N)
+
+          mma2accum_pipeline_consumer_state = scheduler.template fixup<IsComplex>(
+            TiledMma{},
+            work_tile_info,
+            accumulators,
+            mma2accum_pipeline,
+            mma2accum_pipeline_consumer_state,
+            typename CollectiveEpilogue::CopyOpT2R{}
+          );
+
+          //
+          // Epilogue and write to gD
+          //
+          if (scheduler.compute_epilogue(work_tile_info)) {
+            auto [mma2accum_pipeline_state_next] = collective_epilogue(
+              mma2accum_pipeline,
+              mma2accum_pipeline_consumer_state,
+              problem_shape_MNKL,
+              CtaShape_MNK{},
+              cta_coord_mnkl,
+              accumulators,
+              shared_storage.tensors.epilogue
+            );
+            // Advance the mma2accum pipe
+            mma2accum_pipeline_consumer_state = mma2accum_pipeline_state_next;
+          }
+        }
         // Complex kernels use a collective epilogue
         else {
           mma2accum_pipeline.consumer_wait(mma2accum_pipeline_consumer_state);
diff --git a/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_mma_transform.hpp b/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_mma_transform.hpp
index 5e0a540b42..cbc50b3b08 100644
--- a/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_mma_transform.hpp
+++ b/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_mma_transform.hpp
@@ -131,16 +131,19 @@ class GemmUniversal<
   static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
 
   // Warp specialization thread count per threadblock
-  static constexpr uint32_t NumSchedThreads        = NumThreadsPerWarp; // 1 warp
-  static constexpr uint32_t NumMMAThreads          = NumThreadsPerWarp; // 1 warp
-  static constexpr uint32_t NumMainloopLoadThreads = NumThreadsPerWarp; // 1 warp
-  static constexpr uint32_t NumEpilogueLoadThreads = NumThreadsPerWarp; // 1 warp
-  static constexpr uint32_t NumEpilogueThreads     = CollectiveEpilogue::ThreadCount;
-  static constexpr uint32_t NumEpilogueWarps       = NumEpilogueThreads / NumThreadsPerWarp;
-
-  static constexpr uint32_t MaxThreadsPerBlock = NumSchedThreads +
-                                                 NumMainloopLoadThreads + NumMMAThreads +
-                                                 NumEpilogueLoadThreads + NumEpilogueThreads;
+  static constexpr uint32_t NumSchedThreads          = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumMMAThreads            = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumMainloopABLoadThreads = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumEpilogueLoadThreads   = NumThreadsPerWarp; // 1 warp
+  static constexpr uint32_t NumEpilogueThreads       = CollectiveEpilogue::ThreadCount;
+  static constexpr uint32_t NumEpilogueWarps         = NumEpilogueThreads / NumThreadsPerWarp;
+  static constexpr uint32_t NumMainloopSFLoadThreads = NumThreadsPerWarp; // 1 warp
+
+
+  static constexpr uint32_t MaxThreadsPerBlock = cute::round_up(NumSchedThreads +
+                                                 NumMainloopABLoadThreads + NumMMAThreads +
+                                                 NumEpilogueLoadThreads + NumEpilogueThreads + 
+                                                 NumMainloopSFLoadThreads, 128);
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
 
   static constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_load_pipe_increment(CtaShape_MNK{});
@@ -152,8 +155,8 @@ class GemmUniversal<
   static constexpr uint32_t CLCResponseSize = sizeof(typename TileScheduler::CLCResponse);
 
   // Pipeline and pipeline state types
-  using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
-  using MainloopPipelineState = typename CollectiveMainloop::MainloopPipelineState;
+  using MainloopABPipeline = typename CollectiveMainloop::MainloopABPipeline;
+  using MainloopABPipelineState = typename CollectiveMainloop::MainloopABPipelineState;
 
   using EpiLoadPipeline = typename CollectiveEpilogue::LoadPipeline;
   using EpiLoadPipelineState = typename CollectiveEpilogue::LoadPipelineState;
@@ -163,11 +166,11 @@ class GemmUniversal<
 
   using LoadOrderBarrier = cutlass::OrderedSequenceBarrier<1,2>;
 
-  using Mma2TransformPipeline = typename CollectiveMainloop::Mma2TransformPipeline;
-  using Mma2TransformPipelineState = typename Mma2TransformPipeline::PipelineState;
+  using AccumulatorPipeline = typename CollectiveMainloop::AccumulatorPipeline;
+  using AccumulatorPipelineState = typename AccumulatorPipeline::PipelineState;
 
-  using Load2TransformPipeline = typename CollectiveMainloop::Load2TransformPipeline;
-  using Load2TransformPipelineState = typename Load2TransformPipeline::PipelineState;
+  using MainloopSFPipeline = typename CollectiveMainloop::MainloopSFPipeline;
+  using MainloopSFPipelineState = typename MainloopSFPipeline::PipelineState;
 
   using CLCPipeline = cutlass::PipelineCLCFetchAsync<SchedulerPipelineStageCount, ClusterShape>;
   using CLCPipelineState = typename CLCPipeline::PipelineState;
@@ -178,7 +181,7 @@ class GemmUniversal<
   using TmemAllocator = cute::conditional_t<cute::size(cute::shape<0>(typename TiledMma::ThrLayoutVMNK{})) == 1,
       cute::TMEM::Allocator1Sm, cute::TMEM::Allocator2Sm>;
 
-  static constexpr uint32_t GenericRegisterRequirement = 104;
+  static constexpr uint32_t GenericRegisterRequirement = 48;
   static constexpr uint32_t AccumRegisterRequirement = 256;
 
   // Kernel level shared memory storage
@@ -186,19 +189,15 @@ class GemmUniversal<
     // Barriers should be allocated in lower 8KB of SMEM for SM100
     struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
-      using Load2TransformPipelineStorage = typename CollectiveMainloop::Load2TransformPipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
       using LoadOrderBarrierStorage = typename LoadOrderBarrier::SharedStorage;
       using CLCPipelineStorage = typename CLCPipeline::SharedStorage;
-      using Mma2TransformPipelineStorage = typename CollectiveMainloop::Mma2TransformPipelineStorage;
       using CLCThrottlePipelineStorage = typename CLCThrottlePipeline::SharedStorage;
 
       alignas(16) MainloopPipelineStorage mainloop;
-      alignas(16) Load2TransformPipelineStorage load2transform;
       alignas(16) EpiLoadPipelineStorage epi_load;
       alignas(16) LoadOrderBarrierStorage load_order;
       alignas(16) CLCPipelineStorage clc;
-      alignas(16) Mma2TransformPipelineStorage mma2transform;
       alignas(16) CLCThrottlePipelineStorage clc_throttle;
       alignas(16) arch::ClusterBarrier tmem_dealloc;
       alignas(16) arch::ClusterBarrier epilogue_throttle;
@@ -240,19 +239,23 @@ class GemmUniversal<
   };
 
   enum class WarpCategory : int32_t {
-    MMA          = 0,
-    Sched        = 1,
-    MainloopLoad = 2,
-    EpilogueLoad = 3,
-    Epilogue     = 4
+    MMA            = 0,
+    Sched          = 1,
+    MainloopABLoad = 2,
+    EpilogueLoad   = 3,
+    Epilogue       = 4, // 4 warps
+    MainloopSFLoad = 8,
+    Unused         = 9,
   };
 
   struct IsParticipant {
-    uint32_t mma       = false;
-    uint32_t sched     = false;
-    uint32_t main_load = false;
-    uint32_t epi_load  = false;
-    uint32_t epilogue  = false;
+    uint32_t mma          = false;
+    uint32_t sched        = false;
+    uint32_t main_ab_load = false;
+    uint32_t epi_load     = false;
+    uint32_t epilogue     = false;
+    uint32_t main_sf_load = false;
+    uint32_t unused       = false;
   };
 
   //
@@ -407,8 +410,20 @@ class GemmUniversal<
 
     // Account for more than one epilogue warp
     int warp_idx = canonical_warp_idx_sync();
-    WarpCategory warp_category = warp_idx < static_cast<int>(WarpCategory::Epilogue) ? WarpCategory(warp_idx)
-                                                                                     : WarpCategory::Epilogue;
+    WarpCategory warp_category = [&] () CUTLASS_LAMBDA_FUNC_INLINE {
+      if (warp_idx < static_cast<int>(WarpCategory::Epilogue)) {
+        return WarpCategory(warp_idx);
+      } 
+      else if (warp_idx < static_cast<int>(WarpCategory::MainloopSFLoad)) {
+        return WarpCategory::Epilogue;
+      } 
+      else if (warp_idx == static_cast<int>(WarpCategory::MainloopSFLoad)) {
+        return WarpCategory::MainloopSFLoad;
+      } 
+      else {
+        return WarpCategory::Unused;
+      }
+    }();
 
     uint32_t lane_predicate = cute::elect_one_sync();
     auto cluster_shape = cutlass::detail::select_cluster_shape(ClusterShape{});
@@ -440,41 +455,43 @@ class GemmUniversal<
     IsParticipant is_participant = {
       (warp_category == WarpCategory::MMA),                                 // mma
       (warp_category == WarpCategory::Sched) && is_first_cta_in_cluster,    // sched
-      (warp_category == WarpCategory::MainloopLoad),                        // main_load
+      (warp_category == WarpCategory::MainloopABLoad),                      // main_ab_load
       (warp_category == WarpCategory::EpilogueLoad) && is_epi_load_needed,  // epi_load
-      (warp_category == WarpCategory::Epilogue)                             // epilogue
+      (warp_category == WarpCategory::Epilogue),                            // epilogue
+      (warp_category == WarpCategory::MainloopSFLoad),                      // main_sf_load
+      (warp_category == WarpCategory::Unused)                               // unused
     };
 
     // Mainloop Load pipeline
-    typename MainloopPipeline::Params mainloop_pipeline_params;
-    if (WarpCategory::MainloopLoad == warp_category) {
-      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Producer;
+    typename MainloopABPipeline::Params mainloop_ab_pipeline_params;
+    if (WarpCategory::MainloopABLoad == warp_category) {
+      mainloop_ab_pipeline_params.role = MainloopABPipeline::ThreadCategory::Producer;
     }
     if (WarpCategory::MMA == warp_category) {
-      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Consumer;
+      mainloop_ab_pipeline_params.role = MainloopABPipeline::ThreadCategory::Consumer;
     }
-    mainloop_pipeline_params.is_leader = lane_predicate && is_mma_leader_cta && is_participant.main_load;
-    mainloop_pipeline_params.transaction_bytes = CollectiveMainloop::TmaTransactionBytes;
-    mainloop_pipeline_params.initializing_warp = 0;
-    MainloopPipeline mainloop_pipeline(shared_storage.pipelines.mainloop,
-                                       mainloop_pipeline_params,
-                                       cluster_shape,
-                                       cute::true_type{},   // Perform barrier init
-                                       cute::false_type{}); // Delay mask calculation
-
-    typename Load2TransformPipeline::Params load2transform_pipeline_params;
-    if (WarpCategory::MainloopLoad == warp_category) {
-      load2transform_pipeline_params.role = Load2TransformPipeline::ThreadCategory::Producer;
+    mainloop_ab_pipeline_params.is_leader = lane_predicate && is_mma_leader_cta && is_participant.main_ab_load;
+    mainloop_ab_pipeline_params.transaction_bytes = CollectiveMainloop::TmaTransactionBytes;
+    mainloop_ab_pipeline_params.initializing_warp = 0;
+    MainloopABPipeline mainloop_ab_pipeline(shared_storage.pipelines.mainloop.pipeline_ab,
+                                            mainloop_ab_pipeline_params,
+                                            cluster_shape,
+                                            cute::true_type{},   // Perform barrier init
+                                            cute::false_type{}); // Delay mask calculation
+
+    typename MainloopSFPipeline::Params mainloop_sf_pipeline_params;
+    if (WarpCategory::MainloopSFLoad == warp_category) {
+      mainloop_sf_pipeline_params.role = MainloopSFPipeline::ThreadCategory::Producer;
     }
     if (WarpCategory::Epilogue == warp_category) {
-      load2transform_pipeline_params.role = Load2TransformPipeline::ThreadCategory::Consumer;
+      mainloop_sf_pipeline_params.role = MainloopSFPipeline::ThreadCategory::Consumer;
     }
-    load2transform_pipeline_params.initializing_warp = 0;
-    load2transform_pipeline_params.producer_arv_count = CollectiveMainloop::NumLoad2TransformProducerThreadEvents;
-    load2transform_pipeline_params.consumer_arv_count = NumEpilogueThreads;
+    mainloop_sf_pipeline_params.initializing_warp = 8;
+    mainloop_sf_pipeline_params.producer_arv_count = CollectiveMainloop::NumMainloopSFProducerThreadEvents;
+    mainloop_sf_pipeline_params.consumer_arv_count = NumEpilogueThreads;
 
-    Load2TransformPipeline load2transform_pipeline(shared_storage.pipelines.load2transform,
-                                          load2transform_pipeline_params);
+    MainloopSFPipeline mainloop_sf_pipeline(shared_storage.pipelines.mainloop.pipeline_sf,
+                                            mainloop_sf_pipeline_params);
 
     // Epilogue Load pipeline
     typename EpiLoadPipeline::Params epi_load_pipeline_params;
@@ -498,8 +515,8 @@ class GemmUniversal<
 
     // Load order barrier
     typename LoadOrderBarrier::Params load_order_barrier_params;
-    load_order_barrier_params.group_id = (warp_category == WarpCategory::MainloopLoad) ? 0 : 1;
-    load_order_barrier_params.group_size = NumMainloopLoadThreads;
+    load_order_barrier_params.group_id = (warp_category == WarpCategory::MainloopABLoad) ? 0 : 1;
+    load_order_barrier_params.group_size = NumMainloopABLoadThreads;
     load_order_barrier_params.initializing_warp = 5;
     LoadOrderBarrier load_order_barrier(shared_storage.pipelines.load_order, load_order_barrier_params);
 
@@ -514,7 +531,8 @@ class GemmUniversal<
     clc_pipeline_params.producer_blockid = 0;
     clc_pipeline_params.producer_arv_count = 1;
     clc_pipeline_params.consumer_arv_count = NumSchedThreads + cluster_size *
-                                                 (NumMainloopLoadThreads + NumEpilogueThreads + NumMMAThreads);
+                                                 (NumMainloopABLoadThreads + NumEpilogueThreads + 
+                                                  NumMMAThreads + NumMainloopSFLoadThreads);
     if (is_epi_load_needed) {
       clc_pipeline_params.consumer_arv_count += cluster_size * NumEpilogueLoadThreads;
     }
@@ -523,30 +541,30 @@ class GemmUniversal<
     CLCPipeline clc_pipeline(shared_storage.pipelines.clc, clc_pipeline_params, cluster_shape);
 
     // Mainloop-Epilogue pipeline
-    typename Mma2TransformPipeline::Params mma2transform_pipeline_params;
+    typename AccumulatorPipeline::Params accumulator_pipeline_params;
     if (WarpCategory::MMA == warp_category) {
-      mma2transform_pipeline_params.role = Mma2TransformPipeline::ThreadCategory::Producer;
+      accumulator_pipeline_params.role = AccumulatorPipeline::ThreadCategory::Producer;
     }
     if (WarpCategory::Epilogue == warp_category) {
-      mma2transform_pipeline_params.role = Mma2TransformPipeline::ThreadCategory::Consumer;
+      accumulator_pipeline_params.role = AccumulatorPipeline::ThreadCategory::Consumer;
     }
     // Only one producer thread arrives on this barrier.
-    mma2transform_pipeline_params.producer_arv_count = 1;
-    mma2transform_pipeline_params.consumer_arv_count = size(AtomThrShapeMNK{}) * NumEpilogueThreads;
-    mma2transform_pipeline_params.initializing_warp = 2;
-    Mma2TransformPipeline mma2transform_pipeline(shared_storage.pipelines.mma2transform,
-                                                 mma2transform_pipeline_params,
+    accumulator_pipeline_params.producer_arv_count = 1;
+    accumulator_pipeline_params.consumer_arv_count = size(AtomThrShapeMNK{}) * NumEpilogueThreads;
+    accumulator_pipeline_params.initializing_warp = 2;
+    AccumulatorPipeline accumulator_pipeline(shared_storage.pipelines.mainloop.pipeline_accum,
+                                                 accumulator_pipeline_params,
                                                  cluster_shape);
 
     // CLC throttle pipeline
     typename CLCThrottlePipeline::Params clc_throttle_pipeline_params;
-    if (WarpCategory::MainloopLoad == warp_category) {
+    if (WarpCategory::MainloopABLoad == warp_category) {
       clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Producer;
     }
     if (WarpCategory::Sched == warp_category) {
       clc_throttle_pipeline_params.role = CLCThrottlePipeline::ThreadCategory::Consumer;
     }
-    clc_throttle_pipeline_params.producer_arv_count = NumMainloopLoadThreads;
+    clc_throttle_pipeline_params.producer_arv_count = NumMainloopABLoadThreads;
     clc_throttle_pipeline_params.consumer_arv_count = NumSchedThreads;
     clc_throttle_pipeline_params.dst_blockid = 0;
     clc_throttle_pipeline_params.initializing_warp = 3;
@@ -573,7 +591,7 @@ class GemmUniversal<
     if (WarpCategory::MMA == warp_category && lane_predicate) {
       epilogue_throttle_barrier.init(                          NumMMAThreads +
                                     (is_first_cta_in_cluster ? NumSchedThreads : 0) +
-                                                               NumMainloopLoadThreads +
+                                                               NumMainloopABLoadThreads +
                                     (is_epi_load_needed      ? NumEpilogueLoadThreads : 0));
     }
 
@@ -581,11 +599,11 @@ class GemmUniversal<
     // To all producers and consumer threadblocks in the cluster
     pipeline_init_arrive_relaxed(cluster_size);
 
-    auto load_inputs = collective_mainloop.load_init(
+    auto load_inputs = collective_mainloop.load_ab_init(
         problem_shape_MNKL, params.mainloop, shared_storage.tensors.mainloop);
 
-    MainloopPipelineState mainloop_pipe_consumer_state;
-    MainloopPipelineState mainloop_pipe_producer_state = cutlass::make_producer_start_state<MainloopPipeline>();
+    MainloopABPipelineState mainloop_ab_pipe_consumer_state;
+    MainloopABPipelineState mainloop_ab_pipe_producer_state = cutlass::make_producer_start_state<MainloopABPipeline>();
 
     EpiLoadPipelineState epi_load_pipe_consumer_state;
     EpiLoadPipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
@@ -596,17 +614,17 @@ class GemmUniversal<
     CLCPipelineState clc_pipe_consumer_state;
     CLCPipelineState clc_pipe_producer_state = cutlass::make_producer_start_state<CLCPipeline>();
 
-    Mma2TransformPipelineState mma2transform_pipe_consumer_state;
-    Mma2TransformPipelineState mma2transform_pipe_producer_state = cutlass::make_producer_start_state<Mma2TransformPipeline>();
+    AccumulatorPipelineState accumulator_pipe_consumer_state;
+    AccumulatorPipelineState accumulator_pipe_producer_state = cutlass::make_producer_start_state<AccumulatorPipeline>();
 
-    Load2TransformPipelineState load2transform_pipe_consumer_state;
-    Load2TransformPipelineState load2transform_pipe_producer_state = cutlass::make_producer_start_state<Load2TransformPipeline>();
+    MainloopSFPipelineState mainloop_sf_pipe_consumer_state;
+    MainloopSFPipelineState mainloop_sf_pipe_producer_state = cutlass::make_producer_start_state<MainloopSFPipeline>();
 
     dim3 block_id_in_cluster = cute::block_id_in_cluster();
 
     // Calculate mask after cluster barrier arrival
-    mainloop_pipeline.init_masks(cluster_shape, block_id_in_cluster);
-    mma2transform_pipeline.init_masks(cluster_shape, block_id_in_cluster);
+    mainloop_ab_pipeline.init_masks(cluster_shape, block_id_in_cluster);
+    accumulator_pipeline.init_masks(cluster_shape, block_id_in_cluster);
 
     // TileID scheduler
     TileScheduler scheduler(&shared_storage.clc_response[0], params.scheduler, block_id_in_cluster);
@@ -619,7 +637,7 @@ class GemmUniversal<
 
     pipeline_init_wait(cluster_size);
 
-    if (is_participant.main_load) {
+    if (is_participant.main_ab_load) {
       // Register reconfiguration
       arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
 
@@ -633,15 +651,12 @@ class GemmUniversal<
       epilogue_throttle_barrier.arrive();
       bool requires_clc_query = true;
 
-      auto pipelines = cute::make_tuple(mainloop_pipeline, load2transform_pipeline);
-      auto states = cute::make_tuple(mainloop_pipe_producer_state, load2transform_pipe_producer_state);
-
       do {
 
         // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
         auto k_tile_iter = scheduler.get_k_tile_iterator(work_tile_info, problem_shape_MNKL, CtaShape_MNK{}, load_inputs.k_tiles);
         auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, CtaShape_MNK{});
-        auto k_tile_prologue = min(MainloopPipeline::Stages, k_tile_count);
+        auto k_tile_prologue = min(MainloopABPipeline::Stages, k_tile_count);
 
         if constexpr (IsSchedDynamicPersistent) {
           if (is_first_cta_in_cluster && requires_clc_query) {
@@ -652,34 +667,28 @@ class GemmUniversal<
         }
 
         // Start mainloop prologue loads, arrive on the epilogue residual load barrier, resume mainloop loads
-        auto [mainloop_producer_state_next, load2transform_producer_state_next, k_tile_iter_next] = collective_mainloop.load(
-          mainloop_pipeline,
-          load2transform_pipeline,
-          mainloop_pipe_producer_state,
-          load2transform_pipe_producer_state,
+        auto [mainloop_ab_producer_state_next, k_tile_iter_next] = collective_mainloop.load_ab(
+          mainloop_ab_pipeline,
+          mainloop_ab_pipe_producer_state,
           load_inputs,
           cta_coord_mnkl,
           k_tile_iter, k_tile_prologue
         );
-        mainloop_pipe_producer_state = mainloop_producer_state_next;
-        load2transform_pipe_producer_state = load2transform_producer_state_next;
+        mainloop_ab_pipe_producer_state = mainloop_ab_producer_state_next;
 
         if (do_load_order_arrive) {
           load_order_barrier.arrive();
           do_load_order_arrive = false;
         }
 
-        auto [mainloop_producer_state_next_, load2transform_producer_state_next_, unused_] = collective_mainloop.load(
-          mainloop_pipeline,
-          load2transform_pipeline,
-          mainloop_pipe_producer_state,
-          load2transform_pipe_producer_state,
+        auto [mainloop_ab_producer_state_next_, unused_] = collective_mainloop.load_ab(
+          mainloop_ab_pipeline,
+          mainloop_ab_pipe_producer_state,
           load_inputs,
           cta_coord_mnkl,
           k_tile_iter_next, k_tile_count - k_tile_prologue
         );
-        mainloop_pipe_producer_state = mainloop_producer_state_next_;
-        load2transform_pipe_producer_state = load2transform_producer_state_next_;
+        mainloop_ab_pipe_producer_state = mainloop_ab_producer_state_next_;
 
         // Sync warp to prevent non-participating threads entering next wave early
         syncwarp();
@@ -697,11 +706,61 @@ class GemmUniversal<
         }
       } while (work_tile_info.is_valid());
 
-      collective_mainloop.load_tail(
-        mainloop_pipeline, 
-        load2transform_pipeline, 
-        mainloop_pipe_producer_state, 
-        load2transform_pipe_producer_state
+      collective_mainloop.load_ab_tail(
+        mainloop_ab_pipeline, 
+        mainloop_ab_pipe_producer_state
+      );
+      
+    }
+
+    else if (is_participant.main_sf_load) {
+      auto mainloop_sf_inputs = collective_mainloop.load_sf_init(
+        problem_shape_MNKL, params.mainloop, shared_storage.tensors.mainloop);
+
+      // Register reconfiguration
+      arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
+
+      // Ensure that the prefetched kernel does not touch
+      // unflushed global memory prior to this instruction
+      cutlass::arch::wait_on_dependent_grids();
+
+      bool requires_clc_query = true;
+
+      do {
+
+        // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+        auto k_tile_iter = scheduler.get_k_tile_iterator(work_tile_info, problem_shape_MNKL, CtaShape_MNK{}, mainloop_sf_inputs.k_tiles);
+        auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, CtaShape_MNK{});
+
+        // Start mainloop prologue loads, arrive on the epilogue residual load barrier, resume mainloop loads
+        auto [mainloop_sf_producer_state_next, k_tile_iter_next] = collective_mainloop.load_sf(
+          mainloop_sf_pipeline,
+          mainloop_sf_pipe_producer_state,
+          mainloop_sf_inputs,
+          cta_coord_mnkl,
+          k_tile_iter, k_tile_count
+        );
+        mainloop_sf_pipe_producer_state = mainloop_sf_producer_state_next;
+
+        // Sync warp to prevent non-participating threads entering next wave early
+        syncwarp();
+
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+          work_tile_info,
+          clc_pipeline,
+          clc_pipe_consumer_state
+        );
+        work_tile_info = next_work_tile_info;
+        cta_coord_mnkl = scheduler.work_tile_to_cta_coord(work_tile_info);
+        requires_clc_query = increment_pipe;
+        if (increment_pipe) {
+          ++clc_pipe_consumer_state;
+        }
+      } while (work_tile_info.is_valid());
+
+      collective_mainloop.load_sf_tail(
+        mainloop_sf_pipeline, 
+        mainloop_sf_pipe_producer_state
       );
       
     }
@@ -710,6 +769,10 @@ class GemmUniversal<
       // Register reconfiguration
       arch::warpgroup_reg_dealloc<GenericRegisterRequirement>();
 
+      if constexpr (IsSchedDynamicPersistent) {
+        cutlass::arch::wait_on_dependent_grids();
+      }
+
       // Signal the epilogue warps to proceed once the prologue is complete
       epilogue_throttle_barrier.arrive();
 
@@ -791,16 +854,16 @@ class GemmUniversal<
         }
 
         if (is_mma_leader_cta) {
-          auto [mainloop_pipe_consumer_state_, mma2transform_pipe_producer_state_] = collective_mainloop.mma(
-            cute::make_tuple(mainloop_pipeline, mma2transform_pipeline),
-            cute::make_tuple(mainloop_pipe_consumer_state, mma2transform_pipe_producer_state),
+          auto [mainloop_ab_pipe_consumer_state_, accumulator_pipe_producer_state_] = collective_mainloop.mma(
+            cute::make_tuple(mainloop_ab_pipeline, accumulator_pipeline),
+            cute::make_tuple(mainloop_ab_pipe_consumer_state, accumulator_pipe_producer_state),
             tmem_storage,
             mma_inputs,
             cta_coord_mnkl,
             k_tile_count
           );
-          mainloop_pipe_consumer_state = mainloop_pipe_consumer_state_;
-          mma2transform_pipe_producer_state = mma2transform_pipe_producer_state_;
+          mainloop_ab_pipe_consumer_state = mainloop_ab_pipe_consumer_state_;
+          accumulator_pipe_producer_state = accumulator_pipe_producer_state_;
         }
 
         work_tile_info = next_work_tile_info;
@@ -817,7 +880,7 @@ class GemmUniversal<
 
       // Leader MMA waits for leader + peer epilogues to release stage
       if (is_mma_leader_cta) {
-        mma2transform_pipeline.producer_tail(mma2transform_pipe_producer_state);
+        accumulator_pipeline.producer_tail(accumulator_pipe_producer_state);
       }
       // Signal to peer MMA that entire tmem allocation can be deallocated
       if constexpr (has_mma_peer_cta) {
@@ -912,13 +975,13 @@ class GemmUniversal<
       uint32_t tmem_base_ptr = shared_storage.tmem_base_ptr;
       collective_mainloop.set_tmem_offsets(tmem_storage, tmem_base_ptr);
 
-      auto transform_inputs = collective_mainloop.transform_init(
+      auto accum_inputs = collective_mainloop.accum_init(
         problem_shape_MNKL, 
         shared_storage.tensors.mainloop
       );
 
-      auto pipelines = cute::make_tuple(mma2transform_pipeline, load2transform_pipeline);
-      auto states = cute::make_tuple(mma2transform_pipe_consumer_state, load2transform_pipe_consumer_state);
+      auto pipelines = cute::make_tuple(accumulator_pipeline, mainloop_sf_pipeline);
+      auto states = cute::make_tuple(accumulator_pipe_consumer_state, mainloop_sf_pipe_consumer_state);
       bool do_tail_store = false;
       do {
 
@@ -935,11 +998,11 @@ class GemmUniversal<
           ++clc_pipe_consumer_state;
         }
 
-        auto [accum, tiled_t2r, next_state] = collective_mainloop.transform(
+        auto [accum, tiled_t2r, next_state] = collective_mainloop.accum(
           pipelines,
           states,
           tmem_storage,
-          transform_inputs,
+          accum_inputs,
           cta_coord_mnkl,
           typename CollectiveEpilogue::CopyOpT2R{},
           typename CollectiveEpilogue::EpilogueTile{},
diff --git a/include/cutlass/gemm/kernel/sm100_sparse_gemm_tma_warpspecialized.hpp b/include/cutlass/gemm/kernel/sm100_sparse_gemm_tma_warpspecialized.hpp
index 17a755ee69..8ed73a16cb 100644
--- a/include/cutlass/gemm/kernel/sm100_sparse_gemm_tma_warpspecialized.hpp
+++ b/include/cutlass/gemm/kernel/sm100_sparse_gemm_tma_warpspecialized.hpp
@@ -687,6 +687,9 @@ class GemmUniversal<
     }
 
     else if (is_participant.sched) {
+      if constexpr (IsSchedDynamicPersistent) {
+        cutlass::arch::wait_on_dependent_grids();
+      }
       if constexpr (IsSchedDynamicPersistent) {
         // Whether a new CLC query must be performed.
         // See comment below where this variable is updated for a description of
diff --git a/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp b/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp
index 5196533783..7429a08be3 100755
--- a/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp
+++ b/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp
@@ -95,23 +95,6 @@ class PersistentTileSchedulerSm100 {
   };
 
   struct Arguments {
-
-    Arguments() = default;
-    Arguments(Arguments const&) = default;
-    Arguments(Arguments&&) = default;
-
-    CUTLASS_HOST_DEVICE
-    Arguments&
-    operator=(Arguments const&) {
-      return *this;
-    }
-
-    CUTLASS_HOST_DEVICE
-    Arguments&
-    operator=(Arguments &&) {
-      return *this;
-    }
-
     int max_swizzle_size = 0;
     RasterOrderOptions raster_order = RasterOrderOptions::Heuristic;
   };
@@ -405,7 +388,7 @@ class PersistentTileSchedulerSm100 {
     return make_coord(m_coord, n_coord, _, l_coord);
   }
 
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   static void
   issue_clc_query(PipelineState<Stages> state, uint32_t mbarrier_addr, CLCResponse* clc_response_ptr) {
   #if defined(CUTLASS_ARCH_CLC_ENABLED)
@@ -468,7 +451,7 @@ class PersistentTileSchedulerSm100 {
 
   // Kernel helper function to get next work tile
   template <class TileSchedulerPipeline, class TileSchedulerPipelineState>
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   auto
   fetch_next_work(
     WorkTileInfo work_tile_info,
@@ -627,9 +610,10 @@ class PersistentTileSchedulerSm100 {
     store_query_response(state, make_invalid_response());
   }
 
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   void
   store_query_response(PipelineState<Stages> state, CLCResponse clc_response) {
+    #if defined(__CUDA_ARCH__)
     uint32_t smem_ptr = cute::cast_smem_ptr_to_uint(&clc_response_ptr_[state.index()]);
     asm volatile("st.shared.v4.b32 [%0], {%1, %2, %3, %4};\n"
                   : : "r"(smem_ptr)
@@ -638,6 +622,7 @@ class PersistentTileSchedulerSm100 {
                     , "r"(clc_response.data[2])
                     , "r"(clc_response.data[3]));
     cutlass::arch::fence_view_async_shared();
+    #endif
   }
 
   CUTLASS_DEVICE
diff --git a/include/cutlass/gemm/kernel/sm100_tile_scheduler_group.hpp b/include/cutlass/gemm/kernel/sm100_tile_scheduler_group.hpp
index 2e64eac3af..1c92efbcd1 100755
--- a/include/cutlass/gemm/kernel/sm100_tile_scheduler_group.hpp
+++ b/include/cutlass/gemm/kernel/sm100_tile_scheduler_group.hpp
@@ -54,18 +54,19 @@ namespace cutlass::gemm::kernel::detail {
 // If we had access to host-side problem shapes, one could to use it to figure out the grid shape
 // and thereafter use CLC query (which can then be linearized and mapped to an approriate tile coord).
 
-template<class GroupProblemShape>
+template<class GroupProblemShape, int SchedulerPipelineStageCount>
 class PersistentTileSchedulerSm100Group {
 
 public:
-  using UnderlyingScheduler = PersistentTileSchedulerSm90Group<GroupProblemShape>;
+  using UnderlyingScheduler = PersistentTileSchedulerSm90Group<GroupProblemShape, SchedulerPipelineStageCount>;
   using UnderlyingProblemShape = typename GroupProblemShape::UnderlyingProblemShape;
   using Params = PersistentTileSchedulerSm100GroupParams<UnderlyingProblemShape>;
   using WorkTileInfo = typename UnderlyingScheduler::WorkTileInfo;
   using Arguments = typename UnderlyingScheduler::Arguments;
   using RasterOrder = typename Params::RasterOrder;
   using RasterOrderOptions = typename Params::RasterOrderOptions;
-  struct CLCResponse { uint32_t data[4]; };
+
+  using CLCResponse = WorkTileInfo;
   
   static constexpr bool IsDynamicPersistent = UnderlyingScheduler::IsDynamicPersistent;
 
@@ -123,18 +124,19 @@ class PersistentTileSchedulerSm100Group {
   PersistentTileSchedulerSm100Group() { }
 
   CUTLASS_DEVICE
-  PersistentTileSchedulerSm100Group(CLCResponse* /* clc_response_ptr */, Params const& params)
+  PersistentTileSchedulerSm100Group(CLCResponse* clc_response_ptr, Params const& params)
     : scheduler_params(params),
-      scheduler_sm90(params.params_sm90_) { }
+      scheduler_sm90(params.params_sm90_, clc_response_ptr) { }
 
   CUTLASS_DEVICE
-  PersistentTileSchedulerSm100Group(CLCResponse* /* clc_response_ptr */, Params const& params, dim3 /* block_id_in_cluster */)
+  PersistentTileSchedulerSm100Group(CLCResponse* clc_response_ptr, Params const& params, dim3 /* block_id_in_cluster */)
     : scheduler_params(params),
-      scheduler_sm90(params.params_sm90_) { }
+      scheduler_sm90(params.params_sm90_, clc_response_ptr) { }
 
-  template <class ClusterShape>
+  // Returns the initial work tile info that will be computed over
+  template <typename ClusterShape>
   CUTLASS_DEVICE
-  WorkTileInfo
+  auto
   initial_work_tile_info(ClusterShape cluster_shape) {
     return scheduler_sm90.initial_work_tile_info(cluster_shape);
   }
@@ -194,6 +196,17 @@ class PersistentTileSchedulerSm100Group {
     );
   }
 
+  template <typename CLCPipeline, typename CLCPipelineState>
+  CUTLASS_DEVICE
+  auto
+  advance_to_next_work(
+    CLCPipeline& clc_pipeline,
+    CLCPipelineState clc_pipe_producer_state,
+    uint32_t advance_count = 1) {
+
+    return scheduler_sm90.advance_to_next_work(clc_pipeline, clc_pipe_producer_state, advance_count);
+  }
+
   //
   // K Tile API
   //
@@ -282,10 +295,10 @@ class PersistentTileSchedulerSm100Group {
   auto
   fetch_next_work(
     WorkTileInfo work_tile_info,
-    [[maybe_unused]] CLCPipeline& clc_pipeline,
-    [[maybe_unused]] CLCPipelineState clc_pipe_consumer_state) {
+    CLCPipeline& clc_pipeline,
+    CLCPipelineState clc_pipe_consumer_state) {
 
-    return scheduler_sm90.fetch_next_work(work_tile_info);
+    return scheduler_sm90.fetch_next_work(work_tile_info, clc_pipeline, clc_pipe_consumer_state);
   }
 
 private:
@@ -300,7 +313,6 @@ class PersistentTileSchedulerSm100Group {
   //
   // Storage
   //
-  CLCResponse *clc_response_ptr_ = nullptr;
   Params scheduler_params;
 };
 
diff --git a/include/cutlass/gemm/kernel/sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp b/include/cutlass/gemm/kernel/sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp
index ae77e7aa41..24d4176628 100644
--- a/include/cutlass/gemm/kernel/sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp
+++ b/include/cutlass/gemm/kernel/sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp
@@ -551,6 +551,10 @@ class GemmUniversal<
       if (producer_warp_role == ProducerWarpRole::Warp1) {
         work_tile_info = scheduler.initial_work_tile_info(ClusterShape{});
 
+        if constexpr (IsSchedDynamicPersistent) {
+          cutlass::arch::wait_on_dependent_grids();
+        }
+
         if constexpr (IsSchedDynamicPersistent) {
           bool requires_clc_query = true;
           TileSchedulerPipelineState scheduler_pipe_producer_state = cutlass::make_producer_start_state<TileSchedulerPipeline>();
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
index 727c8ee0cc..b5c0873200 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
@@ -69,6 +69,17 @@ class GemmUniversal<
   cute::enable_if_t<cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, typename CollectiveMainloop_::DispatchPolicy::Schedule>>
 >
 {
+  // Get the type of the scheduler response.
+  template<typename TileScheduler, typename = void>
+  struct TileSchedulerResponseGetter {
+    using Type = typename TileScheduler::CLCResponse;
+  };
+
+  template<typename TileScheduler>
+  struct TileSchedulerResponseGetter<TileScheduler, void_t<typename TileScheduler::SchedulerResponse>> {
+    using Type = typename TileScheduler::SchedulerResponse;
+  };
+
 public:
   //
   // Type Aliases
@@ -111,22 +122,42 @@ class GemmUniversal<
   using EpilogueParams = typename CollectiveEpilogue::Params;
 
   static_assert(ArchTag::kMinComputeCapability >= 90);
-  static_assert(cute::is_void_v<TileScheduler_>,
-    "Ptr-Array Cooperative and Grouped Gemm Cooperative kernel only supports the default scheduler.");
 
   static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
   static constexpr uint32_t MinTensorMapWorkspaceAlignment = 64;
 
-  using TileScheduler = cute::conditional_t<IsGroupedGemmKernel,
-    typename detail::TileSchedulerSelector<
-      GroupScheduler, ArchTag,
-      TileShape, ClusterShape,
-      2, // Default unused parameter - SchedulerPipelineStageCount
-      ProblemShape>::Scheduler,
-    typename detail::TileSchedulerSelector<
-    void, ArchTag, TileShape, ClusterShape>::Scheduler>;
+  static_assert(
+    cute::is_void_v<TileScheduler_>
+    or (
+      IsGroupedGemmKernel
+      and cute::is_any_of_v<TileScheduler_, GroupScheduler>
+    ),
+    "Ptr-Array Cooperative and Grouped Gemm Cooperative kernel only supports the default scheduler.");
+
+  using SchedulerTag = cute::conditional_t<
+    cute::is_void_v<TileScheduler_>,
+    cute::conditional_t<
+      IsGroupedGemmKernel,
+      GroupScheduler,     // Special grouped gemm scheduler
+      void                // Default scheduler for non-grouped kernels
+    >,
+    TileScheduler_
+  >;
+
+  using TileScheduler = typename detail::TileSchedulerSelector<
+    SchedulerTag,
+    ArchTag,
+    TileShape,
+    ClusterShape,
+    8, // SchedulerPipelineStageCount -- Grouped GEMM scheduler will benefit from a larger number of stages.
+    cute::conditional_t<cute::is_same_v<SchedulerTag, void>, void, ProblemShape> // Use void for default scheduler.
+  >::Scheduler;
+
+  static constexpr auto TileSchedulerStages = 8;
+
   using TileSchedulerArguments = typename TileScheduler::Arguments;
   using TileSchedulerParams = typename TileScheduler::Params;
+  using TileSchedulerResponse = typename TileSchedulerResponseGetter<TileScheduler>::Type;
 
   static constexpr uint32_t NumLoadWarpGroups = 1;
   static constexpr uint32_t NumMmaThreads = size(TiledMma{});
@@ -134,6 +165,7 @@ class GemmUniversal<
   static constexpr uint32_t MaxThreadsPerBlock = NumMmaThreads + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
   static constexpr uint32_t NumProducerThreads = CollectiveMainloop::NumProducerThreadEvents;
+  static constexpr bool     IsMainloopAuxiliaryLoadNeeded = detail::HasAuxiliaryLoad_v<typename CollectiveMainloop::DispatchPolicy>;
 
   /// Register requirement for Load and Math WGs
   static constexpr uint32_t LoadRegisterRequirement = 40;
@@ -153,14 +185,18 @@ class GemmUniversal<
     } tensors;
 
     struct PipelineStorage : cute::aligned_struct<16, _1> {
+      using TileSchedulerPipelineStorage = typename TileScheduler::PipelineStorage;
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
+      alignas(16) TileSchedulerPipelineStorage scheduler;
       alignas(16) MainloopPipelineStorage mainloop;
       alignas(16) EpiLoadPipelineStorage epi_load;
       alignas(16) typename LoadWarpOrderBarrier::SharedStorage load_order;
     } pipelines;
 
+    alignas(16) TileSchedulerResponse scheduler_response[TileSchedulerStages];
+
     struct TensorMapStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorMapStorage = typename CollectiveMainloop::TensorMapStorage;
       using EpilogueTensorMapStorage = typename CollectiveEpilogue::TensorMapStorage;
@@ -271,7 +307,8 @@ class GemmUniversal<
     if constexpr (IsGroupedGemmKernel) {
       // Group GEMM currently only supports rank-3 problem shapes
       implementable &= (args.mode == GemmUniversalMode::kGrouped && rank(typename ProblemShape::UnderlyingProblemShape{}) == 3);
-    } else {
+    }
+    else {
       implementable &= (args.mode == GemmUniversalMode::kArray && rank(typename ProblemShape::UnderlyingProblemShape{}) == 4);
     }
     if (!implementable) {
@@ -375,9 +412,12 @@ class GemmUniversal<
     using namespace cute;
     using X = Underscore;
 
-// Any Tensor Op MMA Atom in the WGMMA ISA is arch conditional to sm90a.
-#if ! defined(__CUDA_ARCH_FEAT_SM90_ALL)
-    printf("ERROR : Arch conditional MMA instruction used without targeting sm90a compute capability. Aborting.\n");
+#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL) || CUDA_ARCH_CONDITIONAL_OR_FAMILY(1200))
+#  define ENABLE_SM90_KERNEL_LEVEL 1
+#endif
+// Any Tensor Op MMA Atom in the ISA is arch conditional.
+#if ! defined(ENABLE_SM90_KERNEL_LEVEL)
+    printf("ERROR : Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.\n");
 #else
 
     // Preconditions
@@ -404,15 +444,28 @@ class GemmUniversal<
     };
     enum class ProducerWarpRole {
       Mainloop = 0,
-      Warp1 = 1,
+      MainloopAux = 1,
       Epilogue = 2,
-      Warp3 = 3
+      Scheduler = 3
     };
 
     // Kernel level shared memory storage
     SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
 
-    int thread_idx = int(ThreadIdxX());
+    auto scheduler = [&] () {
+      // Group scheduler requires a different constructor that takes a response ptr
+      if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+        return TileScheduler{params.scheduler, shared_storage.scheduler_response};
+      }
+      else {
+        return TileScheduler{params.scheduler};
+      }
+    } ();
+    // In a warp specialized kernel, collectives expose data movement and compute operations separately
+    CollectiveMainloop collective_mainloop;
+    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+
+    int thread_idx = int(threadIdx.x);
     int lane_idx = canonical_lane_idx();
     int warp_idx = canonical_warp_idx_sync();
     int warp_idx_in_warp_group = warp_idx % NumWarpsPerWarpGroup;
@@ -426,10 +479,33 @@ class GemmUniversal<
 
     // Note: Tma Descriptor Prefetch (from either const or param) is not applicable here
 
+    // TileScheduler pipeline
+    using TileSchedulerPipeline = typename TileScheduler::Pipeline;
+    typename TileSchedulerPipeline::Params tile_scheduler_pipeline_params;
+    if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+      if (warp_group_role == WarpGroupRole::Producer
+       && producer_warp_role == ProducerWarpRole::Scheduler) {
+        tile_scheduler_pipeline_params.role = TileSchedulerPipeline::ThreadCategory::Producer;
+      }
+      else {
+        tile_scheduler_pipeline_params.role = TileSchedulerPipeline::ThreadCategory::Consumer;
+      }
+      tile_scheduler_pipeline_params.consumer_arv_count = NumMmaThreads
+                                                        + NumThreadsPerWarp * (
+                                                          1                                                           // Main DMA warp
+                                                          + (collective_epilogue.is_producer_load_needed() ? 1 : 0)   // Epilog DMA warp
+                                                          + (IsMainloopAuxiliaryLoadNeeded ? 1 : 0)                   // Aux DMA warp
+                                                        );
+      tile_scheduler_pipeline_params.producer_arv_count = 1;
+    }
+    TileSchedulerPipeline tile_scheduler_pipeline(shared_storage.pipelines.scheduler, tile_scheduler_pipeline_params);
+
     // Mainloop Load pipeline
     using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
     typename MainloopPipeline::Params mainloop_pipeline_params;
-    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::Mainloop) {
+    if (warp_group_role == WarpGroupRole::Producer
+      && (producer_warp_role == ProducerWarpRole::Mainloop
+       || producer_warp_role == ProducerWarpRole::MainloopAux)) {
       mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Producer;
     }
     if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
@@ -471,11 +547,13 @@ class GemmUniversal<
 
     // Initialize starting pipeline states for the collectives
     // Epilogue store pipe is producer-only (consumer is TMA unit, waits via scoreboarding)
+    typename TileSchedulerPipeline::PipelineState tile_scheduler_pipe_consumer_state;
     typename CollectiveMainloop::PipelineState mainloop_pipe_consumer_state;
     typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_consumer_state;
 
     // For the DMA Load (producer) we start with an opposite phase
     // i.e., we skip all waits since we know that the buffer is indeed empty
+    PipelineState tile_scheduler_pipe_producer_state = cutlass::make_producer_start_state<TileSchedulerPipeline>();
     PipelineState mainloop_pipe_producer_state = cutlass::make_producer_start_state<MainloopPipeline>();
     PipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
     PipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
@@ -499,16 +577,11 @@ class GemmUniversal<
     const auto c_tile_count = CollectiveEpilogue::get_load_pipe_increment(blk_shape);
     const auto d_tile_count = CollectiveEpilogue::get_store_pipe_increment(blk_shape);
 
-    TileScheduler scheduler{params.scheduler};
-
-    // In a warp specialized kernel, collectives expose data movement and compute operations separately
-    CollectiveMainloop collective_mainloop;
-    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
-
     // Wait for all thread blocks in the Cluster
     cluster_wait_fn();
 
     auto work_tile_info = scheduler.initial_work_tile_info(ClusterShape{});
+
     if (not work_tile_info.is_valid()) {
       // When problem shapes are only on device, the grid launched may be larger than the total number of blocks across groups
       return;
@@ -533,8 +606,22 @@ class GemmUniversal<
     if (warp_group_role == WarpGroupRole::Producer) {
       cutlass::arch::warpgroup_reg_dealloc<LoadRegisterRequirement>();
 
+      if (producer_warp_role == ProducerWarpRole::Scheduler) {
+        // GroupScheduler requires a producer warp to iterate over the group infos and push
+        // the work tile infos to the downstream pipelines.
+        if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+          do {
+            auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(tile_scheduler_pipeline, tile_scheduler_pipe_producer_state);
+            work_tile_info = next_work_tile_info;
+            if (increment_pipe) {
+              ++tile_scheduler_pipe_producer_state;
+            }
+          } while (work_tile_info.is_valid());
+          tile_scheduler_pipeline.producer_tail(tile_scheduler_pipe_producer_state);
+        }
+      }
       // Mainloop Producer Warp
-      if (producer_warp_role == ProducerWarpRole::Mainloop) {
+      else if (producer_warp_role == ProducerWarpRole::Mainloop) {
         int32_t curr_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx;
         int32_t const mock_l_coord = 0;
         int32_t const sm_idx = BlockIdxX() + (BlockIdxY() * GridDimX());
@@ -544,26 +631,28 @@ class GemmUniversal<
         auto input_tensormaps = collective_mainloop.tensormaps_init(params.mainloop, shared_storage.tensormaps.mainloop, sm_count, sm_idx);
 
         // Update tensormap for the initial batch for the CTA
-        if (work_tile_info.is_valid()) {
-          collective_mainloop.tensormaps_perform_update(
-            shared_storage.tensormaps.mainloop,
-            params.mainloop,
-            input_tensormaps,
-            problem_shape_MNKL,
-            curr_batch
-          );
-          // Ensure warp is converged before issuing tensormap fence release
-          syncwarp();
-          // Entire warp must do this (i.e. it's aligned)
-          collective_mainloop.tensormaps_cp_fence_release(shared_storage.tensormaps.mainloop, input_tensormaps);
-        }
+        collective_mainloop.tensormaps_perform_update(
+          shared_storage.tensormaps.mainloop,
+          params.mainloop,
+          input_tensormaps,
+          problem_shape_MNKL,
+          curr_batch
+        );
+        // Ensure warp is converged before issuing tensormap fence release
+        syncwarp();
+        // Entire warp must do this (i.e. it's aligned)
+        collective_mainloop.tensormaps_cp_fence_release(shared_storage.tensormaps.mainloop, input_tensormaps);
 
         bool do_load_order_arrive = true;
         bool did_batch_change = true;
-        while (work_tile_info.is_valid()) {
+        do {
           if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
-            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+                work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
             work_tile_info = next_work_tile_info;
+            if (increment_pipe) {
+              ++tile_scheduler_pipe_consumer_state;
+            }
             continue;
           }
 
@@ -605,8 +694,11 @@ class GemmUniversal<
           }
 
           // Get next work tile
-          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
           work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++tile_scheduler_pipe_consumer_state;
+          }
           auto next_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx
           did_batch_change = next_batch != curr_batch;
           if (work_tile_info.is_valid() && did_batch_change) {
@@ -633,12 +725,73 @@ class GemmUniversal<
           }
           // Advance the producer state for the last remaining stage that was being waited for above
           mainloop_pipe_producer_state.advance(1);
-        } // Scheduler work fetch loop
+        } while (work_tile_info.is_valid()); // Scheduler work fetch loop
 
         // Make sure all Consumer Warp Groups have been waited upon
         collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
       } // Mainloop Producer Warp End
+      else if (producer_warp_role == ProducerWarpRole::MainloopAux) {
+        if constexpr (IsMainloopAuxiliaryLoadNeeded) {
+          int32_t curr_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx;
+          int32_t const mock_l_coord = 0;
+
+          bool did_batch_change = true;
+          do {
+            if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+              auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+              work_tile_info = next_work_tile_info;
+              if (increment_pipe) {
+                ++tile_scheduler_pipe_consumer_state;
+              }
+              continue;
+            }
+
+            // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+            auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+            auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+            auto blk_coord = make_coord(m_coord, n_coord, _, mock_l_coord);
+
+            // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+            auto work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+            auto work_k_tile_start = TileScheduler::get_work_k_tile_start(work_tile_info);
+            auto k_tile_iter = cute::make_coord_iterator(idx2crd(work_k_tile_start, shape<3>(gA_mkl)), shape<3>(gA_mkl));
+
+            if (did_batch_change) {
+              load_inputs = collective_mainloop.tensors_perform_update(load_inputs, params.mainloop, problem_shape_MNKL, curr_batch);
+            }
+
+            collective_mainloop.load_auxiliary(
+              params.mainloop,
+              mainloop_pipeline,
+              mainloop_pipe_producer_state,
+              load_inputs,
+              blk_coord,
+              k_tile_iter, work_k_tile_count,
+              lane_idx,
+              block_rank_in_cluster,
+              shared_storage.tensors.mainloop
+            );
+            // Update starting pipeline state for the next tile
+            // Wait for the last TMA stage to complete loading, before issuing tensormap updates
+            mainloop_pipe_producer_state.advance(work_k_tile_count);
 
+            // Get next work tile
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+            work_tile_info = next_work_tile_info;
+            if (increment_pipe) {
+              ++tile_scheduler_pipe_consumer_state;
+            }
+            auto next_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx
+            did_batch_change = next_batch != curr_batch;
+            if (work_tile_info.is_valid() && did_batch_change) {
+              curr_batch = next_batch;
+              if constexpr (IsGroupedGemmKernel) {
+                problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), 1);
+              }
+            }
+          } while (work_tile_info.is_valid()); // Scheduler work fetch loop
+        }
+      }
       // Epilogue Producer Warp
       else if (producer_warp_role == ProducerWarpRole::Epilogue && collective_epilogue.is_producer_load_needed()) {
         int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
@@ -649,28 +802,26 @@ class GemmUniversal<
         bool did_batch_change = true;
         constexpr bool IsEpiLoad = true;
 
-        if (work_tile_info.is_valid()) {
-          collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
-            shared_storage.tensormaps.epilogue,
-            params.epilogue,
-            epi_load_tensormap,
-            problem_shape_MNKL,
-            work_tile_info.L_idx,
-            0
-          );
+        collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
+          shared_storage.tensormaps.epilogue,
+          params.epilogue,
+          epi_load_tensormap,
+          problem_shape_MNKL,
+          work_tile_info.L_idx,
+          0
+        );
 
-          // Converge before issuing tensormap fence release since fence is aligned
-          __syncwarp();
-          collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+        // Converge before issuing tensormap fence release since fence is aligned
+        __syncwarp();
+        collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
 
-          load_order_barrier.wait();
-        }
+        load_order_barrier.wait();
 
-        while (work_tile_info.is_valid()) {
+        do {
           int32_t curr_batch = work_tile_info.L_idx;
 
           // Get next work tile
-          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
 
           if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
             if constexpr (IsGroupedGemmKernel) {
@@ -703,8 +854,10 @@ class GemmUniversal<
               wait
             );
           }
-
           work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++tile_scheduler_pipe_consumer_state;
+          }
           did_batch_change = curr_batch != work_tile_info.L_idx;
 
           if (work_tile_info.is_valid() && did_batch_change) {
@@ -729,7 +882,7 @@ class GemmUniversal<
             }
           }
 
-        } // Scheduler work fetch loop
+        } while (work_tile_info.is_valid()); // Scheduler work fetch loop
 
         // Make sure all Consumer Warp Groups have been waited upon
         collective_epilogue.load_tail(epi_load_pipeline, epi_load_pipe_producer_state);
@@ -752,27 +905,24 @@ class GemmUniversal<
       bool did_batch_change = true;
       constexpr bool IsEpiLoad = false;
 
-      if (work_tile_info.is_valid()) {
-
-        if (warp_idx_in_warp_group == 0) {
-          collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
-            shared_storage.tensormaps.epilogue,
-            params.epilogue,
-            epi_store_tensormap,
-            problem_shape_MNKL,
-            work_tile_info.L_idx,
-            consumer_warp_group_idx
-          );
+      if (warp_idx_in_warp_group == 0) {
+        collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
+          shared_storage.tensormaps.epilogue,
+          params.epilogue,
+          epi_store_tensormap,
+          problem_shape_MNKL,
+          work_tile_info.L_idx,
+          consumer_warp_group_idx
+        );
 
-          // Converge before issuing tensormap fence release since fence is aligned
-          __syncwarp();
-          collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
-                                                                     epi_store_tensormap,
-                                                                     consumer_warp_group_idx);
-        }
+        // Converge before issuing tensormap fence release since fence is aligned
+        syncwarp();
+        collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+                                                                    epi_store_tensormap,
+                                                                    consumer_warp_group_idx);
       }
 
-      while (work_tile_info.is_valid()) {
+      do {
         if constexpr (IsGroupedGemmKernel) {
           problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
         }
@@ -791,9 +941,6 @@ class GemmUniversal<
         // MSVC CTAD breaks if we say "Tensor" here, so we use "auto" instead.
         auto accumulators = partition_fragment_C(tiled_mma, take<0,2>(blk_shape));               // (MMA,MMA_M,MMA_N)
 
-        static_assert(cute::is_any_of_v<TileScheduler,
-            detail::PersistentTileSchedulerSm90Group<ProblemShape>,
-            detail::PersistentTileSchedulerSm90>);
         if (TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
 
           collective_mainloop.mma(
@@ -851,8 +998,11 @@ class GemmUniversal<
         }
 
         // Get next work tile
-        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
         work_tile_info = next_work_tile_info;
+        if (increment_pipe) {
+          ++tile_scheduler_pipe_consumer_state;
+        }
 
         did_batch_change = curr_batch != work_tile_info.L_idx;
         if (work_tile_info.is_valid() && did_batch_change) {
@@ -871,13 +1021,13 @@ class GemmUniversal<
 
             // Converge before issuing tensormap fence release since fence is aligned
             __syncwarp();
-            collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+            collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, 
                                                                        epi_store_tensormap,
                                                                        consumer_warp_group_idx);
           }
         }
 
-      } // Scheduler work fetch loop
+      } while (work_tile_info.is_valid()); // Scheduler work fetch loop
 
       // Cooperative only needs TMA to complete at the very end of the kernel
       if (do_store_tail) {
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
index e038a4cfbb..c720c21559 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
@@ -69,6 +69,18 @@ class GemmUniversal<
   cute::enable_if_t<cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>>
 >
 {
+
+  // Get the type of the scheduler response.
+  template<typename TileScheduler, typename = void>
+  struct TileSchedulerResponseGetter {
+    using Type = typename TileScheduler::CLCResponse;
+  };
+
+  template<typename TileScheduler>
+  struct TileSchedulerResponseGetter<TileScheduler, void_t<typename TileScheduler::SchedulerResponse>> {
+    using Type = typename TileScheduler::SchedulerResponse;
+  };
+
 public:
   //
   // Type Aliases
@@ -111,28 +123,50 @@ class GemmUniversal<
   using EpilogueParams = typename CollectiveEpilogue::Params;
 
   static_assert(ArchTag::kMinComputeCapability >= 90);
-  static_assert(cute::is_void_v<TileScheduler_>,
-    "Ptr-Array Pingpong and Grouped Gemm Pingpong kernel only supports the default scheduler.");
 
   static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
   static constexpr uint32_t MinTensorMapWorkspaceAlignment = 64;
 
-  using TileScheduler = cute::conditional_t<IsGroupedGemmKernel,
-    typename detail::TileSchedulerSelector<
-      GroupScheduler, ArchTag,
-      TileShape, ClusterShape,
-      2, // Default unused parameter - SchedulerPipelineStageCount
-      ProblemShape>::Scheduler,
-    typename detail::TileSchedulerSelector<
-    void, ArchTag, TileShape, ClusterShape>::Scheduler>;
+  static_assert(
+    cute::is_void_v<TileScheduler_>
+    or (
+      IsGroupedGemmKernel
+      and cute::is_any_of_v<TileScheduler_, GroupScheduler>
+    ),
+    "Ptr-Array Pingpong and Grouped Gemm Pingpong kernel only supports the default scheduler.");
+
+  using SchedulerTag = cute::conditional_t<
+    cute::is_void_v<TileScheduler_>,
+    cute::conditional_t<
+      IsGroupedGemmKernel,
+      GroupScheduler,     // Special grouped gemm scheduler
+      void                // Default scheduler for non-grouped kernels
+    >,
+    TileScheduler_
+  >;
+
+
+  using TileScheduler = typename detail::TileSchedulerSelector<
+    SchedulerTag,
+    ArchTag,
+    TileShape,
+    ClusterShape,
+    8, // SchedulerPipelineStageCount -- Grouped GEMM scheduler will benefit from a larger number of stages.
+    cute::conditional_t<cute::is_same_v<SchedulerTag, void>, void, ProblemShape> // Use void for default scheduler.
+  >::Scheduler;
+
   using TileSchedulerArguments = typename TileScheduler::Arguments;
   using TileSchedulerParams = typename TileScheduler::Params;
+  using TileSchedulerResponse = typename TileSchedulerResponseGetter<TileScheduler>::Type;
+
+  static constexpr auto TileSchedulerStages = 8;
 
   static constexpr uint32_t NumLoadWarpGroups = 1;
   static constexpr uint32_t NumMmaWarpGroups = 2;
   static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumMmaWarpGroups * NumThreadsPerWarpGroup);
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
   static constexpr uint32_t NumProducerThreads = CollectiveMainloop::NumProducerThreadEvents;
+  static constexpr bool     IsMainloopAuxiliaryLoadNeeded = detail::HasAuxiliaryLoad_v<typename CollectiveMainloop::DispatchPolicy>;
 
   /// Register requirement for Load and Math WGs
   static constexpr uint32_t LoadRegisterRequirement = 40;
@@ -159,16 +193,20 @@ class GemmUniversal<
     } tensors;
 
     struct PipelineStorage : cute::aligned_struct<16, _1> {
+      using TileSchedulerPipelineStorage = typename TileScheduler::PipelineStorage;
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
       using MathWarpGroupOrderBarrierStorage = MathWarpGroupOrderBarrierSharedStorage;
 
+      alignas(16) TileSchedulerPipelineStorage scheduler;
       alignas(16) MainloopPipelineStorage mainloop;
       alignas(16) EpiLoadPipelineStorage epi_load;
       alignas(16) typename LoadWarpOrderBarrier::SharedStorage load_order;
       alignas(16) MathWarpGroupOrderBarrierStorage math_wg_order;
     } pipelines;
 
+    alignas(16) TileSchedulerResponse scheduler_response[TileSchedulerStages];
+
     struct TensorMapStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorMapStorage = typename CollectiveMainloop::TensorMapStorage;
       using EpilogueTensorMapStorage = typename CollectiveEpilogue::TensorMapStorage;
@@ -283,7 +321,8 @@ class GemmUniversal<
     if constexpr (IsGroupedGemmKernel) {
       // Group GEMM currently only supports rank-3 problem shapes
       implementable &= (args.mode == GemmUniversalMode::kGrouped && rank(typename ProblemShape::UnderlyingProblemShape{}) == 3);
-    } else {
+    }
+    else {
       implementable &= (args.mode == GemmUniversalMode::kArray && rank(typename ProblemShape::UnderlyingProblemShape{}) == 4);
     }
     if (!implementable) {
@@ -386,9 +425,12 @@ class GemmUniversal<
     using namespace cute;
     using X = Underscore;
 
-// Any Tensor Op MMA Atom in the WGMMA ISA is arch conditional to sm90a.
-#if ! defined(__CUDA_ARCH_FEAT_SM90_ALL)
-    printf("ERROR : Arch conditional MMA instruction used without targeting sm90a compute capability. Aborting.\n");
+#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL) || CUDA_ARCH_CONDITIONAL_OR_FAMILY(1200))
+#  define ENABLE_SM90_KERNEL_LEVEL 1
+#endif
+// Any Tensor Op MMA Atom in the ISA is arch conditional.
+#if ! defined(ENABLE_SM90_KERNEL_LEVEL)
+    printf("ERROR : Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.\n");
 #else
 
     // Preconditions
@@ -412,14 +454,28 @@ class GemmUniversal<
     };
     enum class ProducerWarpRole {
       Mainloop = 0,
-      Warp1 = 1,
+      MainloopAux = 1,
       Epilogue = 2,
-      Warp3 = 3
+      Scheduler = 3
     };
 
     // Kernel level shared memory storage
     SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
 
+    auto scheduler = [&] () {
+      // Group scheduler requires a different constructor that takes a response ptr
+      if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+        return TileScheduler{params.scheduler, shared_storage.scheduler_response};
+      }
+      else {
+        return TileScheduler{params.scheduler};
+      }
+    } ();
+
+    // In a warp specialized kernel, collectives expose data movement and compute operations separately
+    CollectiveMainloop collective_mainloop;
+    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+
     int thread_idx = int(threadIdx.x);
     int lane_idx = canonical_lane_idx();
     int warp_idx = canonical_warp_idx_sync();
@@ -434,10 +490,32 @@ class GemmUniversal<
 
     // Note: Tma Descriptor Prefetch (from either const or param) is not applicable here
 
+    // TileScheduler pipeline
+    using TileSchedulerPipeline = typename TileScheduler::Pipeline;
+    typename TileSchedulerPipeline::Params tile_scheduler_pipeline_params;
+    if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+      if (warp_group_role == WarpGroupRole::Producer
+        && producer_warp_role == ProducerWarpRole::Scheduler) {
+        tile_scheduler_pipeline_params.role = TileSchedulerPipeline::ThreadCategory::Producer;
+      }
+      else {
+        tile_scheduler_pipeline_params.role = TileSchedulerPipeline::ThreadCategory::Consumer;
+      }
+      tile_scheduler_pipeline_params.consumer_arv_count = NumThreadsPerWarpGroup * NumMmaWarpGroups                   // 1 MATH WG
+                                                        + NumThreadsPerWarp * (
+                                                          1                                                           // Main DMA warp
+                                                          + (collective_epilogue.is_producer_load_needed() ? 1 : 0)   // Epilog DMA warp
+                                                          + (IsMainloopAuxiliaryLoadNeeded ? 1 : 0)                   // Aux DMA warp
+                                                        );
+      tile_scheduler_pipeline_params.producer_arv_count = 1;
+    }
+    TileSchedulerPipeline tile_scheduler_pipeline(shared_storage.pipelines.scheduler, tile_scheduler_pipeline_params);
     // Mainloop Load pipeline
     using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
     typename MainloopPipeline::Params mainloop_pipeline_params;
-    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::Mainloop) {
+    if (warp_group_role == WarpGroupRole::Producer
+      && (producer_warp_role == ProducerWarpRole::Mainloop
+       || producer_warp_role == ProducerWarpRole::MainloopAux)) {
       mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Producer;
     }
     if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
@@ -485,11 +563,13 @@ class GemmUniversal<
 
     // Initialize starting pipeline states for the collectives
     // Epilogue store pipe is producer-only (consumer is TMA unit, waits via scoreboarding)
+    typename TileSchedulerPipeline::PipelineState tile_scheduler_pipe_consumer_state;
     typename CollectiveMainloop::PipelineState mainloop_pipe_consumer_state;
     typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_consumer_state;
 
     // For the DMA Load (producer) we start with an opposite phase
     // i.e., we skip all waits since we know that the buffer is indeed empty
+    PipelineState tile_scheduler_pipe_producer_state = cutlass::make_producer_start_state<TileSchedulerPipeline>();
     PipelineState mainloop_pipe_producer_state = cutlass::make_producer_start_state<MainloopPipeline>();
     PipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
     PipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
@@ -513,16 +593,11 @@ class GemmUniversal<
     const auto c_tile_count = CollectiveEpilogue::get_load_pipe_increment(blk_shape);
     const auto d_tile_count = CollectiveEpilogue::get_store_pipe_increment(blk_shape);
 
-    TileScheduler scheduler{params.scheduler};
-
-    // In a warp specialized kernel, collectives expose data movement and compute operations separately
-    CollectiveMainloop collective_mainloop;
-    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
-
     // Wait for all thread blocks in the Cluster
     cluster_wait_fn();
 
     auto work_tile_info = scheduler.initial_work_tile_info(ClusterShape{});
+
     if (not work_tile_info.is_valid()) {
       // When problem shapes are only on device, the grid launched may be larger than the total number of blocks across groups
       return;
@@ -531,16 +606,21 @@ class GemmUniversal<
     // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
     auto problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
 
-    if (warp_group_role == WarpGroupRole::Consumer1) {
+    // Consumer1 is not on the critical path at prologue.
+    if (warp_group_role == WarpGroupRole::Consumer1) [[unlikely]] {
       // Advance 2nd Math WG to the next work tile for the startup
       const auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
 
-      auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+      auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
       work_tile_info = next_work_tile_info;
       if (!work_tile_info.is_valid()) {
         return;
       }
 
+      if (increment_pipe) {
+        ++tile_scheduler_pipe_consumer_state;
+      }
+
       // Advance 2nd Math WG pipeline states to the end of 1st Math WG
       mainloop_pipe_consumer_state.advance(k_tile_count);
       epi_load_pipe_consumer_state.advance(c_tile_count);
@@ -565,8 +645,20 @@ class GemmUniversal<
     if (warp_group_role == WarpGroupRole::Producer) {
       cutlass::arch::warpgroup_reg_dealloc<LoadRegisterRequirement>();
 
+      if (producer_warp_role == ProducerWarpRole::Scheduler) {
+        if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+          do {
+            auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(tile_scheduler_pipeline, tile_scheduler_pipe_producer_state);
+            work_tile_info = next_work_tile_info;
+            if (increment_pipe) {
+              ++tile_scheduler_pipe_producer_state;
+            }
+          } while (work_tile_info.is_valid());
+          tile_scheduler_pipeline.producer_tail(tile_scheduler_pipe_producer_state);
+        }
+      }
       // Mainloop Producer Warp
-      if (producer_warp_role == ProducerWarpRole::Mainloop) {
+      else if (producer_warp_role == ProducerWarpRole::Mainloop) {
         int32_t curr_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx;
         int32_t const mock_l_coord = 0;
         int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
@@ -576,26 +668,27 @@ class GemmUniversal<
         auto input_tensormaps = collective_mainloop.tensormaps_init(params.mainloop, shared_storage.tensormaps.mainloop, sm_count, sm_idx);
 
         // Update tensormap for the initial batch for the CTA
-        if (work_tile_info.is_valid()) {
-          collective_mainloop.tensormaps_perform_update(
-            shared_storage.tensormaps.mainloop,
-            params.mainloop,
-            input_tensormaps,
-            problem_shape_MNKL,
-            curr_batch
-          );
-          // Ensure warp is converged before issuing tensormap fence release
-          __syncwarp();
-          // Entire warp must do this (i.e. it's aligned)
-          collective_mainloop.tensormaps_cp_fence_release(shared_storage.tensormaps.mainloop, input_tensormaps);
-        }
+        collective_mainloop.tensormaps_perform_update(
+          shared_storage.tensormaps.mainloop,
+          params.mainloop,
+          input_tensormaps,
+          problem_shape_MNKL,
+          curr_batch
+        );
+        // Ensure warp is converged before issuing tensormap fence release
+        __syncwarp();
+        // Entire warp must do this (i.e. it's aligned)
+        collective_mainloop.tensormaps_cp_fence_release(shared_storage.tensormaps.mainloop, input_tensormaps);
 
         bool do_load_order_arrive = true;
         bool did_batch_change = true;
-        while (work_tile_info.is_valid()) {
+        do {
           if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
-            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
             work_tile_info = next_work_tile_info;
+            if (increment_pipe) {
+              ++tile_scheduler_pipe_consumer_state;
+            }
             continue;
           }
 
@@ -637,8 +730,11 @@ class GemmUniversal<
           }
 
           // Get next work tile
-          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
           work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++tile_scheduler_pipe_consumer_state;
+          }
           auto next_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx
           did_batch_change = next_batch != curr_batch;
           if (work_tile_info.is_valid() && did_batch_change) {
@@ -665,12 +761,73 @@ class GemmUniversal<
           }
           // Advance the producer state for the last remaining stage that was being waited for above
           mainloop_pipe_producer_state.advance(1);
-        } // Scheduler work fetch loop
+        } while (work_tile_info.is_valid()); // Scheduler work fetch loop
 
         // Make sure all Consumer Warp Groups have been waited upon
         collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
       } // Mainloop Producer Warp End
+      else if (producer_warp_role == ProducerWarpRole::MainloopAux) {
+        if constexpr (IsMainloopAuxiliaryLoadNeeded) {
+          int32_t curr_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx;
+          int32_t const mock_l_coord = 0;
+
+          bool did_batch_change = true;
+          do {
+            if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+              auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+              work_tile_info = next_work_tile_info;
+              if (increment_pipe) {
+                ++tile_scheduler_pipe_consumer_state;
+              }
+              continue;
+            }
+
+            // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+            auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+            auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+            auto blk_coord = make_coord(m_coord, n_coord, _, mock_l_coord);
+
+            // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+            auto work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+            auto work_k_tile_start = TileScheduler::get_work_k_tile_start(work_tile_info);
+            auto k_tile_iter = cute::make_coord_iterator(idx2crd(work_k_tile_start, shape<3>(gA_mkl)), shape<3>(gA_mkl));
+
+            if (did_batch_change) {
+              load_inputs = collective_mainloop.tensors_perform_update(load_inputs, params.mainloop, problem_shape_MNKL, curr_batch);
+            }
+
+            collective_mainloop.load_auxiliary(
+              params.mainloop,
+              mainloop_pipeline,
+              mainloop_pipe_producer_state,
+              load_inputs,
+              blk_coord,
+              k_tile_iter, work_k_tile_count,
+              lane_idx,
+              block_rank_in_cluster,
+              shared_storage.tensors.mainloop
+            );
 
+            // Update starting pipeline state for the next tile
+            mainloop_pipe_producer_state.advance(work_k_tile_count);
+
+            // Get next work tile
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+            work_tile_info = next_work_tile_info;
+            if (increment_pipe) {
+              ++tile_scheduler_pipe_consumer_state;
+            }
+            auto next_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx
+            did_batch_change = next_batch != curr_batch;
+            if (work_tile_info.is_valid() && did_batch_change) {
+              curr_batch = next_batch;
+              if constexpr (IsGroupedGemmKernel) {
+                problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), 1);
+              }
+            }
+          } while (work_tile_info.is_valid()); // Scheduler work fetch loop
+        } // End of auxiliary load needed check
+      } // Mainloop Auxiliary Load Producer Warp End
       // Epilogue Producer Warp
       else if (producer_warp_role == ProducerWarpRole::Epilogue && collective_epilogue.is_producer_load_needed()) {
         int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
@@ -681,28 +838,26 @@ class GemmUniversal<
         bool did_batch_change = true;
         constexpr bool IsEpiLoad = true;
 
-        if (work_tile_info.is_valid()) {
-          collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
-            shared_storage.tensormaps.epilogue,
-            params.epilogue,
-            epi_load_tensormap,
-            problem_shape_MNKL,
-            work_tile_info.L_idx,
-            0
-          );
+        collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
+          shared_storage.tensormaps.epilogue,
+          params.epilogue,
+          epi_load_tensormap,
+          problem_shape_MNKL,
+          work_tile_info.L_idx,
+          0
+        );
 
-          // Converge before issuing tensormap fence release since fence is aligned
-          __syncwarp();
-          collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+        // Converge before issuing tensormap fence release since fence is aligned
+        __syncwarp();
+        collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
 
-          load_order_barrier.wait();
-        }
+        load_order_barrier.wait();
 
-        while (work_tile_info.is_valid()) {
+        do {
           int32_t curr_batch = work_tile_info.L_idx;
 
           // Get next work tile
-          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
 
           if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
             if constexpr (IsGroupedGemmKernel) {
@@ -737,6 +892,9 @@ class GemmUniversal<
           }
 
           work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++tile_scheduler_pipe_consumer_state;
+          }
           did_batch_change = curr_batch != work_tile_info.L_idx;
 
           if (work_tile_info.is_valid() && did_batch_change) {
@@ -761,7 +919,7 @@ class GemmUniversal<
             }
           }
 
-        } // Scheduler work fetch loop
+        } while (work_tile_info.is_valid()); // Scheduler work fetch loop
 
         // Make sure all Consumer Warp Groups have been waited upon
         collective_epilogue.load_tail(epi_load_pipeline, epi_load_pipe_producer_state);
@@ -784,27 +942,24 @@ class GemmUniversal<
       bool did_batch_change = true;
       constexpr bool IsEpiLoad = false;
 
-      if (work_tile_info.is_valid()) {
-
-        if (warp_idx_in_warp_group == 0) {
-          collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
-            shared_storage.tensormaps.epilogue,
-            params.epilogue,
-            epi_store_tensormap,
-            problem_shape_MNKL,
-            work_tile_info.L_idx,
-            consumer_warp_group_idx
-          );
+      if (warp_idx_in_warp_group == 0) {
+        collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
+          shared_storage.tensormaps.epilogue,
+          params.epilogue,
+          epi_store_tensormap,
+          problem_shape_MNKL,
+          work_tile_info.L_idx,
+          consumer_warp_group_idx
+        );
 
-          // Converge before issuing tensormap fence release since fence is aligned
-          __syncwarp();
-          collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
-                                                                     epi_store_tensormap,
-                                                                     consumer_warp_group_idx);
-        }
+        // Converge before issuing tensormap fence release since fence is aligned
+        __syncwarp();
+        collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+                                                                    epi_store_tensormap,
+                                                                    consumer_warp_group_idx);
       }
 
-      while (work_tile_info.is_valid()) {
+      do {
         if constexpr (IsGroupedGemmKernel) {
           problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
         }
@@ -823,9 +978,6 @@ class GemmUniversal<
         // MSVC CTAD breaks if we say "Tensor" here, so we use "auto" instead.
         auto accumulators = partition_fragment_C(tiled_mma, take<0,2>(blk_shape));               // (MMA,MMA_M,MMA_N)
 
-        static_assert(cute::is_any_of_v<TileScheduler,
-            detail::PersistentTileSchedulerSm90Group<ProblemShape>,
-            detail::PersistentTileSchedulerSm90>);
         if (TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
 
           math_wg_order_barrier.wait();
@@ -889,8 +1041,11 @@ class GemmUniversal<
         }
 
         // Get next work tile
-        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
         work_tile_info = next_work_tile_info;
+        if (increment_pipe) {
+          ++tile_scheduler_pipe_consumer_state;
+        }
 
         // Skip a tile for pingpong
         if (work_tile_info.is_valid()) {
@@ -901,10 +1056,11 @@ class GemmUniversal<
           mainloop_pipe_consumer_state.advance(work_k_tile_count);
 
           // Go to next tile
-          auto [next_next_work_tile_info, next_increment_pipe] = scheduler.fetch_next_work(work_tile_info);
-
-          work_tile_info = next_next_work_tile_info;
-          increment_pipe = next_increment_pipe;
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+          work_tile_info = next_work_tile_info;
+          if (increment_pipe) {
+            ++tile_scheduler_pipe_consumer_state;
+          }
         }
 
         did_batch_change = curr_batch != work_tile_info.L_idx;
@@ -951,7 +1107,7 @@ class GemmUniversal<
         // Cue for next Math WG's Epilogue to start
         math_wg_order_barrier.arrive();
 
-      } // Scheduler work fetch loop
+      } while (work_tile_info.is_valid()); // Scheduler work fetch loop
     } // Consumer Warp Groups End
 #endif
   }
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
index eed7b8d719..ac0187f745 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
@@ -270,7 +270,7 @@ class GemmUniversal<
     using namespace cute;
     using X = Underscore;
 
-#if defined(__CUDA_ARCH_FEAT_SM90_ALL)
+#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL) || CUDA_ARCH_CONDITIONAL_OR_FAMILY(1200))
 #  define ENABLE_SM90_KERNEL_LEVEL 1
 #endif
 // Any Tensor Op MMA Atom in the WGMMA ISA is arch conditional to sm90a.
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
index ac929d0d35..dcf1e50360 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
@@ -126,10 +126,15 @@ class GemmUniversal<
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
   static constexpr uint32_t NumFixupBarriers = NumMmaWarpGroups;
   static constexpr uint32_t NumProducerThreads = CollectiveMainloop::NumProducerThreadEvents;
+  static constexpr bool     IsMainloopAuxiliaryLoadNeeded = detail::HasAuxiliaryLoad_v<typename CollectiveMainloop::DispatchPolicy>;
 
   /// Register requirement for Load and Math WGs
-  static constexpr uint32_t LoadRegisterRequirement = 40;
-  static constexpr uint32_t MmaRegisterRequirement = 232;
+  static constexpr int RegsPerThread =
+    size<0>(TileShape{}) * size<1>(TileShape{}) / NumMMAThreads *
+    sizeof(ElementAccumulator) / sizeof(uint32_t);
+  static constexpr bool HeavyRegisterPressure = RegsPerThread >= 208;
+  static constexpr uint32_t LoadRegisterRequirement = !HeavyRegisterPressure ? 40 : 24;
+  static constexpr uint32_t MmaRegisterRequirement = !HeavyRegisterPressure ? 232 : 240;
 
   // 1 stage ordered sequence between mainloop and epilogue producer load threads
   using LoadWarpOrderBarrier = cutlass::OrderedSequenceBarrier<1,2>;
@@ -337,7 +342,7 @@ class GemmUniversal<
     using namespace cute;
     using X = Underscore;
 
-#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL))
+#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL) || CUDA_ARCH_CONDITIONAL_OR_FAMILY(1200))
 #  define ENABLE_SM90_KERNEL_LEVEL 1
 #endif
 // Any Tensor Op MMA Atom in the ISA is arch conditional.
@@ -365,7 +370,7 @@ class GemmUniversal<
       Mainloop = 0,
       Warp1 = 1,
       Epilogue = 2,
-      Warp3 = 3
+      MainloopAux = 3
     };
 
 
@@ -639,7 +644,53 @@ class GemmUniversal<
         // Make sure all Consumer Warp Groups have been waited upon
         collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
 
-      } // Mainloop Producer Warp End
+      }
+      else if (producer_warp_role == ProducerWarpRole::MainloopAux) {
+        if constexpr (IsMainloopAuxiliaryLoadNeeded) {
+          while (work_tile_info.is_valid()) {
+            if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+              auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+              work_tile_info = next_work_tile_info;
+              continue;
+            }
+
+            // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+            auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+            auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+            auto l_coord = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl));
+            auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
+
+            // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+            auto work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+            auto work_k_tile_start = TileScheduler::get_work_k_tile_start(work_tile_info);
+            auto k_tile_iter = cute::make_coord_iterator(idx2crd(work_k_tile_start, shape<3>(gA_mkl)), shape<3>(gA_mkl));
+
+            collective_mainloop.load_auxiliary(
+              params.mainloop,
+              mainloop_pipeline,
+              mainloop_pipe_producer_state,
+              load_inputs,
+              blk_coord,
+              k_tile_iter, work_k_tile_count,
+              lane_idx,
+              block_rank_in_cluster,
+              shared_storage.tensors.mainloop
+            );
+            // Update starting pipeline state for the next tile
+            mainloop_pipe_producer_state.advance(work_k_tile_count);
+
+            // Get next work tile
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(
+              work_tile_info,
+              scheduler_pipeline,
+              scheduler_pipe_consumer_state
+            );
+
+            work_tile_info = next_work_tile_info;
+          } // Scheduler work fetch loop
+
+        }
+      }
 
       // Epilogue Producer Warp
       else if (producer_warp_role == ProducerWarpRole::Epilogue && is_epi_load_needed) {
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp
index 50961020c1..77743fe4e1 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp
@@ -130,16 +130,22 @@ class GemmUniversal<
   static constexpr uint32_t NumEpilogueLoadThreads = NumThreadsPerWarp;      // 1 warp for C
   static constexpr uint32_t NumLoadWarpGroups = 1;
   static constexpr uint32_t NumMmaWarpGroups = 2;
+  static constexpr uint32_t NumProducerThreads = CollectiveMainloop::NumProducerThreadEvents;
   static constexpr uint32_t NumMMAThreads = size(TiledMma{});                 // 4 warp 
   static constexpr uint32_t MaxThreadsPerBlock = NumMMAThreads * NumMmaWarpGroups + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
+  static constexpr bool     IsMainloopAuxiliaryLoadNeeded = detail::HasAuxiliaryLoad_v<typename CollectiveMainloop::DispatchPolicy>;
   
   static_assert(NumMMAThreads == 128, "Pingpong kernel must have TiledMMA operating using 128 threads.");
   static_assert(MaxThreadsPerBlock == 384, "Pingpong kernel must have 384 threads in total.");
 
   /// Register requirement for Load and Math WGs
-  static constexpr uint32_t LoadRegisterRequirement = 40;
-  static constexpr uint32_t MmaRegisterRequirement = 232;
+  static constexpr int RegsPerThread =
+    (size<0>(TileShape{}) * size<1>(TileShape{}) * sizeof(ElementAccumulator))
+    / (NumMMAThreads * sizeof(uint32_t));
+  static constexpr bool HeavyRegisterPressure = RegsPerThread >= 208;
+  static constexpr uint32_t LoadRegisterRequirement = !HeavyRegisterPressure ? 40 : 24;
+  static constexpr uint32_t MmaRegisterRequirement = !HeavyRegisterPressure ? 232 : 240;
 
   // 1 stage ordered sequence between mainloop and epilogue producer load threads
   using LoadWarpOrderBarrier = cutlass::OrderedSequenceBarrier<1,2>;
@@ -348,7 +354,7 @@ class GemmUniversal<
     using namespace cute;
     using X = Underscore;
 
-#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL))
+#if (defined(__CUDA_ARCH_FEAT_SM90_ALL) || defined(__CUDA_ARCH_FEAT_SM120_ALL) || CUDA_ARCH_CONDITIONAL_OR_FAMILY(1200))
 #  define ENABLE_SM90_KERNEL_LEVEL 1
 #endif
 // Any Tensor Op MMA Atom in the ISA is arch conditional.
@@ -371,7 +377,7 @@ class GemmUniversal<
       Mainloop = 0,
       Warp1 = 1,
       Epilogue = 2,
-      Warp3 = 3
+      MainloopAux = 3
     };
 
     // Kernel level shared memory storage
@@ -449,6 +455,7 @@ class GemmUniversal<
     }
     mainloop_pipeline_params.is_leader = warp_group_thread_idx == 0;
     mainloop_pipeline_params.num_consumers = NumThreadsPerWarpGroup;
+    mainloop_pipeline_params.num_producers = NumProducerThreads;
     mainloop_pipeline_params.transaction_bytes = params.mainloop.tma_transaction_bytes;
     MainloopPipeline mainloop_pipeline(shared_storage.pipelines.mainloop, mainloop_pipeline_params, ClusterShape{});
 
@@ -680,6 +687,52 @@ class GemmUniversal<
         
       } // Mainloop Producer Warp End
 
+      else if (producer_warp_role == ProducerWarpRole::MainloopAux) {
+        if constexpr (IsMainloopAuxiliaryLoadNeeded) {
+          // Ensure that the prefetched kernel does not touch
+          // unflushed global memory prior to this instruction
+          cutlass::arch::wait_on_dependent_grids();
+          while (work_tile_info.is_valid()) {
+            // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+            auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+            auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+            auto l_coord = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl));
+            auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
+
+            auto k_tile_iter = cute::make_coord_iterator(shape<3>(gA_mkl));
+            collective_mainloop.load_auxiliary(
+              params.mainloop,
+              mainloop_pipeline,
+              mainloop_pipe_producer_state,
+              load_inputs,
+              blk_coord,
+              k_tile_iter, k_tile_count,
+              lane_idx,
+              block_rank_in_cluster,
+              shared_storage.tensors.mainloop
+            );
+            // Update starting pipeline state for the next tile
+            mainloop_pipe_producer_state.advance(k_tile_count);
+
+            scheduler.advance_to_next_work();
+            work_tile_info = scheduler.get_current_work();
+          } // Scheduler work fetch loop
+
+          // Make sure all Consumer Warp Groups have been waited upon
+          collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
+
+          if constexpr (IsSchedDynamicPersistent) {  
+            auto [next_work_tile_info, increment_pipe] = 
+              scheduler.fetch_next_work(
+                work_tile_info,
+                scheduler_pipeline,
+                scheduler_pipe_consumer_state
+              );
+          }
+          
+        }
+      }
+
       // Epilogue Producer Warp
       else if (producer_warp_role == ProducerWarpRole::Epilogue && collective_epilogue.is_producer_load_needed()) {
 
diff --git a/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp b/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
index e0108583ef..9aee96e714 100644
--- a/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
+++ b/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
@@ -43,7 +43,7 @@ namespace cutlass::gemm::kernel::detail {
 ///////////////////////////////////////////////////////////////////////////////
 
 // Persistent Thread Block (TB) scheduler
-template <class GroupProblemShape>
+template <class GroupProblemShape, int SchedulerPipelineStageCount>
 class PersistentTileSchedulerSm90Group {
   //
   // Data members
@@ -58,6 +58,7 @@ class PersistentTileSchedulerSm90Group {
     int group_idx = 0;
     uint64_t start_linear_idx = 0;
     uint64_t total_tiles = 0;
+    uint64_t problem_blocks_along_raster_order = 0;
   } current_group_info_;
 
 public:
@@ -65,18 +66,18 @@ class PersistentTileSchedulerSm90Group {
     int32_t M_idx = 0;
     int32_t N_idx = 0;
     int32_t L_idx = 0;
-    bool is_valid_tile = false;
+    int32_t is_valid_tile = 0;
 
     CUTLASS_HOST_DEVICE
     bool
     is_valid() const {
-      return is_valid_tile;
+      return is_valid_tile != 0;
     }
 
     CUTLASS_HOST_DEVICE
     static WorkTileInfo
     invalid_work_tile() {
-      return {-1, -1, -1, false};
+      return {-1, -1, -1, 0};
     }
 
     CUTLASS_HOST_DEVICE
@@ -98,19 +99,34 @@ class PersistentTileSchedulerSm90Group {
   using RasterOrderOptions = typename Params::RasterOrderOptions;
   static constexpr bool IsDynamicPersistent = false;
 
-  using Pipeline = PipelineEmpty;
-  using PipelineStorage = typename Pipeline::SharedStorage;
+  // We need to hard code the number of stages here since the scheduling is static
+  // and it can benefit from a larger number of stages without worrying about imbalances.
+
+  using Pipeline = PipelineAsync<SchedulerPipelineStageCount>;
+
+  // Call out the types here to work around a bug in MSVC.
+
+  // using PipelineStorage = typename Pipeline::SharedStorage;
+  // using PipelineState = typename Pipeline::PipelineState;
+  using PipelineStorage = cutlass::PipelineDetail::PipelineAsyncSharedStorage<SchedulerPipelineStageCount>;
+  using PipelineState = cutlass::PipelineDetail::PipelineAsyncPipelineState<SchedulerPipelineStageCount>;
+
   using ThrottlePipeline = PipelineEmpty;
-  using ThrottlePipelineStorage = typename ThrottlePipeline::SharedStorage;
-  struct CLCResponse {};
+  using ThrottlePipelineStorage = typename PipelineEmpty::SharedStorage;
+  using SchedulerResponse = WorkTileInfo;
 
   class SharedStorage {
   public:
-    CUTLASS_DEVICE PipelineStorage pipeline() { return PipelineStorage{}; }
+    CUTLASS_DEVICE PipelineStorage pipeline() { return pipeline_; }
+    // Pipeline throttle is not needed here as the scheduling is not dynamic.
     CUTLASS_DEVICE ThrottlePipelineStorage throttle_pipeline() { return ThrottlePipelineStorage{}; }
-    CUTLASS_DEVICE CLCResponse* data() { return nullptr; }
+    CUTLASS_DEVICE SchedulerResponse* data() { return data_; }
+
+  private:
+    alignas(16) PipelineStorage pipeline_;
+    alignas(16) SchedulerResponse data_[SchedulerPipelineStageCount];
   };
-  
+
   struct Arguments {
     int max_swizzle_size = 1;
     // Not applying Heuristics for Grouped problems, since largest dimension can change per group
@@ -119,6 +135,8 @@ class PersistentTileSchedulerSm90Group {
 
   // Sink scheduler params as a member
   Params scheduler_params;
+  SchedulerResponse *response_ptr_ = nullptr;
+  ProblemShape cached_problem_shapes_[2];
 
   //
   // Methods
@@ -229,7 +247,7 @@ class PersistentTileSchedulerSm90Group {
 
   PersistentTileSchedulerSm90Group() = default;
 
-  CUTLASS_DEVICE explicit PersistentTileSchedulerSm90Group(Params const& params_) : scheduler_params(params_) {
+  CUTLASS_DEVICE explicit PersistentTileSchedulerSm90Group(Params const& params_, SchedulerResponse* response_ptr) : scheduler_params(params_), response_ptr_(response_ptr) {
     // MSVC requires protecting use of CUDA-specific nonstandard syntax,
     // like blockIdx and gridDim, with __CUDA_ARCH__.
 #if defined(__CUDA_ARCH__) || defined __SYCL_DEVICE_ONLY__
@@ -240,8 +258,12 @@ class PersistentTileSchedulerSm90Group {
       current_work_linear_idx_ = uint64_t(BlockIdxX()) * uint64_t(GridDimY()) + uint64_t(BlockIdxY());
     }
 
-    total_grid_size_ = uint64_t(GridDimX()) * uint64_t(GridDimY()) * uint64_t(GridDimZ());
+    int lane_idx = canonical_lane_idx();
+    if (lane_idx < params_.groups_) {
+      cached_problem_shapes_[1] = params_.problem_shapes_[lane_idx];
+    }
 
+    total_grid_size_ = uint64_t(GridDimX()) * uint64_t(GridDimY()) * uint64_t(GridDimZ());
     uint64_t ctas_along_m, ctas_along_n;
     if (is_tuple<decltype(cute::shape<0>(params_.problem_shapes_[0]))>::value ||
         is_tuple<decltype(cute::shape<1>(params_.problem_shapes_[0]))>::value) {
@@ -255,52 +277,24 @@ class PersistentTileSchedulerSm90Group {
     auto problem_blocks_m = round_up(ctas_along_m, (1 << params_.log_swizzle_size_) * params_.cluster_shape_.m());
     auto problem_blocks_n = round_up(ctas_along_n, (1 << params_.log_swizzle_size_) * params_.cluster_shape_.n());
     current_group_info_.total_tiles = problem_blocks_m * problem_blocks_n;
+    current_group_info_.problem_blocks_along_raster_order = params_.raster_order_ == RasterOrder::AlongN ? problem_blocks_n : problem_blocks_m;
+
 #else
     CUTLASS_ASSERT(false && "This line should never be reached");
 #endif
   }
 
-  CUTLASS_DEVICE
-  WorkTileInfo
-  get_current_work() {
-    return get_current_work_for_linear_idx(current_work_linear_idx_);
-  }
-
-  CUTLASS_DEVICE
-  WorkTileInfo
-  get_current_work_for_linear_idx(uint64_t linear_idx) {
-    if (scheduler_params.pre_processed_problem_shapes && linear_idx >= scheduler_params.blocks_across_problem_) {
-      return WorkTileInfo::invalid_work_tile();
-    }
-
-    return get_work_idx_m_and_n(linear_idx,
-                                current_group_info_,
-                                scheduler_params.groups_,
-                                scheduler_params.problem_shapes_,
-                                scheduler_params.cta_shape_,
-                                scheduler_params.cluster_shape_,
-                                scheduler_params.divmod_cluster_shape_major_,
-                                scheduler_params.divmod_cluster_shape_minor_,
-                                scheduler_params.divmod_cta_shape_m_,
-                                scheduler_params.divmod_cta_shape_n_,
-                                scheduler_params.log_swizzle_size_, 
-                                scheduler_params.raster_order_);
-  }
-
-  CUTLASS_DEVICE
-  void
-  advance_to_next_work(uint32_t advance_count = 1) {
-    current_work_linear_idx_ += total_grid_size_ * uint64_t(advance_count);
-  }
-
   // get work_idx_m, work_idx_n from linear_idx while applying swizzle
-  static CUTLASS_DEVICE
+  template<class WorkTileInfo, class GroupInfo, class ProblemShape, class RasterOrder>
+  static
+  CUTLASS_DEVICE
   WorkTileInfo
   get_work_idx_m_and_n(
       uint64_t linear_idx,
-      struct GroupInfo& group_info,
+      GroupInfo& group_info,
       int32_t total_problem_groups,
       ProblemShape* problem_shapes,
+      ProblemShape (&cached_problem_shapes)[2],
       GemmCoord cta_shape,
       GemmCoord cluster_shape,
       FastDivmodU64Pow2 const& divmod_cluster_shape_major,
@@ -310,40 +304,67 @@ class PersistentTileSchedulerSm90Group {
       int32_t log_swizzle_size, 
       RasterOrder raster_order) {
 
-    bool valid_tile = true;
-    uint64_t ctas_along_m, ctas_along_n;
-    if (is_tuple<decltype(cute::shape<0>(problem_shapes[group_info.group_idx]))>::value ||
-        is_tuple<decltype(cute::shape<1>(problem_shapes[group_info.group_idx]))>::value) {
-      ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(problem_shapes[group_info.group_idx]), cta_shape.m()));
-      ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(problem_shapes[group_info.group_idx]), cta_shape.n()));
-    }
-    else {
-      ctas_along_m = divmod_cta_shape_m.divide(cute::shape<0>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_m.divisor - 1);
-      ctas_along_n = divmod_cta_shape_n.divide(cute::shape<1>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_n.divisor - 1);
+    int32_t valid_tile = 1;
+
+    // Use a warp to "speculatively" check if the work tile maps to the next 32 groups
+    int lane_idx = canonical_lane_idx();
+
+    if (linear_idx >= group_info.total_tiles + group_info.start_linear_idx) {
+      group_info.group_idx += lane_idx;
+      for ( ; ; group_info.group_idx += NumThreadsPerWarp) {
+        cached_problem_shapes[0] = cached_problem_shapes[1];
+        if (group_info.group_idx + NumThreadsPerWarp < total_problem_groups) {
+          cached_problem_shapes[1] = problem_shapes[group_info.group_idx + NumThreadsPerWarp];
+        }
+        if (group_info.group_idx < total_problem_groups) {
+          uint64_t ctas_along_m, ctas_along_n;
+          if (is_tuple<decltype(cute::shape<0>(cached_problem_shapes[0]))>::value ||
+              is_tuple<decltype(cute::shape<1>(cached_problem_shapes[0]))>::value) {
+            ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(cached_problem_shapes[0]), cta_shape.m()));
+            ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(cached_problem_shapes[0]), cta_shape.n()));
+          }
+          else {
+            ctas_along_m = divmod_cta_shape_m.divide(cute::shape<0>(cached_problem_shapes[0]) +  divmod_cta_shape_m.divisor - 1);
+            ctas_along_n = divmod_cta_shape_n.divide(cute::shape<1>(cached_problem_shapes[0]) +  divmod_cta_shape_n.divisor - 1);
+          }
+          auto problem_blocks_m = round_up(ctas_along_m, (1 << log_swizzle_size) * cluster_shape.m());
+          auto problem_blocks_n = round_up(ctas_along_n, (1 << log_swizzle_size) * cluster_shape.n());
+          group_info.problem_blocks_along_raster_order = raster_order == RasterOrder::AlongN ? problem_blocks_n : problem_blocks_m;
+          group_info.total_tiles = problem_blocks_m * problem_blocks_n;
+        } else {
+          group_info.total_tiles = INT_MAX;
+        }
+
+        auto curr_total_tiles = group_info.total_tiles;
+
+        // Calculate prefix sum for start_linear_idx.
+        #pragma unroll
+        for (int i = 1; i < NumThreadsPerWarp; i *= 2) {
+          auto n = shfl_up_sync(0xffffffff, curr_total_tiles, i);
+          curr_total_tiles = lane_idx >= i ? curr_total_tiles + n : curr_total_tiles;
+        }
+        group_info.start_linear_idx += curr_total_tiles - group_info.total_tiles;
+
+        uint32_t thread_succeed = ballot_sync(0xffffffff, linear_idx < group_info.start_linear_idx + group_info.total_tiles);
+        if (thread_succeed) {
+          // Use the first succeeding thread.
+          int first_succeeding_thread = ffs(thread_succeed) - 1;
+          group_info.group_idx = shfl_sync(0xffffffff, group_info.group_idx, first_succeeding_thread);
+          group_info.start_linear_idx = shfl_sync(0xffffffff, group_info.start_linear_idx, first_succeeding_thread);
+          group_info.total_tiles = shfl_sync(0xffffffff, group_info.total_tiles, first_succeeding_thread);
+          group_info.problem_blocks_along_raster_order = shfl_sync(0xffffffff, group_info.problem_blocks_along_raster_order, first_succeeding_thread);
+          if (group_info.group_idx + lane_idx < total_problem_groups) {
+            cached_problem_shapes[1] = problem_shapes[group_info.group_idx + lane_idx];
+          }
+          break;
+        }
+        // Update the start_linear_idx for all threads so that they're ready for the next iteration.
+        group_info.start_linear_idx = shfl_sync(0xffffffff, group_info.start_linear_idx + group_info.total_tiles, NumThreadsPerWarp - 1);
+      }
     }
-    auto problem_blocks_m = round_up(ctas_along_m, (1 << log_swizzle_size) * cluster_shape.m());
-    auto problem_blocks_n = round_up(ctas_along_n, (1 << log_swizzle_size) * cluster_shape.n());
-    group_info.total_tiles = problem_blocks_m * problem_blocks_n;
-
-    while (group_info.start_linear_idx + group_info.total_tiles <= linear_idx) {
-      group_info.group_idx++;
 
-      if (group_info.group_idx >= total_problem_groups)
-        return WorkTileInfo::invalid_work_tile();
-
-      group_info.start_linear_idx += group_info.total_tiles;
-      if (is_tuple<decltype(cute::shape<0>(problem_shapes[group_info.group_idx]))>::value ||
-          is_tuple<decltype(cute::shape<1>(problem_shapes[group_info.group_idx]))>::value) {
-        ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(problem_shapes[group_info.group_idx]), cta_shape.m()));
-        ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(problem_shapes[group_info.group_idx]), cta_shape.n()));
-      }
-      else {
-        ctas_along_m = divmod_cta_shape_m.divide(cute::shape<0>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_m.divisor - 1);
-        ctas_along_n = divmod_cta_shape_n.divide(cute::shape<1>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_n.divisor - 1);
-      }
-      problem_blocks_m = round_up(ctas_along_m, (1 << log_swizzle_size) * cluster_shape.m());
-      problem_blocks_n = round_up(ctas_along_n, (1 << log_swizzle_size) * cluster_shape.n());
-      group_info.total_tiles = problem_blocks_m * problem_blocks_n;
+    if (group_info.group_idx >= total_problem_groups) {
+      return WorkTileInfo::invalid_work_tile();
     }
 
     uint64_t cluster_id, cluster_major_offset = 0, cluster_minor_offset = 0;
@@ -369,13 +390,8 @@ class PersistentTileSchedulerSm90Group {
     offset = cluster_id & ((1 << log_swizzle_size) - 1);
     extra = cluster_id >> log_swizzle_size;
 
-    uint64_t curr_group_cluster_blk_major;
-    if (raster_order == RasterOrder::AlongN) {
-      curr_group_cluster_blk_major = divmod_cluster_shape_major.divide(problem_blocks_n);
-    }
-    else {
-      curr_group_cluster_blk_major = divmod_cluster_shape_major.divide(problem_blocks_m);
-    }
+    uint64_t curr_group_cluster_blk_major = divmod_cluster_shape_major.divide(group_info.problem_blocks_along_raster_order);
+
     cluster_idx_minor_div_swizzle = extra / curr_group_cluster_blk_major;
     cluster_idx_major = extra % curr_group_cluster_blk_major;
 
@@ -392,7 +408,46 @@ class PersistentTileSchedulerSm90Group {
     else {
       return {major_work_idx, minor_work_idx, group_info.group_idx, valid_tile}; 
     }
+  }
 
+  CUTLASS_DEVICE
+  WorkTileInfo
+  get_current_work_for_linear_idx(uint64_t linear_idx) {
+    if (scheduler_params.pre_processed_problem_shapes && linear_idx >= scheduler_params.blocks_across_problem_) {
+      return WorkTileInfo::invalid_work_tile();
+    }
+    return get_work_idx_m_and_n<WorkTileInfo>(
+              linear_idx,
+              current_group_info_,
+              scheduler_params.groups_,
+              scheduler_params.problem_shapes_,
+              cached_problem_shapes_,
+              scheduler_params.cta_shape_,
+              scheduler_params.cluster_shape_,
+              scheduler_params.divmod_cluster_shape_major_,
+              scheduler_params.divmod_cluster_shape_minor_,
+              scheduler_params.divmod_cta_shape_m_,
+              scheduler_params.divmod_cta_shape_n_,
+              scheduler_params.log_swizzle_size_,
+              scheduler_params.raster_order_);
+  }
+  template <typename TileSchedulerPipeline, typename TileSchedulerPipelineState>
+  CUTLASS_DEVICE
+  auto
+  advance_to_next_work(
+    TileSchedulerPipeline& scheduler_pipeline,
+    TileSchedulerPipelineState scheduler_pipe_producer_state,
+    uint32_t advance_count = 1) {
+
+    current_work_linear_idx_ += total_grid_size_ * uint64_t(advance_count);
+    auto work_tile = get_current_work_for_linear_idx(current_work_linear_idx_);
+    scheduler_pipeline.producer_acquire(scheduler_pipe_producer_state);
+    if (cute::elect_one_sync()) {
+      response_ptr_[scheduler_pipe_producer_state.index()] = work_tile;
+      cutlass::arch::fence_view_async_shared();
+      scheduler_pipeline.producer_commit(scheduler_pipe_producer_state);
+    }
+    return cute::make_tuple(work_tile, true);
   }
 
   // Returns whether the block assigned this work should compute the epilogue for the corresponding
@@ -503,25 +558,32 @@ class PersistentTileSchedulerSm90Group {
   }
 
   // Kernel helper function to get next work tile
+  template <typename TileSchedulerPipeline, typename TileSchedulerPipelineState>
   CUTLASS_DEVICE
   auto
-  fetch_next_work(WorkTileInfo work_tile_info) {
+  fetch_next_work(
+    WorkTileInfo work_tile_info,
+    TileSchedulerPipeline& scheduler_pipeline,
+    TileSchedulerPipelineState scheduler_pipe_consumer_state) {
+
     if (continue_current_work(work_tile_info)) {
       return cute::make_tuple(work_tile_info, true);
     }
+    scheduler_pipeline.consumer_wait(scheduler_pipe_consumer_state);
+    auto work_tile = response_ptr_[scheduler_pipe_consumer_state.index()];
+    cutlass::arch::fence_view_async_shared();
+    scheduler_pipeline.consumer_release(scheduler_pipe_consumer_state);
 
-    advance_to_next_work();
-    return cute::make_tuple(get_current_work(), true);
+    return cute::make_tuple(work_tile, true);
   }
   
   // Returns the initial work tile info that will be computed over
   template <class ClusterShape>
   CUTLASS_DEVICE
-  WorkTileInfo
+  auto
   initial_work_tile_info(ClusterShape) {
-    return get_current_work();
+    return get_current_work_for_linear_idx(current_work_linear_idx_);
   }
-
 };
 
 } // namespace cutlass::gemm::kernel::detail
diff --git a/include/cutlass/gemm/kernel/tile_scheduler.hpp b/include/cutlass/gemm/kernel/tile_scheduler.hpp
index 102c1eecc4..83244d63f9 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler.hpp
+++ b/include/cutlass/gemm/kernel/tile_scheduler.hpp
@@ -60,7 +60,7 @@ struct StaticPersistentScheduler { };
 ////////////////////////////////////////////////////////////////////////////////
 
 #include "cutlass/gemm/kernel/sm90_tile_scheduler.hpp"
-#include "cutlass/gemm/kernel/sm100_static_tile_scheduler.hpp"
+#include "cutlass/gemm/kernel/sm100_static_tile_scheduler.hpp" 
 
 #include "cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp"
 #include "cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp"
@@ -69,6 +69,7 @@ struct StaticPersistentScheduler { };
 #include "cutlass/gemm/kernel/sm100_tile_scheduler_group.hpp"
 #if defined (SYCL_INTEL_TARGET)
 #include "cutlass/gemm/kernel/xe_tile_scheduler_streamk.hpp"
+#include "cutlass/gemm/kernel/xe_tile_scheduler_group.hpp"
 #endif
 ////////////////////////////////////////////////////////////////////////////////
 
@@ -148,23 +149,23 @@ struct TileSchedulerSelector<
 template <
   class ArchTag,
   class TileShape,
-  class ClusterShape,
-  uint32_t SchedulerPipelineStageCount
+  class ClusterShape, 
+  uint32_t SchedulerPipelineStageCount     
 >
 struct TileSchedulerSelector<
     StaticPersistentScheduler,
     ArchTag,
     TileShape,
     ClusterShape
-    , SchedulerPipelineStageCount
+    , SchedulerPipelineStageCount              
   > {
   using Scheduler = PersistentTileSchedulerSm90;
 };
 
 template <
   class TileShape,
-  class ClusterShape,
-  uint32_t SchedulerPipelineStageCount,
+  class ClusterShape, 
+  uint32_t SchedulerPipelineStageCount, 
   class GroupProblemShape
 >
 struct TileSchedulerSelector<
@@ -175,7 +176,7 @@ struct TileSchedulerSelector<
     , SchedulerPipelineStageCount              
     , GroupProblemShape
   > {
-  using Scheduler = PersistentTileSchedulerSm90Group<GroupProblemShape>;
+  using Scheduler = PersistentTileSchedulerSm90Group<GroupProblemShape, SchedulerPipelineStageCount>;
 };
 
 #if defined (SYCL_INTEL_TARGET)
@@ -204,11 +205,11 @@ struct TileSchedulerSelector<
     GroupScheduler,
     arch::IntelXe,
     TileShape,
-    ClusterShape, 
+    ClusterShape,
     SchedulerPipelineStageCount,
     GroupProblemShape
   > {
-  using Scheduler = PersistentTileSchedulerSm90Group<GroupProblemShape>;
+  using Scheduler = PersistentTileSchedulerXeGroup<GroupProblemShape>;
 };
 template <
   class TileShape,
@@ -303,7 +304,7 @@ struct TileSchedulerSelector<
     SchedulerPipelineStageCount,
     GroupProblemShape
   > {
-  using Scheduler = PersistentTileSchedulerSm100Group<GroupProblemShape>;
+  using Scheduler = PersistentTileSchedulerSm100Group<GroupProblemShape, SchedulerPipelineStageCount>;
 };
 
 // SM100 stream-K scheduler
@@ -387,6 +388,24 @@ struct TileSchedulerSelector<
                         SchedulerPipelineStageCount>;
 };
 
+// SM120 Group tile scheduler
+template <
+  class TileShape,
+  class ClusterShape,
+  uint32_t SchedulerPipelineStageCount,
+  class GroupProblemShape
+>
+struct TileSchedulerSelector<
+    GroupScheduler,
+    arch::Sm120,
+    TileShape,
+    ClusterShape,
+    SchedulerPipelineStageCount,
+    GroupProblemShape
+  > {
+  using Scheduler = PersistentTileSchedulerSm90Group<GroupProblemShape, SchedulerPipelineStageCount>;
+};
+
 ////////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass::gemm::kernel::detail
diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 5d12d2efc0..0a837bed54 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -1097,6 +1097,10 @@ struct PersistentTileSchedulerSm90StreamKParams {
         return 0;
       }
     }
+    // Ensure that the number of SK tiles is divisible by cluster size so that it can be evenly
+    // divided among SK clusters.
+    sk_tiles = (sk_tiles / cluster_size) * cluster_size;
+
     return static_cast<uint32_t>(sk_tiles);
   }
 
diff --git a/include/cutlass/gemm/kernel/xe_tile_scheduler_group.hpp b/include/cutlass/gemm/kernel/xe_tile_scheduler_group.hpp
new file mode 100644
index 0000000000..4511d600c1
--- /dev/null
+++ b/include/cutlass/gemm/kernel/xe_tile_scheduler_group.hpp
@@ -0,0 +1,512 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 Codeplay Software Ltd. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cutlass/fast_math.h"
+#include "cutlass/gemm_coord.hpp"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+#include "cute/layout.hpp"
+#include "cute/tensor.hpp"
+
+namespace cutlass::gemm::kernel::detail {
+
+///////////////////////////////////////////////////////////////////////////////
+
+// Persistent Thread Block (TB) scheduler
+template <class GroupProblemShape>
+class PersistentTileSchedulerXeGroup {
+  //
+  // Data members
+  //
+
+private:
+  uint64_t current_work_linear_idx_ = 0;
+  uint64_t total_grid_size_ = 0;
+
+  // Tracking current group, its starting linear idx and total tiles
+  struct GroupInfo {
+    int group_idx = 0;
+    uint64_t start_linear_idx = 0;
+    uint64_t total_tiles = 0;
+  } current_group_info_;
+
+public:
+  struct WorkTileInfo {
+    int32_t M_idx = 0;
+    int32_t N_idx = 0;
+    int32_t L_idx = 0;
+    bool is_valid_tile = false;
+
+    CUTLASS_HOST_DEVICE
+    bool
+    is_valid() const {
+      return is_valid_tile;
+    }
+
+    CUTLASS_HOST_DEVICE
+    static WorkTileInfo
+    invalid_work_tile() {
+      return {-1, -1, -1, false};
+    }
+
+    CUTLASS_HOST_DEVICE
+    bool
+    is_final_split(uint32_t k_tiles_per_output_tile) const {
+      return true;
+    }
+
+    CUTLASS_HOST_DEVICE
+    int32_t
+    reduction_subtile_idx() const {
+      return -1;
+    }
+  };
+
+  using ProblemShape = typename GroupProblemShape::UnderlyingProblemShape;
+  using Params = PersistentTileSchedulerSm90GroupParams<ProblemShape>;
+  using RasterOrder = typename Params::RasterOrder;
+  using RasterOrderOptions = typename Params::RasterOrderOptions;
+
+  struct Arguments {
+    int max_swizzle_size = 1;
+    // Not applying Heuristics for Grouped problems, since largest dimension can change per group
+    RasterOrderOptions raster_order = RasterOrderOptions::AlongM;
+  };
+
+  // Sink scheduler params as a member
+  Params scheduler_params;
+
+  //
+  // Methods
+  //
+
+  template <class TileShape, class ClusterShape>
+  static Params
+  to_underlying_arguments(
+    GroupProblemShape problem_shapes,
+    TileShape tile_shape,
+    ClusterShape cluster_shape,
+    KernelHardwareInfo const& hw_info,
+    Arguments const& arguments,
+    [[maybe_unused]] void* workspace=nullptr,
+    [[maybe_unused]] const uint32_t epilogue_subtile = 1,
+    [[maybe_unused]] uint32_t ktile_start_alignment_count = 1u
+    ) {
+
+    // We only need the tile and cluster shape during scheduler setup, so let FTAD do the magic
+    static_assert(cute::is_static<TileShape>::value);
+    static_assert(cute::is_static<ClusterShape>::value);
+
+    dim3 problem_blocks = get_tiled_cta_shape_mnl(
+      problem_shapes.groups(),
+      problem_shapes,
+      hw_info,
+      tile_shape, cluster_shape);
+
+    Params params;
+    params.initialize(
+      problem_blocks,
+      problem_shapes.groups(),
+      problem_shapes.problem_shapes,
+      problem_shapes.host_problem_shapes,
+      to_gemm_coord(tile_shape),
+      to_gemm_coord(cluster_shape),
+      hw_info,
+      arguments.max_swizzle_size, 
+      arguments.raster_order
+    );
+
+    return params;
+  }
+
+  // Given the inputs, computes the physical grid we should launch.
+  template<class TileShape, class ClusterShape>
+  CUTLASS_HOST_DEVICE static
+  dim3
+  get_grid_shape(
+    [[maybe_unused]] Params const& params,
+    GroupProblemShape problem_shapes,
+    TileShape tile_shape,
+    ClusterShape cluster_shape,
+    KernelHardwareInfo hw_info,
+    Arguments arguments,
+    bool truncate_by_problem_size=true) {
+
+    dim3 problem_blocks = get_tiled_cta_shape_mnl(
+      problem_shapes.groups(),
+      problem_shapes,
+      hw_info,
+      tile_shape, cluster_shape);
+
+    return Params::get_grid_shape(
+      problem_blocks,
+      to_gemm_coord(cluster_shape),
+      hw_info,
+      arguments.max_swizzle_size,
+      arguments.raster_order,
+      /* truncate_by_problem_size = */true
+    );
+  }
+
+  // Given the inputs, computes the total number of output blocks this problem will compute over
+  // Note that this is only the logical size of our grid, not the physical grid we will actually launch.
+  template<class BlockShape, class ClusterShape>
+  CUTLASS_HOST_DEVICE static
+  dim3
+  get_tiled_cta_shape_mnl(int groups, GroupProblemShape problem_shapes, KernelHardwareInfo hw_info, BlockShape cta_shape, ClusterShape cluster_shape) {
+    uint32_t total_ctas = 0;
+    uint32_t cta_in_N_dim = 1; // We linearize the blocks across all the problems here
+
+    // If host problem shapes are not provided.
+    if (!problem_shapes.is_host_problem_shape_available()) {
+      total_ctas = hw_info.sm_count;
+    }
+    // If host problem shapes are provided, make a better decision about possibility to launch smaller grid.
+    else {
+      for (int group = 0; group < groups; group++) {
+        auto ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(problem_shapes.get_host_problem_shape(group)), cute::shape<0>(cta_shape)));
+        auto ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(problem_shapes.get_host_problem_shape(group)), cute::shape<1>(cta_shape)));
+        auto problem_blocks_m = round_up(ctas_along_m, cute::get<0>(cluster_shape));
+        auto problem_blocks_n = round_up(ctas_along_n, cute::get<1>(cluster_shape));
+        total_ctas += problem_blocks_m * problem_blocks_n;
+      }
+    }
+
+    return Params::get_tiled_cta_shape_mnl(
+      to_gemm_coord(cluster_shape),
+      total_ctas, cta_in_N_dim
+    );
+  }
+
+  static bool
+  can_implement(Arguments const& args) {
+    return true;
+  }
+
+  PersistentTileSchedulerXeGroup() = default;
+
+  CUTLASS_DEVICE explicit PersistentTileSchedulerXeGroup(Params const& params_) : scheduler_params(params_) {
+    // MSVC requires protecting use of CUDA-specific nonstandard syntax,
+    // like blockIdx and gridDim, with __CUDA_ARCH__.
+#if defined(__CUDA_ARCH__) || defined __SYCL_DEVICE_ONLY__
+    if (scheduler_params.raster_order_ == RasterOrder::AlongN) {
+      current_work_linear_idx_ = uint64_t(BlockIdxX()) + uint64_t(BlockIdxY()) * uint64_t(GridDimX());
+    }
+    else {
+      current_work_linear_idx_ = uint64_t(BlockIdxX()) * uint64_t(GridDimY()) + uint64_t(BlockIdxY());
+    }
+
+    total_grid_size_ = uint64_t(GridDimX()) * uint64_t(GridDimY()) * uint64_t(GridDimZ());
+
+    uint64_t ctas_along_m, ctas_along_n;
+    if (is_tuple<decltype(cute::shape<0>(params_.problem_shapes_[0]))>::value ||
+        is_tuple<decltype(cute::shape<1>(params_.problem_shapes_[0]))>::value) {
+      ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(params_.problem_shapes_[0]), scheduler_params.cta_shape_.m()));
+      ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(params_.problem_shapes_[0]), scheduler_params.cta_shape_.n()));
+    }
+    else {
+      ctas_along_m = scheduler_params.divmod_cta_shape_m_.divide(cute::shape<0>(params_.problem_shapes_[0]) +  scheduler_params.divmod_cta_shape_m_.divisor - 1);
+      ctas_along_n = scheduler_params.divmod_cta_shape_n_.divide(cute::shape<1>(params_.problem_shapes_[0]) +  scheduler_params.divmod_cta_shape_n_.divisor - 1);
+    }
+    auto problem_blocks_m = round_up(ctas_along_m, (1 << params_.log_swizzle_size_) * params_.cluster_shape_.m());
+    auto problem_blocks_n = round_up(ctas_along_n, (1 << params_.log_swizzle_size_) * params_.cluster_shape_.n());
+    current_group_info_.total_tiles = problem_blocks_m * problem_blocks_n;
+#else
+    CUTLASS_ASSERT(false && "This line should never be reached");
+#endif
+  }
+
+  CUTLASS_DEVICE
+  WorkTileInfo
+  get_current_work() {
+    return get_current_work_for_linear_idx(current_work_linear_idx_);
+  }
+
+  CUTLASS_DEVICE
+  WorkTileInfo
+  get_current_work_for_linear_idx(uint64_t linear_idx) {
+    if (scheduler_params.pre_processed_problem_shapes && linear_idx >= scheduler_params.blocks_across_problem_) {
+      return WorkTileInfo::invalid_work_tile();
+    }
+
+    return get_work_idx_m_and_n(linear_idx,
+                                current_group_info_,
+                                scheduler_params.groups_,
+                                scheduler_params.problem_shapes_,
+                                scheduler_params.cta_shape_,
+                                scheduler_params.cluster_shape_,
+                                scheduler_params.divmod_cluster_shape_major_,
+                                scheduler_params.divmod_cluster_shape_minor_,
+                                scheduler_params.divmod_cta_shape_m_,
+                                scheduler_params.divmod_cta_shape_n_,
+                                scheduler_params.log_swizzle_size_, 
+                                scheduler_params.raster_order_);
+  }
+
+  CUTLASS_DEVICE
+  void
+  advance_to_next_work(uint32_t advance_count = 1) {
+    current_work_linear_idx_ += total_grid_size_ * uint64_t(advance_count);
+  }
+
+  // get work_idx_m, work_idx_n from linear_idx while applying swizzle
+  static CUTLASS_DEVICE
+  WorkTileInfo
+  get_work_idx_m_and_n(
+      uint64_t linear_idx,
+      struct GroupInfo& group_info,
+      int32_t total_problem_groups,
+      ProblemShape* problem_shapes,
+      GemmCoord cta_shape,
+      GemmCoord cluster_shape,
+      FastDivmodU64Pow2 const& divmod_cluster_shape_major,
+      FastDivmodU64Pow2 const& divmod_cluster_shape_minor,
+      FastDivmodU64 const& divmod_cta_shape_m,
+      FastDivmodU64 const& divmod_cta_shape_n,
+      int32_t log_swizzle_size, 
+      RasterOrder raster_order) {
+
+    bool valid_tile = true;
+    uint64_t ctas_along_m, ctas_along_n;
+    if (is_tuple<decltype(cute::shape<0>(problem_shapes[group_info.group_idx]))>::value ||
+        is_tuple<decltype(cute::shape<1>(problem_shapes[group_info.group_idx]))>::value) {
+      ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(problem_shapes[group_info.group_idx]), cta_shape.m()));
+      ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(problem_shapes[group_info.group_idx]), cta_shape.n()));
+    }
+    else {
+      ctas_along_m = divmod_cta_shape_m.divide(cute::shape<0>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_m.divisor - 1);
+      ctas_along_n = divmod_cta_shape_n.divide(cute::shape<1>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_n.divisor - 1);
+    }
+    auto problem_blocks_m = round_up(ctas_along_m, (1 << log_swizzle_size) * cluster_shape.m());
+    auto problem_blocks_n = round_up(ctas_along_n, (1 << log_swizzle_size) * cluster_shape.n());
+    group_info.total_tiles = problem_blocks_m * problem_blocks_n;
+
+    while (group_info.start_linear_idx + group_info.total_tiles <= linear_idx) {
+      group_info.group_idx++;
+
+      if (group_info.group_idx >= total_problem_groups)
+        return WorkTileInfo::invalid_work_tile();
+
+      group_info.start_linear_idx += group_info.total_tiles;
+      if (is_tuple<decltype(cute::shape<0>(problem_shapes[group_info.group_idx]))>::value ||
+          is_tuple<decltype(cute::shape<1>(problem_shapes[group_info.group_idx]))>::value) {
+        ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(problem_shapes[group_info.group_idx]), cta_shape.m()));
+        ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(problem_shapes[group_info.group_idx]), cta_shape.n()));
+      }
+      else {
+        ctas_along_m = divmod_cta_shape_m.divide(cute::shape<0>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_m.divisor - 1);
+        ctas_along_n = divmod_cta_shape_n.divide(cute::shape<1>(problem_shapes[group_info.group_idx]) +  divmod_cta_shape_n.divisor - 1);
+      }
+      problem_blocks_m = round_up(ctas_along_m, (1 << log_swizzle_size) * cluster_shape.m());
+      problem_blocks_n = round_up(ctas_along_n, (1 << log_swizzle_size) * cluster_shape.n());
+      group_info.total_tiles = problem_blocks_m * problem_blocks_n;
+    }
+
+    uint64_t cluster_id, cluster_major_offset = 0, cluster_minor_offset = 0;
+    uint64_t blk_per_grid_dim = divmod_cluster_shape_minor.divide(linear_idx - group_info.start_linear_idx);
+    divmod_cluster_shape_major(cluster_id, cluster_major_offset, blk_per_grid_dim);
+
+    // With static schedulers, we launch grid such that all cluster are linear (1-D) order, i.e., 
+    // there can only be one cluster in the minor dimension. get_grid_shape() in scheduler params
+    // put cluster_shape.m/n() as the minor dimension based on raster order AlongN/M resp.
+    // Therefore, the offset of a CTA (inside a cluster) in the minor dimension can be directly be 
+    // inferred by the blockIdx along the minor dimension.
+    if (raster_order == RasterOrder::AlongN) {
+      cluster_minor_offset = BlockIdxX();
+    }
+    else {
+      cluster_minor_offset = BlockIdxY();
+    }
+
+    uint64_t cluster_idx_minor, cluster_idx_major;
+    
+    uint64_t cluster_idx_minor_div_swizzle, extra, offset;
+
+    offset = cluster_id & ((1 << log_swizzle_size) - 1);
+    extra = cluster_id >> log_swizzle_size;
+
+    uint64_t curr_group_cluster_blk_major;
+    if (raster_order == RasterOrder::AlongN) {
+      curr_group_cluster_blk_major = divmod_cluster_shape_major.divide(problem_blocks_n);
+    }
+    else {
+      curr_group_cluster_blk_major = divmod_cluster_shape_major.divide(problem_blocks_m);
+    }
+    cluster_idx_minor_div_swizzle = extra / curr_group_cluster_blk_major;
+    cluster_idx_major = extra % curr_group_cluster_blk_major;
+
+    cluster_idx_minor = cluster_idx_minor_div_swizzle * (1 << log_swizzle_size) + offset;
+
+    auto minor_work_idx = static_cast<int32_t>(cluster_idx_minor * divmod_cluster_shape_minor.divisor + 
+                                               cluster_minor_offset);
+    auto major_work_idx = static_cast<int32_t>(cluster_idx_major * divmod_cluster_shape_major.divisor + 
+                                               cluster_major_offset);
+
+    if (raster_order == RasterOrder::AlongN) {
+      return {minor_work_idx, major_work_idx, group_info.group_idx, valid_tile};
+    }
+    else {
+      return {major_work_idx, minor_work_idx, group_info.group_idx, valid_tile}; 
+    }
+
+  }
+
+  // Returns whether the block assigned this work should compute the epilogue for the corresponding
+  // output tile. For the basic tile scheduler, this is always true.
+  CUTLASS_HOST_DEVICE
+  static bool
+  compute_epilogue(WorkTileInfo const&, Params const&) {
+    return true;
+  }
+
+  // Performs the reduction across splits for a given output tile. Since this scheduler does
+  // not split output tiles, no reduction is needed.
+  template <class FrgTensorC>
+  CUTLASS_DEVICE
+  static void
+  fixup(Params const&, WorkTileInfo const&, FrgTensorC&, uint32_t, uint32_t) {}
+
+  // Returns whether the current WorkTileInfo passed in should continue to be used. Since
+  // this scheduler only schedules work in units of single, full output tiles, the WorkTileInfo
+  // passed in should not be used after having been processed.
+  CUTLASS_DEVICE
+  static bool
+  continue_current_work(WorkTileInfo&) {
+    return false;
+  }
+
+  // The basic tile scheduler does not require any additional workspace
+  template <class ProblemShape, class ElementAccumulator>
+  static size_t
+  get_workspace_size(Arguments const&, ProblemShape, KernelHardwareInfo const&, uint32_t, const uint32_t = 1, uint32_t = 1) {
+    return 0;
+  }
+
+  template <class ProblemShape, class ElementAccumulator>
+  static cutlass::Status
+  initialize_workspace(Arguments const&, void*, cudaStream_t, ProblemShape, KernelHardwareInfo const&,
+    uint32_t, const uint32_t = 1, uint32_t = 1, CudaHostAdapter* cuda_adapter = nullptr) {
+    return Status::kSuccess;
+  }
+
+  template <class ProblemShape_MNKL, class TileShape>
+  CUTLASS_HOST_DEVICE
+  static int
+  get_work_k_tile_count(WorkTileInfo const& work_tile_info, ProblemShape_MNKL problem_shape, TileShape tile_shape) {
+    // All work units returned by this scheduler cover the entire K iteration
+    // space of the output tile assigned to the work unit.
+    return cute::size(cute::ceil_div(cute::get<2>(problem_shape), cute::get<2>(tile_shape)));
+  }
+
+  CUTLASS_HOST_DEVICE
+  static uint32_t
+  get_work_k_tile_start(WorkTileInfo const&) {
+    // All work units returned by this scheduler start from K tile 0
+    return 0u;
+  }
+
+  CUTLASS_DEVICE
+  static bool
+  need_separate_reduction(Params const& params) {
+    return false;
+  }
+
+  CUTLASS_DEVICE
+  bool
+  is_work_tile_for_reduction(WorkTileInfo const& work_tile_info, Params const& params) {
+    return false;
+  }
+
+  CUTLASS_DEVICE
+  uint32_t
+  epilgoue_subtile_idx(WorkTileInfo const& work_tile_info, Params const& params) const {
+    return 0;
+  }
+
+  template <class FrgTensorC>
+  CUTLASS_DEVICE
+  void
+  separate_reduction(
+    Params const& params,
+    WorkTileInfo const& work_tile_info,
+    FrgTensorC& accumulators,
+    uint32_t num_barriers,
+    uint32_t barrier_idx) {
+  }
+
+  // Shares the accumulator set with peers in the global workspace
+  template <class FrgTensorC>
+  CUTLASS_DEVICE
+  static void
+  share(
+    Params const& params,
+    WorkTileInfo const& work_tile_info,
+    FrgTensorC& accumulators,
+    uint32_t num_barriers,
+    uint32_t barrier_idx) {
+  }
+
+  CUTLASS_DEVICE
+  static bool
+  valid_warpgroup_in_work_tile(WorkTileInfo const& work_tile_info) {
+    return true;
+  }
+
+  CUTLASS_DEVICE
+  static bool
+  requires_separate_reduction(Params const& params) {
+    return false;
+  }
+
+  // Kernel helper function to get next work tile
+  CUTLASS_DEVICE
+  auto
+  fetch_next_work(WorkTileInfo work_tile_info) {
+    if (continue_current_work(work_tile_info)) {
+      return cute::make_tuple(work_tile_info, true);
+    }
+
+    advance_to_next_work();
+    return cute::make_tuple(get_current_work(), true);
+  }
+  
+  // Returns the initial work tile info that will be computed over
+  template <class ClusterShape>
+  CUTLASS_DEVICE
+  WorkTileInfo
+  initial_work_tile_info(ClusterShape) {
+    return get_current_work();
+  }
+
+};
+
+} // namespace cutlass::gemm::kernel::detail
diff --git a/include/cutlass/gpu_generics.h b/include/cutlass/gpu_generics.h
index 898eab3ecd..aac30deef5 100644
--- a/include/cutlass/gpu_generics.h
+++ b/include/cutlass/gpu_generics.h
@@ -319,6 +319,24 @@ T shfl_xor_sync(
 #endif
 }
 
+CUTLASS_DEVICE
+int ffs(int x) {
+#if defined(__CUDA_ARCH__)
+  return __ffs(x);
+#else
+  return 0;
+#endif
+}
+
+CUTLASS_DEVICE
+int ballot_sync(unsigned mask, int predicate) {
+#if defined(__CUDA_ARCH__)
+  return __ballot_sync(mask, predicate);
+#else
+  return 0;
+#endif
+}
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 /*
diff --git a/include/cutlass/numeric_types.h b/include/cutlass/numeric_types.h
index de9ee5fa16..b79a3d25b1 100644
--- a/include/cutlass/numeric_types.h
+++ b/include/cutlass/numeric_types.h
@@ -86,8 +86,7 @@ template <> struct has_negative_zero<float> : CUTE_STL_NAMESPACE::true_type{};
 template <> struct has_negative_zero<double> : CUTE_STL_NAMESPACE::true_type{};
 template <> struct has_negative_zero<tfloat32_t> : CUTE_STL_NAMESPACE::true_type{};
 
-
-// Helper variable template
+// Helper variable template 
 template <typename T>
 inline constexpr bool has_negative_zero_v = has_negative_zero<T>::value;
 
@@ -109,3 +108,6 @@ struct get_unpacked_element_type {
 }  // namespace cutlass
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+
diff --git a/include/cutlass/pipeline/sm100_pipeline.hpp b/include/cutlass/pipeline/sm100_pipeline.hpp
index ee25497b96..b3a5b9b3ef 100644
--- a/include/cutlass/pipeline/sm100_pipeline.hpp
+++ b/include/cutlass/pipeline/sm100_pipeline.hpp
@@ -622,6 +622,11 @@ class PipelineTmaUmmaAsync {
     impl_.producer_acquire(state, barrier_token);
   }
 
+  CUTLASS_DEVICE
+  void producer_expect_transaction(PipelineState state, uint32_t transaction_bytes) {
+    impl_.producer_expect_transaction(state, transaction_bytes);
+  }
+
   // NOP for TMA based mainloop
   CUTLASS_DEVICE
   void producer_commit(PipelineState state, uint32_t bytes) {
@@ -985,7 +990,7 @@ class PipelineCLCFetchAsync {
     consumer_release(state.index());
   }
 
-  CUTLASS_DEVICE
+  CUTLASS_HOST_DEVICE
   uint32_t producer_get_barrier(PipelineState state) {
     return cute::cast_smem_ptr_to_uint(reinterpret_cast<void*>(&full_barrier_ptr_[state.index()]));
   }
@@ -1064,6 +1069,10 @@ class PipelineEmpty {
   struct Params {};
   struct SharedStorage {};
 
+  // Constructor
+  CUTLASS_DEVICE
+  PipelineEmpty(SharedStorage& storage, Params const& params) {}
+
   // Constructor
   CUTLASS_DEVICE
   PipelineEmpty(SharedStorage&& storage, Params const& params) {}
diff --git a/include/cutlass/pipeline/sm90_pipeline.hpp b/include/cutlass/pipeline/sm90_pipeline.hpp
index a58893196f..08ab9873fa 100644
--- a/include/cutlass/pipeline/sm90_pipeline.hpp
+++ b/include/cutlass/pipeline/sm90_pipeline.hpp
@@ -421,7 +421,12 @@ class PipelineTmaAsync {
   }
 
   CUTLASS_DEVICE
-  void producer_acquire(PipelineState state, ProducerToken barrier_token = {BarrierStatus::WaitAgain}) {
+  void producer_acquire(PipelineState state) {
+    producer_acquire(state.index(), state.phase());
+  }
+
+  CUTLASS_DEVICE
+  void producer_acquire(PipelineState state, ProducerToken barrier_token) {
     producer_acquire(state.index(), state.phase(), barrier_token);
   }
 
@@ -452,6 +457,11 @@ class PipelineTmaAsync {
     return producer_get_barrier(state.index());
   }
 
+  CUTLASS_DEVICE
+  void producer_expect_transaction(PipelineState state, uint32_t transaction_bytes) {
+    producer_expect_transaction(state.index(), transaction_bytes);
+  }
+
   ////////////////////
   // Consumer APIs
   ////////////////////
@@ -497,6 +507,25 @@ class PipelineTmaAsync {
     return {static_cast<BarrierStatus>(barrier_status)};
   }
 
+  CUTLASS_DEVICE
+  void producer_acquire(uint32_t stage, uint32_t phase) {
+    empty_barrier_ptr_[stage].wait(phase);
+
+    if (params_.is_leader) {
+      full_barrier_ptr_[stage].arrive_and_expect_tx(params_.transaction_bytes);
+    }
+    #ifndef NDEBUG
+    if (params_.role == ThreadCategory::Consumer || params_.role == ThreadCategory::NonParticipant) {
+      asm volatile ("brkpt;\n" ::);
+    }
+
+    // Most likely you have elected more than one leader
+    if (params_.is_leader && (ThreadIdxX() % 32 != 0)) {
+      asm volatile ("brkpt;\n" ::);
+    }
+    #endif
+  }
+
   CUTLASS_DEVICE
   void producer_acquire(uint32_t stage, uint32_t phase, ProducerToken barrier_token) {
     detail::pipeline_check_is_producer(params_.role);
@@ -519,6 +548,14 @@ class PipelineTmaAsync {
     #endif
   }
 
+  CUTLASS_DEVICE
+  void producer_expect_transaction(uint32_t stage, uint32_t transaction_bytes) {
+    detail::pipeline_check_is_producer(params_.role);
+    if (params_.is_leader) {
+      full_barrier_ptr_[stage].expect_transaction(transaction_bytes);
+    }
+  }
+
   // NOP for TMA based mainloop
   CUTLASS_DEVICE
   void producer_commit(uint32_t stage, uint32_t bytes) {
diff --git a/include/cutlass/sycl_vector_types.h b/include/cutlass/sycl_vector_types.h
index fe5f931201..e2d8f2b160 100644
--- a/include/cutlass/sycl_vector_types.h
+++ b/include/cutlass/sycl_vector_types.h
@@ -103,3 +103,9 @@ int4 make_int4(int x, int y, int z, int w) {
   return int4 {x,y,z,w};
 }
 }
+
+namespace cute {
+
+using float2 = cutlass::float2;
+
+}
\ No newline at end of file
diff --git a/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp b/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp
index f6756890ef..7dffec5cc0 100644
--- a/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp
+++ b/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp
@@ -411,7 +411,7 @@ class SM90StructuredSparseCompressor {
           CUTE_UNROLL
           for (int elt_log_idx = 0; elt_log_idx < OneChunkSizeA{}; ++elt_log_idx) {
             ElementAMmaRawUnit elem_A = tAsA[elt_log_idx];
-
+            
             // Handle negative 0
             ElementAMmaRawUnit masked_elem_A = elem_A;
             if constexpr (has_negative_zero_v<ElementA>) {
@@ -506,10 +506,8 @@ class SM90StructuredSparseCompressor {
 
     constexpr bool IsRowMajor = cute::is_same_v<LayoutTag, cutlass::layout::RowMajor>;
     using Element = typename TensorSrc::element_type;
-
     constexpr bool IsQmmaF6 = cute::sizeof_bits_v<Element> == 6;
 
-
     CUTE_STATIC_ASSERT(cute::is_static_v<decltype(shape(dSrc))>, "shape(dSrc) needs to be static");
     CUTE_STATIC_ASSERT(cute::is_static_v<decltype(shape(dDst))>, "shape(dDst) needs to be static");
     CUTE_STATIC_ASSERT(cute::sizeof_bits_v<typename TensorSrc::element_type> == cute::sizeof_bits_v<typename TensorDst::element_type>,
@@ -557,7 +555,6 @@ class SM90StructuredSparseCompressor {
             for (int iter_col_thr = 0; iter_col_thr < ValueShapeCols; ++iter_col_thr) {
               const int row_i = (iter_row_blk * ThreadShapeRows + threadIdx_X_row) * ValueShapeRows + iter_row_thr;
               const int col_i = (col_chunk_i * ThreadShapeCols + threadIdx_X_col) * ValueShapeCols + iter_col_thr;
-
               if constexpr ( (not pred) and (not IsQmmaF6) ) {
                 dDst(row_i, col_i) = dSrc(row_i, col_i);
               }
diff --git a/include/cutlass/version.h b/include/cutlass/version.h
index 1e2b5de94d..a288004992 100644
--- a/include/cutlass/version.h
+++ b/include/cutlass/version.h
@@ -35,8 +35,8 @@
 #include <string>
 
 #define CUTLASS_MAJOR 3
-#define CUTLASS_MINOR 8
-#define CUTLASS_PATCH 0
+#define CUTLASS_MINOR 9
+#define CUTLASS_PATCH 2
 
 #ifdef CUTLASS_VERSIONS_GENERATED
 #include "cutlass/version_extended.h"
diff --git a/media/docs/blackwell_functionality.md b/media/docs/blackwell_functionality.md
deleted file mode 100644
index f5d51bae55..0000000000
--- a/media/docs/blackwell_functionality.md
+++ /dev/null
@@ -1,582 +0,0 @@
-# Blackwell SM100 GEMMs
-
-[**TLDR; jump to block scaled GEMM example**](#detailed_blockscale_example)
-
-Blackwell SM100 introduces `tcgen05.mma` instructions. `tcgen05.mma` instructions support all legacy types (`tfloat32_t`, `half_t`, `bfloat16_t`, `int8_t`, `uint8_t`) and
-the new 4, 6, and 8-bits floating point datatypes with and without scale factors. 
-This document explains the new `tcgen05.mma` instructions supported by CUTLASS and how one can leverage CUTLASS to create
-efficient SM100 GEMM kernels targeting these new mma instructions.
-
-Blackwell SM100 has 7 new `tcgen05.mma` instructions. These instructions are 2x to 4x faster then Hopper Architecture's WGMMA instructions.
-
-| Ptx Instruction                                                                  | Throughput                 | Notes |
-|----------------------------------------------------------------------------------|----------------------------|-------|
-|tcgen05.mma.cta_group::[1\|2].kind::tf32                                          | 2x Hopper Tf32 Tensor Core | MMA with A={tf32} x B={tf32} TN, NT, TT, NN layouts                                                       |
-|tcgen05.mma.cta_group::[1\|2].kind::f16                                           | 2x Hopper Fp16 Tensor Core | MMA with A={f16} x B={f16} or A={bf16} x B={bf16}  TN, NT, TT, NN layouts                                 |
-|tcgen05.mma.cta_group::[1\|2].kind::i8                                            | 2x Hopper I8 Tensor Core   | MMA with A={i8} x B={i8} or A={u8} x B={u8}  TN, NT, TT, NN layouts                                       |
-|tcgen05.mma.cta_group::[1\|2].kind::f8f6f4                                        | 2x Hopper Fp8 Tensor Core  | Mixed precision MMA with A={f4,f6,f8} x B={f4,f6,f8} TN, NT, TT, NN layouts                               |
-|tcgen05.mma.cta_group::[1\|2].kind::mxf8f6f4.block_scale                          | 2x Hopper Fp8 Tensor Core  | Block scaled mixed precision MMA with A={mxf4,mxf6,mxf8} x B={mxf4,mxf6,mxf8} with TN, NT, TT, NN layouts |
-|tcgen05.mma.cta_group::[1\|2].kind::mxf4.block_scale                              | 4x Hopper Fp8 Tensor Core  | Block scaled MMA with A={mxf4} x B={mxf4} with TN layouts                                                 |
-|tcgen05.mma.cta_group::[1\|2].kind::mxf4nvf4.block_scale.scale_vec_size::[2X\|4X] | 4x Hopper Fp8 Tensor Core  | Block scaled MMA with A={mxf4} x B={mxf4} or A={nvf4} x B={nvf4} with TN layouts                          |
-
-For more detailed information see [`tcgen05.mma` PTX documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tensorcore-5th-generation-family-instructions).
-
-## New in Blackwell SM100
-
-### Block Scaled GEMMs
-
-Instructions with `kind` modifiers `mxf8f6f4`, `mxf4`, and `nvf4mxf4` perform matrix multiplication operations with scale
-factors of the form $D = C +( A \times SFA) * (B \times SFB)$. Scale factors are applied to GEMM-K dimension such that
-every 16 or 32 elements of $A$ and $B$ matrices in K dimension have an associated scale factor. For example, an $M\times K$,
-$A$ matrix has an associated $M \times \lceil K/32 \rceil$ SFA matrix; and an $N\times K$ $B$, matrix has an associated
-$N \times \lceil K/32 \rceil$ SFB matrix. For block scaled GEMMs, an entry of output D matrix is
-$D_{ij} = C_{ij} + \sum_{k} (A_{i,k} \times SFA_{i,k/SV}) \times (B_{j,k}\times SFB_{j,k/SV})$, in index notation, we SV is the scale factor vector size (16 or 32).
-Further details can be found in
-[PTX documentation on block scaling](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-block-scaling).
-
-### Blackwell Narrow Precision Data Types
-
-Narrow-precision `tcgen05.mma` instructions can operate on several 4, 6, and 8-bit data types. Blackwell MMAs can operate
-on five different 8-bit floating point values, of which only two (`float_ue8m0_t` and `float_ue4m3_t`) can be used as scale factor data types.
-There are two 6-bit floating point types and one 4-bit floating point data type.
-See [PTX documentation for narrow precision data types](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#alternate-floating-point-data-formats) for details.
-
-**Blackwell Narrow Precision Data Types**
-| Data Type         | Exponent Bits | Mantissa Bits | Signed | Bit Size |
-|-------------------|---------------|---------------|--------|----------|
-| float_e4m3_t      |4              |3              | Yes    | 8        |
-| float_e5m2_t      |5              |2              | Yes    | 8        |
-| float_e2m3_t      |2              |3              | Yes    | 6        |
-| float_e3m2_t      |3              |2              | Yes    | 6        |
-| float_e2m1_t      |2              |1              | Yes    | 4        |
-| float_ue8m0_t[^1] |8              |0              | No     | 8        |
-| float_ue4m3_t[^1] |4              |3              | No     | 8        |
-
-[^1]: Only valid as scale factor data types.
-
-Block scaled MMAs use `mx` and `nv` types which are a pair of float8_t, float6_t, float4_t with 2 of the scale factor data types with a predetermined scale factor vector size. `mx` types follow OCP specification (see [OCP Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)). The following types provided by CUTLASS can be used as inputs to collective builders to generate the block scaled kernels:
-
-**Blackwell Block Scaled Narrow Precision Data Types**
-| Mx/Nv Data Type            |Scale Factor Type | SF Vector Size | OCP Compliant |
-|----------------------------|------------------|----------------|---------------|
-| mx_float8_t\<Any F8type\>  |float_ue8m0_t     |32              | Yes           |
-| mx_float6_t\<Any F6Type\>  |float_ue8m0_t     |32              | Yes           |
-| mx_float4_t                |float_ue8m0_t     |32              | Yes           |
-| nv_float4_t                |float_ue4m3_t     |16              | No            |
-
-## Layouts, Tensor Alignment Requirements to Target `tcgen05.mma` Instructions
-
-Tables below list valid data type, and AB layout combinations. Note that the alignment is reported as number of elements. A and B matrix layouts are
-represented with T and N. T represents row-major layouts, and N represents column-major layouts. For instance, TN is
-row-major A matrix with column-major B matrix.
-
-For legacy types (`tf32`, `f16`, `bf16`, `i8` and `u8`) alignment requirements for A and B matrices are the same as in Hopper.
-All four layouts (TT, NN, NT, TT) are supported for all legacy data types.
-
-**Table 1: Valid Data Type, Alignment, and Layout Combinations For MMAs with Legacy Types** <a id="legacy_gemm_table" name="legacy_gemm_table"></a>
-|                               | A Type     | B Type     | AB Layout      | A Alignment | B Alignment | Target tcgen05.mma.kind | Unit Test |
-|-------------------------------|------------|------------|----------------|-------------|-------------|-------------------------|-----------|
-|1                              | tfloat32_t | tfloat32_t | TN, NN, NT, TT | 4           | 4           | tf32                    | |
-|2                              | half_t     | half_t     | TN, NN, NT, TT | 8           | 8           | f16                     | [Unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
-|3                              | bfloat16_t | bfloat16_t | TN, NN, NT, TT | 8           | 8           | f16                     | [Similar to half_t unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
-|4                              | int8_t     | int8_t     | TN, NN, NT, TT | 16          | 16          | i8                      | [Unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)|
-|5                              | uint8_t    | uint8_t    | TN, NN, NT, TT | 16          | 16          | i8                      | [Similar to int8_t unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)|
-
-For narrow precision Mmas, not all A/B type, and A/B layout combinations are supported by every `tcgen05.mma` instructions.
-Furthermore, tensor copy instructions for subbyte types impose additional alignment requirements while loading narrow-precision
-tensors from global memory to shared memory 
-(see [PTX doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-tensor-copy-restrictions) for details).
-
-Below tables list valid layout, and alignment values for each A and B data type combination and their target `tcgen05.mma`
-instructions supported by CUTLASS. 
-
-**Table 2: Valid Data Type, Alignment, and Layout Combinations For Narrow Precision MMAs Without Block Scaling** <a id="non_bs_gemm_table" name="non_bs_gemm_table"></a>
-|                               | A Type   | B Type   | AB Layout      | A Alignment | B Alignment | Target tcgen05.mma.kind | Unit Test |
-|-------------------------------|----------|----------|----------------|-------------|-------------|-------------------------|-----------|
-|[1](#nonbs_rows_1_2_3_6)       | float4_t | float4_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[2](#nonbs_rows_1_2_3_6)       | float4_t | float6_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[3](#nonbs_rows_1_2_3_6)       | float6_t | float4_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[4](#nonbs_rows_4_7)           | float4_t | float8_t | TN, NN, NT, TT | 128         | 16          | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
-|[5](#nonbs_rows_5_8)           | float8_t | float4_t | TN, NN, NT, TT | 16          | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
-|[6](#nonbs_rows_1_2_3_6)       | float6_t | float6_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[7](#nonbs_rows_4_7)           | float6_t | float8_t | TN, NN, NT, TT | 128         | 16          | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
-|[8](#nonbs_rows_5_8)           | float8_t | float6_t | TN, NN, NT, TT | 16          | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
-|[9](#nonbs_rows_9)             | float8_t | float8_t | TN, NN, NT, TT | 16          | 16          | f8f6f4                  | [Unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/f8_f8_void_f32.cu)|
-
-
-**Table 3: Valid Data Type, Alignment, and Layout Combinations for Block Scaled Narrow Precision MMAs** <a id="bs_gemm_table" name="bs_gemm_table"></a>
-|                         | A Type      | B Type      | AB Layout      | A Alignment | B Alignment | Target tcgen05.mma.kind |Unit Test|
-|-------------------------|-------------|-------------|----------------|-------------|-------------|-------------------------|------|
-|[1](#bs_rows_1)          | nv_float4_t | nv_float4_t | TN             | 32          | 32          | mxf4nvf4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/nvf4_nvf4_bf16_bf16.cu)|
-|[2](#bs_rows_2)          | mx_float4_t | mx_float4_t | TN             | 32          | 32          | mxf4, mxf4nvf4          |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_tn_layout.cu)|
-|[3](#bs_rows_3)          | mx_float4_t | mx_float4_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_nt_layout.cu)|
-|[4](#bs_rows_4_5_7_8_10) | mx_float4_t | mx_float6_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_nt_layout.cu)|
-|[5](#bs_rows_4_5_7_8_10) | mx_float6_t | mx_float4_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_nt_layout.cu)|
-|[6](#bs_rows_6_9_11)     | mx_float4_t | mx_float8_t | TN, NN, NT, TT | 128         | 16          | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_nt_layout.cu)|
-|[7](#bs_rows_4_5_7_8_10) | mx_float8_t | mx_float4_t | TN, NN, NT, TT | 16          | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_nt_layout.cu)|
-|[8](#bs_rows_4_5_7_8_10) | mx_float6_t | mx_float6_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_nt_layout.cu)|
-|[9](#bs_rows_6_9_11)     | mx_float6_t | mx_float8_t | TN, NN, NT, TT | 128         | 16          | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_nt_layout.cu)|
-|[10](#bs_rows_4_5_7_8_10)| mx_float8_t | mx_float6_t | TN, NN, NT, TT | 16          | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_nt_layout.cu)|
-|[11](#bs_rows_6_9_11)    | mx_float8_t | mx_float8_t | TN, NN, NT, TT | 16          | 16          | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_tn_layout.cu.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_nt_layout.cu)|
-
-## MMA tile shapes supported
-
-The alignment restrictions also limit the options for Mma Tile Shapes. Tables below list the supported/valid `MmaTileShape`,
-Layout, and Dispatch Policy combinations for each row of [Table 1](#legacy_gemm_table), [Table 2](#non_bs_gemm_table), and [Table 3](#bs_gemm_table).
-
-**Table 4: Valid Tile Shapes and Dispatch Policies for lagacy types (All rows of Table 1)** <a id="legacy_rows" name="legacy_rows"></a> 
-| 1/2 SM | Mma Tile Shape   | TN | TT | NT | NN | Dispatch Policy                    |
-|--------|------------------|----|----|----|----|------------------------------------|
-| 1SM    | 64x64x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x128x(4*MMA-K) | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x192x(4*MMA-K) | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x256x(4*MMA-K) | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x64x(4*MMA-K) | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x128x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x192x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x256x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 2SM    | 128x64x(4*MMA-K) | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x128x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x192x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x256x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x64x(4*MMA-K) | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x128x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x192x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x256x(4*MMA-K)| Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-
-**Table 5: Valid Tile Shapes and Dispatch Policies for {float4_t, float6_t} x {float4_t, float6_t} (Rows 1,2,3,6 of Table 2)** <a id="nonbs_rows_1_2_3_6" name="nonbs_rows_1_2_3_6"></a> 
-
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
-|--------|----------------|----|----|----|----|------------------------------------|
-| 1SM    | 64x64x128      | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x128x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x192x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x256x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 2SM    | 128x64x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x192x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x256x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-
-**Table 6: Valid Tile Shapes and Dispatch Policies for float8_t x {float4_t, float6_t} (Rows 5,8 of Table 2)** <a id="nonbs_rows_5_8" name="nonbs_rows_5_8"></a> 
-
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
-|--------|----------------|----|----|----|----|------------------------------------|
-| 1SM    | 64x64x128      | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x128x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x192x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x256x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 2SM    | 128x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-
-**Table 7: Valid Tile Shapes and Dispatch Policies for {float4_t, float6_t} x float8_t (Rows 4,7 of Table 2)** <a id="nonbs_rows_4_7" name="nonbs_rows_4_7"></a> 
-
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
-|--------|----------------|----|----|----|----|------------------------------------|
-| 1SM    | 64x64x128      | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x128x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x192x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x256x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 2SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x128x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x192x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x256x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-
-**Table 8: Valid Tile Shapes and Dispatch Policies for float8_t x float8_t (Row 9 of Table 2)** <a id="nonbs_rows_9" name="nonbs_rows_9"></a> 
-
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
-|--------|----------------|----|----|----|----|------------------------------------|
-| 1SM    | 64x64x128      | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x128x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x192x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 64x256x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100` |
-| 2SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-| 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100` |
-
-
-**Table 9: Valid Tile Shapes for nv_float4_t x nv_float4_t (Row 1 of Table 3)** <a id="bs_rows_1" name="bs_rows_1"></a>
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                       |
-|--------|---------------|----|----|----|----|----------------------------------------|
-| 1SM    | 128x128x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmNvf4Sm100` |
-| 1SM    | 128x192x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmNvf4Sm100` |
-| 1SM    | 128x256x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmNvf4Sm100` |
-| 2SM    | 256x128x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmNvf4Sm100` |
-| 2SM    | 256x192x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmNvf4Sm100` |
-| 2SM    | 256x256x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmNvf4Sm100` |
-
-**Table 10: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t (Row 2 of Table 3)** <a id="bs_rows_2" name="bs_rows_2"></a>
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                       |
-|--------|---------------|----|----|----|----|----------------------------------------|
-| 1SM    | 128x128x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmMxf4Sm100` |
-| 1SM    | 128x192x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmMxf4Sm100` |
-| 1SM    | 128x256x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmMxf4Sm100` |
-| 2SM    | 256x128x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmMxf4Sm100` |
-| 2SM    | 256x192x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmMxf4Sm100` |
-| 2SM    | 256x256x256   | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmMxf4Sm100` |
-
-**Table 11: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t (Row 3 of Table 3)** <a id="bs_rows_3" name="bs_rows_3"></a>
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                            |
-|--------|---------------|----|----|----|----|--------------------------------------------|
-| 1SM    | 128x128x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 1SM    | 128x192x128   | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 1SM    | 128x256x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 2SM    | 256x128x128   | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-| 2SM    | 256x192x128   | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-| 2SM    | 256x256x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-
-**Table 12: Valid Tile Shapes and Dispatch Policies for {mx_float4_t, mx_float6_t, mx_float8_t} x {mx_float4_t, mx_float6_t} (Rows 4, 5, 7, 8, 10 of Table 3)** <a id="bs_rows_4_5_7_8_10" name="bs_rows_4_5_7_8_10"></a> 
-| 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                            |
-|--------|---------------|----|----|----|----|--------------------------------------------|
-| 1SM    | 128x128x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 1SM    | 128x192x128   | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 1SM    | 128x256x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 2SM    | 256x128x128   | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-| 2SM    | 256x192x128   | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-| 2SM    | 256x256x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-
-**Table 13: Valid Tile Shapes and Dispatch Policies for {mx_float4_t, mx_float6_t, mx_float8_t} x mx_float8_t (Rows 6, 9, 11 of Table 3)** <a id="bs_rows_6_9_11" name="bs_rows_6_9_11"></a> 
-| 1/2 SM | Mma Tile Shape | TN| TT | NT | NN | Dispatch Policy                            |
-|--------|---------------|----|----|----|----|--------------------------------------------|
-| 1SM    | 128x128x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 1SM    | 128x192x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 1SM    | 128x256x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100` |
-| 2SM    | 256x128x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-| 2SM    | 256x192x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-| 2SM    | 256x256x128   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100` |
-
-## Epilogue config supported
-
-**Table 14: Epilogue Dispatch Policy** <a id="epi_dispatch" name="epi_dispatch"></a> 
-| 1/2 SM | Epilogue Dispatch Policy                 |
-|--------|------------------------------------------|
-| 1SM    | cutlass::epilogue::TmaWarpSpecialized1Sm |
-| 1SM    | cutlass::epilogue::NoSmemWarpSpecialized1Sm |
-| 2SM    | cutlass::epilogue::TmaWarpSpecialized2Sm |
-| 2SM    | cutlass::epilogue::NoSmemWarpSpecialized2Sm |
-
-**Table 15: Epilogue PerSmTileShape_MNK** <a id="epi_persmtileshape" name="epi_persmtileshape"></a> 
-| 1/2 SM | MMA tile Shape           | PerSmTileShape_MNK      |
-|--------|--------------------------|-------------------------|
-| 1SM    | 64x64xMMA_TileShape_K    | 64x64xMMA_TileShape_K   |
-| 1SM    | 64x128xMMA_TileShape_K   | 64x128xMMA_TileShape_K  |
-| 1SM    | 64x192xMMA_TileShape_K   | 64x192xMMA_TileShape_K  |
-| 1SM    | 64x256xMMA_TileShape_K   | 64x256xMMA_TileShape_K  |
-| 1SM    | 128x64xMMA_TileShape_K   | 128x64xMMA_TileShape_K  |
-| 1SM    | 128x128xMMA_TileShape_K  | 128x128xMMA_TileShape_K |
-| 1SM    | 128x192xMMA_TileShape_K  | 128x192xMMA_TileShape_K |
-| 1SM    | 128x256xMMA_TileShape_K  | 128x256xMMA_TileShape_K |
-| 2SM    | 128x64xMMA_TileShape_K   | 64x64xMMA_TileShape_K   |
-| 2SM    | 128x128xMMA_TileShape_K  | 64x128xMMA_TileShape_K  |
-| 2SM    | 128x192xMMA_TileShape_K  | 64x192xMMA_TileShape_K  |
-| 2SM    | 128x256xMMA_TileShape_K  | 64x256xMMA_TileShape_K  |
-| 2SM    | 256x64xMMA_TileShape_K   | 128x64xMMA_TileShape_K  |
-| 2SM    | 256x128xMMA_TileShape_K  | 128x128xMMA_TileShape_K |
-| 2SM    | 256x192xMMA_TileShape_K  | 128x192xMMA_TileShape_K |
-| 2SM    | 256x256xMMA_TileShape_K  | 128x256xMMA_TileShape_K |
-
-MMA_TileShape_K is is generally 4 * MMA-Instruction-K. It depends on the config we defined in MMA tile shapes supported section.
-
-### Auto Kernel Dispatch Policies
-
-In addition to direct dispatch policies listed above, the user can also use auto policies for both non-block scaled narrow-precision
-GEMMs, and block scaled narrow-precision GEMMs.
-
-CUTLASS will do its best to find the most efficient kernel for given parameters, however, the preferred method for building
-these kernels is to use direct kernel dispatch policies shown in the above tables.
-
-* `cutlass::gemm::collective::KernelScheduleAuto`: For a given Mma Tile Size, data type and layout combinations choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) and 1/2 SM `tcgen05.mma`.
-* `KernelTmaWarpSpecialized1SmBlockScaledSm100`: Use 1 SM `tcgen05.mma` instruction and choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) automatically.
-* `KernelTmaWarpSpecialized2SmBlockScaledSm100`: Use 2 SM `tcgen05.mma` instruction and choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) automatically.
-
-Similarly for epilogues, we can use `cutlass::epilogue::collective::EpilogueScheduleAuto`.
-
-## Building a Block Scaled Kernel <a id="detailed_blockscale_example" name="detailed_blockscale_example"></a>
-
-For non-blockscaled dense GEMM refer to [quick start page](quickstart.md#instantiating-a-blackwell-gemm-kernel). An example dense GEMM can be found:
-1. [Blackwell FP16 GEMM example](../../examples/70_blackwell_gemm/).
-
-Narrow precision and block scaled narrow precision kernels can be built using CUTLASS 3.x collective builder interface
-(as described in [CUTLASS 3.0 GEMM API](gemm_api_3x.md#cutlass-30-gemm-api)). However, special attention needs to be given to 
-A and B matrix layouts, alignment requirements, and dispatch policies to obtain a functionally correct and performant kernel
-which are listed above.
-
-Several examples of block scaled kernels can be found in [examples/72_blackwell_narrow_precision_gemm](../../examples/72_blackwell_narrow_precision_gemm/) directory:
-1. [NVF4 Gemm with block scaling](../../examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
-2. [NVF4 Gemm with block scaling and NVF4 output matrix](../../examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
-3. [Mixed precision Nvf4 x Mxf8 GEMM with block scaling](../../examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)
-
-Collective builder interface expects the same arguments as any other CUTLASS 3.x kernels as described
-[here](gemm_api_3x.md#collective-builder-for-collectivemmas) with a small difference for Collective MMA builder interface.
-As in all Blackwell kernels, the `TileShape_MNK` argument expects the `MmaTileShape_MNK` which is the tile shape needed
-by 1 or 2 SM `tcgen05.mma` instructions.
-
-Let's consider building a block scaled GEMM where the A matrix is of type `mx_float4_t` and column-major (N), and the
-B matrix is of type `mx_float4_t` and row-major (T). We first need to describe the A and B tensors, and find the
-instruction that can support the selected A and B type and layout pair. Then, we will choose the performance parameters.
-
-The skeleton C++ code is shown below:
-
-```cpp
-  ///////////////////////////////////////////////////////////
-  //                Mainloop Builder Setup
-  ///////////////////////////////////////////////////////////
-  
-  ///////////////////////////////////////////
-  // 1. Describe A and B tensors
-  ///////////////////////////////////////////
-  using ElementA       = // TBD
-  constexpr int AlignA = // TBD
-  using GmemLayoutA    = // TBD
-  using ElementB       = // TBD
-  constexpr int AlignB = // TBD
-  using GmemLayoutB    = // TBD
-
-  // Mma's accumulator type
-  using ElementAccumulator = float;           // Always float for block scaled tcgen05.mma instructions
-
-  //////////////////////////////////////////
-  // 2. Choose Performance Parameters
-  //////////////////////////////////////////
-
-  // Tile and cluster shapes
-  // Collective MMA takes tile shape of the MMA operation as input
-  using KernelMainloopPolicy     = // TBD
-  using MmaTileShape_MNK         = // TBD
-  using ClusterShape_MNK         = // TBD
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm100, cutlass::arch::OpClassBlockScaledTensorOp,      // Arch and Tensorop spec
-      ElementA, GmemLayoutA, AlignA,                                        // A tensor elem type, layout and alignment requirement
-      ElementB, GmemLayoutB, AlignB,                                        // B tensor elem type, layout and alignment requirement
-      ElementAccumulator,                                                   // Mma instruction accumulator type
-      MmaTileShape_MNK, ClusterShape_MNK,                                   // Mma instruction tile shape, cluster shape
-      // Epilogue's SMEM usage that needs to be subtracted from overall SMEM capacity 
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      KernelMainloopPolicy                                                  // Kernel schedule policy.
-                                                                            // Auto or using targeted scheduling policy
-    >::CollectiveOp;
-```
-
-From the valid type and layout combinations [Table 3](#bs_gemm_table), we see that only **row 3** can support `mx_float4_t`x`mx_float4_t`
-combination with NT layout. As a result, we need to use the `tcgen05.mma.kind:mxf8f6f4` instruction. Additionally, in order
-to use `tcgen05.mma.kind:mxf8f6f4`, we see that A and B tensors both should be 128-element aligned.
-Thus, we can describe A and B tensors as follows:
-
-```cpp
-  ///////////////////////////////////////////////////////////
-  //                Mainloop Builder Setup
-  ///////////////////////////////////////////////////////////
-  
-  ///////////////////////////////////////////
-  // 1. Describe A and B tensors
-  ///////////////////////////////////////////
-  using ElementA       = mx_float4_t;
-  constexpr int AlignA = 128;
-  using GmemLayoutA    = cutlass::layout::ColumnMajor;
-  using ElementB       = mx_float4_t;
-  constexpr int AlignB = 128;
-  using GmemLayoutB    = cutlass::layout::RowMajor;
-```
-Next, we need to choose the performance parameters such as `MmaTileShape_MNK`, `KernelMainloopPolicy`,
-and `ClusterShape_MNK`.
-
-`MmaTileShape_MNK` supported for `mx_float4_t`x`mx_float4_t` with `mxf8f6f4` are listed in [Table 11](#bs_rows_3).
-For NT layout, we see that 3 `MmaTileShape_MNK` are supported: `128x128x128`, and `128x256x128` with 1SM instruction;
-and `256x256x128` with 2SM instruction. Let's say, we expect to get the best performance with `256x256x128` MMA tile shape
-for our GEMM problem. Then, we need to set the `KernelMainloopPolicy` to `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`.
-Now, we need to choose the `ClusterShape_MNK`. Since we have selected a 2SM mma instruction, `ClusterShape_MNK` should be
-compatible and its first mode should be a multiple of 2. `ClusterShape_MNK = cute::Shape<_2, [_1|_2|_4], _1>` or
-`ClusterShape_MNK = cute::Shape<_4, [_1|_2|_4], _1>` would be valid options. Let's choose `cute::Shape<_4,_4,_1>`.
-Our performance parameters looks like below:
-
-```cpp
-  //////////////////////////////////////////
-  // 2. Choose Performance Parameters
-  //////////////////////////////////////////
-
-  // Tile and cluster shapes
-  // Collective MMA takes tile shape of the MMA operation as input
-  using KernelMainloopPolicy     = cutlass::gemm::KernelTmaWarpSpecialized2SmMxf8f6f4Sm100;
-  using MmaTileShape_MNK         = cute::Shape<_256,_256,_128>;
-  using ClusterShape_MNK         = cute::Shape<_4,_4,_1>;
-```
-
-After we config the main-loop, let's setup the epilogue. 
-A normal epilogue looks like below, we need to specify the output layout, datatype, alignment and PerSmTileShape_MNK, and let others to be default/auto.
-
-PerSmTileShape_MNK should be deduced from the mainloop setup. For example, in above mainloop setup, the MmaTileShape_MNK is
-256x256x128 and the KernelMainloopPolicy is 2sm policy. 
-It means each CTA is doing (256 / 2sm) x 256 x 128 output, so the PerSmTileShape_MNK is 128x256x128. The possible PerSmTileShape_MNK
-is listed in [Table 15](#epi_persmtileshape)
-
-The epilogue scheduling policy is configurable, and it is common to set `cutlass::epilogue::collective::EpilogueScheduleAuto`
-to allow the epilogue builder to automatically select the appropriate policy. However, it can also be explicitly defined to
-use other policies based on the 1sm or 2sm MMA instruction. The available policies are listed in [Table 14](#epi_dispatch).
-
-```cpp
-  // Describe C and D tensors
-  using ElementC = cutlass::half_t;
-  constexpr int AlignC = 8;
-  using GmemLayoutC = cutlass::layout::RowMajor;
-  using ElementD = cutlass::float_e2m1_t;
-  constexpr int AlignD = 32;
-  using GmemLayoutD = cutlass::layout::RowMajor;
-  // Mma's accumulator type
-  using ElementAccumulator = float;
-  // Epilogue computation's precision type
-  using ElementCompute = float;
-  
-  //
-  // Construct CollectiveEpilogue
-  //
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm100, cutlass::arch::OpClassBlockScaledTensorOp,      // Arch and Tensorop spec
-      MmaTileShape_MNK, ClusterShape_MNK,                                   // MMA tile shape, and cluster shape
-      cutlass::epilogue::collective::EpilogueTileAuto,                      // Epilogue subtile shape. Auto will find a suitable tile shape
-      ElementAccumulator, ElementCompute,                                   // Mma instr's accumulator type and compute precision for epilogue
-      ElementC, GmemLayoutC, AlignC,                                        // C tensor description
-      ElementD, GmemLayoutD, AlignD,                                        // D tensor description
-      cutlass::epilogue::TmaWarpSpecialized2Sm                              // Epilogue schedule policy
-    >::CollectiveOp;
-
-```
-
-If we want to let the epilogue generate mxf4/nvf4/mxf6/mxf8 (i.e. elements + block-scalefactor), we need to setup the epilogue fusion into the builder. 
-First, we need to choose a SFDVectorSize indicates how many elements sharing the same block-scalefactor. 
-Then, we need to choose ElementSFD and GmemLayoutSFD which indicates the output datatype and which output-dim is used to generate the block-scalefactor. 
-Typically, GmemLayoutSFD would be same as the GmemLayoutD.
-
-```cpp
-  //
-  // Construct FusionOperation
-  //
-  constexpr int SFDVectorSize = 16;
-  // Define the fusion operation applied during epilogue
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-      SFDVectorSize,
-      ElementD, ElementCompute, 
-      ElementSFD, GmemLayoutSFD,
-      ElementC
-    >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm100, cutlass::arch::OpClassBlockScaledTensorOp,      // Arch and Tensorop spec
-      MmaTileShape_MNK, ClusterShape_MNK,                                   // MMA tile shape, and cluster shape
-      cutlass::epilogue::collective::EpilogueTileAuto,                      // Epilogue subtile shape. Auto will find a suitable tile shape
-      ElementAccumulator, ElementCompute,                                   // Mma instr's accumulator type and compute precision for epilogue
-      ElementC, GmemLayoutC, AlignC,                                        // C tensor description
-      ElementD, GmemLayoutD, AlignD,                                        // D tensor description
-      cutlass::epilogue::TmaWarpSpecialized2Sm                              // Epilogue schedule policy
-      FusionOperation                                                       // <================================== Pass the fusion config into epilogue builder.
-    >::CollectiveOp;
-```
-
-Above example made a gentle introduction to using the fusion operations in the epilogue. For more detailed example, see
-[Blackwell GEMM with collective builder](../../examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu)
-
-Note that we have first discussed the CollectiveMainloop, then the CollectiveEpilogue for clarity. 
-However, the CollectiveMainloop needs to know the SMEM utilization of the epilogue. Therefore, it needs to be setup before the CollectiveMainloop. See  [examples/72_blackwell_narrow_precision_gemm](../../examples/72_blackwell_narrow_precision_gemm/) directory for full kernel and run setup.
-
-### Scale Factor Layouts
-
-The scale factor layout consists of a 512B basic-block structure, as illustrated in the diagram below. Each block contains 128 M/N dimension and 4 scale factors (SF) along the K dimension.
-The byte order of the basic storage chunk is row-major, meaning that M0SF0 to M0SF3, M32SF0 to M32SF3, M64SF0 to M64SF3, and M96SF0 to M96SF3 are stored consecutively in GMEM.
-
-[](../images/M128xK4_scalefactor_gmem.png)
-<p align="center">
-  <img src="../images/M128xK4_scalefactor_gmem.png" alt="/M128xK4_scalefactor_gmem.png"/>
-</p>
-
-If the scale factor tensor exceeds M128xSF4, it indicates that there are multiple basic blocks along both the M and SFK dimensions. The arrangement of these basic blocks follows a K-major order. Here is a diagram illustrating the scenario where M equals 512 and the SFK is 16.
-
-[](../images/narrow_precison_multiple_block_sf_layout.png)
-<p align="center">
-  <img src="../images/narrow_precison_multiple_block_sf_layout.png" alt="/narrow_precison_multiple_block_sf_layout.png"/>
-</p>
-
-The creation of scale factor tensors' layouts are tedious. CUTLASS provides `Sm1xxBlockScaledConfig` to create these layouts easily
-(See [sm100_blockscaled_layout.hpp](cutlass/include/cutlass/detail/sm100_blockscaled_layout.hpp)).
-The interface to create SFA and SFB tensor layouts is as follows:
-
-```cpp
-auto problem_shape = make_shape(M, N, K, L);
-using SfConfig = Sm1xxBlockScaledConfig<SFVecSize>;
-
-// SFA shape: ((32,4), ceil(M/128)), ((SFVecSize,4), ceil(K/4), L)
-auto layout_sfa = SfConfig::tile_atom_to_shape_SFA(problem_shape);
-// SFB shape: ((32,4), ceil(N/128)), ((SFVecSize,4), ceil(K/4), L)
-auto layout_sfb = SfConfig::tile_atom_to_shape_SFB(problem_shape);
-
-auto tensor_sfa = make_tensor(aptr, layout_sfa);
-auto tensor_sfb = make_tensor(bptr, layout_sfb);
-// Access SF for for element m,k of A tensor
-auto val_a_mk = tensor_sfa(make_coord(m,k,0));
-```
-
-# Copyright
-
-Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: BSD-3-Clause
-
-```
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are met:
-
-  1. Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-  2. Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-  3. Neither the name of the copyright holder nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
-
-  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-```
diff --git a/media/docs/blackwell_cluster_launch_control.md b/media/docs/cpp/blackwell_cluster_launch_control.md
similarity index 82%
rename from media/docs/blackwell_cluster_launch_control.md
rename to media/docs/cpp/blackwell_cluster_launch_control.md
index faebb90025..a4006f206e 100644
--- a/media/docs/blackwell_cluster_launch_control.md
+++ b/media/docs/cpp/blackwell_cluster_launch_control.md
@@ -6,7 +6,7 @@ A GEMM workload usually consists of three phases: prologue, mainloop and epilogu
 
 Consider a GEMM that has `20x20x1` output tiles, running on a GPU with `100` SMs. There is another kernel occupying all the resources of `20` SMs so only `80` SMs can be used. Assume cluster shape is `1x1x1`. The following diagram shows how the schedule would look like for such a kernel. 
 
-<p align="center"><img src=../images/non_persistent.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
+<p align="center"><img src=../../images/non_persistent.png alt="GEMM tiles are evenly divided among available SMs" title="GEMM Scheduling with Limited SM Resources"></p>
 
 
 ### Static Scheduler
@@ -14,7 +14,7 @@ CUTLASS has adopted a software technique named **persistent kernels**. Persisten
 
 However, static scheduler is susceptible to workload imbalance if the resources of some SMs are unavailable. The following diagram illustrates this issue. 
 
-<p align="center"><img src=../images/persistent_static.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
+<p align="center"><img src=../../images/persistent_static.png alt="GEMM tiles are unevenly divided among available SMs, leading to workload imbalance" title="Imbalanced Workload Scheduling due to Static Scheduler"></p>
 
 ### Dynamic Scheduler with Cluster Launch Control
 A fundamental limitation of persistent scheduling is that the number of SMs this kernel can utilize is unknown in real time. Some SMs might be occupied by another kernel and thus their resources are unavailable. This makes it challenging to load-balance work across SMs.
@@ -32,7 +32,7 @@ Cluster launch control follows the below rules:
 
 The following diagram shows how the schedule would look like with cluster launch control.
 
-<p align="center"><img src=../images/persistent_clc.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
+<p align="center"><img src=../../images/persistent_clc.png alt="GEMM tiles are dynamically allocated among available SMs, leading to a balanced workload" title="Dynamic Scheduler with Cluster Launch Control"></p>
 
 ## Programming Model
 ### Pseudo Code
@@ -43,7 +43,7 @@ __device__ non_persistent_kernel(...) {
   setup_common_data_structures();
   dim3 workCoordinates = blockIdx;
   coordinate_specific_compute(workCoordinates);
-} 
+}
 ```
 #### Static Persistent Kernel
 ``` c++
@@ -51,9 +51,10 @@ __device__ non_persistent_kernel(...) {
 __device__ static_persistent_kernel(...) {
   setup_common_data_structures(...);
   dim3 workCoordinates = blockIdx;
+  bool isValidId;
   do {
     coordinate_specific_compute(workCoordinates);
-    isValidId, workCoordinates = staticTileScheduler.fetch_next_work();
+    std::tie(isValidId, workCoordinates) = staticTileScheduler.fetch_next_work();
   } while (isValidId);
 }
 ```
@@ -65,9 +66,11 @@ __device__ static_persistent_kernel(...) {
 __device__ clc_dynamic_persistent_kernel(...) {
   setup_common_data_structures(...);
   dim3 workCoordinates = blockIdx;
+  dim3 newClcID;
+  bool isValidId;
   do {
     coordinate_specific_compute(workCoordinates);
-    isValidId, newClcID = clcTileScheduler.fetch_next_work();
+    std::tie(isValidId, newClcID) = clcTileScheduler.fetch_next_work();
     workCoordinates = newClcID;
   } while (isValidId);
 }
@@ -76,8 +79,8 @@ __device__ clc_dynamic_persistent_kernel(...) {
 
 ### Cluster Launch Control Pipeline Class
 
-Please refer to the `PipelineCLCFetchAsync` pipeline class defined in [Cluster launch control pipeline class](/include/cutlass/pipeline/sm100_pipeline.hpp). Cluster launch control queries can be pipelined and mananged by an asynchronous pipeline with producer-consumer relationship (See
-[pipeline](/media/docs/pipeline.md) document). The producer is the scheduler warp of the 0th CTA in the cluster and the consumers are all warps that need `ClcID`s. 
+Please refer to the `PipelineCLCFetchAsync` pipeline class defined in [Cluster launch control pipeline class](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/sm100_pipeline.hpp). Cluster launch control queries can be pipelined and managed by an asynchronous pipeline with producer-consumer relationship (See
+[pipeline](pipeline.md) document). The producer is the scheduler warp of the 0th CTA in the cluster and the consumers are all warps that need `ClcID`s. 
 
 To setup a CLC pipeline correctly, we need to make sure the params are set to the right values:
 
@@ -88,18 +91,18 @@ To setup a CLC pipeline correctly, we need to make sure the params are set to th
 
 
 ### Dynamic tile scheduler class
-Please refer to `PersistentTileSchedulerSm100` class defined in [sm100 dynamic persistent tile scheduler](/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp).
+Please refer to `PersistentTileSchedulerSm100` class defined in [sm100 dynamic persistent tile scheduler](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp).
 
 There are two important methods of the CLC scheduler class. The first is `advance_to_next_work`, which is intended to be executed by one elected thread from the scheduler warp. It effectively sends out the CLC query to the CLC. A CLC query response will be broadcast to the same shared memory address of all CTAs in the cluster.
 
 The other method is named `get_current_work`. It simply loads the CLC response from the shared memory buffer indexed by a pipeline state. 
 
 
-The CLC pipeline and scheduler classes are used together to ensure correct functionality and necessary synchronization of CLC feature. Please refer to [cluster launch control pipeline unit test](/test/unit/pipeline/pipeline_cluster_launch_control_async_warp_specialized_blackwell.cu).
+The CLC pipeline and scheduler classes are used together to ensure correct functionality and necessary synchronization of CLC feature. Please refer to [cluster launch control pipeline unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/pipeline/pipeline_cluster_launch_control_async_warp_specialized_blackwell.cu).
 
 ## Blackwell Warp-specialized Persistent Kernel
 
-Now, let's take a look at how CLC feature is used in our [Blackwell dense GEMM kernel](/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp).
+Now, let's take a look at how CLC feature is used in our [Blackwell dense GEMM kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp).
 
 This particular warp-specialized kernel has the following warp assignment:
 
diff --git a/media/docs/cpp/blackwell_functionality.md b/media/docs/cpp/blackwell_functionality.md
new file mode 100644
index 0000000000..582899d305
--- /dev/null
+++ b/media/docs/cpp/blackwell_functionality.md
@@ -0,0 +1,756 @@
+# Blackwell SM100 GEMMs
+
+[**TLDR; jump to block scaled GEMM example**](#detailed_blockscale_example)
+
+Blackwell SM100 introduces `tcgen05.mma` instructions. `tcgen05.mma` instructions support all legacy types (`tfloat32_t`, `half_t`, `bfloat16_t`, `int8_t`, `uint8_t`) and
+the new 4, 6, and 8-bits floating point datatypes with and without scale factors. 
+This document explains the new `tcgen05.mma` instructions supported by CUTLASS and how one can leverage CUTLASS to create
+efficient SM100 GEMM kernels targeting these new mma instructions.
+
+Blackwell SM100 has 7 new `tcgen05.mma` instructions. These instructions are 2x to 4x faster then Hopper Architecture's WGMMA instructions.
+
+| Ptx Instruction                                                                       | Throughput                 | Notes |
+|---------------------------------------------------------------------------------------|----------------------------|-------|
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::tf32                                          | 2x Hopper Tf32 Tensor Core | MMA with A={tf32} x B={tf32} TN, NT, TT, NN layouts                                                       |
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::f16                                           | 2x Hopper Fp16 Tensor Core | MMA with A={f16} x B={f16} or A={bf16} x B={bf16}  TN, NT, TT, NN layouts                                 |
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::i8                                            | 2x Hopper I8 Tensor Core   | MMA with A={i8} x B={i8} or A={u8} x B={u8}  TN, NT, TT, NN layouts                                       |
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::f8f6f4                                        | 2x Hopper Fp8 Tensor Core  | Mixed precision MMA with A={f4,f6,f8} x B={f4,f6,f8} TN, NT, TT, NN layouts                               |
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::mxf8f6f4.block_scale                          | 2x Hopper Fp8 Tensor Core  | Block scaled mixed precision MMA with A={mxf4,mxf6,mxf8} x B={mxf4,mxf6,mxf8} with TN, NT, TT, NN layouts |
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::mxf4.block_scale                              | 4x Hopper Fp8 Tensor Core  | Block scaled MMA with A={mxf4} x B={mxf4} with TN layouts                                                 |
+|tcgen05.mma(.sp).cta_group::[1\|2].kind::mxf4nvf4.block_scale.scale_vec_size::[2X\|4X] | 4x Hopper Fp8 Tensor Core  | Block scaled MMA with A={mxf4} x B={mxf4} or A={nvf4} x B={nvf4} with TN layouts                          |
+
+For more detailed information see [`tcgen05.mma` PTX documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tensorcore-5th-generation-family-instructions).
+
+## New in Blackwell SM100
+
+### Block Scaled GEMMs
+
+Instructions with `kind` modifiers `mxf8f6f4`, `mxf4`, and `nvf4mxf4` perform matrix multiplication operations with scale
+factors of the form $D = C +( A \times SFA) * (B \times SFB)$. Scale factors are applied to GEMM-K dimension such that
+every 16 or 32 elements of $A$ and $B$ matrices in K dimension have an associated scale factor (32 or 64 elements for sparse as sparse gemm compress 2x along k-dim). For example, an $M\times K$,
+$A$ matrix has an associated $M \times \lceil K/32 \rceil$ SFA matrix; and an $N\times K$ $B$, matrix has an associated
+$N \times \lceil K/32 \rceil$ SFB matrix. For block scaled GEMMs, an entry of output D matrix is
+$D_{ij} = C_{ij} + \sum_{k} (A_{i,k} \times SFA_{i,k/SV}) \times (B_{j,k}\times SFB_{j,k/SV})$, in index notation, we SV is the scale factor vector size (16 or 32).
+Further details can be found in
+[PTX documentation on block scaling](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-block-scaling).
+
+### Blackwell Narrow Precision Data Types
+
+Narrow-precision `tcgen05.mma` instructions can operate on several 4, 6, and 8-bit data types. Blackwell MMAs can operate
+on five different 8-bit floating point values, of which only two (`float_ue8m0_t` and `float_ue4m3_t`) can be used as scale factor data types.
+There are two 6-bit floating point types and one 4-bit floating point data type.
+See [PTX documentation for narrow precision data types](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#alternate-floating-point-data-formats) for details.
+
+**Blackwell Narrow Precision Data Types**
+| Data Type         | Exponent Bits | Mantissa Bits | Signed | Bit Size |
+|-------------------|---------------|---------------|--------|----------|
+| float_e4m3_t      |4              |3              | Yes    | 8        |
+| float_e5m2_t      |5              |2              | Yes    | 8        |
+| float_e2m3_t      |2              |3              | Yes    | 6        |
+| float_e3m2_t      |3              |2              | Yes    | 6        |
+| float_e2m1_t      |2              |1              | Yes    | 4        |
+| float_ue8m0_t[^1] |8              |0              | No     | 8        |
+| float_ue4m3_t[^1] |4              |3              | No     | 8        |
+
+[^1]: Only valid as scale factor data types.
+
+Block scaled MMAs use `mx` and `nv` types which are a pair of float8_t, float6_t, float4_t with 2 of the scale factor data types with a predetermined scale factor vector size. `mx` types follow OCP specification (see [OCP Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)). The following types provided by CUTLASS can be used as inputs to collective builders to generate the block scaled kernels:
+
+**Blackwell Block Scaled Narrow Precision Data Types**
+| Mx/Nv Data Type            |Scale Factor Type | SF Vector Size (Dense) | SF Vector Size (Sparse)| OCP Compliant |
+|----------------------------|------------------|------------------------|------------------------|---------------|
+| mx_float8_t\<Any F8type\>  |float_ue8m0_t     |32                      |64                      | Yes           |
+| mx_float6_t\<Any F6Type\>  |float_ue8m0_t     |32                      |64                      | Yes           |
+| mx_float4_t                |float_ue8m0_t     |32                      |64                      | Yes           |
+| nv_float4_t                |float_ue4m3_t     |16                      |32                      | No            |
+
+## Layouts, Tensor Alignment Requirements to Target `tcgen05.mma` Instructions
+
+Tables below list valid data type, and AB layout combinations. Note that the alignment is reported as number of elements. A and B matrix layouts are
+represented with T and N. T represents row-major layouts, and N represents column-major layouts. For instance, TN is
+row-major A matrix with column-major B matrix.
+
+For legacy types (`tf32`, `f16`, `bf16`, `i8` and `u8`) alignment requirements for A and B matrices are the same as in Hopper.
+All four layouts (TT, NN, NT, TT) are supported for all legacy data types.
+
+**Table 1: Valid Data Type, Alignment, and Layout Combinations For MMAs with Legacy Types** <a id="legacy_gemm_table" name="legacy_gemm_table"></a>
+|                               | Dense / Sparse | A Type     | B Type     | AB Layout      | A Alignment      | B Alignment | Target tcgen05.mma.kind | Unit Test |
+|-------------------------------|----------------|------------|------------|----------------|------------------|-------------|-------------------------|---------- |
+|[1](#legacy_rows)              | Dense          | tfloat32_t | tfloat32_t | TN, NN, NT, TT | 4                | 4           | tf32                    |           |
+|[2](#legacy_rows)              | Dense          | half_t     | half_t     | TN, NN, NT, TT | 8                | 8           | f16                     | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)                  |
+|[3](#legacy_rows)              | Dense          | bfloat16_t | bfloat16_t | TN, NN, NT, TT | 8                | 8           | f16                     | [Similar to half_t unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
+|[4](#legacy_rows)              | Dense          | int8_t     | int8_t     | TN, NN, NT, TT | 16               | 16          | i8                      | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)                    |
+|[5](#legacy_rows)              | Dense          | uint8_t    | uint8_t    | TN, NN, NT, TT | 16               | 16          | i8                      | [Similar to int8_t unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)  |
+|[6](#legacy_rows)              | Sparse         | tfloat32_t | tfloat32_t | TN, NN, NT, TT |  4  (N) / 8 (T)  | 4           | tf32                    | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f32_f32_f32_f32_f32_tfmma.cu)                  |
+|[7](#legacy_rows)              | Sparse         | half_t     | half_t     | TN, NN, NT, TT |  8  (N) / 16 (T) | 8           | f16                     | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f16_f16_f32_f16_f16_hmma.cu)                   |
+|[8](#legacy_rows)              | Sparse         | bfloat16_t | bfloat16_t | TN, NN, NT, TT |  8  (N) / 16 (T) | 8           | f16                     | [Similar to half_t unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f16_f16_f32_f16_f16_hmma.cu) |
+|[9](#legacy_rows)              | Sparse         | int8_t     | int8_t     | TN, NN, NT, TT |  16 (N) / 32 (T) | 16          | i8                      | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_s8_s8_s32_s8_s8_imma.cu)                       |
+|[10](#legacy_rows)             | Sparse         | uint8_t    | uint8_t    | TN, NN, NT, TT |  16 (N) / 32 (T) | 16          | i8                      | [Similar to int8_t unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_s8_s8_s32_s8_s8_imma.cu)     |
+
+For narrow precision Mmas, not all A/B type, and A/B layout combinations are supported by every `tcgen05.mma` instructions.
+Furthermore, tensor copy instructions for subbyte types impose additional alignment requirements while loading narrow-precision
+tensors from global memory to shared memory 
+(see [PTX doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-tensor-copy-restrictions) for details).
+
+Below tables list valid layout, and alignment values for each A and B data type combination and their target `tcgen05.mma`
+instructions supported by CUTLASS. 
+
+**Table 2: Valid Data Type, Alignment, and Layout Combinations For Narrow Precision MMAs Without Block Scaling** <a id="non_bs_gemm_table" name="non_bs_gemm_table"></a>
+|                               | Dense / Sparse | A Type   | B Type   | AB Layout      | A Alignment       | B Alignment | Target tcgen05.mma.kind | Unit Test |
+|-------------------------------|----------------|----------|----------|----------------|-------------------|-------------|-------------------------|-----------|
+|[1](#nonbs_rows_1_2_3_6)       | Dense          | float4_t | float4_t | TN, NN, NT, TT | 128               | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[2](#nonbs_rows_1_2_3_6)       | Dense          | float4_t | float6_t | TN, NN, NT, TT | 128               | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[3](#nonbs_rows_1_2_3_6)       | Dense          | float6_t | float4_t | TN, NN, NT, TT | 128               | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[4](#nonbs_rows_4_7)           | Dense          | float4_t | float8_t | TN, NN, NT, TT | 128               | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
+|[5](#nonbs_rows_5_8)           | Dense          | float8_t | float4_t | TN, NN, NT, TT | 16                | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
+|[6](#nonbs_rows_1_2_3_6)       | Dense          | float6_t | float6_t | TN, NN, NT, TT | 128               | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[7](#nonbs_rows_4_7)           | Dense          | float6_t | float8_t | TN, NN, NT, TT | 128               | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
+|[8](#nonbs_rows_5_8)           | Dense          | float8_t | float6_t | TN, NN, NT, TT | 16                | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
+|[9](#nonbs_rows_9)             | Dense          | float8_t | float8_t | TN, NN, NT, TT | 16                | 16          | f8f6f4                  | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f8_f8_void_f32.cu)|
+|[10](#nonbs_rows_1_2_3_6)      | Sparse         | float4_t | float4_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f16_tn.cu) |
+|[11](#nonbs_rows_1_2_3_6)      | Sparse         | float4_t | float6_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f6_f32_f16_f16_tn.cu) |
+|[12](#nonbs_rows_1_2_3_6)      | Sparse         | float6_t | float4_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f4_f32_f16_f16_tn.cu) |
+|[13](#nonbs_rows_4_7)          | Sparse         | float4_t | float8_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f8_f32_f16_f16_tn.cu) |
+|[14](#nonbs_rows_5_8)          | Sparse         | float8_t | float4_t | TN, NN, NT, TT | 16  (N) / 32  (T) | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f4_f32_f16_f16_tn.cu) |
+|[15](#nonbs_rows_1_2_3_6)      | Sparse         | float6_t | float6_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f16_tn.cu) |
+|[16](#nonbs_rows_4_7)          | Sparse         | float6_t | float8_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f8_f32_f16_f16_tn.cu) |
+|[17](#nonbs_rows_5_8)          | Sparse         | float8_t | float6_t | TN, NN, NT, TT | 16  (N) / 32  (T) | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f6_f32_f16_f16_tn.cu) |
+|[18](#nonbs_rows_9)            | Sparse         | float8_t | float8_t | TN, NN, NT, TT | 16  (N) / 32  (T) | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f8_f8_f32_f16_f16_qmma.cu)                |
+
+
+**Table 3: Valid Data Type, Alignment, and Layout Combinations for Block Scaled Narrow Precision MMAs** <a id="bs_gemm_table" name="bs_gemm_table"></a>
+|                          | Dense / Sparse | A Type      | B Type      | AB Layout      | A Alignment       | B Alignment | Target tcgen05.mma.kind |Unit Test|
+|--------------------------|----------------|-------------|-------------|----------------|-------------------|-------------|-------------------------|---------|
+|[1](#bs_rows_1)           | Dense          | nv_float4_t | nv_float4_t | TN             | 32                | 32          | mxf4nvf4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/nvf4_nvf4_bf16_bf16.cu)|
+|[2](#bs_rows_2)           | Dense          | mx_float4_t | mx_float4_t | TN             | 32                | 32          | mxf4, mxf4nvf4          |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_tn_layout.cu)|
+|[3](#bs_rows_3)           | Dense          | mx_float4_t | mx_float4_t | TN, NN, NT, TT | 128               | 128         | mxf8f6f4                |[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_nt_layout.cu)|
+|[4](#bs_rows_4_5_7_8_10)  | Dense          | mx_float4_t | mx_float6_t | TN, NN, NT, TT | 128               | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_nt_layout.cu)|
+|[5](#bs_rows_4_5_7_8_10)  | Dense          | mx_float6_t | mx_float4_t | TN, NN, NT, TT | 128               | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_nt_layout.cu)|
+|[6](#bs_rows_6_9_11)      | Dense          | mx_float4_t | mx_float8_t | TN, NN, NT, TT | 128               | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_nt_layout.cu)|
+|[7](#bs_rows_4_5_7_8_10)  | Dense          | mx_float8_t | mx_float4_t | TN, NN, NT, TT | 16                | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_nt_layout.cu)|
+|[8](#bs_rows_4_5_7_8_10)  | Dense          | mx_float6_t | mx_float6_t | TN, NN, NT, TT | 128               | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_nt_layout.cu)|
+|[9](#bs_rows_6_9_11)      | Dense          | mx_float6_t | mx_float8_t | TN, NN, NT, TT | 128               | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_nt_layout.cu)|
+|[10](#bs_rows_4_5_7_8_10) | Dense          | mx_float8_t | mx_float6_t | TN, NN, NT, TT | 16                | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_nt_layout.cu)|
+|[11](#bs_rows_6_9_11)     | Dense          | mx_float8_t | mx_float8_t | TN, NN, NT, TT | 16                | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_tn_layout.cu.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_nt_layout.cu)|
+|[12](#bs_rows_1)          | Sparse         | nv_float4_t | nv_float4_t | TN             | 32  (N) / 64  (T) | 32          | mxf4nvf4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_nvf4_nvf4_f32_void_f16_o_tnn.cu) |
+|[13](#bs_rows_2)          | Sparse         | mx_float4_t | mx_float4_t | TN             | 32  (N) / 64  (T) | 32          | mxf4, mxf4nvf4          |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf4_f32_f16_f16_o_tnn.cu)  |
+|[14](#bs_rows_3)          | Sparse         | mx_float4_t | mx_float4_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf4_f32_f16_f16_q_tnt.cu)  |
+|[15](#bs_rows_4_5_7_8_10) | Sparse         | mx_float4_t | mx_float6_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf6_f32_f16_f16_q_tnt.cu)  |
+|[16](#bs_rows_4_5_7_8_10) | Sparse         | mx_float6_t | mx_float4_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf4_f32_f16_f16_q_tnt.cu)  |
+|[17](#bs_rows_6_9_11)     | Sparse         | mx_float4_t | mx_float8_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf8_f32_f16_f16_q_tnt.cu)  |
+|[18](#bs_rows_4_5_7_8_10) | Sparse         | mx_float8_t | mx_float4_t | TN, NN, NT, TT | 16  (N) / 32  (T) | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf8_mxf4_f32_f16_f16_q_tnt.cu)  |
+|[19](#bs_rows_4_5_7_8_10) | Sparse         | mx_float6_t | mx_float6_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf6_f32_f16_f16_q_tnt.cu)  |
+|[20](#bs_rows_6_9_11)     | Sparse         | mx_float6_t | mx_float8_t | TN, NN, NT, TT | 128 (N) / 256 (T) | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf8_f32_f16_f16_q_tnt.cu)  |
+|[21](#bs_rows_4_5_7_8_10) | Sparse         | mx_float8_t | mx_float6_t | TN, NN, NT, TT | 16  (N) / 32  (T) | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf8_mxf6_f32_f16_f16_q_tnt.cu)  |
+|[22](#bs_rows_6_9_11)     | Sparse         | mx_float8_t | mx_float8_t | TN, NN, NT, TT | 16  (N) / 32  (T) | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf8_mxf8_f32_f16_f16_q_tnn.cu)  |
+
+## MMA tile shapes supported
+
+The alignment restrictions also limit the options for Mma Tile Shapes. Tables below list the supported/valid `MmaTileShape`,
+Layout, and Dispatch Policy combinations for each row of [Table 1](#legacy_gemm_table), [Table 2](#non_bs_gemm_table), and [Table 3](#bs_gemm_table).
+
+**Table 4: Valid Tile Shapes and Dispatch Policies for legacy types (All rows of Table 1)** <a id="legacy_rows" name="legacy_rows"></a> 
+| Dense / Sparse | 1/2 SM | Mma Tile Shape     | TN | TT | NT | NN | Dispatch Policy                          |
+|----------------|--------|--------------------|----|----|----|----|------------------------------------------|
+| Dense          | 1SM    | 64x64x(4*MMA-K)    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x128x(4*MMA-K)   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x192x(4*MMA-K)   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x256x(4*MMA-K)   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x64x(4*MMA-K)   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x128x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x192x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x256x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 2SM    | 128x64x(4*MMA-K)   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x128x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x192x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x256x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x64x(4*MMA-K)   | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x128x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x192x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x256x(4*MMA-K)  | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Sparse         | 1SM    | 128x64x(2/4*MMA-K) | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x128x(2/4*MMA-K)| Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x192x(2/4*MMA-K)| Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x(2/4*MMA-K)| Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 2SM    | 256x64x(2/4*MMA-K) | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x128x(2/4*MMA-K)| Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x192x(2/4*MMA-K)| Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x(2/4*MMA-K)| Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+
+**Table 5: Valid Tile Shapes and Dispatch Policies for {float4_t, float6_t} x {float4_t, float6_t} (Rows 1,2,3,6,10,11,12,and 15 of Table 2)** <a id="nonbs_rows_1_2_3_6" name="nonbs_rows_1_2_3_6"></a> 
+
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                          |
+|----------------|--------|----------------|----|----|----|----|------------------------------------------|
+| Dense          | 1SM    | 64x64x128      | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x128x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x192x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x256x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 2SM    | 128x64x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x192x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x256x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Sparse         | 1SM    | 128x128x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 2SM    | 256x128x128    | N  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+
+**Table 6: Valid Tile Shapes and Dispatch Policies for float8_t x {float4_t, float6_t} (Rows 5,8,14,and 17 of Table 2)** <a id="nonbs_rows_5_8" name="nonbs_rows_5_8"></a> 
+
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                          |
+|----------------|--------|----------------|----|----|----|----|------------------------------------------|
+| Dense          | 1SM    | 64x64x128      | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x128x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x192x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x256x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 2SM    | 128x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x64x128     | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Sparse         | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 2SM    | 256x128x128    | Y  | Y  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+
+**Table 7: Valid Tile Shapes and Dispatch Policies for {float4_t, float6_t} x float8_t (Rows 4,7,13,and 16 of Table 2)** <a id="nonbs_rows_4_7" name="nonbs_rows_4_7"></a> 
+
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                          |
+|----------------|--------|----------------|----|----|----|----|------------------------------------------|
+| Dense          | 1SM    | 64x64x128      | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x128x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x192x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x256x128     | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 2SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x128x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x192x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x256x128    | Y  | Y  | N  | N  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Sparse         | 1SM    | 128x128x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 2SM    | 256x128x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x128    | N  | N  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+
+**Table 8: Valid Tile Shapes and Dispatch Policies for float8_t x float8_t (Row 9,18 of Table 2)** <a id="nonbs_rows_9" name="nonbs_rows_9"></a> 
+
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                          |
+|----------------|--------|----------------|----|----|----|----|------------------------------------------|
+| Dense          | 1SM    | 64x64x128      | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x128x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x192x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 64x256x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmSm100`       |
+| Dense          | 2SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x64x128     | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmSm100`       |
+| Sparse         | 1SM    | 128x64x128     | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmSm100` |
+| Sparse         | 2SM    | 256x64x128     | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x128x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x192x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+| Sparse         | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmSm100` |
+
+**Table 9: Valid Tile Shapes for nv_float4_t x nv_float4_t (Row 1 and 12 of Table 3)** <a id="bs_rows_1" name="bs_rows_1"></a>
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                               |
+|----------------|--------|----------------|----|----|----|----|----------------------------------------------|
+| Dense          | 1SM    | 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmNvf4Sm100`       |
+| Dense          | 1SM    | 128x192x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmNvf4Sm100`       |
+| Dense          | 1SM    | 128x256x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmNvf4Sm100`       |
+| Dense          | 2SM    | 256x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmNvf4Sm100`       |
+| Dense          | 2SM    | 256x192x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmNvf4Sm100`       |
+| Dense          | 2SM    | 256x256x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmNvf4Sm100`       |
+| Sparse         | 1SM    | 128x128x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized1SmNvf4Sm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized1SmNvf4Sm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized2SmNvf4Sm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized2SmNvf4Sm100` |
+
+**Table 10: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t (Row 2 and 13 of Table 3)** <a id="bs_rows_2" name="bs_rows_2"></a>
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                              |
+|----------------|--------|----------------|----|----|----|----|----------------------------------------------|
+| Dense          | 1SM    | 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmMxf4Sm100`       |
+| Dense          | 1SM    | 128x192x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmMxf4Sm100`       |
+| Dense          | 1SM    | 128x256x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized1SmMxf4Sm100`       |
+| Dense          | 2SM    | 256x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmMxf4Sm100`       |
+| Dense          | 2SM    | 256x192x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmMxf4Sm100`       |
+| Dense          | 2SM    | 256x256x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecialized2SmMxf4Sm100`       |
+| Sparse         | 1SM    | 128x128x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized1SmNvf4Sm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized1SmNvf4Sm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized2SmNvf4Sm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | N  | N  | N  | `KernelSparseTmaWarpSpecialized2SmNvf4Sm100` |
+
+**Table 11: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t (Row 3 and 14 of Table 3)** <a id="bs_rows_3" name="bs_rows_3"></a>
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                                  |
+|----------------|--------|----------------|----|----|----|----|--------------------------------------------------|
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Sparse         | 1SM    | 128x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 1SM    | 128x192x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x192x256    | Y  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+
+**Table 12: Valid Tile Shapes and Dispatch Policies for {mx_float4_t, mx_float6_t, mx_float8_t} x {mx_float4_t, mx_float6_t} (Rows 4, 5, 7, 8, 10, 15, 16, 18, 19, and 21 of Table 3)** <a id="bs_rows_4_5_7_8_10" name="bs_rows_4_5_7_8_10"></a> 
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                                  |
+|----------------|--------|----------------|----|----|----|----|--------------------------------------------------|
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | N  | N  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Sparse         | 1SM    | 128x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 1SM    | 128x192x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x192x256    | Y  | N  | N  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+
+**Table 13: Valid Tile Shapes and Dispatch Policies for {mx_float4_t, mx_float6_t, mx_float8_t} x mx_float8_t (Rows 6, 9, 11, 17, 20, and 22 of Table 3)** <a id="bs_rows_6_9_11" name="bs_rows_6_9_11"></a> 
+| Dense / Sparse | 1/2 SM | Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                                  |
+|----------------|--------|----------------|----|----|----|----|--------------------------------------------------|
+| Dense          | 1SM    | 128x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 1SM    | 128x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 1SM    | 128x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized1SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x128x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x192x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Dense          | 2SM    | 256x256x128    | Y  | Y  | Y  | Y  | `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`       |
+| Sparse         | 1SM    | 128x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 1SM    | 128x192x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 1SM    | 128x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x128x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x192x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+| Sparse         | 2SM    | 256x256x256    | Y  | Y  | Y  | Y  | `KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100` |
+
+## Epilogue config supported
+
+**Table 14: Epilogue Dispatch Policy** <a id="epi_dispatch" name="epi_dispatch"></a> 
+| Dense / Sparse | Legacy / Narrow Precision   | 1/2 SM | Epilogue Dispatch Policy                           |
+|----------------|-----------------------------|--------|----------------------------------------------------|
+| Dense          | Legacy & Narrow Precision   | 1SM    | `cutlass::epilogue::TmaWarpSpecialized1Sm`         |
+| Dense          | Legacy & Narrow Precision   | 1SM    | `cutlass::epilogue::NoSmemWarpSpecialized1Sm`      |
+| Dense          | Legacy & Narrow Precision   | 2SM    | `cutlass::epilogue::TmaWarpSpecialized2Sm`         |
+| Dense          | Legacy & Narrow Precision   | 2SM    | `cutlass::epilogue::NoSmemWarpSpecialized2Sm`      |
+| Sparse         | Legacy                      | 1SM    | `cutlass::epilogue::TmaWarpSpecialized1Sm`         |
+| Sparse         | Legacy                      | 1SM    | `cutlass::epilogue::NoSmemWarpSpecialized1Sm`      |
+| Sparse         | Legacy                      | 2SM    | `cutlass::epilogue::TmaWarpSpecialized2Sm`         |
+| Sparse         | Legacy                      | 2SM    | `cutlass::epilogue::NoSmemWarpSpecialized2Sm`      |
+| Sparse         | Narrow Precision (nvf4)     | 1SM    | `cutlass::epilogue::TmaWarpSpecialized1SmNvf4`     |
+| Sparse         | Narrow Precision (nvf4)     | 2SM    | `cutlass::epilogue::TmaWarpSpecialized2SmNvf4`     |
+| Sparse         | Narrow Precision (mxf4)     | 1SM    | `cutlass::epilogue::TmaWarpSpecialized1SmMxf4`     |
+| Sparse         | Narrow Precision (mxf4)     | 2SM    | `cutlass::epilogue::TmaWarpSpecialized2SmMxf4`     |
+| Sparse         | Narrow Precision (mxf8f6f4) | 1SM    | `cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4` |
+| Sparse         | Narrow Precision (mxf8f6f4) | 2SM    | `cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4` |
+
+**Table 15: Epilogue PerSmTileShape_MNK** <a id="epi_persmtileshape" name="epi_persmtileshape"></a> 
+| 1/2 SM | MMA tile Shape           | PerSmTileShape_MNK      |
+|--------|--------------------------|-------------------------|
+| 1SM    | 64x64xMMA_TileShape_K    | 64x64xMMA_TileShape_K   |
+| 1SM    | 64x128xMMA_TileShape_K   | 64x128xMMA_TileShape_K  |
+| 1SM    | 64x192xMMA_TileShape_K   | 64x192xMMA_TileShape_K  |
+| 1SM    | 64x256xMMA_TileShape_K   | 64x256xMMA_TileShape_K  |
+| 1SM    | 128x64xMMA_TileShape_K   | 128x64xMMA_TileShape_K  |
+| 1SM    | 128x128xMMA_TileShape_K  | 128x128xMMA_TileShape_K |
+| 1SM    | 128x192xMMA_TileShape_K  | 128x192xMMA_TileShape_K |
+| 1SM    | 128x256xMMA_TileShape_K  | 128x256xMMA_TileShape_K |
+| 2SM    | 128x64xMMA_TileShape_K   | 64x64xMMA_TileShape_K   |
+| 2SM    | 128x128xMMA_TileShape_K  | 64x128xMMA_TileShape_K  |
+| 2SM    | 128x192xMMA_TileShape_K  | 64x192xMMA_TileShape_K  |
+| 2SM    | 128x256xMMA_TileShape_K  | 64x256xMMA_TileShape_K  |
+| 2SM    | 256x64xMMA_TileShape_K   | 128x64xMMA_TileShape_K  |
+| 2SM    | 256x128xMMA_TileShape_K  | 128x128xMMA_TileShape_K |
+| 2SM    | 256x192xMMA_TileShape_K  | 128x192xMMA_TileShape_K |
+| 2SM    | 256x256xMMA_TileShape_K  | 128x256xMMA_TileShape_K |
+
+MMA_TileShape_K is is generally 4 * MMA-Instruction-K. It depends on the config we defined in MMA tile shapes supported section.
+
+### Auto Kernel Dispatch Policies
+
+In addition to direct dispatch policies listed above, the user can also use auto policies for both non-block scaled narrow-precision
+GEMMs (both sparse and dense), and block scaled narrow-precision GEMMs (only dense).
+
+CUTLASS will do its best to find the most efficient kernel for given parameters, however, the preferred method for building
+these kernels is to use direct kernel dispatch policies shown in the above tables.
+
+* `cutlass::gemm::collective::KernelScheduleAuto`: For a given Mma Tile Size, data type and layout combinations choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) and 1/2 SM `tcgen05.mma(.sp)`.
+* `KernelTmaWarpSpecialized1SmBlockScaledSm100`: Use 1 SM `tcgen05.mma` instruction and choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) automatically.
+* `KernelTmaWarpSpecialized2SmBlockScaledSm100`: Use 2 SM `tcgen05.mma` instruction and choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) automatically.
+* `KernelSparseTmaWarpSpecialized1SmBlockScaledSm100`: Use 1 SM `tcgen05.mma.sp` instruction and choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) automatically.
+* `KernelSparseTmaWarpSpecialized2SmBlockScaledSm100`: Use 2 SM `tcgen05.mma.sp` instruction and choose instr kind (mxf8f6f4, mxf4, nvf4mxf4) automatically.
+
+Similarly for epilogues, we can use `cutlass::epilogue::collective::EpilogueScheduleAuto`.
+
+## Building a Block Scaled Kernel <a id="detailed_blockscale_example" name="detailed_blockscale_example"></a>
+
+For non-blockscaled dense GEMM refer to [quick start page](quickstart.md#instantiating-a-blackwell-sm100-gemm-kernel). An example dense GEMM can be found:
+1. [Blackwell FP16 GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/70_blackwell_gemm/).
+
+An example sparse GEMM can be found:
+1. [Blackwell FP16 Sparse GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/83_blackwell_sparse_gemm/).
+
+Narrow precision and block scaled narrow precision kernels can be built using CUTLASS 3.x collective builder interface
+(as described in [CUTLASS 3.0 GEMM API](gemm_api_3x.md#cutlass-30-gemm-api)). However, special attention needs to be given to 
+A and B matrix layouts, alignment requirements, and dispatch policies to obtain a functionally correct and performant kernel
+which are listed above.
+
+Several examples of block scaled dense GEMM kernels can be found in [examples/72_blackwell_narrow_precision_gemm](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/) directory:
+1. [NVF4 Gemm with block scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
+2. [NVF4 Gemm with block scaling and NVF4 output matrix](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
+3. [Mixed precision Nvf4 x Mxf8 GEMM with block scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)
+
+Several examples of block scaled sparse GEMM kernels can be found in [examples/84_blackwell_narrow_precision_sparse_gemm](https://github.com/NVIDIA/cutlass/tree/main/examples/84_blackwell_narrow_precision_sparse_gemm) directory:
+1. [NVF4 Gemm with block scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu)
+2. [Mixed precision Nvf4 x Mxf8 GEMM with block scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)
+
+Collective builder interface expects the same arguments as any other CUTLASS 3.x kernels as described
+[here](gemm_api_3x.md#collective-builder-for-collectivemmas) with a small difference for Collective MMA builder interface.
+As in all Blackwell kernels, the `TileShape_MNK` argument expects the `MmaTileShape_MNK` which is the tile shape needed
+by 1 or 2 SM `tcgen05.mma` instructions.
+
+Let's consider building a block scaled GEMM where the A matrix is of type `mx_float4_t` and column-major (N), and the
+B matrix is of type `mx_float4_t` and row-major (T). We first need to describe the A and B tensors, and find the
+instruction that can support the selected A and B type and layout pair. Then, we will choose the performance parameters.
+
+The skeleton C++ code is shown below:
+
+```cpp
+  ///////////////////////////////////////////////////////////
+  //                Mainloop Builder Setup
+  ///////////////////////////////////////////////////////////
+  
+  ///////////////////////////////////////////
+  // 1. Describe A and B tensors
+  ///////////////////////////////////////////
+  using ElementA       = // TBD
+  constexpr int AlignA = // TBD
+  using GmemLayoutA    = // TBD
+  using ElementB       = // TBD
+  constexpr int AlignB = // TBD
+  using GmemLayoutB    = // TBD
+
+  // Mma's accumulator type
+  using ElementAccumulator = float;           // Always float for block scaled tcgen05.mma instructions
+
+  //////////////////////////////////////////
+  // 2. Choose Performance Parameters
+  //////////////////////////////////////////
+
+  // Tile and cluster shapes
+  // Collective MMA takes tile shape of the MMA operation as input
+  using KernelMainloopPolicy     = // TBD
+  using MmaTileShape_MNK         = // TBD
+  using ClusterShape_MNK         = // TBD
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassBlockScaledTensorOp,      // Arch and Tensorop spec
+      ElementA, GmemLayoutA, AlignA,                                        // A tensor elem type, layout and alignment requirement
+      ElementB, GmemLayoutB, AlignB,                                        // B tensor elem type, layout and alignment requirement
+      ElementAccumulator,                                                   // Mma instruction accumulator type
+      MmaTileShape_MNK, ClusterShape_MNK,                                   // Mma instruction tile shape, cluster shape
+      // Epilogue's SMEM usage that needs to be subtracted from overall SMEM capacity 
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      KernelMainloopPolicy                                                  // Kernel schedule policy.
+                                                                            // Auto or using targeted scheduling policy
+    >::CollectiveOp;
+```
+
+From the valid type and layout combinations [Table 3](#bs_gemm_table), we see that only **row 3** can support `mx_float4_t`x`mx_float4_t`
+combination with NT layout. As a result, we need to use the `tcgen05.mma.kind:mxf8f6f4` instruction. Additionally, in order
+to use `tcgen05.mma.kind:mxf8f6f4`, we see that A and B tensors both should be 128-element aligned.
+Thus, we can describe A and B tensors as follows:
+
+```cpp
+  ///////////////////////////////////////////////////////////
+  //                Mainloop Builder Setup
+  ///////////////////////////////////////////////////////////
+  
+  ///////////////////////////////////////////
+  // 1. Describe A and B tensors
+  ///////////////////////////////////////////
+  using ElementA       = mx_float4_t;
+  constexpr int AlignA = 128;
+  using GmemLayoutA    = cutlass::layout::ColumnMajor;
+  using ElementB       = mx_float4_t;
+  constexpr int AlignB = 128;
+  using GmemLayoutB    = cutlass::layout::RowMajor;
+```
+Next, we need to choose the performance parameters such as `MmaTileShape_MNK`, `KernelMainloopPolicy`,
+and `ClusterShape_MNK`.
+
+`MmaTileShape_MNK` supported for `mx_float4_t`x`mx_float4_t` with `mxf8f6f4` are listed in [Table 11](#bs_rows_3).
+For NT layout, we see that 3 `MmaTileShape_MNK` are supported: `128x128x128`, and `128x256x128` with 1SM instruction;
+and `256x256x128` with 2SM instruction. Let's say, we expect to get the best performance with `256x256x128` MMA tile shape
+for our GEMM problem. Then, we need to set the `KernelMainloopPolicy` to `KernelTmaWarpSpecialized2SmMxf8f6f4Sm100`.
+Now, we need to choose the `ClusterShape_MNK`. Since we have selected a 2SM mma instruction, `ClusterShape_MNK` should be
+compatible and its first mode should be a multiple of 2. `ClusterShape_MNK = cute::Shape<_2, [_1|_2|_4], _1>` or
+`ClusterShape_MNK = cute::Shape<_4, [_1|_2|_4], _1>` would be valid options. Let's choose `cute::Shape<_4,_4,_1>`.
+Our performance parameters looks like below:
+
+```cpp
+  //////////////////////////////////////////
+  // 2. Choose Performance Parameters
+  //////////////////////////////////////////
+
+  // Tile and cluster shapes
+  // Collective MMA takes tile shape of the MMA operation as input
+  using KernelMainloopPolicy     = cutlass::gemm::KernelTmaWarpSpecialized2SmMxf8f6f4Sm100;
+  using MmaTileShape_MNK         = cute::Shape<_256,_256,_128>;
+  using ClusterShape_MNK         = cute::Shape<_4,_4,_1>;
+```
+
+After we config the main-loop, let's setup the epilogue. 
+A normal epilogue looks like below, we need to specify the output layout, datatype, alignment and PerSmTileShape_MNK, and let others to be default/auto.
+
+PerSmTileShape_MNK should be deduced from the mainloop setup. For example, in above mainloop setup, the MmaTileShape_MNK is
+256x256x128 and the KernelMainloopPolicy is 2sm policy. 
+It means each CTA is doing (256 / 2sm) x 256 x 128 output, so the PerSmTileShape_MNK is 128x256x128. The possible PerSmTileShape_MNK
+is listed in [Table 15](#epi_persmtileshape)
+
+The epilogue scheduling policy is configurable, and it is common to set `cutlass::epilogue::collective::EpilogueScheduleAuto`
+to allow the epilogue builder to automatically select the appropriate policy. However, it can also be explicitly defined to
+use other policies based on the 1sm or 2sm MMA instruction. The available policies are listed in [Table 14](#epi_dispatch).
+
+```cpp
+  // Describe C and D tensors
+  using ElementC = cutlass::half_t;
+  constexpr int AlignC = 8;
+  using GmemLayoutC = cutlass::layout::RowMajor;
+  using ElementD = cutlass::float_e2m1_t;
+  constexpr int AlignD = 32;
+  using GmemLayoutD = cutlass::layout::RowMajor;
+  // Mma's accumulator type
+  using ElementAccumulator = float;
+  // Epilogue computation's precision type
+  using ElementCompute = float;
+  
+  //
+  // Construct CollectiveEpilogue
+  //
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassBlockScaledTensorOp,      // Arch and Tensorop spec
+      MmaTileShape_MNK, ClusterShape_MNK,                                   // MMA tile shape, and cluster shape
+      cutlass::epilogue::collective::EpilogueTileAuto,                      // Epilogue subtile shape. Auto will find a suitable tile shape
+      ElementAccumulator, ElementCompute,                                   // Mma instr's accumulator type and compute precision for epilogue
+      ElementC, GmemLayoutC, AlignC,                                        // C tensor description
+      ElementD, GmemLayoutD, AlignD,                                        // D tensor description
+      cutlass::epilogue::TmaWarpSpecialized2Sm                              // Epilogue schedule policy
+    >::CollectiveOp;
+
+```
+
+If we want to let the epilogue generate mxf4/nvf4/mxf6/mxf8 (i.e. elements + block-scalefactor), we need to setup the epilogue fusion into the builder. 
+First, we need to choose a SFDVectorSize indicates how many elements sharing the same block-scalefactor. 
+Then, we need to choose ElementSFD and GmemLayoutSFD which indicates the output datatype and which output-dim is used to generate the block-scalefactor. 
+Typically, GmemLayoutSFD would be same as the GmemLayoutD.
+
+```cpp
+  //
+  // Construct FusionOperation
+  //
+  constexpr int SFDVectorSize = 16;
+  // Define the fusion operation applied during epilogue
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+      SFDVectorSize,
+      ElementD, ElementCompute, 
+      ElementSFD, GmemLayoutSFD,
+      ElementC
+    >;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassBlockScaledTensorOp,      // Arch and Tensorop spec
+      MmaTileShape_MNK, ClusterShape_MNK,                                   // MMA tile shape, and cluster shape
+      cutlass::epilogue::collective::EpilogueTileAuto,                      // Epilogue subtile shape. Auto will find a suitable tile shape
+      ElementAccumulator, ElementCompute,                                   // Mma instr's accumulator type and compute precision for epilogue
+      ElementC, GmemLayoutC, AlignC,                                        // C tensor description
+      ElementD, GmemLayoutD, AlignD,                                        // D tensor description
+      cutlass::epilogue::TmaWarpSpecialized2Sm                              // Epilogue schedule policy
+      FusionOperation                                                       // <================================== Pass the fusion config into epilogue builder.
+    >::CollectiveOp;
+```
+
+Above example made a gentle introduction to using the fusion operations in the epilogue. For more detailed example, see
+[Blackwell GEMM with collective builder](https://github.com/NVIDIA/cutlass/tree/main/examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu)
+
+Note that we have first discussed the CollectiveMainloop, then the CollectiveEpilogue for clarity. 
+However, the CollectiveMainloop needs to know the SMEM utilization of the epilogue. Therefore, it needs to be setup before the CollectiveMainloop. See  [examples/72_blackwell_narrow_precision_gemm](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/) directory for full kernel and run setup.
+
+### Scale Factor Layouts
+
+The scale factor layout consists of a 512B basic-block structure, as illustrated in the diagram below. Each block contains 128 M/N dimension and 4 scale factors (SF) along the K dimension.
+The byte order of the basic storage chunk is row-major, meaning that M0SF0 to M0SF3, M32SF0 to M32SF3, M64SF0 to M64SF3, and M96SF0 to M96SF3 are stored consecutively in GMEM.
+
+![ALT](../../images/M128xK4_scalefactor_gmem.png)
+
+If the scale factor tensor exceeds M128xSF4, it indicates that there are multiple basic blocks along both the M and SFK dimensions. The arrangement of these basic blocks follows a K-major order. Here is a diagram illustrating the scenario where M equals 512 and the SFK is 16.
+
+![ALT](../../images/narrow_precison_multiple_block_sf_layout.png)
+
+The creation of scale factor tensors' layouts are tedious. CUTLASS provides `Sm1xxBlockScaledConfig` to create these layouts easily
+(See [sm100_blockscaled_layout.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/detail/sm100_blockscaled_layout.hpp)).
+The interface to create SFA and SFB tensor layouts is as follows:
+
+```cpp
+auto problem_shape = make_shape(M, N, K, L);
+using SfConfig = Sm1xxBlockScaledConfig<SFVecSize>;
+
+// SFA shape: ((32,4), ceil(M/128)), ((SFVecSize,4), ceil(K/4), L)
+auto layout_sfa = SfConfig::tile_atom_to_shape_SFA(problem_shape);
+// SFB shape: ((32,4), ceil(N/128)), ((SFVecSize,4), ceil(K/4), L)
+auto layout_sfb = SfConfig::tile_atom_to_shape_SFB(problem_shape);
+
+auto tensor_sfa = make_tensor(aptr, layout_sfa);
+auto tensor_sfb = make_tensor(bptr, layout_sfb);
+// Access SF for for element m,k of A tensor
+auto val_a_mk = tensor_sfa(make_coord(m,k,0));
+```
+# Blackwell SM120 GEMMs
+The NVIDIA RTX 5000 Series GPUs introduce support for new narrow precision (4bit and 6bit) block-scaled and non-block-scaled tensor cores. The PTX ISA has extended the `mma` instructions to support these data formats which are 1x to 4x faster than Ada architecture's fp8 tensor cores. For more detailed information see [`mma` PTX documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#multiply-and-accumulate-instruction-mma).
+
+CUTLASS 4.0 has added support for these newly introduced narrow precision GEMMs. Similar to the Blackwell SM100 GEMMs, the SM120 GEMMs can be built using the collective builder interface. See examples in [examples/79_blackwell_geforce_gemm/](../../examples/79_blackwell_geforce_gemm/) and unit tests listed below. 
+
+The data types supported and tensor alignment requirements are the same as the Blackwell SM100 GEMMs. The scale factor layout is also the same as SM100 mentioned above. `OpClassTensorOp` is used for non-blockscaled narrow precision GEMMs and `OpClassBlockScaledTensorOp` is used for blockscaled narrow precision GEMMs.
+
+| Ptx Instruction                                                     | Throughput                 | Notes | Unit Test |
+|---------------------------------------------------------------------|----------------------------|-------|-----------|
+|mma.sync.aligned.kind::f8f6f4                                        | 1x Ada Fp8 Tensor Core(2x for FP32 accumulator)     | Mixed precision MMA with A={f4,f6,f8} x B={f4,f6,f8} TN layouts                               | [unit test](../../test/unit/gemm/device/sm120_tensorop_gemm/) |
+|mma.sync.aligned.kind::mxf8f6f4.block_scale                          | 1x Ada Fp8 Tensor Core(2x for FP32 accumulator)     | Block scaled mixed precision MMA with A={mxf4,mxf6,mxf8} x B={mxf4,mxf6,mxf8} with TN layouts | [unit test](../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf6_mxf8_f32_f32.cu) |
+|mma.sync.aligned.kind::mxf4.block_scale                              | 2x Ada Fp8 Tensor Core(4x for FP32 accumulator)     | Block scaled MMA with A={mxf4} x B={mxf4} with TN layouts                                     | [unit test](../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf4_mxf4_f32_f32.cu) |
+|mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::[2X\|4X]      | 2x Ada Fp8 Tensor Core(4x for FP32 accumulator)     | Block scaled MMA with A={mxf4} x B={mxf4} or A={nvf4} x B={nvf4} with TN layouts              | [unit test](../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32.cu) |
+
+Besides the similarities, there are some key differences from the Blackwell SM100 GEMMs:
+
+## Cluster Size
+
+On Geforce series graphics card, there is no multicast feature therefore the cluster shape is fixed to 1x1x1.
+
+## Tensor Layout
+
+Only TN layout is supported. Matrix A is row major and matrix B is column major.
+
+## Pingpong v.s. cooperative kernel schedule
+
+Similar to Hopper's warp-group GEMM, SM120 GEMMs support both pingpong and cooperative kernel schedules. Pingpong kernel schedule has two groups of 4 MMA warps working on different output tiles, overlapping the mainloop and epilogue, while the cooperative kernel schedule has only one group of 8 MMA warps working on the same output tile. If `KernelScheduleAuto` is specified, `KernelTmaWarpSpecializedCooperative` will be selected by default.
+
+## Epilogue schedule:
+ 
+`EpilogueScheduleAuto` must be used.
+
+## Tile size:
+ 
+Below are tables that summarize the valid tile shapes and dispatch policies for SM120 GEMMs. If the output is `float_6_t`, the tile size in the leading dimension of output tensor must be 128.
+
+**Table 16: Valid Tile Shapes and Dispatch Policies for {float8_t, float_6_t, float_4_t} x {float8_t, float_6_t, float_4_t} of SM120 GEMMs** 
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 64x64x128      | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 64x128x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 128x64x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+**Table 17: Valid Tile Shapes for nv_float4_t x nv_float4_t of SM120 GEMMs** 
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 256x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedCooperative` |
+ 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+**Table 18: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t of SM120 GEMMs**
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 256x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedCooperative` |
+ 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+**Table 19: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t of SM120 GEMMs**
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedMxf8f6f4Sm120` or `KernelTmaWarpSpecializedPingpongMxf8f6f4Sm120` |
+ 256x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedMxf8f6f4Sm120` |
+ 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedMxf8f6f4Sm120` or `KernelTmaWarpSpecializedPingpongMxf8f6f4Sm120` |
+
+Specialized policies must be used to generate mixed-input-datatype `mx_float4_t` kernels.
+
+**Table 20: Valid Tile Shapes and Dispatch Policies for {mx_float4_t, mx_float6_t, mx_float8_t} x {mx_float4_t, mx_float6_t, mx_float8_t}**
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+# Copyright
+
+Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
diff --git a/media/docs/build/building_in_windows_with_visual_studio.md b/media/docs/cpp/build/building_in_windows_with_visual_studio.md
similarity index 98%
rename from media/docs/build/building_in_windows_with_visual_studio.md
rename to media/docs/cpp/build/building_in_windows_with_visual_studio.md
index 7548e7c734..ebadf32180 100644
--- a/media/docs/build/building_in_windows_with_visual_studio.md
+++ b/media/docs/cpp/build/building_in_windows_with_visual_studio.md
@@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3.0: Building on Windows with Visual Studio**
-
 # Building on Windows with Visual Studio
 
 CUTLASS 3.2 reintroduces support for the Microsoft Visual Studio compiler on Windows.
diff --git a/media/docs/build/building_with_clang_as_host_compiler.md b/media/docs/cpp/build/building_with_clang_as_host_compiler.md
similarity index 97%
rename from media/docs/build/building_with_clang_as_host_compiler.md
rename to media/docs/cpp/build/building_with_clang_as_host_compiler.md
index b1cf6815a6..47b3971da3 100644
--- a/media/docs/build/building_with_clang_as_host_compiler.md
+++ b/media/docs/cpp/build/building_with_clang_as_host_compiler.md
@@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3: Building with Clang as host compiler**
-
 # Building with Clang as host compiler
 
 CUTLASS 3.2(.1) reintroduces support for building with
diff --git a/media/docs/build/building_with_sycl_support.md b/media/docs/cpp/build/building_with_sycl_support.md
similarity index 100%
rename from media/docs/build/building_with_sycl_support.md
rename to media/docs/cpp/build/building_with_sycl_support.md
diff --git a/media/docs/code_organization.md b/media/docs/cpp/code_organization.md
similarity index 96%
rename from media/docs/code_organization.md
rename to media/docs/cpp/code_organization.md
index fff1ce9c8b..84d9ab0ffb 100644
--- a/media/docs/code_organization.md
+++ b/media/docs/cpp/code_organization.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
-
-[README](../../README.md#documentation) > **Code Organization**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
 
 # CUTLASS Code Organization
 
@@ -78,13 +76,13 @@ include/                     # Top-level include directory. Client applications
     *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
 ```
 
-See [Programming Guidelines](/media/docs/programming_guidelines.md) for further details about
+See [Programming Guidelines](programming_guidelines.md) for further details about
 conventions and design patterns used throughout CUTLASS.
 
 ## CuTe
 
 CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations. More documentation
-for CuTe can be found in [`/media/docs/cute/`](/media/docs/cute/).
+for CuTe can be found in [`cute/`](cute/index).
 
 ## Tools
 
@@ -138,7 +136,7 @@ and may be built as follows.
 $ make cutlass_profiler -j
 ```
 
-[Further details about the CUTLASS Profiler are described here.](/media/docs/profiler.md)
+[Further details about the CUTLASS Profiler are described here.](profiler.md)
 
 ### CUTLASS Utilities
 
@@ -166,7 +164,7 @@ tools/
           *
 ```
 
-[More details about CUTLASS Utilities may be found here.](/media/docs/utilities.md)
+[More details about CUTLASS Utilities may be found here.](utilities.md)
 
 
 ## Examples
diff --git a/media/docs/cute/00_quickstart.md b/media/docs/cpp/cute/00_quickstart.md
similarity index 78%
rename from media/docs/cute/00_quickstart.md
rename to media/docs/cpp/cute/00_quickstart.md
index c0904528d5..aa4ea9d15c 100644
--- a/media/docs/cute/00_quickstart.md
+++ b/media/docs/cpp/cute/00_quickstart.md
@@ -30,22 +30,22 @@ and how to launch kernels.
 
 CuTe's tests and examples build and run as part of CUTLASS's normal build process.
 
-CuTe's unit tests live in the [`test/unit/cute`](../../../test/unit/cute) subdirectory.
+CuTe's unit tests live in the [`test/unit/cute`](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute) subdirectory.
 
-CuTe's examples live in the [`examples/cute`](../../../examples/cute) subdirectory.
+CuTe's examples live in the [`examples/cute`](https://github.com/NVIDIA/cutlass/tree/main/examples/cute) subdirectory.
 
 ## Library Organization
 
-CuTe is a header-only C++ library, so there is no source code that needs building. Library headers are contained within the top level [`include/cute`](../../../include/cute) directory, with components of the library grouped by directories that represent their semantics.
+CuTe is a header-only C++ library, so there is no source code that needs building. Library headers are contained within the top level [`include/cute`](https://github.com/NVIDIA/cutlass/tree/main/include/cute) directory, with components of the library grouped by directories that represent their semantics.
 
 |        Directory       |        Contents        |
 |------------------------|------------------------|
-| [`include/cute`](../../../include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](../../../include/cute/layout.hpp) and [`Tensor`](../../../include/cute/tensor.hpp). |
-| [`include/cute/container`](../../../include/cute/container) | Implementations of STL-like objects, such as tuple, array, and aligned array.  |
-| [`include/cute/numeric`](../../../include/cute/numeric) | Fundamental numeric data types that include nonstandard floating-point types, nonstandard integer types, complex numbers, and integer sequence.  |
-| [`include/cute/algorithm`](../../../include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
-| [`include/cute/arch`](../../../include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
-| [`include/cute/atom`](../../../include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.
+| [`include/cute`](https://github.com/NVIDIA/cutlass/tree/main/include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/layout.hpp) and [`Tensor`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/tensor.hpp). |
+| [`include/cute/container`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/container) | Implementations of STL-like objects, such as tuple, array, and aligned array.  |
+| [`include/cute/numeric`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/numeric) | Fundamental numeric data types that include nonstandard floating-point types, nonstandard integer types, complex numbers, and integer sequence.  |
+| [`include/cute/algorithm`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
+| [`include/cute/arch`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
+| [`include/cute/atom`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.
 
 ## Tutorial
 
@@ -103,7 +103,7 @@ if (thread0()) {
 Some algorithms depend on some thread or threadblock,
 so you may need to print on threads or threadblocks other than zero.
 The header file
-[`cute/util/debug.hpp`](../../../include/cute/util/debug.hpp),
+[`cute/util/debug.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/util/debug.hpp),
 among other utilities,
 includes the function `bool thread(int tid, int bid)`
 that returns `true` if running on thread `tid` and threadblock `bid`.
@@ -112,7 +112,7 @@ that returns `true` if running on thread `tid` and threadblock `bid`.
 
 Some CuTe types have special printing functions that use a different output format.
 
-The `cute::print_layout` function will display any rank-2 layout in a plain test table. This is excellent for visualizing the map from coordinates to indices.
+The `cute::print_layout` function will display any rank-2 layout in a plain text table. This is excellent for visualizing the map from coordinates to indices.
 
 The `cute::print_tensor` function will display any rank-1, rank-2, rank-3, or rank-4 tensor in a plain text multidimensional table. The values of the tensor are printed so you can verify the tile of data is what you expect after a copy, for example.
 
diff --git a/media/docs/cute/01_layout.md b/media/docs/cpp/cute/01_layout.md
similarity index 97%
rename from media/docs/cute/01_layout.md
rename to media/docs/cpp/cute/01_layout.md
index bf4f4f73b7..72150634a1 100644
--- a/media/docs/cute/01_layout.md
+++ b/media/docs/cpp/cute/01_layout.md
@@ -33,17 +33,17 @@ CuTe provides a number of traits to work with integers.
 * `cute::is_static<T>`: Checks whether `T` is an empty type (so instantiations cannot depend on any dynamic information). Equivalent to `std::is_empty`.
 * `cute::is_constant<N,T>`: Checks that `T` is a static integer AND its value is equivalent to `N`.
 
-See the [`integral_constant` implementations](../../../include/cute/numeric/integral_constant.hpp) for more information.
+See the [`integral_constant` implementations](https://github.com/NVIDIA/cutlass/tree/main/include/cute/numeric/integral_constant.hpp) for more information.
 
 ### Tuple
 
 A tuple is a finite ordered list of zero or more elements.
-The [`cute::tuple` class](../../../include/cute/container/tuple.hpp) behaves like `std::tuple`, but works on device and host. It imposes restrictions on its template arguments and strips down the implementation for performance and simplicity.
+The [`cute::tuple` class](https://github.com/NVIDIA/cutlass/tree/main/include/cute/container/tuple.hpp) behaves like `std::tuple`, but works on device and host. It imposes restrictions on its template arguments and strips down the implementation for performance and simplicity.
 
 ### IntTuple
 
 CuTe defines the IntTuple concept as either an integer, or a tuple of IntTuples. Note the recursive definition.
-In C++, we define [operations on `IntTuple`](../../../include/cute/int_tuple.hpp).
+In C++, we define [operations on `IntTuple`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/int_tuple.hpp).
 
 Examples of `IntTuple`s include:
 * `int{2}`, the dynamic integer 2.
@@ -53,7 +53,7 @@ Examples of `IntTuple`s include:
 
 CuTe reuses the `IntTuple` concept for many different things,
 including Shape, Stride, Step, and Coord
-(see [`include/cute/layout.hpp`](../../../include/cute/layout.hpp)).
+(see [`include/cute/layout.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/layout.hpp)).
 
 Operations defined on `IntTuple`s include the following.
 
diff --git a/media/docs/cute/02_layout_algebra.md b/media/docs/cpp/cute/02_layout_algebra.md
similarity index 95%
rename from media/docs/cute/02_layout_algebra.md
rename to media/docs/cpp/cute/02_layout_algebra.md
index d8142dbe61..c2accec951 100644
--- a/media/docs/cute/02_layout_algebra.md
+++ b/media/docs/cpp/cute/02_layout_algebra.md
@@ -17,7 +17,7 @@ In the previous section, we summarized `Layout`s with
 
 The `coalesce` operation is a "simplify" on functions from integers to integers. If we only care about input integers, then we can manipulate the shape and number of modes of the `Layout` without changing it as a function. The only thing `coalesce` can't change is the `Layout`'s `size`.
 
-More specifically, you can find the checked post-conditions in [the `coalesce` unit test](../../../test/unit/cute/core/coalesce.cpp), which we'll reproduce here:
+More specifically, you can find the checked post-conditions in [the `coalesce` unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/coalesce.cpp), which we'll reproduce here:
 ```cpp
 // @post size(@a result) == size(@a layout)
 // @post depth(@a result) <= 1
@@ -116,7 +116,7 @@ compatible(B, R)
 
 That is, every coordinate of `B` can also be used as a coordinate of `R`. This is an expected property of functional composition because `B` defines the *domain* of `R`.
 
-You can find many examples and checked post-conditions in [the `composition` unit test](../../../test/unit/cute/core/composition.cpp). The post-conditions are precisely as we just stated.
+You can find many examples and checked post-conditions in [the `composition` unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/composition.cpp). The post-conditions are precisely as we just stated.
 ```cpp
 // @post compatible(@a layout_b, @a result)
 // @post for all i, 0 <= i < size(@a layout_b), @a result(i) == @a layout_a(@a layout_b(i)))
@@ -250,7 +250,7 @@ We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from
 
 The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
 <p align="center">
-  <img src="../../images/cute/composition1.png" alt="composition1.png" height="250"/>
+  <img src="../../../images/cute/composition1.png" alt="composition1.png" height="250"/>
 </p>
 
 For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
@@ -269,7 +269,7 @@ auto result = composition(a, tiler);
 ```
 where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
 <p align="center">
-  <img src="../../images/cute/composition2.png" alt="composition2.png" height="250"/>
+  <img src="../../../images/cute/composition2.png" alt="composition2.png" height="250"/>
 </p>
 
 ## Composition Tilers
@@ -289,7 +289,7 @@ Before getting to "product" and "divide," we need one more operation. We can thi
 
 The `complement` of a layout attempts to find another layout that represents the "rest" -- the elements that aren't touched by the layout.
 
-You can find many examples and checked post-conditions in [the `complement` unit test](../../../test/unit/cute/core/complement.cpp). The post-conditions include
+You can find many examples and checked post-conditions in [the `complement` unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/complement.cpp). The post-conditions include
 ```cpp
 // @post cosize(make_layout(@a layout_a, @a result))) >= size(@a cotarget)
 // @post cosize(@a result) >= round_up(size(@a cotarget), cosize(@a layout_a))
@@ -309,7 +309,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl
 
 ### Complement Examples
 
-`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides as well as IntTuple `cotarget` can be found in [the unit test](../../../test/unit/cute/core/complement.cpp).
+`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides as well as IntTuple `cotarget` can be found in [the unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/complement.cpp).
 
 * `complement(4:1, 24)` is `6:4`. Note that `(4,6):(1,4)` has cosize `24`. The layout `4:1` is effectively repeated 6 times with `6:4`.
 
@@ -324,7 +324,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl
 * `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,2),(3,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.
 
 <p align="center">
-  <img src="../../images/cute/complement1.png" alt="complement1.png" height="75"/>
+  <img src="../../../images/cute/complement1.png" alt="complement1.png" height="75"/>
 </p>
 As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."
 
@@ -372,7 +372,7 @@ This is computed in the three steps described in the implementation above.
 * Composition of `A = (4,2,3):(2,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.
 
 <p align="center">
-  <img src="../../images/cute/divide1.png" alt="divide1.png" height="150"/>
+  <img src="../../../images/cute/divide1.png" alt="divide1.png" height="150"/>
 </p>
 
 The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@@ -384,7 +384,7 @@ Using the `Tiler` concept defined above, this immediately generalizes to multidi
 Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.
 
 <p align="center">
-  <img src="../../images/cute/divide2.png" alt="divide2.png" height="450"/>
+  <img src="../../../images/cute/divide2.png" alt="divide2.png" height="450"/>
 </p>
 
 The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.
@@ -430,7 +430,7 @@ We note that `logical_divide` preserves the *semantics* of the modes while permu
 This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.
 
 <p align="center">
-  <img src="../../images/cute/divide3.png" alt="divide3.png" height="450"/>
+  <img src="../../../images/cute/divide3.png" alt="divide3.png" height="450"/>
 </p>
 
 We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.
@@ -477,7 +477,7 @@ This is computed in the three steps described in the implementation above.
 * Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.
 
 <p align="center">
-  <img src="../../images/cute/product1.png" alt="product1.png" height="175"/>
+  <img src="../../../images/cute/product1.png" alt="product1.png" height="175"/>
 </p>
 
 The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@@ -487,7 +487,7 @@ Note that the result is identical to the result of the 1-D Logical Divide exampl
 Of course, we can change the number and order of the tiles in the product by changing `B`.
 
 <p align="center">
-  <img src="../../images/cute/product2.png" alt="product2.png" height="175"/>
+  <img src="../../../images/cute/product2.png" alt="product2.png" height="175"/>
 </p>
 
 For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.
@@ -497,7 +497,7 @@ For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated til
 We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.
 
 <p align="center">
-  <img src="../../images/cute/product2d.png" alt="product2d.png" height="250"/>
+  <img src="../../../images/cute/product2d.png" alt="product2d.png" height="250"/>
 </p>
 
 The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 column-major arrangement.
@@ -520,7 +520,7 @@ Because `A` is always compatible with mode-0 of the result and `B` is always com
 This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.
 
 <p align="center">
-  <img src="../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
+  <img src="../../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
 </p>
 
 The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 column-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.
@@ -528,7 +528,7 @@ The above image shows the same result as the `tiler` approach, but with much mor
 Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "column" mode being constructed from the `A` "column" mode then the `B` "column" mode, the resulting "column" mode is constructed from the `B` "column" mode then the `A` "column" mode.
 
 <p align="center">
-  <img src="../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
+  <img src="../../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
 </p>
 
 This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."
diff --git a/media/docs/cute/03_tensor.md b/media/docs/cpp/cute/03_tensor.md
similarity index 99%
rename from media/docs/cute/03_tensor.md
rename to media/docs/cpp/cute/03_tensor.md
index 4a1459fac6..45abb88ec2 100644
--- a/media/docs/cute/03_tensor.md
+++ b/media/docs/cpp/cute/03_tensor.md
@@ -270,7 +270,7 @@ Tensor F = A(make_coord(2,_),make_coord(_,3,_));
 ```
 
 <p align="center">
-  <img src="../../images/cute/slice.png" alt="slice.png" height="300"/>
+  <img src="../../../images/cute/slice.png" alt="slice.png" height="300"/>
 </p>
 
 In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.
@@ -328,7 +328,7 @@ Tensor  v = tv(threadIdx.x, _);                                  // (4)
 ```
 
 <p align="center">
-  <img src="../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
+  <img src="../../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
 </p>
 
 The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.
diff --git a/media/docs/cute/04_algorithms.md b/media/docs/cpp/cute/04_algorithms.md
similarity index 91%
rename from media/docs/cute/04_algorithms.md
rename to media/docs/cpp/cute/04_algorithms.md
index a00460ab4b..6b519729c0 100644
--- a/media/docs/cute/04_algorithms.md
+++ b/media/docs/cpp/cute/04_algorithms.md
@@ -4,7 +4,7 @@ This section summarizes the interfaces and implementations
 of common numerical algorithms performed on `Tensor`s.
 
 The implementation of these algorithms may be found in the
-[include/cute/algorithm/](../../../include/cute/algorithm/)
+[include/cute/algorithm/](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/)
 directory.
 
 ## `copy`
@@ -12,7 +12,7 @@ directory.
 CuTe's `copy` algorithm copies the elements of a source `Tensor`
 into the elements of a destination `Tensor`.
 The various overloads of `copy` can be found in
-[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+[`include/cute/algorithm/copy.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/copy.hpp).
 
 ### Interface and specialization opportunities
 
@@ -82,7 +82,7 @@ such as `cp.async`, or its C++ interface `memcpy_async`.
 In that case, users will need to perform
 the additional synchronization appropriate to that underlying implementation
 before they may use the results of the `copy` algorithm.
-[The CuTe GEMM tutorial example](../../../examples/cute/tutorial/)
+[The CuTe GEMM tutorial example](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/)
 shows one such synchronization method.
 More optimized GEMM implementations use pipelining techniques
 to overlap asynchronous `copy` operations with other useful work.
@@ -129,7 +129,7 @@ CuTe's optimized copy implementations can do all of these.
 ## `copy_if`
 
 CuTe's `copy_if` algorithm lives in the same header as `copy`,
-[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+[`include/cute/algorithm/copy.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/copy.hpp).
 The algorithm takes source and destination `Tensor` parameters like `copy`,
 but it also takes a "predication `Tensor`"
 with the same shape as the input and output.
@@ -195,7 +195,7 @@ for different architectures, please refer to the
 ## `axpby`
 
 The `axpby` algorithm lives in the header file
-[`include/cute/algorithm/axpby.hpp`](../../../include/cute/algorithm/axpby.hpp).
+[`include/cute/algorithm/axpby.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/axpby.hpp).
 It assigns to $y$ the result of $\alpha x + \beta y$,
 where $\alpha$ and $\beta$ are scalars and $x$ and $y$ are `Tensor`s.
 The name stands for "Alpha times X Plus Beta times Y,"
@@ -205,21 +205,21 @@ and is a generalization of the original BLAS "AXPY" routine
 ## `fill`
 
 The `fill` algorithm lives in the header file
-[`include/cute/algorithm/fill.hpp`](../../../include/cute/algorithm/fill.hpp).
+[`include/cute/algorithm/fill.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/fill.hpp).
 It overwrites the elements of its `Tensor` output argument
 with a given scalar value.
 
 ## `clear`
 
 The `clear` algorithm lives in the header file
-[`include/cute/algorithm/clear.hpp`](../../../include/cute/algorithm/clear.hpp).
+[`include/cute/algorithm/clear.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/clear.hpp).
 It overwrites the elements of its `Tensor` output argument with zeros.
 
 ## Other algorithms
 
 CuTe provides other algorithms.
 Their header files can be found in the
-[`include/cute/algorithm`](../../../include/cute/algorithm)
+[`include/cute/algorithm`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm)
 directory.
 
 ## Copyright
diff --git a/media/docs/cute/0t_mma_atom.md b/media/docs/cpp/cute/0t_mma_atom.md
similarity index 92%
rename from media/docs/cute/0t_mma_atom.md
rename to media/docs/cpp/cute/0t_mma_atom.md
index 8896d9b992..aa6da8c283 100644
--- a/media/docs/cute/0t_mma_atom.md
+++ b/media/docs/cpp/cute/0t_mma_atom.md
@@ -66,7 +66,7 @@ including
 #### Location of files
 
 CuTe provides its Operations structs in the
-[`include/cute/arch`](../../../include/cute/arch)
+[`include/cute/arch`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch)
 directory, in header files starting with `mma`.
 
 #### Operation struct's name
@@ -84,7 +84,7 @@ These often include
 
 For example, the Volta section below will refer to the
 `SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
-[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp).
+[`include/cute/arch/mma_sm70.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp).
 
 * "SM70" refers to Volta.
 
@@ -111,7 +111,7 @@ An Operation struct has the following members.
 An Operation struct has four public type aliases:
 `DRegisters`, `ARegisters`, `BRegisters`, and `CRegisters`.
 For example, the `SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
-[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp)
+[`include/cute/arch/mma_sm70.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp)
 defines these as follows.
 
 ```c++
@@ -145,7 +145,7 @@ can still compile, even if the PTX instruction is not available.
 #### Location of files
 
 CuTe provides its Traits structs in the
-[`include/cute/atom`](../../../include/cute/atom)
+[`include/cute/atom`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom)
 directory, in header files starting with `mma_traits`.
 
 #### Contents
@@ -175,7 +175,7 @@ An `MMA_Traits` specialization defines the following public type aliases.
 
 The specialization of MMA_Traits for the
 `SM70_8x8x4_F32F16F16F32_NT` Operation lives in the header file
-[`include/cute/atom/mma_traits_sm70.hpp`](../../../include/cute/atom/mma_traits_sm70.hpp).
+[`include/cute/atom/mma_traits_sm70.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/mma_traits_sm70.hpp).
 It looks like this.
 
 ```c++
@@ -209,7 +209,7 @@ Volta architecture implements an HMMA instruction where a group of 8 threads cal
 We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:
 
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
 </p>
 
 ### Types
@@ -251,10 +251,10 @@ Again, this layout function maps the logical thread id [0,8) of the MMA operatio
 Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.
 
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
 </p>
 
-The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](../../../include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
+The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
 
 We can start constructing a `CLayout` from the picture above. As with any CuTe layout, it is a pair of `Shape` and corresponding `Stride`. Let us just look at the shape for now. We know that the HMMA uses 8 threads each of which own 8 values. Therefore, the shape of our mapping must have a size of 8 along two modes. With this, we have
 
@@ -321,7 +321,7 @@ In the case of F16 accumulators, the layout is way less complex. Each row of acc
 A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.
 
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
 </p>
 
 Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:
@@ -376,7 +376,7 @@ Accumulators are mapped hierarchically in GMMA, starting from the concept of a c
 
 Each core matrix has the layout as shown in the diagram below.
 <p align="center">
-  <img src="../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
+  <img src="../../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
 </p>
 
 As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.
@@ -384,7 +384,7 @@ As in the Volta examples, the thread IDs are logical only, and which of the four
 Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.
 
 <p align="center">
-  <img src="../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
+  <img src="../../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
 </p>
 
 With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.
@@ -453,7 +453,7 @@ MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
 print_latex(mma);
 ```
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
 </p>
 
 The above is equivalent to 
@@ -473,7 +473,7 @@ We can create an object akin to a WMMA by using four of these quadpair MMAs:
     print_latex(mma);
 ```
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
 </p>
 This `TiledMMA` replicates the `MMA_Atom` across threads as we can see the `T4` and `T8` and `T12` threads in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the atom's partitioning pattern for a new quadpair and this replication follows a `(2,2):(2,1)` layout.
 
@@ -486,7 +486,7 @@ The above represents a 16x16x4 MMA now, but we can immediately expand this "tile
     print_latex(mma);
 ```
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
 </p>
 This `TiledMMA` replicates the previous `TiledMMA` across values instead of threads. We can see the `T0V8` and `T16V8` and `T8V8` values in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the previous `TiledMMA`'s partitioning pattern for a new set of values.
 
@@ -514,7 +514,7 @@ which are separate, but we might prefer them to be next to each other. That is w
     print_latex(mma);
 ```
 <p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
+  <img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
 </p>
 
 That layout `(4,4,2):(1,8,4)` is read like a scatter permutation, telling the m-coords of the original image where to go in the new image.
diff --git a/media/docs/cute/0x_gemm_tutorial.md b/media/docs/cpp/cute/0x_gemm_tutorial.md
similarity index 98%
rename from media/docs/cute/0x_gemm_tutorial.md
rename to media/docs/cpp/cute/0x_gemm_tutorial.md
index beb51523bc..44ea84dca2 100644
--- a/media/docs/cute/0x_gemm_tutorial.md
+++ b/media/docs/cpp/cute/0x_gemm_tutorial.md
@@ -1,7 +1,7 @@
 # CuTe dense matrix-matrix multiply tutorial
 
 In this section, we review
-[these examples](../../../examples/cute/tutorial/),
+[these examples](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/),
 which demonstrate a few self-contained, single-file dense matrix-matrix multiply implementations using only CuTe.
 
 ## `sgemm_1.cu`
@@ -335,7 +335,7 @@ These thread layouts are then used to partition the tiles of data in global memo
 where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.
 
 <p align="center">
-  <img src="../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
+  <img src="../../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
 </p>
 This diagram shows a `tC` layout, highlights two threads in green and blue, shows the projections of the `tC` layout, and finally highlights the subtensors within `sA`, `sB`, and `gC` that `tCsA`, `tCsB`, and `tCgC` represent.
 
@@ -391,7 +391,7 @@ As a first example, lets look at the `TiledCopy` that `gemm_nt` generates.
 ```
 The easiest way to see what this `TiledCopy` does is to look at the partition pattern in LaTeX.
 <p align="center">
-  <img src="../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
+  <img src="../../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
 </p>
 On the left is the source-tensor partitioning and on the right is the destination-tensor partitioning. The partition patterns are the same for this case, but there exist PTX instructions which require different patterns in the source and destination. The diagram shows that each thread reads 4x1 `TA` elements and there are 32x8 threads. The `UniversalCopy<uint128_t>` forces the instruction to use a 128-bit copy instruction. If the partition (of `sA` or `gA` in this case) does not result in 4 `TA` elements that cannot be vectorized to a 128-bit load/store, then CuTe will statically fail with an error message to that effect.
 
@@ -422,7 +422,7 @@ As a first example, lets look at the `TiledMMA` that `gemm_nt` generates.
 ```
 The easiest way to see what this `TiledMMA` does is to look at the partition pattern in LaTeX.
 <p align="center">
-  <img src="../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
+  <img src="../../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
 </p>
 On the left is the A-tensor partitioning, on the top is the B-tensor partitioning, and in the middle is the C-tensor partitioning.Because the `UniversalFMA` is a 1x1x1 MMA instruction, a 16x16x1 tiling of them results in a 16x16x1 `TiledMMA`. Other MMA instructions will have different threads involved and have different instruction sizes. In this case, all threads will read a single element from `A`, `B`, and `C` each.
 
@@ -535,7 +535,7 @@ gett(int m0, int m1, int n, int k,
 ```
 Note that the only changes are the definition of shape `M`, the definition of strides `dA` and `dC`, and the definition of the CTA Tiler `bM`. The above uses a multimodel problem shape `M = (m0,m1)` and a multimodal CTA Tiler `bM = <_64,_2>` to change which portion of the global memory tensors `A` and `C` each CTA will be responsible for computing.
 
-Similar examples can be found for CUTLASS 3.x kernels that are based on CuTe, such as [this Hopper GETT example](../../../examples/51_hopper_gett).
+Similar examples can be found for CUTLASS 3.x kernels that are based on CuTe, such as [this Hopper GETT example](https://github.com/NVIDIA/cutlass/tree/main/examples/51_hopper_gett).
 
 ## Copyright
 
diff --git a/media/docs/cute/0y_predication.md b/media/docs/cpp/cute/0y_predication.md
similarity index 100%
rename from media/docs/cute/0y_predication.md
rename to media/docs/cpp/cute/0y_predication.md
diff --git a/media/docs/cute/0z_tma_tensors.md b/media/docs/cpp/cute/0z_tma_tensors.md
similarity index 100%
rename from media/docs/cute/0z_tma_tensors.md
rename to media/docs/cpp/cute/0z_tma_tensors.md
diff --git a/media/docs/cpp/cute/index.rst b/media/docs/cpp/cute/index.rst
new file mode 100644
index 0000000000..a6611dd782
--- /dev/null
+++ b/media/docs/cpp/cute/index.rst
@@ -0,0 +1,17 @@
+.. _cpp_cute:
+
+CuTe
+====================
+
+.. toctree::
+  :maxdepth: 2
+
+  00_quickstart<00_quickstart.md>
+  01_layout<01_layout.md>
+  02_layout_algebra<02_layout_algebra.md>
+  03_tensor<03_tensor.md>
+  04_algorithms<04_algorithms.md>
+  0t_mma_atom<0t_mma_atom.md>
+  0x_gemm_tutorial<0x_gemm_tutorial.md>
+  0y_predication<0y_predication.md>
+  0z_tma_tensors<0z_tma_tensors.md>
diff --git a/media/docs/cutlass_3x_backwards_compatibility.md b/media/docs/cpp/cutlass_3x_backwards_compatibility.md
similarity index 94%
rename from media/docs/cutlass_3x_backwards_compatibility.md
rename to media/docs/cpp/cutlass_3x_backwards_compatibility.md
index 85eca7d617..1dc42ef7aa 100644
--- a/media/docs/cutlass_3x_backwards_compatibility.md
+++ b/media/docs/cpp/cutlass_3x_backwards_compatibility.md
@@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3.0 GEMM Backwards Compatibility**
-
 # CUTLASS 3.0 GEMM Backwards Compatibility
 
 Although CUTLASS 3.0 restructures the GEMM hierarchy and introduces new types for the
@@ -16,7 +14,7 @@ The entry point for CUTLASS's Device GEMM API
 is the class
 `cutlass::gemm::device::GemmUniversalAdapter`.
 This class lives in the header file
-[include/cutlass/gemm/device/gemm_universal_adapter.h](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+[include/cutlass/gemm/device/gemm_universal_adapter.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_universal_adapter.h).
 
 `GemmUniversalAdapter` is a "universal adapter"
 and serves as a common device interface
@@ -89,7 +87,7 @@ and a collective epilogue.
 The entry point for CUTLASS's kernel API is the class
 `cutlass::gemm::kernel::GemmUniversal`.
 This class' declaration lives in the header file
-[include/cutlass/gemm/kernel/gemm_universal.hpp](/include/cutlass/gemm/kernel/gemm_universal.hpp).
+[include/cutlass/gemm/kernel/gemm_universal.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.hpp).
 
 ```c++
 /*
@@ -128,11 +126,11 @@ Each kernel layer schedule is specialized
 for a GEMM scheduling algorithm and GPU architecture.
 Specializations of `kernel::GemmUniversal` for 3.0 APIs live in
 any of various `gemm_*.hpp` files in the directory
-[include/cutlass/gemm/kernel/](../../include/cutlass/gemm/kernel/).
+[include/cutlass/gemm/kernel/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/).
 The specialization to which to dispatch is decided through the dispatch policy's `Schedule` type.
 
 Specializations for 2.x APIs live in the header file
-[include/cutlass/gemm/kernel/gemm_universal.h](../../include/cutlass/gemm/kernel/gemm_universal.h).
+[include/cutlass/gemm/kernel/gemm_universal.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.h).
 
 ### Kernel API design differences
 
@@ -204,7 +202,7 @@ if they wish to author custom mainloop code in the 3.x API.
 
 Similarly, for the GEMM inner loops, `cute::MMA_Atom`s replace the
 `gemm::warp` and `gemm::thread` layer code. Going forward, all new PTX instructions
-and associated metadata development will occur directly inside [`cute/arch/*.hpp`](/include/cute/arch/) and [`cute/atom/*.hpp`](/include/cute/atom/).
+and associated metadata development will occur directly inside [`cute/arch/*.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/) and [`cute/atom/*.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/).
 
 The desired inner loop MMA iteration order and tiling can be achieved through careful
 selection of the atom layout, value layout, and permutations of the `cute::TiledMma`.
@@ -212,7 +210,7 @@ selection of the atom layout, value layout, and permutations of the `cute::Tiled
 For epilogues, the `cutlass::epilogue::collective` layer replaces `cutlass::threadblock::collective`.  However, the thread-level epilogue elementwise operations
 in `cutlass::epilogue::thread` will continue to be used in 3.x kernels as well, albeit, with
 a more idiomatic epilogue vectorization strategy.
-[Example 50](/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu)
+[Example 50](https://github.com/NVIDIA/cutlass/tree/main/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu)
 shows how to use 2.x epilogue thread operators with 3.0 API kernels.
 
 ## Porting from 2.x to 3.0 API
@@ -271,7 +269,7 @@ For the matrix B,
 CUTLASS 2.x defines "layout tag" classes
 `cutlass::layout::ColumnMajor` and `cutlass::layout::RowMajor`,
 that live in the header file
-[`cutlass/layout/matrix.h`](/include/cutlass/layout/matrix.h).
+[`cutlass/layout/matrix.h`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/matrix.h).
 The interpretation of these layouts in GEMM
 depends on whether they are applied
 to the input matrix A or B. For the matrix A, "column major" means
@@ -304,7 +302,7 @@ whether we are talking about the A or B matrix. M and N major inputs always have
 static size-1 stride in their 0th (outer) mode. Similarly, K major inputs
 always contain the static size-1 stride in their 1st mode. This uniformity in stride order
 allows us to represent tensor layouts much more cleanly and treat both A and B equally in our interfaces.
-See for example the following snippet from our [`kernel/sm70_gemm.hpp`](/include/cutlass/gemm/kernel/sm70_gemm.hpp)
+See for example the following snippet from our [`kernel/sm70_gemm.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm70_gemm.hpp)
 for Ampere kernel schedules.
 
 ```c++
@@ -352,7 +350,7 @@ dynamic stride modes corresponding to the minor mode and the batch mode. Batch
 mode is included by default as all CUTLASS 3.0 kernels support packed batch-mode GEMMs
 out of the box.
 
-The [`cutlass/gemm/gemm.h#440`](../../include/cutlass/gemm/gemm.h#440)
+The [`cutlass/gemm/gemm.h#440`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/gemm.h#440)
 header file includes functions
 that can be useful for converting
 from CUTLASS 3.0 `cute::Stride`s back to CUTLASS 2.x layout tags.
@@ -375,7 +373,7 @@ these 2.x reflective types from an assembled kernel with a more stable API,
 the specialization of `cutlass::gemm::device::GemmUniversalAdapter`
 for CUTLASS 3.0 kernel provides all aliases for all 2.x type aliases
 in addition to the layout tags. You can see how they are used in the header file
-[`cutlass/gemm/device/gemm_universal_adapter.h`](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+[`cutlass/gemm/device/gemm_universal_adapter.h`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_universal_adapter.h).
 Here is an excerpt.
 
 ```c++
diff --git a/media/docs/cutlass_3x_design.md b/media/docs/cpp/cutlass_3x_design.md
similarity index 95%
rename from media/docs/cutlass_3x_design.md
rename to media/docs/cpp/cutlass_3x_design.md
index 54d6c35c08..b1eed530fb 100644
--- a/media/docs/cutlass_3x_design.md
+++ b/media/docs/cpp/cutlass_3x_design.md
@@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3.0 Design and Hierarchy**
-
 # CUTLASS 3.0 Design
 
 CUTLASS 3.0 is a major enhancement over the abstractions of CUTLASS 2.x
@@ -29,7 +27,7 @@ CUTLASS 3.0 has the following design goals, in no particular order.
 CUTLASS 2.x decomposes the moving parts of a GEMM operation
 across a hierarchy that closely mirrors the organization of GPU
 architectures. This discussed in detail within the
-[CUTLASS 2.x GEMM API documentation](/media/docs/gemm_api.md).
+[CUTLASS 2.x GEMM API documentation](gemm_api.md).
 This design, however, sometimes results in a coupling that is too tight
 to extend to newer GPU features that might not fit into the same architectural
 hierarchy. For instance, Hopper's warp-group wide instructions do not naturally
@@ -46,7 +44,7 @@ with a consistent interface to hardware acceleration regardless of
 the architecture specific details.
 
 The new conceptual GEMM hierarchy is discussed in detail in the dedicated
-[CUTLASS 3.0 GEMM API documentation readme](/media/docs/gemm_api_3x.md),
+[CUTLASS 3.0 GEMM API documentation readme](gemm_api_3x.md),
 along with code examples of the core concepts and types. 
 
 ## Adoption of CuTe Layout and Tensors
@@ -55,9 +53,9 @@ CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tens
 CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. 
 
 CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates, greatly simplifying the design,
-improving code composability, and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](/media/docs/cute/00_quickstart.md).
+improving code composability, and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](cute/00_quickstart.md).
 
-![CuTe helps reduce named iterator types down to a single vocabulary type, `Layout`](/media/images/cutlass-reduction-in-named-iterators.png)
+![CuTe helps reduce named iterator types down to a single vocabulary type, `Layout`](../../images/cutlass-reduction-in-named-iterators.png)
 
 Programming massively parallel systems with various layers of logical thread and data hierarchies is not a trivial task. 
 
diff --git a/media/docs/dependent_kernel_launch.md b/media/docs/cpp/dependent_kernel_launch.md
similarity index 93%
rename from media/docs/dependent_kernel_launch.md
rename to media/docs/cpp/dependent_kernel_launch.md
index a5d0a51473..1beb8bf7d0 100644
--- a/media/docs/dependent_kernel_launch.md
+++ b/media/docs/cpp/dependent_kernel_launch.md
@@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **Dependent kernel launch**
-
 # Dependent kernel launches
 
 The Hopper and Blackwell architectures supports a new feature through which two kernels in the same stream can
@@ -37,11 +35,11 @@ gemm.run(
 ```
 ## Model-Aware Optimizations with PDL
 
-In [example 63](../../examples/63_hopper_gemm_with_weight_prefetch/README.md), we use PDL to explicitly optimize for 
+In [example 63](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md), we use PDL to explicitly optimize for 
 performance of kernels where we know that one of the input matricies (our weights) will not be produced by a prior 
 kernel. In that case, we only need to wait on the prior kernels memory flush in order to load the other input matrix 
 (our activations). During our prologue, we can prefetch our weights to improve performance for memory bandwidth-bound
-problem sizes. For more informations we refer the reader to [the example](../../examples/63_hopper_gemm_with_weight_prefetch/README.md).
+problem sizes. For more informations we refer the reader to [the example](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md).
 
 ## Copyright
 
diff --git a/media/docs/doxygen_mainpage.md b/media/docs/cpp/doxygen_mainpage.md
similarity index 97%
rename from media/docs/doxygen_mainpage.md
rename to media/docs/cpp/doxygen_mainpage.md
index 1ff521accc..c9f9dc9adb 100644
--- a/media/docs/doxygen_mainpage.md
+++ b/media/docs/cpp/doxygen_mainpage.md
@@ -33,7 +33,7 @@ to CUTLASS 3.0, please refer to the
 For a code example showing how to define
 a GEMM kernel using CUTLASS, please refer to
 [the quickstart guide](./quickstart.md).
-The [`examples` directory](../../examples)
+The [`examples` directory](https://github.com/NVIDIA/cutlass/tree/main/examples)
 has a variety of examples.
 
 # Copyright
diff --git a/media/docs/efficient_gemm.md b/media/docs/cpp/efficient_gemm.md
similarity index 82%
rename from media/docs/efficient_gemm.md
rename to media/docs/cpp/efficient_gemm.md
index 470c4eee79..771d24db87 100644
--- a/media/docs/efficient_gemm.md
+++ b/media/docs/cpp/efficient_gemm.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "Efficient GEMM in CUDA")
-
-[README](../../README.md#documentation) > **Efficient GEMM in CUDA**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "Efficient GEMM in CUDA")
 
 # Efficient GEMM in CUDA
 
@@ -60,7 +58,7 @@ This is the hierarchical GEMM computation embodied by CUTLASS. Each stage depict
 nested level of tiling which corresponds to a layer of concurrency within the CUDA execution model and to a
 level within the memory hierarchy, becoming increasingly finer moving left to right.
 
-![ALT](../images/gemm-hierarchy-with-epilogue.png "Hierarchical GEMM in CUDA")
+![ALT](../../images/gemm-hierarchy-with-epilogue.png "Hierarchical GEMM in CUDA")
 
 
 ### Threadblock-level GEMM
@@ -154,7 +152,7 @@ following scopes.
 
 The following diagram illustrates the efficient, pipelined mainloop body used in CUTLASS GEMMs.
 
-![ALT](../images/software-pipeline.png "Software pipeline in CUTLASS")
+![ALT](../../images/software-pipeline.png "Software pipeline in CUTLASS")
 
 ### Threadblock Rasterization
 
@@ -164,7 +162,7 @@ consecutively launched threadblocks to packed two-dimensional regions of the par
 problem to increase the probability that these will access the same tiles of global memory at
 approximately the same time.
 
-Several functions are defined in [cutlass/gemm/threadblock_swizzle.h](../../include/cutlass/gemm/threadblock/threadblock_swizzle.h).
+Several functions are defined in [cutlass/gemm/threadblock_swizzle.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/threadblock/threadblock_swizzle.h).
 
 
 ### Parallelized Reductions
@@ -226,26 +224,26 @@ to the Hopper kernel design. Blackwell SM100 kernels have a substantially differ
 however, the concept of separating out producer and consumer agents still applies.
 
 Starting with Hopper, CUTLASS 3.0 incorporates the concept of [Warp Specialization](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#spatial-partitioning-also-known-as-warp-specialization)
-as part of the kernel design. A thread block is partitioned into two sets of warps, [*producer* warp group](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [*consumer* warp group](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp). The *producer* warp group loads data from global memory into shared memory buffers using the new [Tensor Memory Accelerator (TMA)](https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). 
+as part of the kernel design. A thread block is partitioned into two sets of warps, [*producer* warp group](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [*consumer* warp group](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp). The *producer* warp group loads data from global memory into shared memory buffers using the new [Tensor Memory Accelerator (TMA)](https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). 
 
-[*Producer* warp group (DMA)](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) waits for the shared memory buffers to be signaled as [empty](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) by the *consumer* warp group using the newly added **Async Pipeline class** ([refer](pipeline.md)). Once the data is written into the shared memory, TMA is also updates the barrier associated with that stage to notify affected threads that the buffer has been [filled](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp). The [*Consumer* warp group (MMA)](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) on the other hand waits for the *producer* warp group to signal that the buffer is [filled](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) and then launches tensor core MMA operations. Finally, the *consumer* warp group [releases](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) the buffers for the next set of TMA loads to happens.
+[*Producer* warp group (DMA)](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) waits for the shared memory buffers to be signaled as [empty](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) by the *consumer* warp group using the newly added **Async Pipeline class** ([refer](pipeline.md)). Once the data is written into the shared memory, TMA is also updates the barrier associated with that stage to notify affected threads that the buffer has been [filled](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp). The [*Consumer* warp group (MMA)](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) on the other hand waits for the *producer* warp group to signal that the buffer is [filled](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) and then launches tensor core MMA operations. Finally, the *consumer* warp group [releases](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) the buffers for the next set of TMA loads to happens.
 
 **Warp-Specialized Persistent Cooperative kernel design**
 
-Another flavor of Warp-Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent Cooperative*](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) kernel. Like the Warp-Specialized kernel, the concepts of warp groups and barrier synchronization between warp groups remain the same in the cooperative design. 
+Another flavor of Warp-Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent Cooperative*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) kernel. Like the Warp-Specialized kernel, the concepts of warp groups and barrier synchronization between warp groups remain the same in the cooperative design. 
 The distinctive feature of the Warp-Specialized Persistent Cooperative kernel are the following :
-* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](../../include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
+* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
 * Presence of two *consumer* warp groups cooperating on the same output tile by splitting the tile in half across the M dimension. This allows for larger tile sizes to be enabled - since the register pressure per *consumer* warp group is reduced - and hence improving performance.
 
-Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](../../include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
+Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
 
 **Warp-Specialized Persistent Ping-Pong kernel design**
 
-The third kernel design is the [*Warp-Specialized Persistent Ping-Pong*](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel. 
+The third kernel design is the [*Warp-Specialized Persistent Ping-Pong*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel. 
 Like the Warp-Specialized Persistent Cooperative, kernel the concepts of warp groups, barrier synchronization between warp groups, and the shape of the grid launch remain the same in the persistent ping-pong design. 
 The distinctive feature of the Warp-Specialized Persistent Ping-Pong kernel is the following :
 * The two *consumer* warp groups are assigned a different output tile using the Tile Scheduler. This allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization. 
-* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](../../include/cutlass/pipeline/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order.
+* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order.
 
 # Resources
 
diff --git a/media/docs/functionality.md b/media/docs/cpp/functionality.md
similarity index 71%
rename from media/docs/functionality.md
rename to media/docs/cpp/functionality.md
index 274bba625d..396db1fe86 100644
--- a/media/docs/functionality.md
+++ b/media/docs/cpp/functionality.md
@@ -1,17 +1,15 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Functionality")
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Functionality")
 
-[README](../../README.md#documentation) > **Functionality**
 
 # Functionality
 
 Note : CUTLASS-3 requires users to use CUDA 11.4 or newer, and SM70 or newer, for the target toolkit and architecture, respectively.
-Please refer to the [Compatibility](/README.md#Compatibility) section for more details.
 
 - N - Column Major Matrix
 - T - Row Major matrix
 - {N,T} x {N,T} - All combinations, i.e., NN, NT, TN, TT
-- [NHWC](/include/cutlass/layout/tensor.h#L63-206) - 4 dimension tensor used for convolution
-- [NCxHWx](/include/cutlass/layout/tensor.h#L290-395) - Interleaved 4 dimension tensor used for convolution
+- [NHWC](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/tensor.h#L63-206) - 4 dimension tensor used for convolution
+- [NCxHWx](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/tensor.h#L290-395) - Interleaved 4 dimension tensor used for convolution
 - f - floating point
 - s - signed int
 - b - bit
@@ -32,48 +30,48 @@ Hyperlinks to relevant unit tests demonstrate how specific template instances ma
 
 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**            | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------------|------------------|
-| **TensorOp**        | 90a                 |  12.0+           | `f16 * f16 + { f16, f32 } => { f16, f32 }`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized.cu) |
-| **TensorOp**        | 90a                 |  12.0+           | `bf16 * bf16 + { f16, f32 } => { bf16, f32 }`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_tensor_op_f32.cu) |
-| **TensorOp**        | 90a                 |  12.0+           | `{f32, tf32} * {f32, tf32} + f32 => f32`| { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu) |
-| **TensorOp**        | 90a                 |  12.0+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_s8_s8_s8_tensor_op_s32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `f16 * f16 + { f16, f32 } => { f16, f32 }`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `bf16 * bf16 + { f16, f32 } => { bf16, f32 }`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_tensor_op_f32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `{f32, tf32} * {f32, tf32} + f32 => f32`| { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_s8_s8_s8_tensor_op_s32.cu) |
  
 
 ### CUTLASS 2.x Kernels
 
 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**            | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------------|------------------|
-| **Simt**        | 50+                    |  11.4+            | `f32 * f32 + f32 => f32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_sgemm_nt_sm50.cu)                |
-| **Simt**        | 50+                    |  11.4+            | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_dgemm_nt_sm50.cu)                |
-| **Simt**        | 60+                    |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_hgemm_nt_sm50.cu)                |
-| **Simt**        | 61+                    |  11.4+            | `s8 * s8 + s32 => {s32,s8}`    | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_igemm_nt_sm50.cu)              |
-| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f16_sm70.cu)     |
-| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f32_sm70.cu)                |
-| **WmmaTensorOp**    | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s8t_wmma_tensor_op_s32_sm72.cu) |
-| **WmmaTensorOp**    | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s4t_wmma_tensor_op_s32_sm75.cu)                |
-| **WmmaTensorOp**    | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_b1t_wmma_tensor_op_s32_sm75.cu)                |
-| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f16_sm70.cu)                |
-| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f32_sm70.cu)                |
-| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `cf64 * cf64 + cf64 => cf64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`    | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`         | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f32n_f32n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `s8 * s8 + s32 => {s8, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `s4 * s4 + s32 => {s4, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu) |
-| **TensorOp**        | 90+                |  11.8+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) |
+| **Simt**        | 50+                    |  11.4+            | `f32 * f32 + f32 => f32`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_sgemm_nt_sm50.cu)                |
+| **Simt**        | 50+                    |  11.4+            | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_dgemm_nt_sm50.cu)                |
+| **Simt**        | 60+                    |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_hgemm_nt_sm50.cu)                |
+| **Simt**        | 61+                    |  11.4+            | `s8 * s8 + s32 => {s32,s8}`    | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_igemm_nt_sm50.cu)              |
+| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f16_sm70.cu)     |
+| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f32_sm70.cu)                |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s8t_wmma_tensor_op_s32_sm72.cu) |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s4t_wmma_tensor_op_s32_sm75.cu)                |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_b1t_b1n_b1t_wmma_tensor_op_s32_sm75.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f16_sm70.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f32_sm70.cu)                |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `cf64 * cf64 + cf64 => cf64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`    | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}` | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`         | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f32n_f32n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `s8 * s8 + s32 => {s8, s32}`       | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `s4 * s4 + s32 => {s4, s32}`       | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **TensorOp**        | 90+                |  11.8+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) |
 
 
 ## Device-level Implicit GEMM convolution
@@ -84,19 +82,19 @@ One can find and/or create equivalent dgrad and wgrad convolutional operators.
 
 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**      | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------|------------------|
-| **Simt**            | 50+                |  11.4+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm50.cu)                |
-| **Simt**            | 50+                |  11.4+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
-| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
-| **Simt**            | 80+                |  11.4+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
-| **Simt**            | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |
+| **Simt**            | 50+                |  11.4+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm50.cu)                |
+| **Simt**            | 50+                |  11.4+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
+| **Simt**            | 80+                |  11.4+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
+| **Simt**            | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |
 
 
 
diff --git a/media/docs/fundamental_types.md b/media/docs/cpp/fundamental_types.md
similarity index 99%
rename from media/docs/fundamental_types.md
rename to media/docs/cpp/fundamental_types.md
index 3bfc445321..b29fb5bf81 100644
--- a/media/docs/fundamental_types.md
+++ b/media/docs/cpp/fundamental_types.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")
-
-[README](../../README.md#documentation) > **Fundamental Types**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")
 
 # Fundamental Types
 
diff --git a/media/docs/gemm_api.md b/media/docs/cpp/gemm_api.md
similarity index 90%
rename from media/docs/gemm_api.md
rename to media/docs/cpp/gemm_api.md
index e2aaaccb7d..fd8ecf5e3a 100644
--- a/media/docs/gemm_api.md
+++ b/media/docs/cpp/gemm_api.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
-
-[README](../../README.md#documentation) > **CUTLASS GEMM API**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
 
 # CUTLASS GEMM API
 
@@ -69,7 +67,7 @@ thread-level concurrency.
 This loop nest is expressed in CUTLASS via the following components which are specialized for data type, layout, and
 math instruction.
 
-![ALT](/media/images/cutlass-gemm-components.png "CUTLASS GEMM Components")
+![ALT](../../images/cutlass-gemm-components.png "CUTLASS GEMM Components")
 
 These components are described in the following sections.
 
@@ -80,10 +78,10 @@ GEMM computation across the GPU. This operator is intended to be used in host-si
 has semantics similar to cuBLAS.
 
 The device-wide GEMM API is embodied by the following operators:
-- [cutlass::gemm::device::Gemm](/include/cutlass/gemm/device/gemm.h) - basic GEMM operation
-- [cutlass::gemm::device::GemmArray](/include/cutlass/gemm/device/gemm_array.h) - batched GEMM operation in which input matrices are read from arrays of pointers
-- [cutlass::gemm::device::GemmBatched](/include/cutlass/gemm/device/gemm_batched.h) - batched GEMM operation in which input matrices are separated by a constant stride
-- [cutlass::gemm::device::GemmSplitKParallel](/include/cutlass/gemm/device/gemm_splitk_parallel.h) - GEMM operation that partitions the GEMM K dimension then launches a separate reduction kernel
+- [cutlass::gemm::device::Gemm](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm.h) - basic GEMM operation
+- [cutlass::gemm::device::GemmArray](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_array.h) - batched GEMM operation in which input matrices are read from arrays of pointers
+- [cutlass::gemm::device::GemmBatched](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_batched.h) - batched GEMM operation in which input matrices are separated by a constant stride
+- [cutlass::gemm::device::GemmSplitKParallel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_splitk_parallel.h) - GEMM operation that partitions the GEMM K dimension then launches a separate reduction kernel
 
 **Example:** launch a mixed-precision GEMM targeting Volta Tensor Cores.
 ```c++
@@ -127,14 +125,14 @@ GEMMs at this scope are expected to efficiently load tiles of data from global m
 products with warp-level GEMM operators.
 
 The threadblock-scoped matrix multiply operation is embodied by 
-[cutlass::gemm::threadblock::MmaPipelined](/include/cutlass/gemm/threadblock/mma_pipelined.h).
+[cutlass::gemm::threadblock::MmaPipelined](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/threadblock/mma_pipelined.h).
 This is a class inspired by [std::transform_reduce()](https://en.cppreference.com/w/cpp/algorithm/transform_reduce) 
 which computes the accumulated matrix product of a range of tiles defined by tile iterators.
 
-![ALT](/media/images/cutlass-threadblock-mma-pipelined.png "cutlass::gemm::threadblock::MmaPipelined")
+![ALT](../../images/cutlass-threadblock-mma-pipelined.png "cutlass::gemm::threadblock::MmaPipelined")
 
 In the case of GEMM, the tile iterators are 
-[cutlass::transform::threadblock::PredicatedTileIterator](/include/cutlass/transform/threadblock/predicated_tile_iterator.h)
+[cutlass::transform::threadblock::PredicatedTileIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/transform/threadblock/predicated_tile_iterator.h)
 to traverse a sequence of tiles in global memory with appropriate predication to avoid out-of-bounds
 memory accesses.
 
@@ -213,14 +211,14 @@ The warp-level GEMM API is a generalization of CUDA's WMMA API to achieve the fo
 
 Defining a warp-level matrix multiply in CUTLASS is similar to WMMA as shown below.
 
-![ALT](/media/images/cutlass-warp-level-gemm-api-instantiation.png "CUTLASS vs WMMA API")
+![ALT](../../images/cutlass-warp-level-gemm-api-instantiation.png "CUTLASS vs WMMA API")
 
 The usage model is also similar. The following example computes a warp-level GEMM operation,
 accumulating a series of matrix products in a register-backed array. The input to a warp-level
 GEMM operation in CUTLASS _must_ be data in shared memory loaded by iterators or on 
 register-backed fragments.
 
-![ALT](/media/images/cutlass-warp-level-gemm-operation.png "CUTLASS warp-level GEMM API")
+![ALT](../../images/cutlass-warp-level-gemm-operation.png "CUTLASS warp-level GEMM API")
 
 ```c++
 #include "cutlass/gemm/warp/default_mma_tensor_op.h"
@@ -513,8 +511,8 @@ column-major GEMM, operands A & B are transposed and swapped.
 To enable efficient row-major epilogue for both row-major and column-major output layout, 
 CUTLASS' device-level GEMM operators `cutlass::device::Gemm` and `cutlass::device::GemmUniversal` 
 provide two template definitions:
-- (a) [General definition](/include/cutlass/gemm/device/gemm.h#L217)
-- (b) [Specialized definition for column-major source/output](/include/cutlass/gemm/device/gemm.h#L545)
+- (a) [General definition](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm.h#L217)
+- (b) [Specialized definition for column-major source/output](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm.h#L545)
 
 Efficient row-major epilogue for:
 - (i)  GEMM operator on row-major source/output uses template (a). It runs row-major GEMM and 
@@ -536,8 +534,8 @@ of input layouts. Thus, CUTLASS supports the following layout combinations for i
 CUTLASS defines a template-based interface to Tensor Core operations to avoid resorting
 to inline PTX.
 
-- [mma_sm70.h](/include/cutlass/arch/mma_sm70.h) - Volta TensorCore operations
-- [mma_sm75.h](/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations
+- [mma_sm70.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm70.h) - Volta TensorCore operations
+- [mma_sm75.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations
 
 
 # Copyright
diff --git a/media/docs/gemm_api_3x.md b/media/docs/cpp/gemm_api_3x.md
similarity index 94%
rename from media/docs/gemm_api_3x.md
rename to media/docs/cpp/gemm_api_3x.md
index ab6e6e090e..c643fafd08 100644
--- a/media/docs/gemm_api_3x.md
+++ b/media/docs/cpp/gemm_api_3x.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
-
-[README](../../README.md#documentation) > **CUTLASS 3.0 GEMM API**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
 
 # CUTLASS 3.0 GEMM API
 
@@ -71,7 +69,7 @@ is implied by CUDA grid launch semantics.
 However, for persistent kernels,
 these three loops are expressed in the source code 
 as a single `while` loop that queries the
-[work tile scheduler](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp)
+[work tile scheduler](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp)
 for problem tiles on which to compute.
 
 Inside the three nested `for` loops,
@@ -112,7 +110,7 @@ in order to assemble a kernel.  This order is
 
 3. wrap up the kernel with a device layer adapter.
 
-This order is also reflected in the [CUTLASS 3.0 Hopper kernel examples](/examples/48_hopper_warp_specialized_gemm) as seen in the excerpt below.
+This order is also reflected in the [CUTLASS 3.0 Hopper kernel examples](https://github.com/NVIDIA/cutlass/tree/main/examples/48_hopper_warp_specialized_gemm) as seen in the excerpt below.
 
 ```c++
 // Step 1: Generate the required collective layer mainloop specialization
@@ -208,7 +206,7 @@ Any looping over multiple tiles that
 the algorithm might need to do would happen here.
 
 The `CollectiveMma` class is declared in the header
-[cutlass/gemm/collective/collective_mma.hpp](/include/cutlass/gemm/collective/collective_mma.hpp).
+[cutlass/gemm/collective/collective_mma.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/collective_mma.hpp).
 
 ```c++
 namespace cutlass::gemm::collective {
@@ -328,7 +326,7 @@ all operations that conceptually belong to the same class. This design has the f
 The primary `CollectiveMma` is intended to be an expert user interface that allows full control over
 all the properties of the collective's GPU micro-kernel. However, often a user just wants an
 off-the-shelf GEMM mainloop implementation parameterized on simple configuration parameters. CUTLASS 3.0
-provides [`cutlass::gemm::collective::CollectiveBuilder`](/include/cutlass/gemm/collective/collective_builder.hpp) for such scenarios.
+provides [`cutlass::gemm::collective::CollectiveBuilder`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/collective_builder.hpp) for such scenarios.
 
 ```c++
 namespace cutlass::gemm::collective {
@@ -382,7 +380,7 @@ may also change in the future as we adopt user feedback.
 
 If the builder is able to provide a collective mainloop type for the given set of parameters,
 it will be aliased within as `CollectiveOp`. For more information on how to
-parameterize kernels conveniently with the collective builder, please see example [49_hopper_gemm_with_collective_builder](/examples/49_hopper_gemm_with_collective_builder).
+parameterize kernels conveniently with the collective builder, please see example [49_hopper_gemm_with_collective_builder](https://github.com/NVIDIA/cutlass/tree/main/examples/49_hopper_gemm_with_collective_builder).
 
 ### Epilogue
 
@@ -390,7 +388,7 @@ The collective epilogue implements element-wise operations
 involving the output matrix.  Users can provide a custom
 epilogue, or use one of the standard epilogues.
 These live in the directory
-[include/cutlass/epilogue/collective/](/include/cutlass/epilogue/collective/),
+[include/cutlass/epilogue/collective/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/),
 and include classes like
 `cutlass::epilogue::collective::DefaultEpilogue`
 and
@@ -418,7 +416,7 @@ epilogues, and/or other operations.
 
 The entry point API for CUTLASS 3.0 kernel is the class
 `cutlass::gemm::kernel::GemmUniversal`, found in the header file
-[include/cutlass/gemm/kernel/gemm_universal.hpp](/include/cutlass/gemm/kernel/gemm_universal.hpp).
+[include/cutlass/gemm/kernel/gemm_universal.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.hpp).
 `GemmUniversal` is a stateless universal device kernel
 that implements GEMM as the composition of two parts:
 
@@ -478,24 +476,24 @@ We will explain *collective* in more detail below.
 
 Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
 any of various `gemm_*.hpp` files in the directory
-[include/cutlass/gemm/kernel/](/include/cutlass/gemm/kernel/).
+[include/cutlass/gemm/kernel/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/).
 Specializations for 2.x APIs can be found in the header file
-[include/cutlass/gemm/kernel/gemm_universal.h](/include/cutlass/gemm/kernel/gemm_universal.h).
+[include/cutlass/gemm/kernel/gemm_universal.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.h).
 
 CUTLASS 3.x implements various embodiments of `kernel::GemmUniversal`.
 Each kernel layer schedule is specialized
 for a GEMM scheduling algorithm and GPU architecture.
 Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
 any of various `include/cutlass/gemm/kernel/{arch_tag}*.hpp` files in the directory
-[include/cutlass/gemm/kernel/](/include/cutlass/gemm/kernel/).
+[include/cutlass/gemm/kernel/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/).
 Which specialization to dispatch to is decided through the dispatch policy's `Schedule` type.
 
 For example, the header file
-[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp)
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp)
 has a specialization of `kernel::GemmUniversal` for Hopper
 that uses a warp-specialized mainloop with a persistent scheduling algorithm,
 while the header file
-[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp)
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp)
 has a specialization of `GemmUniversal` for Hopper
 that uses a warp-specialized but non-persistent algorithm.
 
@@ -536,7 +534,7 @@ It serves the same purpose as cuBLAS and behaves similarly.
 The entry point for the Device GEMM API is the class
 `cutlass::gemm::device::GemmUniversalAdapter`.
 This class lives in the header file
-[include/cutlass/gemm/device/gemm_universal_adapter.h](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+[include/cutlass/gemm/device/gemm_universal_adapter.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_universal_adapter.h).
 `GemmUniversalAdapter` is a stateful, reusable handle,
 which is parameterized on the `cutlass::gemm::kernel` type.
 
diff --git a/media/docs/grouped_scheduler.md b/media/docs/cpp/grouped_scheduler.md
similarity index 93%
rename from media/docs/grouped_scheduler.md
rename to media/docs/cpp/grouped_scheduler.md
index 4b86e9159f..333496f7c7 100644
--- a/media/docs/grouped_scheduler.md
+++ b/media/docs/cpp/grouped_scheduler.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Grouped Kernel Schedulers")
-
-[README](../../README.md#documentation) > **Grouped Kernel Schedulers**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Grouped Kernel Schedulers")
 
 # CUTLASS Grouped Kernel Schedulers
 
@@ -59,12 +57,12 @@ Consider, for example, the threadblock-to-tile mapping that occurs for a group o
 each consisting of a grid of 2x2 tiles. Suppose that eight threadblocks are launched. The
 figure below illustrates the threadblock ID assigned to each tile in each GEMM in the group.
 
-![ALT](/media/images/grouped-gemm-schedule-2x2.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with 2x2 grids of tiles")
+![ALT](../../images/grouped-gemm-schedule-2x2.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with 2x2 grids of tiles")
 
 A similar mapping for problems that do not have the same number of tiles
 is shown below:
 
-![ALT](/media/images/grouped-gemm-schedule-varied.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with varying tile count")
+![ALT](../../images/grouped-gemm-schedule-varied.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with varying tile count")
 
 ## Computing the schedule for a given block
 Each threadblock in the grouped GEMM computes its own schedule by calling
@@ -114,7 +112,7 @@ of a grid of 2x2 tiles.  Matrix C in each problem is lower triangular, indicated
 shaded tiles. Consider that eight threadblocks are launched to compute the grouped 
 problem. The default grouped GEMM scheduler will assign threadblocks to tiles in the following order:
 
-![ALT](/media/images/grouped-syr2k-schedule-using-grouped-gemm-scheduler.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")
+![ALT](../../images/grouped-syr2k-schedule-using-grouped-gemm-scheduler.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")
 
 In this case, threadblocks 1 and 5 are continuously assigned to inactive tiles. In
 scenarios in which problems within the group have varying size, we have observed
@@ -129,7 +127,7 @@ lower-triangular problem (and vice-versa for upper-triangular problems).
 Using the example above, the resulting assignment of threadblocks to tiles from
 such a scheduler might be:
 
-![ALT](/media/images/grouped-syr2k-schedule-ideal.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")
+![ALT](../../images/grouped-syr2k-schedule-ideal.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")
 
 Achieving this schedule requires mapping from a threadblock ID to tile coordinates
  `(i, j)`.
@@ -139,7 +137,7 @@ first calculate row and column indices assuming one-indexed rows, tiles, and
 threadblock IDs, and then subtract one to convert to zero-indexed versions. Our 
 description borrows heavily from the mapping described [here](https://stackoverflow.com/a/40954159).
 
-![ALT](/media/images/grouped-syr2k-schedule-3x3.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to one SYR2K with a 3x3 grids of tiles")
+![ALT](../../images/grouped-syr2k-schedule-3x3.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to one SYR2K with a 3x3 grids of tiles")
 
 ### Calculating row `i` given threadblock ID `t`
 For a given row i, all threadblock IDs t in that row satisfy the following:
@@ -199,7 +197,7 @@ each of which contains 2 "true tiles." We can thus first map a threadblock ID to
 using the equations above, and then map it to the "true tile" within its "macro tile." In the example
 of a 2x4 grid, this mapping would look as follows:
 
-![ALT](/media/images/grouped-syr2k-schedule-macro.png "CUTLASS grouped SYR2K scheduler converting a grid into a 'macro grid' for computing tile mappings for non-square grids")
+![ALT](../../images/grouped-syr2k-schedule-macro.png "CUTLASS grouped SYR2K scheduler converting a grid into a 'macro grid' for computing tile mappings for non-square grids")
 
 A zero-indexed threadblock ID `t` is mapped to its "macro tile ID" `t_macro` as:
 ```
@@ -245,7 +243,7 @@ The only modification needed for upper-triangular matrices is to swap `i_macro`
 # Scheduler modes
 The grouped kernel schedulers come with two different modes for finding
 the next tile for a block to compute. These techniques are controlled by
-the [`cutlass::gemm::kernel::GroupScheduleMode`](../../include/cutlass/gemm/kernel/grouped_problem_visitor.h) enum.
+the [`cutlass::gemm::kernel::GroupScheduleMode`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/grouped_problem_visitor.h) enum.
 We describe each mode in greater detail below.
 
 ## `GroupScheduleMode::kDeviceOnly` (default)
@@ -340,7 +338,7 @@ Thus, there are 216 tiles across the group.
 Suppose this grouped GEMM is run on GA100, which has 108 SMs. Suppose that
 the occupancy given the parameters of the grouped GEMM is one -- one threadblock
 can be active at a time on an SM. The grouped GEMM will, thus, run with 108
-persistent threadblocks, each of which computes (216 / 108) = 2 tiles.
+persistent threadblocks, each of which computes (256 / 108) = 2 tiles.
 
 Under the round-robin assignment of tiles to threadblocks employed by
 the grouped GEMM scheduler, the assignment of tiles to threadblocks
@@ -379,7 +377,7 @@ scheduling mode by around 30%.
 
 To ease the process of sorting groups and their associated metadata in this
 manner, the device-level grouped kernels provide a `sort_problems()` method.
-An example of how to use this may be found in the [grouped GEMM example](../../examples/24_gemm_grouped/gemm_grouped.cu).
+An example of how to use this may be found in the [grouped GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/24_gemm_grouped/gemm_grouped.cu).
 
 Finally, while sorting problems can be helpful in certain scenarios, it is
 not guaranteed to improve performance. In some cases, performance can
diff --git a/media/docs/ide_setup.md b/media/docs/cpp/ide_setup.md
similarity index 97%
rename from media/docs/ide_setup.md
rename to media/docs/cpp/ide_setup.md
index 9b02365930..6a332b3181 100644
--- a/media/docs/ide_setup.md
+++ b/media/docs/cpp/ide_setup.md
@@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **IDE Setup for CUTLASS Development**
-
 # IDE Setup for CUTLASS Development
 
 This document outlines instructions and tips for setting up a local editor for CUTLASS development, including support
@@ -33,7 +31,7 @@ and you might see faster responses and more stable performance with clangd.
     * ...others, depending on which files you edit
 1. Edit C++ standard to be `c++17`, `gnu++17`, or equivalent.
 1. Edit `defines` to define preprocessor variables. See
-[Global Config below](#Global-Config) for examples. The important
+[Global Config below](#global-config) for examples. The important
    ones include `__CUDACC_VER_MAJOR__`, `__CUDA_ARCH__`, `__CUDA_ARCH_FEAT_SM90_ALL__`. But configure
    them according to your target architecture.
 1. ...and possible edit any other fields for your specific setup.
diff --git a/media/docs/implicit_gemm_convolution.md b/media/docs/cpp/implicit_gemm_convolution.md
similarity index 84%
rename from media/docs/implicit_gemm_convolution.md
rename to media/docs/cpp/implicit_gemm_convolution.md
index 9b00cfc22e..d65b9a9036 100644
--- a/media/docs/implicit_gemm_convolution.md
+++ b/media/docs/cpp/implicit_gemm_convolution.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Implicit GEMM API")
-
-[README](../../README.md#documentation) > **Implicit GEMM Convolution**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Implicit GEMM API")
 
 # CUTLASS Convolution
 
@@ -55,7 +53,7 @@ f(p, r) = p * stride_h + R - r - 1 + pad_h
 g(q, s) = q * stride_w + S - s - 1 + pad_w
 ```
 
-A [host](/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](/tools/util/include/cutlass/util/reference/device/convolution.h) 
+A [host](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/reference/device/convolution.h) 
 reference implementation are provided in the CUTLASS Utilities.
 
 This computation may be mapped to the elements of a matrix product as follows.
@@ -145,7 +143,7 @@ for (int gemm_i = 0; gemm_i < GEMM_M; ++gemm_i) {
   }
 }
 ```
-The [CUTLASS GEMM implementation](/media/docs/efficient_gemm.md) explicitly iterates over tiles. Consequently, 
+The [CUTLASS GEMM implementation](efficient_gemm.md) explicitly iterates over tiles. Consequently, 
 a tile iterator could be implemented to compute these functions analytically and load the appropriate
 elements. However, the resulting modulo arithmetic would be computationally intensive, and overhead would
 limit performance of a GEMM kernel targeting Turing Tensor Cores. 
@@ -169,7 +167,7 @@ This enables 128-bit vector memory acceses which lead to efficient CUDA kernels.
 CUTLASS defines CUDA C++ templates accepting numerous template arguments to specialize the resulting
 kernel by operation, data type, tile configuration, math instruction, and fused output operation.
 
-In [turing_tensorop_conv2dfprop.cu](/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu), a convolution
+In [turing_tensorop_conv2dfprop.cu](https://github.com/NVIDIA/cutlass/tree/main/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu), a convolution
 operation is defined as follows.
 
 ```c++
@@ -232,7 +230,7 @@ Internal accumulation is performed using 32-bit integers (`int32_t`), and an ele
 is performed on the output in single-precision floating point (`float`).
 
 The threadblock and warp-level tile shapes refer to the hierarchically blocked GEMM computation 
-[described here](/media/docs/gemm_api.md). Larger tiles achieve greater reuse of data loaded through shared memory
+[described here](gemm_api.md). Larger tiles achieve greater reuse of data loaded through shared memory
 but launch fewer CTAs and may not fully occupy the GPU for small problem sizes. Smaller tile configurations achieve
 lower peak utilizations but may better match the number of SMs within the GPU for real-world workloads.
 
@@ -318,7 +316,7 @@ if (status != cutlass::Status::kSuccess) {
 ```
 
 The example demonstrates how the input and output tensors may be written to a file as CSV using
-`cutlass::HostTensor<>` defined in the [CUTLASS Utilities](/media/docs/utilities.md).
+`cutlass::HostTensor<>` defined in the [CUTLASS Utilities](utilities.md).
 
 ```c++
   std::ofstream output_workspace(ss.str());
@@ -339,41 +337,41 @@ The example demonstrates how the input and output tensors may be written to a fi
 CUTLASS defines the following CUDA C++ templates to implement Implicit GEMM Convolution which are described in greater detail in subsequent sections.
 
 **Activations tile iterators** load the activations tile into registers. Two implementations are provided:
-- [conv2d_fprop_activation_tile_access_iterator_analytic.h](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
-- [conv2d_fprop_activation_tile_access_iterator_optimized.h](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
+- [conv2d_fprop_activation_tile_access_iterator_analytic.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
+- [conv2d_fprop_activation_tile_access_iterator_optimized.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
 creating GEMM-A tile in shared memory.
 
 **Filter tile iterators** load filters into registers. Similarly, two implementations are provided:
-- [conv2d_fprop_filter_tile_access_iterator_analytic.h](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
-- [conv2d_fprop_filter_tile_access_iterator_optimized.h](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
+- [conv2d_fprop_filter_tile_access_iterator_analytic.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
+- [conv2d_fprop_filter_tile_access_iterator_optimized.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
 creating GEMM-B tile in shared memory.
 
 The improvements covered by optimized iterators are:
 
 a. Precomputing kernel-invariant pointer deltas on the host 
 b. Computing cta-invariant mask predicates on device-side iterator ctors
-c. Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors.
+c. Use of [fast divmod](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors.
 
 For example, an _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ.
 
 **Pipelined mainloop** loads threadblock-scoped tiles from global memory into shared memory and then applies
 CUTLASS warp-level GEMM operations to load from Shared Memory and issue instructions to Turing Tensor Cores.
-- [mma_pipelined.h](/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h)
+- [mma_pipelined.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h)
 
 Operations for storing to shared memory and performing warp-wide matrix multiply operations using
 Turing Tensor Cores are applied directly from the CUTLASS GEMM components. These include the
 following components.
 
 **Regular Tile Iterator** implemented in 
-[transform::threadblock::RegularTileIterator](/include/cutlass/transform/threadblock/regular_tile_iterator.h)
+[transform::threadblock::RegularTileIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/transform/threadblock/regular_tile_iterator.h)
 stores register-backed fragments to Shared Memory in permuted layouts.
 
-**Warp-level GEMM** defined in [cutlass::gemm::warp::MmaTensorOp](/include/cutlass/gemm/warp/mma_tensor_op.h)
+**Warp-level GEMM** defined in [cutlass::gemm::warp::MmaTensorOp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/warp/mma_tensor_op.h)
 defines tile iterators to load from Shared Memory and issue math instructions to Turing Tensor Cores.
-Further details are [described in here](/media/docs/gemm_api.md#warp-level-matrix-multiply-api).
+Further details are [described in here](gemm_api.md#warp-level-matrix-multiply-api).
 
 **Epilogue** reorders accumulator elements among threads within a threadblock to efficiently update
-the output tensor. It is implemented in [epilogue::threadblock::Epilogue](/include/cutlass/epilogue/threadblock/epilogue.h).
+the output tensor. It is implemented in [epilogue::threadblock::Epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/epilogue.h).
 
 ### Loading Activations and Filters
 
@@ -383,7 +381,7 @@ of channels. After iterating over all filter positions, the convolution algorith
 next interval of channels and proceeds from filter `r=0, s=0`. 
 
 The matrix product of one threadblock tile is computed per iteration of 
-the mainloop as described in the [CUTLASS GEMM implementation](/media/docs/efficient_gemm.md). To
+the mainloop as described in the [CUTLASS GEMM implementation](efficient_gemm.md). To
 summarize, the threadblock tile of activations and filters are loaded from tensors in global memory
 and stored to shared memory. Each thread within the threadblock loads one or more vectors and
 collectively span the entire tile. 
@@ -394,9 +392,9 @@ Filters tensors. Each index in the GEMM _M_ dimension corresponds to a unique _(
 index of the output tensor, and pointers may be computed based on this as well as 
 filter position _(r,s)_.
 
-![ALT](/media/images/conv2d-fprop-int4.png "Convolution Forward Propagation on INT4 data.")
+![ALT](../../images/conv2d-fprop-int4.png "Convolution Forward Propagation on INT4 data.")
 
-The CUTLASS component that embodies this functionality is [Conv2dFpropFilterTileAccessIteratorAnalytic](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h).
+The CUTLASS component that embodies this functionality is [Conv2dFpropFilterTileAccessIteratorAnalytic](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h).
 Its constructor computes the mapping of GEMM _M_ to _(N, P, Q)_, the `at()` method maps the linear offset into the Activations 
 tensor for each memory access the thread is to perform. Additionally, the method `valid()` computes the valided of the access 
 for each filter position and for each memory access to indicate whether the memory access will be within the bounds of the 
@@ -456,11 +454,11 @@ void advance() {
 }
 ```
 
-Similar logic holds for [Conv2dFpropFilterTileAccessIteratorAnalytic](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h).
+Similar logic holds for [Conv2dFpropFilterTileAccessIteratorAnalytic](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h).
 
 To reduce computational overhead in the mainloop body, the pointer offsets may be precomputed
 in host code and provided to the CUDA kernel as a lookup table in its `Params` structure. 
-As shown in [Conv2dFpropFilterTileAccessIteratorOptimized](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h),
+As shown in [Conv2dFpropFilterTileAccessIteratorOptimized](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h),
 the logic to compute offsets from filter position has been extracted to the `Params` constructor.
 
 ```c++
@@ -535,11 +533,11 @@ threads within a warp. The following operations are supported.
 Functionally, the Turing 8x8x32 matrix multiply operation distributes the _A_, _B_, and _C_ matrix across 32
 threads within a warp according to the following illustration.
 
-![ALT](/media/images/mma-8x8x32.png "Turing Tensor Op")
+![ALT](../../images/mma-8x8x32.png "Turing Tensor Op")
 
 This Tensor Core operation is accessible to the CUDA programmer via the PTX instruction
 [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-8832).
-CUTLASS wraps inline PTX with device-side intrinsics defined in [`cutlass/arch/mma_sm75.h`](/include/cutlass/arch/mma_sm75.h) 
+CUTLASS wraps inline PTX with device-side intrinsics defined in [`cutlass/arch/mma_sm75.h`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm75.h) 
 as in the following example.
 
 ```c++
@@ -565,7 +563,7 @@ per row.
 The arrangement of SMEM pointers and destination registers within threads is illustrated as follows. Thread 0 is highlighted
 in the illustration to emphasize the mapping. 
 
-![ALT](/media/images/ldmatrix-8x128bx4.png "Turing ldmatrix PTX instruction")
+![ALT](../../images/ldmatrix-8x128bx4.png "Turing ldmatrix PTX instruction")
 
 The size of the Turing Tensor Core operation computing matrix multiply-accumulate on INT4 data is 8-by-8-by-32
 elements. `ldmatrix` fetches up to 32 rows (or columns) per operation. Sixteen Tensor Core operations may be issued
@@ -574,7 +572,7 @@ as shown in the following figure. Larger tiles are possible by increasing the nu
 and issuing more Tensor Core operations, up to warp-level matrix operations of size 64-by-64-by-32. The limit is
 the number of registers to hold the accumulator elements.
 
-![ALT](/media/images/ldmatrix-tensorop-32x32x32.png "Turing ldmatrix PTX instruction feeding Tensor Core operations")
+![ALT](../../images/ldmatrix-tensorop-32x32x32.png "Turing ldmatrix PTX instruction feeding Tensor Core operations")
 
 ### Shared Memory Layouts
 
@@ -588,7 +586,7 @@ load from Shared Memory using `ldmatrix`. The following figure illustrates the t
 the loading the activations and filters threadblock tiles from global memory and the permuted layout in
 Shared Memory. 
 
-![ALT](/media/images/tensor-op-permuted-smem-layout-TN.png "Shared Memory layout used for Turing Tensor Cores")
+![ALT](../../images/tensor-op-permuted-smem-layout-TN.png "Shared Memory layout used for Turing Tensor Cores")
 
 In the illustration, one warp-wide memory access is highlighted in blue, with individual threads
 loading one 128-bit vector. The tile in global memory could correspond either to the activations
@@ -618,7 +616,7 @@ The following figure shows how the first sixteen threads participating in an `ld
 logically map to the c=0..31 slice of a matrix in Shared Memory. This slice is known as a "k-group" 
 within the code because it corresponds to the same K-index of a warp-level matrix multiply. 
 
-![ALT](/media/images/tensor-op-permuted-smem-layout-TN-k0.png "Load kgroup=0 from Shared Memory using ldmatrix")
+![ALT](../../images/tensor-op-permuted-smem-layout-TN-k0.png "Load kgroup=0 from Shared Memory using ldmatrix")
 
 The lower half of the figure shows the physical arrangement in Shared Memory, with threads offset by row and column
 according to the XOR function. By inspection, we can observe there are no bank conflicts, as _T0 ... T7_ each access unique
@@ -632,9 +630,9 @@ the following sequence:
 - **^3** advances from _k=3_ to _k=0_
 
 The first of these transitions is shown below.
-![ALT](/media/images/tensor-op-permuted-smem-layout-TN-k1.png "Advance to kgroup=1 from Shared Memory using ldmatrix")
+![ALT](../../images/tensor-op-permuted-smem-layout-TN-k1.png "Advance to kgroup=1 from Shared Memory using ldmatrix")
 
-The [CUTLASS warp-level GEMM API](/media/docs/gemm_api.md#warp-level-matrix-multiply-api) defines templates for
+The [CUTLASS warp-level GEMM API](gemm_api.md#warp-level-matrix-multiply-api) defines templates for
 loading slices of data from permuted Shared Memory and issuing operations to Tensor Cores.
 
 ### Updating the Output Tensor
@@ -647,11 +645,11 @@ needed.
 The **Epilogue** is the component for exchanging accumulator elements through Shared Memory, loading slices of the output
 matrix or tensor, applying an elementwise operation such as linear scaling or bias, and storing the result to the output tensor. 
 CUTLASS structures this as several components:
-- [cutlass::epilogue::threadblock::Epilogue](/include/cutlass/epilogue/threadblock/epilogue.h) - the top-level component for looping over the entire threadblock tile
-- [cutlass::epilogue::warp::TileIteratorTensorOp](/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h) - a specialized component for storing accumulators for Tensor Core to Shared Memory
-- [cutlass::epilogue::threadblock::SharedLoadIterator](/include/cutlass/epilogue/threadblock/shared_load_iterator.h) - a component for loading elements from a row-major arrangement in Shared Memory
-- [cutlass::epilogue::threadblock::PredicatedTileIterator](/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h) - a component for loading or storing matrix fragments to Global Memory (with bounds checks)
-- [cutlass::epilogue::thread::LinearCombination](/include/cutlass/epilogue/thread/linear_combination.h) - an element-wise function computing `alpha * AB + beta * C` to compute the final output
+- [cutlass::epilogue::threadblock::Epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/epilogue.h) - the top-level component for looping over the entire threadblock tile
+- [cutlass::epilogue::warp::TileIteratorTensorOp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h) - a specialized component for storing accumulators for Tensor Core to Shared Memory
+- [cutlass::epilogue::threadblock::SharedLoadIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/shared_load_iterator.h) - a component for loading elements from a row-major arrangement in Shared Memory
+- [cutlass::epilogue::threadblock::PredicatedTileIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h) - a component for loading or storing matrix fragments to Global Memory (with bounds checks)
+- [cutlass::epilogue::thread::LinearCombination](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/linear_combination.h) - an element-wise function computing `alpha * AB + beta * C` to compute the final output
 
 ## Unit Tests
 
@@ -663,13 +661,13 @@ b. showcase instantiation of use of these templates in device code, and
 c. assert functional correctness.
 
 **Convolution unit tests**
-- Device-wide convolution operator: [conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu)
+- Device-wide convolution operator: [conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu)
 
 **GEMM unit tests**
-- Warp-scoped matrix multiply for Turing Tensor Cores: [gemm_sm75.cu](/test/unit/gemm/warp/gemm_sm75.cu)
+- Warp-scoped matrix multiply for Turing Tensor Cores: [gemm_sm75.cu](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/warp/gemm_sm75.cu)
 
 **Epilogue unit tests**
-- Epilogue for Turing Tensor Cores: [epilogue_tensor_op.cu](/test/unit/epilogue/threadblock/epilogue_tensor_op.cu)
+- Epilogue for Turing Tensor Cores: [epilogue_tensor_op.cu](https://github.com/NVIDIA/cutlass/tree/main/test/unit/epilogue/threadblock/epilogue_tensor_op.cu)
 
 
 # Convolution Example
@@ -681,10 +679,10 @@ of Implicit GEMM Convolution.
 
 Example `09_turing_tensorop_conv2dfprop` computes a forward convolutional layer in which inputs and
 outputs are 4-b integers. The example source is visible in 
-[examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu](/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu).
+[examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu](https://github.com/NVIDIA/cutlass/tree/main/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu).
 
 
-Before building the example, first perform the prerequisite steps for building any CUTLASS component [described here](/media/docs/quickstart.md).
+Before building the example, first perform the prerequisite steps for building any CUTLASS component [described here](quickstart.md).
 Compute capability 7.5 refers to the Turing architecture, and this work requires CUDA 10.2 Toolkit or later to target
 Turing Tensor Cores using the native `mma` [PTX instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-8832).
 
@@ -708,7 +706,7 @@ initialize them to random values, and compute the result of a convolutional laye
 tensors may be saved to .csv files, and the CUTLASS host-side reference check may be executed to verify correctness.
 
 The complete usage statement is visible by running with `--help`:
-```bash
+```
 $ ./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop --help
 09_turing_tensorop_conv2dfprop example
 
diff --git a/media/docs/layout.md b/media/docs/cpp/layout.md
similarity index 98%
rename from media/docs/layout.md
rename to media/docs/cpp/layout.md
index bd544c0a56..5e1d4d298c 100644
--- a/media/docs/layout.md
+++ b/media/docs/cpp/layout.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Layouts and Tensors")
-
-[README](../../README.md#documentation) > **Layouts and Tensors**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Layouts and Tensors")
 
 Note: This document talks about CUTLASS 2.x layout tag types.
 CUTLASS 3.0 deprecates all legacy 2.x layout tags in favour of a single `cute::Layout<Shape, Stride>`
diff --git a/media/docs/cpp/overview.md b/media/docs/cpp/overview.md
new file mode 100644
index 0000000000..35d2aac134
--- /dev/null
+++ b/media/docs/cpp/overview.md
@@ -0,0 +1,619 @@
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
+
+# Overview
+
+# CUTLASS 3.9.0
+
+_CUTLASS 3.9.0 - March 2025_
+
+CUTLASS is a collection of CUDA C++ template abstractions for implementing
+high-performance matrix-matrix multiplication (GEMM) and related computations at all levels 
+and scales within CUDA. It incorporates strategies for hierarchical decomposition and 
+data movement similar to those used to implement cuBLAS and cuDNN.  CUTLASS decomposes 
+these "moving parts" into reusable, modular software components abstracted by C++ template 
+classes.  Primitives for different levels of a conceptual parallelization hierarchy
+can be specialized and tuned via custom tiling sizes, data types,
+and other algorithmic policy. The resulting flexibility simplifies their use
+as building blocks within custom kernels and applications.
+
+To support a wide variety of applications, CUTLASS provides extensive support for
+mixed-precision computations, providing specialized data-movement and
+multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,
+[FP32 emulation via tensor core instruction](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm), 
+ 8b floating point types (e5m2 and e4m3),
+ block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),
+ narrow integer types (4 and 8b signed and unsigned integers),
+ and binary 1b data types (where architectures allow for the
+native support of such data types).
+CUTLASS demonstrates optimal matrix multiply operations
+targeting the programmable, high-throughput _Tensor Cores_ implemented by
+NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
+
+In addition to GEMMs, CUTLASS implements high-performance convolution via
+the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
+operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
+This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
+
+See the [Quick Start Guide](quickstart.md) to get started quickly.
+
+See the [functionality docs](functionality.md) for a more comprehensive
+list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
+architecture.
+
+# What's New in CUTLASS 3.9
+
+* Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
+  - Collective mainloops that target for:
+    * [Blockscaled datatypes with support for dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp)
+    * [Blockscaled datatypes with support for sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_sparse_mma_tma.hpp)
+  - New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
+  - [Blackwell SM120 epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp) and [full set of EVT fusions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp).
+* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
+  - [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
+  - [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
+  - [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
+* Set of unit tests that demonstrate the usage of both [sparse](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
+* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
+  - Enhancement of [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
+  - Enhancement of [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
+  - Support for [grouped GEMM with blockwise and groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
+  - Support for [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
+  - Support for [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
+  - Support for [grouped GEMM with blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
+* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
+  - Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
+  - Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
+  - Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
+  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
+
+Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
+CUTLASS team is working on a fix.
+
+**See the [CHANGELOG](../release_notes.md) for details of all past releases and updates.**
+
+# Performance
+
+CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
+they exhibit nearly optimal utilization of peak theoretical throughput. The figure below
+shows CUTLASS 3.8's performance as a % of theoretical peak utilization 
+on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
+
+![ALT](../../images/cutlass-3.8-blackwell-gemm-peak-performance.svg "")
+
+The two figures below show the continual CUTLASS performance improvements 
+on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
+CUTLASS 3.1.
+CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads). 
+Tensor Core operations are implemented using CUDA's 
+[mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
+[wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.
+
+![ALT](../../images/cutlass-3.5.1-gemm-peak-performance.png "")
+![ALT](../../images/cutlass-3.5.1-gemm-peak-performance-fp8.png "")
+
+# CuTe
+
+CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
+CuTe is a collection of C++ CUDA template abstractions for
+defining and operating on hierarchically multidimensional layouts of threads and data.
+CuTe provides `Layout` and `Tensor` objects that compactly package the type,
+shape, memory space, and layout of data, while performing the complicated indexing for the user.
+This lets programmers focus on the logical descriptions of their algorithms while
+CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,
+implement, and modify all dense linear algebra operations.
+
+The core abstractions of CuTe are hierarchically multidimensional layouts
+which can be composed with data arrays to represent tensors.
+The representation of layouts is powerful enough to represent nearly
+everything we need to implement efficient dense linear algebra.
+Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
+
+CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
+This greatly simplifies the design and improves code composability and readability.
+More documentation specific to CuTe can be found in its
+[dedicated documentation directory](cute/00_quickstart.md).
+
+# Compatibility
+
+Minimum requirements:
+
+- Architecture: Volta (compute capability 7.0)
+- Compiler: Must support at least C++17
+- CUDA Toolkit version: 11.4
+
+CUTLASS requires a C++17 host compiler and 
+performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads).
+It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.
+
+## Operating Systems
+
+We have tested the following environments.
+
+|**Operating System** | **Compiler** |
+|-----------------|----------|
+| Ubuntu 18.04 | GCC 7.5.0  |
+| Ubuntu 20.04 | GCC 10.3.0 |
+| Ubuntu 22.04 | GCC 11.2.0 |
+
+Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.
+
+Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
+CUTLASS team is working on a fix.
+
+## Hardware
+
+CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.
+
+|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
+|---|---|---|
+|NVIDIA V100 Tensor Core GPU            |7.0|11.4|
+|NVIDIA TitanV                          |7.0|11.4|
+|NVIDIA GeForce RTX 20x0 series         |7.5|11.4|
+|NVIDIA T4                              |7.5|11.4|
+|NVIDIA A100 Tensor Core GPU            |8.0|11.4|
+|NVIDIA A10                             |8.6|11.4|
+|NVIDIA GeForce RTX 30x0 series         |8.6|11.4|
+|NVIDIA GeForce RTX 40x0 series         |8.9|11.8|
+|NVIDIA L40                             |8.9|11.8|
+|NVIDIA H100 Tensor Core GPU            |9.0|11.8|
+|NVIDIA H200 Tensor Core GPU            |9.0|11.8|
+|NVIDIA B200 Tensor Core GPU            |10.0|12.8|
+|NVIDIA GeForce RTX 50x0 series         |10.0|12.8|
+
+## Target Architecture
+
+In general, PTX code generated for one target architecture can be run on future architectures
+(i.e., it is forward compatible).
+However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose
+PTX does not have forward compatibility guarantees.
+Several Hopper and Blackwell PTX instructions fall under this category of
+architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture
+(note the "a" appended). For more details on this and other architecture-accelerated instructions,
+please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
+
+The target architecture information is passed on to CUTLASS via the cmake flag
+`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,
+users are required to build CUTLASS with `90a` as the target architecture.
+If a user accidentally builds a kernel which uses SM90a features
+(e.g. Hopper Tensor Core Instructions), using the SM90 target
+(note the lack of "a"), with either CUDA Toolkit 12 or 11.8,
+the kernel is expected to fail with a runtime error.
+
+```
+cmake .. -DCUTLASS_NVCC_ARCHS="90a"
+```
+Or 
+
+```
+cmake .. -DCUTLASS_NVCC_ARCHS="100a" 
+```
+
+Note: The NVIDIA Blackwell SM100 architecture used in the datacenter 
+products has a different compute capability than the one underpinning 
+NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels 
+compiled for Blackwell SM100 architecture with arch conditional features 
+(using `sm100a`) are not compatible with RTX 50 series GPUs. 
+
+Please refer to the [functionality documentation](functionality.md)
+for details on which kernels require which target architectures.
+
+# Documentation
+
+CUTLASS is described in the following documents and the accompanying
+[Doxygen documentation](https://nvidia.github.io/cutlass).
+
+- [Quick Start Guide](quickstart.md) - basics of building and running CUTLASS
+- [Functionality](functionality.md) - summarizes functionality available in CUTLASS
+- [Efficient GEMM in CUDA](efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
+- [CUTLASS 3.x Design](cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
+- [GEMM API 3.x](gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
+- [GEMM API 2.x](gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
+- [Implicit GEMM Convolution](implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
+- [Code Organization](code_organization.md) - describes the organization and contents of the CUTLASS project
+- [Terminology](terminology.md) - describes terms used in the code
+- [Programming Guidelines](programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
+- [Fundamental types](fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
+- [Layouts](layout.md) - describes layouts of matrices and tensors in memory
+- [Tile Iterators](tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
+- [CUTLASS Profiler](profiler.md) - command-line driven profiling application
+- [CUTLASS Utilities](utilities.md) - additional templates used to facilitate rapid development
+- [Dependent kernel launch](dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
+kernels in the same stream, and how it is used in CUTLASS.
+
+# Resources
+We have also described the structure of an efficient GEMM in our talk at the
+[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
+
+- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
+- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
+- [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
+- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
+- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)
+
+# Building CUTLASS
+
+CUTLASS is a header-only template library and does not need to be built to be used by other
+projects. Client applications should target CUTLASS's `include/` directory in their include
+paths.
+
+CUTLASS unit tests, examples, and utilities can be build with CMake.
+The minimum version of CMake is given in the [Quickstart guide](quickstart.md).
+Make sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed
+on your system.
+
+```bash
+$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
+```
+
+Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
+for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.
+To reduce compile time you can specify
+the architectures to build CUTLASS for by changing the CMake configuration setting
+`CUTLASS_NVCC_ARCHS`.
+
+```bash
+$ mkdir build && cd build
+
+$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture
+```
+
+From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.
+
+The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
+and they may be executed in parallel via make's `-j` command line argument.
+
+```bash
+$ make test_unit -j
+...
+...
+...
+[----------] Global test environment tear-down
+[==========] 946 tests from 57 test cases ran. (10812 ms total)
+[  PASSED  ] 946 tests.
+```
+
+All tests should pass on supported platforms, though the exact number of tests may vary over time.
+
+
+# Project Structure
+
+CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. 
+[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes, 
+and template concepts defined in the CUTLASS project.
+
+A detailed explanation of the source code organization may be found in the 
+[CUTLASS documentation](code_organization.md), but several main components are summarized below.
+
+## CUTLASS Template Library
+
+```
+include/                     # client applications should target this directory in their build's include paths
+
+  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
+
+    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
+
+    conv/                    # code specialized for convolution
+
+    epilogue/                # code specialized for the epilogue of gemm/convolution
+
+    gemm/                    # code specialized for general matrix product computations
+
+    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
+
+    platform/                # CUDA-capable Standard Library components
+
+    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model
+
+    thread/                  # simt code that can be performed within a CUDA thread
+    
+    transform/               # code specialized for layout, type, and domain transformations
+
+    *                        # core vocabulary types, containers, and basic numeric operations
+
+  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
+
+    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples
+
+    arch/                    # Bare bones PTX wrapper structs for copy and math instructions
+
+    atom/                    # Meta-information either link to or built from arch/ operators
+
+      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma
+
+      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy
+
+      *sm*.hpp               # Arch specific meta-information for copy and math operations
+
+    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
+
+```
+
+### CUTLASS SDK Examples
+
+[CUTLASS SDK examples](https://github.com/NVIDIA/cutlass/tree/main/examples) apply CUTLASS templates to implement basic computations.
+
+### Tools
+
+```
+tools/
+  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
+    include/
+      cutlass/
+        library/
+
+  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
+                             #                            CUTLASS Library
+  
+  util/                      # CUTLASS Utilities        - contains numerous helper classes for
+    include/                 #                            manging tensors in device memory, reference
+      cutlass/               #                            implementations for GEMM, random initialization
+        util/                #                            of tensors, and I/O.
+```
+
+### Test
+
+The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
+basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
+
+Instructions for building and running the Unit tests are described in the [Quickstart guide](quickstart.md).
+
+# Performance Profiling
+
+The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
+It can be built as follows:
+
+```bash
+$ make cutlass_profiler -j16
+```
+## Building all GEMM and Convolution kernels (_long_ build times)
+
+By default, only one tile size is instantiated for each data type, math instruction, and layout.
+To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
+Beware, this results in *tens of thousands* of kernels and long build times. 
+This would also result in a large binary size and on some platforms linker to fail on building the library.
+Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
+...
+$ make cutlass_profiler -j16
+```
+
+## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
+
+To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with 
+wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
+or a subset of kernels for NVIDIA Ampere and Turing architecture:
+
+### Building a subset Tensor Core GEMM kernels
+
+To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, 
+use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
+```bash
+./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+          cuBLAS: Passed
+
+       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
+                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
+                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
+                  --max_cc=1024
+
+           Bytes: 118489088  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 1.55948  ms
+          Memory: 70.7616 GiB/s
+
+            Math: 74378.8 GFLOP/s
+
+
+
+=============================
+...
+```
+
+### Building one CUDA Core GEMM kernel
+
+To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling single SGEMM CUDA kernel is as follows:
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+          cuBLAS: Passed
+
+       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
+                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 180355072  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 6.73655  ms
+          Memory: 24.934 GiB/s
+
+            Math: 17218.4 GFLOP/s
+
+=============================
+```
+
+### Building a subset of Tensor Core Convolution kernels
+
+To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation 
+and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
+
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
+                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
+
+           Bytes: 1130659840  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 0.711496  ms
+          Memory: 1479.99 GiB/s
+
+            Math: 166526 GFLOP/s
+
+=============================
+...
+```
+
+
+### Building one Convolution CUDA kernel
+
+To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation 
+and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling one CUDA Core convolution kernel:
+
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 2055798784  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 7.34266  ms
+          Memory: 260.752 GiB/s
+
+            Math: 16136.2 GFLOP/s
+
+
+=============================
+
+```
+
+## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
+- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
+  - [GEMM CMake Examples](quickstart.md#gemm-cmake-examples) 
+  - [Implicit GEMM convolution CMake Examples](quickstart.md#convolution-cmake-examples)
+- [Further details about the CUTLASS Profiler are described here.](profiler.md)
+
+
+# About
+
+CUTLASS is released by NVIDIA Corporation as Open Source software under the 
+[3-clause "New" BSD license](LICENSE.txt).
+
+# Contributors
+
+The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).
+
+# Copyright
+
+Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
diff --git a/media/docs/pipeline.md b/media/docs/cpp/pipeline.md
similarity index 93%
rename from media/docs/pipeline.md
rename to media/docs/cpp/pipeline.md
index 1a8b551ac4..aa6473043b 100644
--- a/media/docs/pipeline.md
+++ b/media/docs/cpp/pipeline.md
@@ -42,10 +42,10 @@ CUTLASS now includes abstractions
 for the following features introduced in Hopper.
 
 1. Thread block cluster - level synchronization and query
-   [APIs](/include/cute/arch/cluster_sm90.hpp)
+   [APIs](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/cluster_sm90.hpp)
 
 2. Abstractions for new
-   [barrier instructions](/include/cutlass/arch/barrier.h)
+   [barrier instructions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/barrier.h)
    which help with efficient synchronization
    of threads within a thread block cluster.
 
@@ -54,7 +54,7 @@ for the following features introduced in Hopper.
 In order to write a performant GEMM Kernel,
 software pipelining is critical to hide the latency of global memory loads.
 (Please refer to the
-[Efficient GEMM](/media/docs/efficient_gemm.md#pipelining) document.)
+[Efficient GEMM](efficient_gemm.md#pipelining) document.)
 Different threads or groups of threads
 may have different roles in the pipeline.
 Some are "producers" that load data or perform computations
@@ -73,7 +73,7 @@ dozens of different kinds of asynchronously executing operations
 that synchronize using multiple barriers organized as a circular list.
 This complexity is too much for human programmers to manage by hand.
 As a result, we have developed
-[asynchronous Pipeline classes](/include/cutlass/pipeline/).
+[asynchronous Pipeline classes](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/).
 These classes help developers orchestrate a pipeline
 of asynchronous producer and consumer threads,
 without needing to worry about lower-level hardware details.
@@ -173,8 +173,8 @@ and then synchronize among 3 asynchronously executing threads:
 Please note that this is a basic example.
 There are different versions possible,
 depending on what the producer and consumer threads are doing.
-Please refer to our [unit tests](/test/unit/pipeline)
-and the other [pipeline classes](/include/cutlass/pipeline/pipeline.hpp)
+Please refer to our [unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/pipeline)
+and the other [pipeline classes](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/pipeline.hpp)
 for more details.
 
 # Copyright
diff --git a/media/docs/profiler.md b/media/docs/cpp/profiler.md
similarity index 97%
rename from media/docs/profiler.md
rename to media/docs/cpp/profiler.md
index c4de675cc0..58088dffcb 100644
--- a/media/docs/profiler.md
+++ b/media/docs/cpp/profiler.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Profiler")
-
-[README](../../README.md#documentation) > **CUTLASS Profiler**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Profiler")
 
 # CUTLASS Profiler
 
@@ -33,9 +31,9 @@ tools/
 
 # Emitting kernels via `emit_kernel_listing.py`
 
-We provide a Python script `emit_kernel_listing.py` that allows a user to selectively test a subset of profiler-based kernels stamped out in `generator.py`. A unique benefit to generate kernels and test via this script is that it can feed a series of runtime arguments, such as different `M`/`N`/`K` and `alpha`/`beta`, to each kernel, instead of relying on a single default value. It also properly generates runtime datatype and cluster shapes for certain kernels to help reduce the generated kernel count and accordingly the total compilation time. An interested user may refer to [emit_kernel_listing.py](../../python/cutlass_library/emit_kernel_listing.py) for details. To enable this new feature, a user should add `-DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON` when building CUTLASS profiler.
+We provide a Python script `emit_kernel_listing.py` that allows a user to selectively test a subset of profiler-based kernels stamped out in `generator.py`. A unique benefit to generate kernels and test via this script is that it can feed a series of runtime arguments, such as different `M`/`N`/`K` and `alpha`/`beta`, to each kernel, instead of relying on a single default value. It also properly generates runtime datatype and cluster shapes for certain kernels to help reduce the generated kernel count and accordingly the total compilation time. An interested user may refer to [emit_kernel_listing.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/emit_kernel_listing.py) for details. To enable this new feature, a user should add `-DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON` when building CUTLASS profiler.
 
-### Instantiating more kernels with Hopper
+## Instantiating more kernels with Hopper
 With Hopper (SM90), you will need to use an additional flag,
 `CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations,
 which unlike previous architectures, will be in the order of millions of kernels.
@@ -81,12 +79,12 @@ Instruction shape levels control the selection of WGMMA shapes used in kernel ge
 - **Level 2**: Includes shapes that are powers of 2.
 - **Level 3**: Includes all other shapes.
 
-The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](../../python/cutlass_library/sm90_shapes.py).
+The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).
 
-Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](../../python/cutlass_library/sm90_utils.py),
+Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_utils.py),
 
 - **Level >= 1**: Indicates that no pruning is being applied.
-- **Level 0**: Indicates pruning according to existing [generator.py](../../python/cutlass_library/generator.py) behavior.
+- **Level 0**: Indicates pruning according to existing [generator.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/generator.py) behavior.
 
 An instantiation level `500`, which is padded to `0500`, thus indicates:
 
@@ -95,7 +93,7 @@ An instantiation level `500`, which is padded to `0500`, thus indicates:
 - **Cluster Sizes**: At level 5, allowing for clusters with 1, 2, 4, 8, or 16 CTAs.
 - **Schedule Pruning**: At level 0, where pruning is applied according to the existing `generator.py` behavior.
 
-### Mixed input data type kernels for Hopper
+## Mixed input data type kernels for Hopper
 
 With Hopper (SM90), the kernel generator will generate the following combinations of mixed input data types ("mixed dtype"):
 
@@ -118,7 +116,7 @@ For each mixed dtype kernel, the kernel generator will generate combinations of
 
 For {4-bits-dtype, 8-bits-dtype} x 16-bits-dtype, the kernel generator will further generate kernels using shuffled layouts for the narrow data type matrix, which may have a better performance compared to its non-shuffle counter parts.
 
-### CUTLASS Profiler usage
+## CUTLASS Profiler usage
 
 The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
 ```bash
@@ -364,11 +362,11 @@ Profile when execution is performed on device 0 and the C tensor is located on a
   $ cutlass_profiler --device=0 --allocations=C:1,D:2 --operation=Gemm --m=1024 --n=1024 --k=128
 ```
 
-The format of tensor argument is followed by `<type>:<layout>`. The type could be `f32` as 32-bit floating point, `s8` as 8-bit signed integer, etc. The available types can be referred to the `NumericTypeID_enumerants` in [util.cu](tools/library/src/util.cu). The layout could be `row` or `column`. If `--enable_sm90_mixed_dtype_shuffle_test=true` is used, the actual layout of the narrow data type matrix is a shuffled layout, neither `row` nor `column`.
+The format of tensor argument is followed by `<type>:<layout>`. The type could be `f32` as 32-bit floating point, `s8` as 8-bit signed integer, etc. The available types can be referred to the `NumericTypeID_enumerants` in [util.cu](https://github.com/NVIDIA/cutlass/tree/main/tools/library/src/util.cu). The layout could be `row` or `column`. If `--enable_sm90_mixed_dtype_shuffle_test=true` is used, the actual layout of the narrow data type matrix is a shuffled layout, neither `row` nor `column`.
 
 In addition to encoded data types, CUTLASS profiler allows non-encoded generic data types, namely `f8`, `f6`, and `f4`, with corresponding encoding specified through GEMM input argument: `--runtime_input_datatype_a` and `--runtime_input_datatype_b`. Currently, six encoding schemes are supported: `e4m3`, `e5m2`, `e3m2`, `e2m3`, and `e2m1`.
 
-Cluster shapes can be statically set to `Shape<int,int,_1>;` and specified via runtime arguments: `cluster_m`, `cluster_n` and `cluster_k` in CUTLASS profiler.  In addition to preferred cluster shapes, a user can also specify fallback cluster shapes via runtime arguments: `cluster_m_fallback`, `cluster_n_fallback` and `cluster_k_fallback` in CUTLASS profiler. Those fallback cluster shapes are smaller shapes than the preferred ones for the hardware to assign when there is no chance to issue a larger preferred CGA cluster to the GPU. There are several rules for using a flexible CGA: 1) Preferred CGA size should be divisible by fallback CGA size. 2) Grid dim should be divisible by preferred CGA size. 3) Preferred CGA and fallback CGA must have the same depth (cluster_dim.z must be equal). One may refer to our CUTLASS Example [73_blackwell_gemm_flexible_cluster](../../examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for more details of the this feature. 
+Cluster shapes can be statically set to `Shape<int,int,_1>;` and specified via runtime arguments: `cluster_m`, `cluster_n` and `cluster_k` in CUTLASS profiler.  In addition to preferred cluster shapes, a user can also specify fallback cluster shapes via runtime arguments: `cluster_m_fallback`, `cluster_n_fallback` and `cluster_k_fallback` in CUTLASS profiler. Those fallback cluster shapes are smaller shapes than the preferred ones for the hardware to assign when there is no chance to issue a larger preferred CGA cluster to the GPU. There are several rules for using a flexible CGA: 1) Preferred CGA size should be divisible by fallback CGA size. 2) Grid dim should be divisible by preferred CGA size. 3) Preferred CGA and fallback CGA must have the same depth (cluster_dim.z must be equal). One may refer to our CUTLASS Example [73_blackwell_gemm_flexible_cluster](https://github.com/NVIDIA/cutlass/tree/main/examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for more details of the this feature. 
 Please be noted that this feature (flexible cluster shapes within a single grid) is only applicable to `sm100a` kernels. The hardware will rasterize into a single cluster shape for those kernels that do not support this feature even with preferred or fallback cluster shapes assigned.
 
 CUTLASS 3.x kernels for Hopper and Blackwell also support a new feature called programatic dependent launch (PDL). This can be enabled with `--use-pdl`, and can overlap the epilogue of the prior kernel with the prologue of the next kernel. This can effectively hide kernel prologues. Using PDL can improve performance for back to back GEMMs. See [dependent kernel launch](dependent_kernel_launch.md) for more information. CUDA graphs can also be used (`--use-cuda-graphs`) with PDL to ensure that smaller kernels are enqueued back-to-back on a stream.
diff --git a/media/docs/programming_guidelines.md b/media/docs/cpp/programming_guidelines.md
similarity index 99%
rename from media/docs/programming_guidelines.md
rename to media/docs/cpp/programming_guidelines.md
index d7d601a2b3..b85108d95f 100644
--- a/media/docs/programming_guidelines.md
+++ b/media/docs/cpp/programming_guidelines.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines")
-
-[README](../../README.md#documentation) > **Programming Guidelines**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines")
 
 # Programming Guidelines
 
@@ -954,9 +952,9 @@ For example:
 ```
 
 Header files such as
-[cutlass/cutlass.h](../../include/cutlass/cutlass.h)
+[cutlass/cutlass.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/cutlass.h)
 and
-[cute/config.hpp](../../include/cutlass/cutlass.h)
+[cute/config.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/cutlass.h)
 offer macros for expressing compiler-dependent behavior.
 These include
 
diff --git a/media/docs/quickstart.md b/media/docs/cpp/quickstart.md
similarity index 95%
rename from media/docs/quickstart.md
rename to media/docs/cpp/quickstart.md
index f62d43b54d..b728f7ede3 100644
--- a/media/docs/quickstart.md
+++ b/media/docs/cpp/quickstart.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Quick Start Guide")
-
-[README](../../README.md#documentation) > **Quick Start**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Quick Start Guide")
 
 # Quickstart
 
@@ -217,7 +215,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS="50;53"          # compiles for NVIDIA Maxwell G
 
 ## Using CUTLASS within other applications
 
-Applications should list [`/include`](/include) within their include paths. They must be
+Applications should list [`/include`](https://github.com/NVIDIA/cutlass/tree/main/include) within their include paths. They must be
 compiled as C++17 or greater.
 
 **Example:** print the contents of a variable storing half-precision data.
@@ -466,7 +464,7 @@ int main(int argc, char const **args) {
 
 # CUTLASS Library
 
-The [CUTLASS Library](/tools/library) defines an API for managing and executing collections of compiled
+The [CUTLASS Library](https://github.com/NVIDIA/cutlass/tree/main/tools/library) defines an API for managing and executing collections of compiled
 kernel instances and launching them from host code without template instantiations in client code.
 
 The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its
@@ -482,16 +480,16 @@ for dense matrix computations on NVIDIA GPUs.
 The CUTLASS Library is used by the CUTLASS Profiler to manage kernel instances, and it is also used
 by several SDK examples.
 
-* [10_planar_complex](/examples/10_planar_complex/planar_complex.cu)
-* [11_planar_complex_array](/examples/11_planar_complex_array/planar_complex_array.cu)
+* [10_planar_complex](https://github.com/NVIDIA/cutlass/tree/main/examples/10_planar_complex/planar_complex.cu)
+* [11_planar_complex_array](https://github.com/NVIDIA/cutlass/tree/main/examples/11_planar_complex_array/planar_complex_array.cu)
 
 The CUTLASS Library defines enumerated types describing numeric data types, matrix and tensor
 layouts, math operation classes, complex transformations, and more.
 
-Client applications should specify [`tools/library/include`](/tools/library/include) in their
+Client applications should specify [`tools/library/include`](https://github.com/NVIDIA/cutlass/tree/main/tools/library/include) in their
 include paths and link against libcutlas_lib.so.
 
-The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies
+The CUTLASS SDK example [10_planar_complex](https://github.com/NVIDIA/cutlass/tree/main/examples/10_planar_complex/CMakeLists.txt) specifies
 its dependency on the CUTLASS Library with the following CMake command.
 ```
 target_link_libraries(
@@ -662,7 +660,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*
 ## Instantiating a Blackwell SM100 GEMM kernel
 
 Blackwell SM100 kernels are instantiated very similarly to Hopper kernels. Let us start with an
-[FP8 GEMM without blockscaling](../../test/unit/gemm/device/sm100_gemm_f8_f8_f8_tensor_op_s32_batch_alpha_beta.cu)
+[FP8 GEMM without blockscaling](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_gemm_f8_f8_f8_tensor_op_s32_batch_alpha_beta.cu)
 as an example.
 
 The kernel starts with setting up datatypes and cluster shapes. 
@@ -706,7 +704,7 @@ for Blackwell, so the epilogue fusion is built in a same way as an SM90 epilogue
 ```
 
 One can refer to our Sm100 unit tests as examples of how to correctly
-choose mainloop schedules. All of our dispatch policies can be found in [dispatch_policy.hpp](../../include/cutlass/gemm/dispatch_policy.hpp)
+choose mainloop schedules. All of our dispatch policies can be found in [dispatch_policy.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp)
 and more comprehensive Blackwell specific documentation for valid 
 dispatch policies can be in [blackwell_functionality.md](./blackwell_functionality.md).
 
@@ -729,7 +727,7 @@ dispatch policies can be in [blackwell_functionality.md](./blackwell_functionali
   >;
 ```
 
-Instantiating a blockscaled GEMM kernel is slightly different. Referring to an [MXFP8 GEMM](./../../test/unit/gemm/device/sm100_gemm_mxf8_mxf8_mxf8_tensor_op_f32_auto.cu) sample unit test, it takes a different tensor operation class:
+Instantiating a blockscaled GEMM kernel is slightly different. Referring to an [MXFP8 GEMM](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_gemm_mxf8_mxf8_mxf8_tensor_op_f32_auto.cu) sample unit test, it takes a different tensor operation class:
  
 ```c++
   using ElementA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
diff --git a/media/docs/terminology.md b/media/docs/cpp/terminology.md
similarity index 94%
rename from media/docs/terminology.md
rename to media/docs/cpp/terminology.md
index f4e3a9d76c..1c5d31ead9 100644
--- a/media/docs/terminology.md
+++ b/media/docs/cpp/terminology.md
@@ -1,13 +1,11 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Terminology")
-
-[README](../../README.md#documentation) > **Terminology**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Terminology")
 
 # CUTLASS Terminology
 
 **cute::Layout**: A `cute::Layout` vocabulary type composed of the hierarchical `cute::Shape` and `cute::Stride`
-tuples that is used throughout CUTLASS 3.0 to represent and manipulate thread and data layouts. More details are included in the [CuTe specific tensor type documentation](/media/docs/cute/03_tensor.md).
+tuples that is used throughout CUTLASS 3.0 to represent and manipulate thread and data layouts. More details are included in the [CuTe specific tensor type documentation](cute/03_tensor.md).
 
-**cute::Tensor**: A pointer backed by a `cute::Layout` used to represent a tensor. More details are included in the [CuTe specific tensor type documentation](/media/docs/cute/03_tensor.md).
+**cute::Tensor**: A pointer backed by a `cute::Layout` used to represent a tensor. More details are included in the [CuTe specific tensor type documentation](cute/03_tensor.md).
 
 **Capacity**: (scalar) physical number of elements in memory required to store a multidimensional object; expressed as the type's LongIndex type
   - example: the capacity of a column-major matrix is `lda * N`
@@ -71,7 +69,7 @@ contiguous and strided dimensions of a tile.
   `sizeof(Array<T, N>)` - gives expected value in units of bytes with minimum storage of `1 B`: (sizeof_bits<T>::value * N) / 8
 
 **Operator**: an object performing a computation on matrix or tensor objects. May be further refined by scope within the execution model hierarchy. Deprecated starting CUTLASS 3.0,
-replaced by [MMA and Copy atoms from CuTe](/media/docs/cute/0t_mma_atom.md).
+replaced by [MMA and Copy atoms from CuTe](cute/0t_mma_atom.md).
 
 **Tile Iterator**: abstraction for accessing and traversing a sequence of tiles in a tensor; CUTLASS specifies 
   [formal concepts for tile iterators](tile_iterator_concept.md). Deprecated starting CUTLASS 3.0.
diff --git a/media/docs/tile_iterator_concept.md b/media/docs/cpp/tile_iterator_concept.md
similarity index 99%
rename from media/docs/tile_iterator_concept.md
rename to media/docs/cpp/tile_iterator_concept.md
index f8db020dfd..63a3eb0b1e 100644
--- a/media/docs/tile_iterator_concept.md
+++ b/media/docs/cpp/tile_iterator_concept.md
@@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Tile Iterator Concepts")
-
-[README](../../README.md#documentation) > **Tile Iterator Concepts**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Tile Iterator Concepts")
 
 # Tile Iterator Concepts
 
diff --git a/media/docs/utilities.md b/media/docs/cpp/utilities.md
similarity index 95%
rename from media/docs/utilities.md
rename to media/docs/cpp/utilities.md
index e8e1b98ec6..b6dffe052f 100644
--- a/media/docs/utilities.md
+++ b/media/docs/cpp/utilities.md
@@ -1,13 +1,12 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
 
-[README](../../README.md#documentation) > **CUTLASS Utilities**
 
 Note: This document discusses utilities commonly used with code that targets CUTLASS 2.x.
 Although CUTLASS 3.0's primary entry point APIs do not transact in these `cutlass::*` tensor types anymore,
 users can still find them convenient for managing allocations with trivial affine layouts.
-For more advanced host side tensor management, [`cute::Tensor`](/media/docs/cute/03_tensor.md)s
+For more advanced host side tensor management, [`cute::Tensor`](cute/03_tensor.md)s
 can be used on either host or device for any memory space and full expressive power of
-[`cute::Layout`](/media/docs/cute/01_layout.md)s.
+[`cute::Layout`](cute/01_layout.md)s.
 
 # CUTLASS Utilities
 
@@ -17,12 +16,12 @@ flexible implementations of needed functionality, but they are not expected to b
 Applications should configure their builds to list `/tools/util/include` in their include
 paths.
 
-Source code is in [`/tools/util/include/cutlass/util/`](/tools/util/include/cutlass/util).
+Source code is in [`/tools/util/include/cutlass/util/`](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util).
 
 ## Tensor Allocation and I/O
 
 To allocate a tensor with storage in both host and device memory, use `HostTensor` in
-[`cutlass/util/host_tensor.h`](/tools/util/include/cutlass/util/host_tensor.h)
+[`cutlass/util/host_tensor.h`](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/host_tensor.h)
 
 ```c++
 template <typename Element, typename Layout>
@@ -61,7 +60,7 @@ cutlass::TensorView<float, cutlass::layout::ColumnMajor> device_view = tensor.de
 ```
 
 Printing to human-readable CSV output is accoplished with `std::ostream::operator<<()` defined in
-[`cutlass/util/tensor_view_io.h`](/tools/util/include/cutlass/util/tensor_view_io.h). 
+[`cutlass/util/tensor_view_io.h`](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/tensor_view_io.h). 
 Note, this assumes all views refer to host memory.
 ```c++
 #include <cutlass/util/tensor_view_io.h>
@@ -428,7 +427,7 @@ synclog at [synclog_at]: [header] line=[line] thread=[threadIdx.xyz] block=[bloc
 * `header`: Name of the synchronization event.
 * `line`: Code line number of the synchronization operation calling into `synclog`.
 
-Additional information may appear at the end of each line, such as shared memory address, phase bit, and arrive count. For more detailed information on `synclog` output, refer to [synclog.hpp](../../include/cutlass/arch/synclog.hpp) in the CUTLASS source code. 
+Additional information may appear at the end of each line, such as shared memory address, phase bit, and arrive count. For more detailed information on `synclog` output, refer to [synclog.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/synclog.hpp) in the CUTLASS source code. 
 
 Please note that `synclog` is an experimental feature, and its functionality is not always guaranteed. We encourage its use in custom kernels and CUTLASS examples, though it is known to be incompatible with profiler kernels.
 
diff --git a/pyproject.toml b/pyproject.toml
index f892fe9de5..04571be9ca 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "nvidia-cutlass"
-version = "3.8.0.0"
+version = "3.9.0.0"
 description = "CUTLASS"
 readme = "README.md"
 requires-python = ">=3.8"
diff --git a/python/cutlass/__init__.py b/python/cutlass/__init__.py
index a3268a12cf..6bf4b278be 100644
--- a/python/cutlass/__init__.py
+++ b/python/cutlass/__init__.py
@@ -29,7 +29,6 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
-
 import logging
 import os
 import sys
@@ -99,7 +98,7 @@ def cuda_install_path():
 this.logger = logging.getLogger(__name__)
 
 # RMM is only supported for Python 3.9+
-if (sys.version_info.major == 3 and sys.version_info.major > 8) or sys.version_info.major > 3:
+if (sys.version_info.major == 3 and sys.version_info.minor > 8) or sys.version_info.major > 3:
     try:
         import rmm
         this.use_rmm = True
@@ -136,7 +135,7 @@ def get_option_registry():
         this._option_registry = OptionRegistry(device_cc())
     return this._option_registry
 
-this.__version__ = '3.8.0'
+this.__version__ = '3.9.2'
 
 from cutlass.backend import create_memory_pool
 from cutlass.emit.pytorch import pytorch
@@ -145,6 +144,7 @@ def get_option_registry():
 from cutlass.op.gemm_grouped import GroupedGemm
 from cutlass.op.op import OperationBase
 from cutlass.backend.evt.ir.tensor import Tensor
+from cutlass.utils.lazy_import import lazy_import
 
 
 this.memory_pool = None
@@ -158,10 +158,36 @@ def get_memory_pool():
     return this.memory_pool
 
 
-from cuda import cuda, cudart
+base_cuda = lazy_import("cuda")
+cuda = lazy_import("cuda.cuda")
+cudart = lazy_import("cuda.cudart")
 
 this._device_id = None
+this._nvcc_version = None
+
+def check_cuda_versions():
+    if not os.getenv("CUTLASS_USE_SYCL"):
+        # Strip any additional information from the CUDA version
+        _cuda_version = base_cuda.__version__.split("rc")[0]
+        # Check that Python CUDA version exceeds NVCC version
+        this._nvcc_version = nvcc_version()
+        _cuda_list = _cuda_version.split('.')
+        _nvcc_list = this._nvcc_version.split('.')
+        for val_cuda, val_nvcc in zip(_cuda_list, _nvcc_list):
+            if int(val_cuda) < int(val_nvcc):
+                raise Exception(f"Python CUDA version of {_cuda_version} must be greater than or equal to NVCC version of {this._nvcc_version}")
+
+        if len(_nvcc_list) > len(_cuda_list):
+            if len(_nvcc_list) != len(_cuda_list) + 1:
+                raise Exception(f"Malformatted NVCC version of {this._nvcc_version}")
+            if _nvcc_list[:-1] == _cuda_list and int(_nvcc_list[-1]) != 0:
+                raise Exception(f"Python CUDA version of {_cuda_version} must be greater than or equal to NVCC version of {this._nvcc_version}")
+    else:
+        this._nvcc_version = "2025.0"
+
 def initialize_cuda_context():
+    check_cuda_versions()
+
     if this._device_id is not None:
         return
 
@@ -192,6 +218,7 @@ def initialize_cuda_context():
 this._sycl_device: dpctl.SyclDevice = None
 
 def initialize_sycl_context():
+    check_cuda_versions()
     if this._device_id is not None and this._sycl_device is not None:
         return
 
diff --git a/python/cutlass/backend/arguments.py b/python/cutlass/backend/arguments.py
index 98d9e3c991..4e7d012871 100644
--- a/python/cutlass/backend/arguments.py
+++ b/python/cutlass/backend/arguments.py
@@ -33,7 +33,10 @@
 from math import prod
 from typing import Union
 
-from cuda import cuda, cudart
+from cutlass.utils.lazy_import import lazy_import
+
+cuda = lazy_import("cuda.cuda")
+cudart = lazy_import("cuda.cudart")
 import numpy as np
 
 import cutlass
diff --git a/python/cutlass/backend/compiler.py b/python/cutlass/backend/compiler.py
index ea471853b2..887e35b77b 100644
--- a/python/cutlass/backend/compiler.py
+++ b/python/cutlass/backend/compiler.py
@@ -39,7 +39,10 @@
 import tempfile
 from functools import lru_cache
 
-from cuda import cuda, nvrtc
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
+cudart = lazy_import("cuda.cudart")
+nvrtc = lazy_import("cuda.nvrtc")
 from cutlass_library import SubstituteTemplate
 
 import dpctl
diff --git a/python/cutlass/backend/conv2d_operation.py b/python/cutlass/backend/conv2d_operation.py
index bf6e5754f4..a261ce90c1 100644
--- a/python/cutlass/backend/conv2d_operation.py
+++ b/python/cutlass/backend/conv2d_operation.py
@@ -29,11 +29,13 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
+from __future__ import annotations
 
 import ctypes
 from typing import Union
 
-from cuda import cuda
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
 from cutlass_library import SubstituteTemplate
 import numpy as np
 
diff --git a/python/cutlass/backend/epilogue.py b/python/cutlass/backend/epilogue.py
index 4e364cb828..f21018d332 100644
--- a/python/cutlass/backend/epilogue.py
+++ b/python/cutlass/backend/epilogue.py
@@ -34,7 +34,6 @@
 
 from cutlass_library import SubstituteTemplate
 import numpy as np
-from scipy.special import erf
 
 from cutlass_library import DataType, DataTypeTag
 from cutlass.backend.c_types import MatrixCoord_, tuple_factory
@@ -542,6 +541,7 @@ class hardswish(ActivationFunctor, metaclass=hardswishMeta):
 class geluMeta(ActivationMeta):
     @classmethod
     def numpy(cls, x):
+        from scipy.special import erf
         return 0.5 * x * (1 + erf(x / np.sqrt(2.0)))
 
     @classmethod
diff --git a/python/cutlass/backend/evt/epilogue.py b/python/cutlass/backend/evt/epilogue.py
index 85c11bea7a..58bd57698f 100644
--- a/python/cutlass/backend/evt/epilogue.py
+++ b/python/cutlass/backend/evt/epilogue.py
@@ -36,7 +36,8 @@
 
 import ctypes
 
-from cuda import cuda
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
 from cutlass_library import DataType
 import numpy as np
 
diff --git a/python/cutlass/backend/evt/frontend/frontend_base.py b/python/cutlass/backend/evt/frontend/frontend_base.py
index 4cc1edf0c8..442a708db1 100644
--- a/python/cutlass/backend/evt/frontend/frontend_base.py
+++ b/python/cutlass/backend/evt/frontend/frontend_base.py
@@ -67,10 +67,10 @@ class EVTFrontendBase:
         "reshape": reshape
     }
 
-    def __init__(self, element_compute=DataType.f32, cc=None, additional_passes=[], **kwargs) -> None:
-        self.cc = cc if cc else device_cc()
+    def __init__(self, cc, element_compute=DataType.f32, additional_passes=[], **kwargs) -> None:
+        self.cc = cc
         self.element_compute = library_type(element_compute)
-        self.dag_ir = DAGIR(self.element_compute, self.cc)
+        self.dag_ir = DAGIR(self.cc, self.element_compute)
         self.compute_cnt = 0
         self.layout_cnt = 0
 
diff --git a/python/cutlass/backend/evt/frontend/python_ast.py b/python/cutlass/backend/evt/frontend/python_ast.py
index 0af934a6c0..14827812f9 100644
--- a/python/cutlass/backend/evt/frontend/python_ast.py
+++ b/python/cutlass/backend/evt/frontend/python_ast.py
@@ -47,8 +47,8 @@
 
 
 class PythonASTFrontend(EVTFrontendBase, ast.NodeVisitor):
-    def __init__(self, element_compute=DataType.f32, **kwargs):
-        super().__init__(element_compute, **kwargs)
+    def __init__(self, cc, element_compute=DataType.f32, **kwargs):
+        super().__init__(cc, element_compute, **kwargs)
         # Flags
         # If this state is True, visit_Constant returns values without creating imm node
         self.no_imm = False
diff --git a/python/cutlass/backend/evt/ir/dag_ir.py b/python/cutlass/backend/evt/ir/dag_ir.py
index ce8c3d64f1..16281d3475 100644
--- a/python/cutlass/backend/evt/ir/dag_ir.py
+++ b/python/cutlass/backend/evt/ir/dag_ir.py
@@ -49,7 +49,7 @@ class DAGIR:
 
     In the DAGIR, ``node`` is an string of its name. ``node_meta`` is the underlying class of the node
     """
-    def __init__(self, element_compute=DataType.f32, cc: int=None) -> None:
+    def __init__(self, cc, element_compute=DataType.f32) -> None:
         # The EVT DAGIR is managed through the nextworkX Digraph class
         self._graph = nx.DiGraph()
 
@@ -57,7 +57,7 @@ def __init__(self, element_compute=DataType.f32, cc: int=None) -> None:
 
         self.reduction_names = []
 
-        self.cc = cc if cc else device_cc()
+        self.cc = cc
 
     #
     # IR manipulator
diff --git a/python/cutlass/backend/evt/passes/graph_drawer.py b/python/cutlass/backend/evt/passes/graph_drawer.py
index 4e1e094e76..fd05bd9237 100644
--- a/python/cutlass/backend/evt/passes/graph_drawer.py
+++ b/python/cutlass/backend/evt/passes/graph_drawer.py
@@ -29,11 +29,11 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
+from __future__ import annotations
 
 import subprocess
 
 from cutlass_library import DataTypeTag
-import pydot
 
 from cutlass.backend.evt.ir.dag_ir import DAGIR
 
@@ -113,6 +113,7 @@ def _to_dot(
         graph: DAGIR,
         name: str
     ):
+        import pydot
         dot_graph = pydot.Dot(name, randir="TB")
         for node in graph.nodes_meta:
             style = self._get_node_style(node)
diff --git a/python/cutlass/backend/evt/passes/pass_dag_2_tree.py b/python/cutlass/backend/evt/passes/pass_dag_2_tree.py
index 5783e9b018..91eb205445 100644
--- a/python/cutlass/backend/evt/passes/pass_dag_2_tree.py
+++ b/python/cutlass/backend/evt/passes/pass_dag_2_tree.py
@@ -108,7 +108,7 @@ def call(self):
 
                 # Create the subgraph
                 subgraph_ = self.dag_ir._graph.subgraph(new_subgraph_nodes)
-                subgraph = DAGIR()
+                subgraph = DAGIR(self.dag_ir.cc)
                 for node in subgraph_.nodes:
                     meta = deepcopy(self.dag_ir.get_node_meta(node))
                     if node not in node_to_fuse:
diff --git a/python/cutlass/backend/frontend.py b/python/cutlass/backend/frontend.py
index c7d98fb51f..c250cd8b47 100644
--- a/python/cutlass/backend/frontend.py
+++ b/python/cutlass/backend/frontend.py
@@ -29,8 +29,10 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
+from __future__ import annotations
 
-from cuda import cuda
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
 import numpy as np
 
 import dpctl
diff --git a/python/cutlass/backend/gemm_operation.py b/python/cutlass/backend/gemm_operation.py
index 698786bb8b..0fe12b818c 100644
--- a/python/cutlass/backend/gemm_operation.py
+++ b/python/cutlass/backend/gemm_operation.py
@@ -29,12 +29,15 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
+from __future__ import annotations
 
 import copy
 import ctypes
 import enum
 
-from cuda import cuda, cudart
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
+cudart = lazy_import("cuda.cudart")
 from cutlass_library import SubstituteTemplate
 import numpy as np
 
diff --git a/python/cutlass/backend/memory_manager.py b/python/cutlass/backend/memory_manager.py
index eb78b555d7..ba50213e06 100644
--- a/python/cutlass/backend/memory_manager.py
+++ b/python/cutlass/backend/memory_manager.py
@@ -34,11 +34,12 @@
 
 import cutlass
 from cutlass.utils.datatypes import is_numpy_tensor
+from cutlass.utils.lazy_import import lazy_import
 
 if cutlass.use_rmm:
     import rmm
 else:
-    from cuda import cudart
+    cudart = lazy_import("cuda.cudart")
 
 from dpctl.memory import MemoryUSMDevice
 
diff --git a/python/cutlass/backend/operation.py b/python/cutlass/backend/operation.py
index 9f0dc861f3..6babdcfecf 100644
--- a/python/cutlass/backend/operation.py
+++ b/python/cutlass/backend/operation.py
@@ -31,18 +31,19 @@
 #################################################################################################
 
 import ctypes
-
-from cuda import __version__, cuda
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
 
 import dpctl
 
 from cutlass.backend.utils.device import device_cc
 
-_version_splits = [int(x) for x in __version__.split("rc")[0].split(".post")[0].split(".")]
 _supports_cluster_launch = None
 
 
 def supports_cluster_launch():
+    from cuda import __version__
+    _version_splits = [int(x) for x in __version__.split("rc")[0].split(".post")[0].split(".")]
     global _supports_cluster_launch
     if _supports_cluster_launch is None:
         major, minor = _version_splits[0], _version_splits[1]
@@ -81,10 +82,12 @@ def get_device_workspace_size(self, arguments):
     def plan(self, arguments):
         raise NotImplementedError()
 
-    def initialize(self, host_workspace, device_workspace, launch_config, arguments, stream=cuda.CUstream(0)):
+    def initialize(self, host_workspace, device_workspace, launch_config, arguments, stream=None):
         raise NotImplementedError()
 
-    def run_with_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
+    def run_with_clusters(self, launch_config, kernel_params, stream=None):
+        if not stream:
+            stream = cuda.CUstream(0)
         if hasattr(self.operation, "tile_description") and hasattr(self.operation.tile_description, "cluster_shape"):
             attr = cuda.CUlaunchAttribute()
             attr.value.clusterDim.x, attr.value.clusterDim.y, attr.value.clusterDim.z = self.operation.tile_description.cluster_shape
@@ -112,7 +115,9 @@ def run_with_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0
             config, f=self.kernel, kernelParams=kernel_params, extra=0)
         return err
 
-    def run_without_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
+    def run_without_clusters(self, launch_config, kernel_params, stream=None):
+        if not stream:
+            stream = cuda.CUstream(0)
         err, = cuda.cuLaunchKernel(
             self.kernel,
             launch_config.grid[0], launch_config.grid[1], launch_config.grid[2],
@@ -131,9 +136,11 @@ def run_with_sycl(self, launch_config, kernel_params, param_size, stream):
         globalSize.reverse()
         localSize = launch_config.block
         localSize.reverse()
-        stream.submit(self.kernel, [raw_arg, local_mem], globalSize, localSize) 
+        stream.submit(self.kernel, [raw_arg, local_mem], globalSize, localSize)
 
-    def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
+    def run(self, host_workspace, device_workspace, launch_config, stream=None):
+        if not stream:
+            stream = cuda.CUstream(0)
         cArg = (ctypes.c_char * len(host_workspace)).from_buffer(host_workspace)
         packed = (ctypes.c_void_p * 1)()
         packed[0] = ctypes.addressof(cArg)
diff --git a/python/cutlass/backend/reduction_operation.py b/python/cutlass/backend/reduction_operation.py
index 3aec976516..559d51c3bb 100644
--- a/python/cutlass/backend/reduction_operation.py
+++ b/python/cutlass/backend/reduction_operation.py
@@ -29,11 +29,14 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 ################################################################################
+from __future__ import annotations
 
 import ctypes
 from typing import Union
 
-from cuda import cuda, cudart
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
+cudart =  lazy_import("cuda.cudart")
 import numpy as np
 
 from cutlass_library import (
diff --git a/python/cutlass/backend/utils/device.py b/python/cutlass/backend/utils/device.py
index 34f8d61d17..e7abb1c3f4 100644
--- a/python/cutlass/backend/utils/device.py
+++ b/python/cutlass/backend/utils/device.py
@@ -33,8 +33,11 @@
 """
 Utility functions for interacting with the device
 """
+from __future__ import annotations
 
-from cuda import cuda, cudart
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
+cudart =  lazy_import("cuda.cudart")
 import dpctl
 
 import cutlass
diff --git a/python/cutlass/epilogue/epilogue.py b/python/cutlass/epilogue/epilogue.py
index 7b578424e4..830f4391e5 100644
--- a/python/cutlass/epilogue/epilogue.py
+++ b/python/cutlass/epilogue/epilogue.py
@@ -43,7 +43,7 @@
     plan.activation = cutlass.epilogue.relu
 """
 
-from cutlass.backend import epilogue
+from cutlass.backend import epilogue, device_cc
 
 
 gelu = epilogue.gelu
@@ -147,8 +147,10 @@ def example_fn(accum, C, alpha, beta, gamma):
     """
     if callable(fn):
         class EpilogueFunctor(PythonASTFrontend):
-            def __init__(self, **kwargs):
-                super().__init__(**kwargs)
+            def __init__(self, cc=None, **kwargs):
+                if not cc:
+                    cc = device_cc()
+                super().__init__(cc, **kwargs)
             pass
         setattr(EpilogueFunctor, "__call__", staticmethod(fn))
 
diff --git a/python/cutlass/library_defaults.py b/python/cutlass/library_defaults.py
index fe70ff7250..5206f793bf 100644
--- a/python/cutlass/library_defaults.py
+++ b/python/cutlass/library_defaults.py
@@ -38,7 +38,6 @@
 import logging
 import os
 
-from cuda import __version__
 import cutlass_library
 from cutlass_library.library import ConvKind, IteratorAlgorithm, StrideSupport, GroupMode
 
@@ -50,28 +49,6 @@
 # The value '11' is used to encode Intel PVC GPU in the expected format.
 _generator_ccs = [11, 50, 60, 61, 70, 75, 80, 90]
 
-# Strip any additional information from the CUDA version
-_cuda_version = __version__.split("rc")[0]
-
-if not os.getenv("CUTLASS_USE_SYCL"):
-    # Check that Python CUDA version exceeds NVCC version
-    _nvcc_version = cutlass.nvcc_version()
-    _cuda_list = _cuda_version.split('.')
-    _nvcc_list = _nvcc_version.split('.')
-    for val_cuda, val_nvcc in zip(_cuda_list, _nvcc_list):
-        if int(val_cuda) < int(val_nvcc):
-            raise Exception(f"Python CUDA version of {_cuda_version} must be greater than or equal to NVCC version of {_nvcc_version}")
-
-    if len(_nvcc_list) > len(_cuda_list):
-        if len(_nvcc_list) != len(_cuda_list) + 1:
-            raise Exception(f"Malformatted NVCC version of {_nvcc_version}")
-        if _nvcc_list[:-1] == _cuda_list and int(_nvcc_list[-1]) != 0:
-            raise Exception(f"Python CUDA version of {_cuda_version} must be greater than or equal to NVCC version of {_nvcc_version}")
-
-else:
-    _nvcc_version = "2025.0"
-
-
 class KernelsForDataType:
     """
     Container class for keeping track of kernels that correspond to a particular combination
@@ -301,7 +278,7 @@ def __init__(
 
         manifest_args = cutlass_library.generator.define_parser().parse_args(args)
         manifest = cutlass_library.manifest.Manifest(manifest_args)
-        generate_function(manifest, _nvcc_version)
+        generate_function(manifest, cutlass._nvcc_version)
 
         if operation_kind not in manifest.operations:
             # No kernels generated for this architecture, this could be because the CUDA
@@ -577,6 +554,9 @@ class OptionRegistry:
     def __init__(self, target_cc: int):
         self.registry = {}
 
+        if target_cc > 90:
+            raise Exception(f"Unsupported compute capability {target_cc}. The CUTLASS Python interface only supports compute capabilities up to 90.")
+
         gemm_kinds = [cutlass_library.GemmKind.Universal, cutlass_library.GemmKind.Universal3x]
         operation_kinds = [cutlass_library.OperationKind.Gemm, cutlass_library.OperationKind.Conv2d]
         # Construct options for each CC
diff --git a/python/cutlass/op/conv.py b/python/cutlass/op/conv.py
index c9fd8f9a54..0e8366abaf 100644
--- a/python/cutlass/op/conv.py
+++ b/python/cutlass/op/conv.py
@@ -111,8 +111,11 @@
 
         args.sync()
 """
-
-from cuda import cuda
+from __future__ import annotations
+from typing import Optional
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
+cudart =  lazy_import("cuda.cudart")
 from cutlass_library import (
     ConvKind,
     ConvMode,
@@ -735,7 +738,7 @@ def run(self, A=None, B=None, C=None, D=None,
             alpha=None, beta=None,
             split_k=("serial", 1), sync: bool = True,
             print_module: bool = False,
-            stream: cuda.CUstream = cuda.CUstream(0)) -> Conv2dArguments:
+            stream: Optional[cuda.CUstream] = None) -> Conv2dArguments:
         """
         Runs the kernel currently specified. If it has not already been, the kernel is emitted and
         compiled. Tensors holding operands and outputs of the kernel are sourced either from the
@@ -768,6 +771,8 @@ def run(self, A=None, B=None, C=None, D=None,
         :return: arguments passed in to the kernel
         :rtype: cutlass.backend.Conv2dArguments
         """
+        if not stream:
+            stream = cuda.CUstream(0)
         super().run_setup()
 
         A = self._verify_tensor(A, self.A, self._element_a, self._layout_a, "A")
@@ -926,7 +931,10 @@ def run(
         self, input=None, weight=None, C=None, output=None, alpha=None, beta=None,
         stride=(1, 1), padding=(0, 0), dilation=(1, 1), split_k=("serial", 1),
         sync: bool = True, print_module: bool = False,
-        stream: cuda.CUstream = cuda.CUstream(0)) -> Conv2dArguments:
+        stream: Optional[cuda.CUstream] = None) -> Conv2dArguments:
+
+        if not stream:
+            stream = cuda.CUstream(0)
 
         A, B, D = input, weight, output
         return super().run(
@@ -951,8 +959,11 @@ def __init__(
     def run(self, grad_output=None, weight=None, C=None, grad_input=None, alpha=None, beta=None,
         stride=(1, 1), padding=(0, 0), dilation=(1, 1), split_k=("serial", 1),
         sync: bool = True, print_module: bool = False,
-        stream: cuda.CUstream = cuda.CUstream(0)) -> Conv2dArguments:
+        stream: Optional[cuda.CUstream] = None) -> Conv2dArguments:
         #
+        if not stream:
+            stream = cuda.CUstream(0)
+
         A, B, D = grad_output, weight, grad_input
         return super().run(
             A, B, C, D, alpha, beta, stride, padding, dilation, split_k, sync, print_module, stream)
@@ -976,8 +987,10 @@ def __init__(
     def run(self, grad_output=None, input=None, C=None, grad_weight=None, alpha=None, beta=None,
         stride=(1, 1), padding=(0, 0), dilation=(1, 1), split_k=("serial", 1),
         sync: bool = True, print_module: bool = False,
-        stream: cuda.CUstream = cuda.CUstream(0)) -> Conv2dArguments:
-        #
+        stream: Optional[cuda.CUstream] = None) -> Conv2dArguments:
+        if not stream:
+            stream = cuda.CUstream(0)
+
         A, B, D = grad_output, input, grad_weight
         return super().run(
             A, B, C, D, alpha, beta, stride, padding, dilation, split_k, sync, print_module, stream)
diff --git a/python/cutlass/op/gemm.py b/python/cutlass/op/gemm.py
index 29a83f2489..96ef1b5eb4 100644
--- a/python/cutlass/op/gemm.py
+++ b/python/cutlass/op/gemm.py
@@ -113,10 +113,12 @@
 
         args.sync()
 """
-
+from __future__ import annotations
+from typing import Optional
 from math import prod
 
-from cuda import cuda
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
 from cutlass_library import (
     DataType,
     DataTypeSize,
@@ -653,6 +655,8 @@ def run(self, A=None, B=None, C=None, D=None,
         :return: arguments passed in to the kernel
         :rtype: cutlass.backend.GemmArguments
         """
+        if not stream:
+            stream = cuda.CUstream(0)
         super().run_setup()
         A = self._verify_tensor(A, self.A, self._element_a, self._layout_a, "A")
         B = self._verify_tensor(B, self.B, self._element_b, self._layout_b, "B")
diff --git a/python/cutlass/op/gemm_grouped.py b/python/cutlass/op/gemm_grouped.py
index c68747bc0f..da2fc8b920 100644
--- a/python/cutlass/op/gemm_grouped.py
+++ b/python/cutlass/op/gemm_grouped.py
@@ -50,10 +50,12 @@
         plan = cutlass.op.GroupedGemm(element=cutlass.DataType.f16, layout=cutlass.LayoutType.RowMajor)
         plan.run([A0, A1], [B0, B1], [C0, C1], [D0, D1])
 """
-
+from __future__ import annotations
+from typing import Optional
 from cutlass_library import DataTypeSize
 
-from cuda import cuda
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
 from cutlass.backend.gemm_operation import (
     GemmGroupedArguments,
     GemmOperationGrouped,
@@ -196,7 +198,7 @@ def construct(self, tile_description: TileDescription = None,
     def run(self, A, B, C, D,
             alpha=None, beta=None, sync: bool = True,
             print_module: bool = False,
-            stream: cuda.CUstream = cuda.CUstream(0)) -> GemmGroupedArguments:
+            stream: Optional[cuda.CUstream] = None) -> GemmGroupedArguments:
         """
         Runs the kernel currently specified.
 
@@ -225,6 +227,9 @@ def run(self, A, B, C, D,
         :return: arguments passed in to the kernel
         :rtype: cutlass.backend.GemmGroupedArguments
         """
+        if not stream:
+            stream = cuda.CUstream(0)
+
         super().run_setup()
 
         if len(A) != len(B) or len(A) != len(C) or len(A) != len(D):
diff --git a/python/cutlass/utils/lazy_import.py b/python/cutlass/utils/lazy_import.py
new file mode 100644
index 0000000000..28ba654654
--- /dev/null
+++ b/python/cutlass/utils/lazy_import.py
@@ -0,0 +1,11 @@
+import importlib
+from typing import Any
+
+def lazy_import(mod_name: str) -> Any:
+    class Lazy:
+        def __getattr__(self, name:str) -> Any:
+            module = importlib.import_module(mod_name)
+            return getattr(module, name)
+    
+    return Lazy()
+            
diff --git a/python/cutlass/utils/profiler.py b/python/cutlass/utils/profiler.py
index 87369670a4..155c1d357a 100644
--- a/python/cutlass/utils/profiler.py
+++ b/python/cutlass/utils/profiler.py
@@ -37,7 +37,9 @@
 import re
 import subprocess
 
-from cuda import cuda, cudart
+from cutlass.utils.lazy_import import lazy_import
+cuda = lazy_import("cuda.cuda")
+cudart =  lazy_import("cuda.cudart")
 import numpy as np
 
 from cutlass import CUTLASS_PATH
@@ -54,18 +56,27 @@ def __init__(self) -> None:
             cuda.cuEventCreate(cuda.CUevent_flags.CU_EVENT_DEFAULT)[1],
         ]
 
-    def start(self, stream=cuda.CUstream(0)):
+    def start(self, stream=None):
+        if not stream:
+            stream = cuda.CUstream(0)
+
         (err,) = cuda.cuEventRecord(self.events[0], stream)
         if err != cuda.CUresult.CUDA_SUCCESS:
             raise RuntimeError(f"CUDA Error {str(err)}")
 
-    def stop(self, stream=cuda.CUstream(0)):
+    def stop(self, stream=None):
+        if not stream:
+            stream = cuda.CUstream(0)
+
         (err,) = cuda.cuEventRecord(self.events[1], stream)
         if err != cuda.CUresult.CUDA_SUCCESS:
             raise RuntimeError(f"CUDA Error {str(err)}")
         pass
 
-    def stop_and_wait(self, stream=cuda.CUstream(0)):
+    def stop_and_wait(self, stream=None):
+        if not stream:
+            stream = cuda.CUstream(0)
+
         self.stop(stream)
         if stream:
             (err,) = cuda.cuStreamSynchronize(stream)
@@ -182,4 +193,3 @@ def flops(self, problem_size, batch_count=1, beta=0.0):
             flops_ += m * n * batch_count * 2
 
         return flops_
-
diff --git a/python/cutlass_library/emit_kernel_listing.py b/python/cutlass_library/emit_kernel_listing.py
index 5a954586a9..a6eca0012b 100755
--- a/python/cutlass_library/emit_kernel_listing.py
+++ b/python/cutlass_library/emit_kernel_listing.py
@@ -282,6 +282,8 @@ def _computeFlopsPerByte(operation, m, n, k, batch_count=1, beta=0.0):
 def emit_gemm_kernel_testlist(manifest, curr_build_dir, arch, mode
                               ):
   profiler_reference_computing = "--verification-providers=device --providers=cutlass"
+  
+
   # beta values for L0 and L1
   # TODO: randomize beta values for wider coverage
   beta_values = [0.5]
diff --git a/python/cutlass_library/gemm_operation.py b/python/cutlass_library/gemm_operation.py
index ed6388a703..0d718621d3 100644
--- a/python/cutlass_library/gemm_operation.py
+++ b/python/cutlass_library/gemm_operation.py
@@ -65,7 +65,8 @@ def __init__(self, gemm_kind, arch, tile_description, A, B, C, element_epilogue,
       epilogue_functor = EpilogueFunctor.LinearCombination, swizzling_functor = SwizzlingFunctor.Identity8, D = None,
       kernel_schedule = KernelScheduleType.ScheduleAuto, epilogue_schedule = EpilogueScheduleType.ScheduleAuto,
       tile_scheduler = TileSchedulerType.Default, mixed_input_mode = None, mixed_input_shuffle = False,
-      ScaleFactorA = None, ScaleFactorB = None, ScaleFactorD = None):
+      ScaleFactorA = None, ScaleFactorB = None, ScaleFactorD = None, 
+      ScaleFactorMVecSize = None, ScaleFactorNVecSize = None, ScaleFactorKVecSize = None):
 
     kinds_3x = {
       GemmKind.Universal3x,
@@ -73,6 +74,8 @@ def __init__(self, gemm_kind, arch, tile_description, A, B, C, element_epilogue,
       GemmKind.BlockScaledUniversal3x, 
       GemmKind.GroupedUniversal3x,
       GemmKind.GroupedBlockScaledUniversal3x,
+      GemmKind.BlockwiseUniversal3x,
+      GemmKind.GroupedBlockwiseUniversal3x,
     }
     self.is_3x = gemm_kind in kinds_3x
     self.prefix = "3x" if self.is_3x else ""
@@ -92,6 +95,11 @@ def __init__(self, gemm_kind, arch, tile_description, A, B, C, element_epilogue,
       self.ScaleFactorD = ScaleFactorD["tensor"]
       self.ScaleFactorVectorSize = ScaleFactorD["vector_size"]
 
+    if is_blockwise(gemm_kind):
+      self.ScaleFactorMVecSize = ScaleFactorMVecSize
+      self.ScaleFactorNVecSize = ScaleFactorNVecSize
+      self.ScaleFactorKVecSize = ScaleFactorKVecSize
+
     if self.D == None:
       self.D = self.C
 
@@ -192,6 +200,8 @@ def core_name(self):
   # Generates a string representing the MMA instruction.
   def extended_name(self):
     ''' Append data types if they differ from compute type. '''
+    element_sfa = ""
+    element_sfb = ""
     if self.is_complex():
       extended_name = "${core_name}"
     else:
@@ -199,6 +209,10 @@ def extended_name(self):
         extended_name = "${core_name}_${element_a}_${element_b}"
         if self.C.element != self.tile_description.math_instruction.element_accumulator:
           extended_name = "${element_c}_" + extended_name
+      elif is_blockwise(self.gemm_kind):
+        extended_name = "${core_name}_${element_sfa}x${element_a}_${element_sfb}x${element_b}"
+        element_sfa = DataTypeNames[self.accumulator_type()]
+        element_sfb = DataTypeNames[self.accumulator_type()]
       else:
         extended_name = "${core_name}"
         if self.C.element != self.tile_description.math_instruction.element_accumulator:
@@ -208,7 +222,9 @@ def extended_name(self):
 
     extended_name = SubstituteTemplate(extended_name, {
       'element_a': DataTypeNames[self.A.element],
+      'element_sfa' : element_sfa,
       'element_b': DataTypeNames[self.B.element],
+      'element_sfb' : element_sfb,
       'element_c': DataTypeNames[self.C.element],
       'core_name': self.core_name()
       })
@@ -253,6 +269,22 @@ def extended_name_3x(self):
         element_d = d_type_names,
         core_name = self.core_name())
 
+    if is_blockwise(self.gemm_kind):
+      d_type_names = DataTypeNames[self.D.element]
+
+      extended_name = "{core_name}_{sfvec_m_size}x{sfvec_k_size}{element_sfa}x{element_a}_{sfvec_n_size}x{sfvec_k_size}{element_sfb}x{element_b}_{element_acc}_{element_c}_{element_d}".format(
+        element_sfa = DataTypeNames[self.accumulator_type()],
+        element_a = DataTypeNames[self.A.element],
+        element_sfb = DataTypeNames[self.accumulator_type()],
+        element_b = DataTypeNames[self.B.element],
+        element_acc = DataTypeNames[self.accumulator_type()],
+        element_c = DataTypeNames[self.C.element],
+        element_d = d_type_names,
+        sfvec_m_size = self.ScaleFactorMVecSize,
+        sfvec_n_size = self.ScaleFactorNVecSize,
+        sfvec_k_size = self.ScaleFactorKVecSize,
+        core_name = self.core_name())
+
     if self.mixed_input_mode != None:
       extended_name = extended_name + self.mixed_input_mode_name()
     return extended_name
@@ -772,6 +804,7 @@ def __init__(self, operation_suffix = ''):
       "cutlass/gemm/kernel/gemm_universal.hpp",
       "cutlass/gemm/collective/collective_builder.hpp",
       "cutlass/epilogue/collective/collective_builder.hpp",
+      "cutlass/detail/blockwise_scale_layout.hpp",
     ]
     self.builtin_epilogue_functor_template = \
 """${epilogue_functor}<
@@ -797,6 +830,7 @@ def __init__(self, operation_suffix = ''):
   >::CollectiveOp;
 
 ${mixed_dtype_prepare_code}
+${blockwise_prepare_code}
 
 using ${operation_name}_mainloop =
   typename cutlass::gemm::collective::CollectiveBuilder<
@@ -864,6 +898,18 @@ def emit_block_scale_epilogue_functor(self, operation):
   def pointerize_if_grouped(operation, layout):
     return layout if not is_grouped(operation.gemm_kind) else layout + "* "
 
+  @staticmethod
+  def transform_layout_A_if_blockwise(operation, layout):
+    layout_sfa = f"{operation.procedural_name()}_LayoutSFA"
+    layout_sfa = layout_sfa if not is_grouped(operation.gemm_kind) else layout_sfa + "* "
+    return layout if not is_blockwise(operation.gemm_kind) else f"cute::tuple<{layout}, {layout_sfa}>"
+
+  @staticmethod
+  def transform_layout_B_if_blockwise(operation, layout):
+    layout_sfb = f"{operation.procedural_name()}_LayoutSFB"
+    layout_sfb = layout_sfb if not is_grouped(operation.gemm_kind) else layout_sfb + "* "
+    return layout if not is_blockwise(operation.gemm_kind) else f"cute::tuple<{layout}, {layout_sfb}>"
+
   @staticmethod
   def problem_shape(operation):
     gemm_shape_type = "cute::Shape<int,int,int,int>"
@@ -1030,14 +1076,25 @@ def emit(self, operation):
       else:
         element_b = narrow_element
 
+    blockwise_prepare_code = ""
+    if is_blockwise(operation.gemm_kind):
+      sfm_vec_size = operation.ScaleFactorMVecSize
+      sfn_vec_size = operation.ScaleFactorNVecSize
+      sfk_vec_size = operation.ScaleFactorKVecSize
+      blockwise_prepare_code = f"""
+using {operation_name_str}_ScaleConfig = cutlass::detail::Sm{operation.arch}BlockwiseScaleConfig<{sfm_vec_size}, {sfn_vec_size}, {sfk_vec_size}>;
+using {operation_name_str}_LayoutSFA = decltype({operation_name_str}_ScaleConfig::deduce_layoutSFA());
+using {operation_name_str}_LayoutSFB = decltype({operation_name_str}_ScaleConfig::deduce_layoutSFB());
+      """
+
     values = {
       'operation_name': operation_name_str,
       'operation_suffix': self.operation_suffix,
       'problem_shape': self.problem_shape(operation),
       'element_a': element_a,
-      'layout_a': self.pointerize_if_grouped(operation, layout_a_str),
+      'layout_a': self.transform_layout_A_if_blockwise(operation, self.pointerize_if_grouped(operation, layout_a_str)),
       'element_b': element_b,
-      'layout_b': self.pointerize_if_grouped(operation, layout_b_str),
+      'layout_b': self.transform_layout_B_if_blockwise(operation, self.pointerize_if_grouped(operation, layout_b_str)),
       'element_c': DataTypeTag[operation.C.element],
       'layout_c': self.pointerize_if_grouped(operation, LayoutTag[instance_layout_C]),
       'element_d': DataTypeTag[operation.D.element],
@@ -1070,7 +1127,8 @@ def emit(self, operation):
       'epilogue_vector_length': str(epilogue_vector_length),
       'element_epilogue': str(DataTypeTag[operation.element_epilogue]),
       'tile_scheduler': str(TileSchedulerTag[operation.tile_scheduler]),
-      'mixed_dtype_prepare_code': mixed_dtype_prepare_code
+      'mixed_dtype_prepare_code': mixed_dtype_prepare_code,
+      'blockwise_prepare_code' : blockwise_prepare_code
     }
 
     # Overriding values for Intel Xe
@@ -1404,6 +1462,8 @@ def __init__(self, operation_path, configuration_name):
       GemmKind.Grouped: EmitGemmGroupedInstance,
       GemmKind.GroupedUniversal3x: EmitGemmUniversal3xInstance,
       GemmKind.GroupedBlockScaledUniversal3x: EmitGemmUniversal3xInstance,
+      GemmKind.BlockwiseUniversal3x: EmitGemmUniversal3xInstance,
+      GemmKind.GroupedBlockwiseUniversal3x: EmitGemmUniversal3xInstance,
     }
 
     self.gemm_kind_wrappers = {
@@ -1418,6 +1478,8 @@ def __init__(self, operation_path, configuration_name):
       GemmKind.Grouped: 'GemmGroupedOperation',
       GemmKind.GroupedUniversal3x: 'GroupedGemmUniversal3xOperation',
       GemmKind.GroupedBlockScaledUniversal3x: 'GroupedBlockScaledGemmUniversal3xOperation',
+      GemmKind.BlockwiseUniversal3x: 'BlockwiseGemmUniversal3xOperation',
+      GemmKind.GroupedBlockwiseUniversal3x: 'GroupedBlockwiseGemmUniversal3xOperation',
     }
 
     self.wmma_guard_start = "#if defined(CUTLASS_ARCH_WMMA_SM${sm_number}_ENABLED)"
@@ -1477,6 +1539,7 @@ def __enter__(self):
       ("grouped_gemm_operation_3x.hpp", None),
       ("sparse_gemm_operation_3x.hpp", None),
       ("block_scaled_gemm_operation_3x.hpp", None),   
+      ("blockwise_gemm_operation_3x.hpp", None),   
       ("cutlass/arch/wmma.h", None),
       ("cutlass/numeric_types.h", None)
     ])
diff --git a/python/cutlass_library/generator.py b/python/cutlass_library/generator.py
index e69a4d3794..def854c594 100644
--- a/python/cutlass_library/generator.py
+++ b/python/cutlass_library/generator.py
@@ -219,6 +219,15 @@ def CreateGemmUniversal3xOperator(
       gemm_op_extra_args["ScaleFactorD"] = { "tensor": TensorDescription(data_type["sfd_type"]["type"], data_type["sfd_type"]["layout"]),
                                              "vector_size" : data_type["sfd_type"]["vector_size"]}
       assert is_block_scaled(gemm_kind)
+    
+    if tile_description.explicit_vector_sizes != None:
+      assert len(tile_description.explicit_vector_sizes) == 3
+      gemm_op_extra_args["ScaleFactorMVecSize"] = tile_description.explicit_vector_sizes[0]
+      gemm_op_extra_args["ScaleFactorNVecSize"] = tile_description.explicit_vector_sizes[1]
+      gemm_op_extra_args["ScaleFactorKVecSize"] = tile_description.explicit_vector_sizes[2]
+      assert is_blockwise(gemm_kind)
+    else:
+      assert not is_blockwise(gemm_kind)
 
     A_dtype = data_type["a_type"]
     B_dtype = data_type["b_type"]
@@ -5811,6 +5820,87 @@ def GenerateSM90_TensorOp_fp8_WGMMA_gemm(manifest, cuda_version, gemm_kind=GemmK
                                               stream_k_schedules,
                                               tile_schedulers=[TileSchedulerType.StreamK])
 
+def GenerateSM90_TensorOp_fp8_WGMMA_gemm_with_blockwise(manifest, cuda_version, gemm_kind=GemmKind.BlockwiseUniversal3x):
+  if not CudaToolkitVersionSatisfies(cuda_version, 12, 3 if is_grouped(gemm_kind) else 0):
+    return
+
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=20, default_level=121, exhaustive_level=9992)
+  is_aligned = True
+
+  # layouts for ABC and their alignments
+  layouts = [
+    [[LayoutType.RowMajor, 16], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 1]],  # TN Layout
+  ]
+
+  math_instructions = generate_fp8_math_instructions_sm90(instantiation_level)
+  tile_descriptions_ = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  tile_descriptions = list()
+
+  for desc in tile_descriptions_:
+    desc.explicit_vector_sizes = [1, desc.tile_shape[1], desc.tile_shape[2]]
+    tile_descriptions.append(copy.deepcopy(desc))
+    desc.explicit_vector_sizes = [desc.tile_shape[0], desc.tile_shape[1], desc.tile_shape[2]]
+    tile_descriptions.append(copy.deepcopy(desc))
+    desc.explicit_vector_sizes = [desc.tile_shape[0], desc.tile_shape[1], desc.tile_shape[2]]
+    tile_descriptions.append(copy.deepcopy(desc))
+    desc.explicit_vector_sizes = [1, 1, desc.tile_shape[2]]
+    tile_descriptions.append(copy.deepcopy(desc))
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_types = []
+    fp8_types = [DataType.e4m3, DataType.e5m2]
+    valid_types_for_d = [DataType.f32, DataType.bf16, DataType.f16, DataType.e4m3, DataType.e5m2]
+    valid_types_for_c = copy.deepcopy(valid_types_for_d)
+    valid_types_for_c.append(DataType.void)
+    for c_type, d_type in product(valid_types_for_c, valid_types_for_d):
+        data_types.append(
+            generate_data_types_from_math_instruction(
+                math_inst,
+                element_source=c_type,
+                element_dest=d_type,
+            )
+        )
+    else:
+        for d_type in valid_types_for_d:
+            data_types.append(
+                generate_data_types_from_math_instruction(
+                    math_inst,
+                    element_source=DataType.void,
+                    element_dest=d_type,
+                )
+            )
+
+    for layout in layouts:
+        for data_type in data_types:
+            # Inconsistency: alignments aren't fixed in FP8
+            # layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+              gemm_kind=gemm_kind,
+              enable_fp8_fast_acc=False,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules, gemm_kind=gemm_kind)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK],
+                                              gemm_kind=gemm_kind)
+
+
 
 def GenerateSM90_TensorOp_fp8_WGMMA_alignx_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
@@ -7499,6 +7589,245 @@ def GenerateSM100_TensorOp_fp8_UMMA_gemm(manifest, cuda_version, gemm_kind=GemmK
       CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
       [[kernel_schedule, epi_schedule]], tile_schedulers=tile_schedulers, gemm_kind=gemm_kind)
 
+def GenerateSM100_TensorOp_fp8_UMMA_gemm_with_blockwise(manifest, cuda_version, gemm_kind=GemmKind.BlockwiseUniversal3x):
+  if not CudaToolkitVersionSatisfies(cuda_version, 12, 8):
+    return
+
+  grouped = is_grouped(gemm_kind)
+
+  # layouts for ABC and their alignments.
+  layouts = [
+    [[LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 0]],
+    [[LayoutType.ColumnMajor, 16], [LayoutType.RowMajor,    16], [LayoutType.ColumnMajor, 0]], 
+    [[LayoutType.RowMajor,    16], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 0]],
+    [[LayoutType.RowMajor,    16], [LayoutType.RowMajor,    16], [LayoutType.ColumnMajor, 0]],
+    [[LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 16], [LayoutType.RowMajor,    0]],
+    [[LayoutType.ColumnMajor, 16], [LayoutType.RowMajor,    16], [LayoutType.RowMajor,    0]],
+    [[LayoutType.RowMajor,    16], [LayoutType.ColumnMajor, 16], [LayoutType.RowMajor,    0]],
+    [[LayoutType.RowMajor,    16], [LayoutType.RowMajor,    16], [LayoutType.RowMajor,    0]],
+  ]
+
+  min_cc = 100
+  max_cc = 100
+  epi_type = DataType.f32
+
+  math_instructions_1sm = [
+    # inst 64x128
+    MathInstruction(
+      [64, 128, 32],
+      DataType.f8, DataType.f8, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [64, 128, 32],
+      DataType.e4m3, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [64, 128, 32],
+      DataType.e4m3, DataType.e5m2, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [64, 128, 32],
+      DataType.e5m2, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    # inst 128x32
+    MathInstruction(
+      [128, 32, 32],
+      DataType.f8, DataType.f8, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 32, 32],
+      DataType.e4m3, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 32, 32],
+      DataType.e4m3, DataType.e5m2, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 32, 32],
+      DataType.e5m2, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    # inst 128x64
+    MathInstruction(
+      [128, 64, 32],
+      DataType.f8, DataType.f8, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 64, 32],
+      DataType.e4m3, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 64, 32],
+      DataType.e4m3, DataType.e5m2, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 64, 32],
+      DataType.e5m2, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    # inst 128x128
+    MathInstruction(
+      [128, 128, 32],
+      DataType.f8, DataType.f8, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 128, 32],
+      DataType.e4m3, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 128, 32],
+      DataType.e4m3, DataType.e5m2, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 128, 32],
+      DataType.e5m2, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    # inst 128x256
+    MathInstruction(
+      [128, 256, 32],
+      DataType.e4m3, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 256, 32],
+      DataType.e4m3, DataType.e5m2, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add),
+    MathInstruction(
+      [128, 256, 32],
+      DataType.e5m2, DataType.e4m3, DataType.f32,
+      OpcodeClass.TensorOp,
+      MathOperation.multiply_add)]
+
+  cluster_shapes_1sm = [[1,2,1], [2,1,1], [1,1,1], [1,4,1], [4,4,1]
+                        , DynamicClusterShape
+                       ]
+
+  tile_schedulers = [
+    TileSchedulerType.Default,
+  ]
+
+  # 1xSM MMA kernels
+  for math_inst in math_instructions_1sm:
+    tile_descriptions = []
+    for cluster_shape in cluster_shapes_1sm:
+      multiplier_1sm = (1, 1, 1) if cluster_shape == DynamicClusterShape else cluster_shape
+      tile_descriptions.append(
+        TileDescription([
+          math_inst.instruction_shape[0]     * multiplier_1sm[0],
+          math_inst.instruction_shape[1]     * multiplier_1sm[1],
+          math_inst.instruction_shape[2] * 4 * multiplier_1sm[2]],
+          0, [4, 1, 1], math_inst, min_cc, max_cc, cluster_shape,
+          [math_inst.instruction_shape[0], math_inst.instruction_shape[1], 
+           math_inst.instruction_shape[2] * 4]))
+      tile_descriptions.append(
+        TileDescription([
+          math_inst.instruction_shape[0]     * multiplier_1sm[0],
+          math_inst.instruction_shape[1]     * multiplier_1sm[1],
+          math_inst.instruction_shape[2] * 4 * multiplier_1sm[2]],
+          0, [4, 1, 1], math_inst, min_cc, max_cc, cluster_shape,
+          [1, math_inst.instruction_shape[1], 
+           math_inst.instruction_shape[2] * 4]))
+      tile_descriptions.append(
+        TileDescription([
+          math_inst.instruction_shape[0]     * multiplier_1sm[0],
+          math_inst.instruction_shape[1]     * multiplier_1sm[1],
+          math_inst.instruction_shape[2] * 4 * multiplier_1sm[2]],
+          0, [4, 1, 1], math_inst, min_cc, max_cc, cluster_shape,
+          [math_inst.instruction_shape[0], 1, 
+           math_inst.instruction_shape[2] * 4]))
+
+    data_types = [
+      {
+        "a_type"   : math_inst.element_a,
+        "b_type"   : math_inst.element_b,
+        "c_type"   : DataType.f16,
+        "d_type"   : DataType.f16,
+        "acc_type" : math_inst.element_accumulator,
+        "epi_type" : epi_type,
+      },
+      {
+        "a_type"   : math_inst.element_a,
+        "b_type"   : math_inst.element_b,
+        "c_type"   : DataType.bf16,
+        "d_type"   : DataType.bf16,
+        "acc_type" : math_inst.element_accumulator,
+        "epi_type" : epi_type,
+      },
+      {
+        "a_type"   : math_inst.element_a,
+        "b_type"   : math_inst.element_b,
+        "c_type"   : DataType.f32,
+        "d_type"   : DataType.f32,
+        "acc_type" : math_inst.element_accumulator,
+        "epi_type" : epi_type,
+      },
+      {
+        "a_type"   : math_inst.element_a,
+        "b_type"   : math_inst.element_b,
+        "c_type"   : DataType.void,
+        "d_type"   : DataType.f16,
+        "acc_type" : math_inst.element_accumulator,
+        "epi_type" : epi_type,
+      },
+      {
+        "a_type"   : math_inst.element_a,
+        "b_type"   : math_inst.element_b,
+        "c_type"   : DataType.void,
+        "d_type"   : DataType.bf16,
+        "acc_type" : math_inst.element_accumulator,
+        "epi_type" : epi_type,
+      },
+      {
+        "a_type"   : math_inst.element_a,
+        "b_type"   : math_inst.element_b,
+        "c_type"   : DataType.void,
+        "d_type"   : DataType.f32,
+        "acc_type" : math_inst.element_accumulator,
+        "epi_type" : epi_type,
+      },
+    ]
+
+    # Set alignment d based on Destination format.
+    for layout in layouts:
+      layout[2][1] = 128 // DataTypeSize[data_types[0]["d_type"]]
+
+    is_runtime_datatype = lambda runtime_datatype: runtime_datatype in (DataType.f4, DataType.f6, DataType.f8)
+    for data_type in data_types:
+      if ( data_type["a_type"] == DataType.e4m3 ) and ( data_type["b_type"] == DataType.e4m3 ) and\
+         ( data_type["d_type"] == DataType.e5m2 ):
+        continue
+
+      is_runtime_datatype_a = is_runtime_datatype(data_type["a_type"])
+      is_runtime_datatype_b = is_runtime_datatype(data_type["d_type"])
+
+      # A/B datatypes should be both static or dynamic
+      if (is_runtime_datatype_a != is_runtime_datatype_b):
+        continue
+
+      # grouped GEMM does not support runtime data type yet
+      if grouped and (is_runtime_datatype_a or is_runtime_datatype_b):
+        continue
+      kernel_schedule = to_grouped_schedule(KernelScheduleType.BlockwiseTmaWarpSpecialized1SmSm100, grouped)
+      epi_schedule = to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecialized1Sm, grouped)
+      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
+        [[kernel_schedule, epi_schedule]],
+        tile_schedulers=tile_schedulers, gemm_kind=gemm_kind)
+
 def GenerateSM100_TensorOp_mixed_8bits_UMMA_gemm(manifest, cuda_version):
   # SM100 MMA with mixed F4/F6/F8 inputs + without block scale
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
@@ -10025,7 +10354,8 @@ def GenerateSM120_TensorOp_fp4_UMMA_gemm_with_block_scaled(manifest, cuda_versio
 
   tile_sizes_cooperative = [
     [128, 128, 128],
-    [128, 128, 256]
+    [128, 128, 256],
+    [256, 128, 128]
   ]
 
   tile_sizes_pingpong = [
@@ -10317,6 +10647,11 @@ def GenerateSM100(manifest, cuda_version):
 
   # StreamK is included in regular generation
   GenerateSM100_TensorOp_mixed_8bits_UMMA_gemm(manifest, cuda_version)
+
+  # Blockwise kernels
+  GenerateSM100_TensorOp_fp8_UMMA_gemm_with_blockwise(manifest, cuda_version)
+  GenerateSM100_TensorOp_fp8_UMMA_gemm_with_blockwise(manifest, cuda_version, gemm_kind=GemmKind.GroupedBlockwiseUniversal3x)
+
   #
   # Sparse Gemm
   #
@@ -10754,6 +11089,8 @@ def GenerateSM90(manifest, cuda_version):
   GenerateSM90_SparseTensorOp_tf32_WGMMA_gemm(manifest, cuda_version)
   GenerateSM90_SparseTensorOp_int8_WGMMA_gemm(manifest, cuda_version)
   GenerateSM90_SparseTensorOp_fp8_WGMMA_gemm(manifest, cuda_version)
+  GenerateSM90_TensorOp_fp8_WGMMA_gemm_with_blockwise(manifest, cuda_version)
+  GenerateSM90_TensorOp_fp8_WGMMA_gemm_with_blockwise(manifest, cuda_version, gemm_kind=GemmKind.GroupedBlockwiseUniversal3x)
 
 ###################################################################################################
 
@@ -10882,6 +11219,8 @@ def define_parser():
 
   manifest = Manifest(args)
 
+  archs = args.architectures.split(';')
+
   GenerateSM50(manifest, args.cuda_version)
   GenerateSM60(manifest, args.cuda_version)
   GenerateSM61(manifest, args.cuda_version)
@@ -10890,8 +11229,8 @@ def define_parser():
   GenerateSM80(manifest, args.cuda_version)
   GenerateSM89(manifest, args.cuda_version)
   GenerateSM90(manifest, args.cuda_version)
-
-  blackwell_enabled_arch = args.architectures in ["100a", "101a", "120a"]
+   
+  blackwell_enabled_arch = any(arch in ["100a", "101a", "120a"] for arch in archs)
   if blackwell_enabled_arch:
     GenerateSM100(manifest, args.cuda_version)
     GenerateSM120(manifest, args.cuda_version)
diff --git a/python/cutlass_library/library.py b/python/cutlass_library/library.py
index ea5e8f9553..f93dd68ff6 100644
--- a/python/cutlass_library/library.py
+++ b/python/cutlass_library/library.py
@@ -325,8 +325,12 @@ def is_complex(data_type):
 def is_block_scaled(gemm_kind):
   return gemm_kind in (GemmKind.BlockScaledUniversal3x, GemmKind.GroupedBlockScaledUniversal3x)
 
+def is_blockwise(gemm_kind):
+  return gemm_kind in (GemmKind.BlockwiseUniversal3x, GemmKind.GroupedBlockwiseUniversal3x)
+
 def is_grouped(gemm_kind):
-  return gemm_kind in (GemmKind.GroupedUniversal3x, GemmKind.GroupedBlockScaledUniversal3x)
+  return gemm_kind in (GemmKind.GroupedUniversal3x, 
+    GemmKind.GroupedBlockScaledUniversal3x, GemmKind.GroupedBlockwiseUniversal3x)
 
 #
 def get_complex_from_real(real_type):
@@ -494,6 +498,9 @@ class KernelScheduleType(enum.Enum):
   PtrArrayTmaWarpSpecializedPingpong = enum_auto()
   PtrArrayTmaWarpSpecializedPingpongFP8FastAccum = enum_auto()
 
+  BlockwiseTmaWarpSpecializedCooperative = enum_auto()
+  PtrArrayBlockwiseTmaWarpSpecializedCooperative = enum_auto()
+
   TmaWarpSpecialized1SmSm100 = enum_auto()
   TmaWarpSpecialized2SmSm100 = enum_auto()
   ImplicitTmaWarpSpecialized1SmSm100 = enum_auto()
@@ -519,6 +526,13 @@ class KernelScheduleType(enum.Enum):
   Mxf8f6f4TmaWarpSpecialized1SmSm100 = enum_auto()
   Mxf8f6f4TmaWarpSpecialized2SmSm100 = enum_auto()
 
+  BlockwiseTmaWarpSpecialized1SmSm100 = enum_auto()
+  BlockwiseTmaWarpSpecialized2SmSm100 = enum_auto()
+
+  PtrArrayBlockwiseTmaWarpSpecialized1SmSm100 = enum_auto()
+  PtrArrayBlockwiseTmaWarpSpecialized2SmSm100 = enum_auto()
+
+
   Mxf4TmaWarpSpecialized1SmSm100 = enum_auto()
   Mxf4TmaWarpSpecialized2SmSm100 = enum_auto()
   Nvf4TmaWarpSpecialized1SmSm100 = enum_auto()
@@ -548,6 +562,8 @@ class KernelScheduleType(enum.Enum):
   KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum',
   KernelScheduleType.ImplicitTmaWarpSpecializedSm90: 'cutlass::conv::KernelImplicitTmaWarpSpecializedSm90',
 
+  KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative: 'cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum',
+
   KernelScheduleType.TmaWarpSpecialized1SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized1SmSm100',
   KernelScheduleType.TmaWarpSpecialized2SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized2SmSm100',
 
@@ -565,6 +581,12 @@ class KernelScheduleType(enum.Enum):
   KernelScheduleType.Mxf8f6f4TmaWarpSpecialized1SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized1SmMxf8f6f4Sm100',
   KernelScheduleType.Mxf8f6f4TmaWarpSpecialized2SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized2SmMxf8f6f4Sm100',
 
+  KernelScheduleType.BlockwiseTmaWarpSpecialized1SmSm100: 'cutlass::gemm::KernelTmaWarpSpecializedBlockwise1SmSm100',
+  KernelScheduleType.BlockwiseTmaWarpSpecialized2SmSm100: 'cutlass::gemm::KernelTmaWarpSpecializedBlockwise2SmSm100',
+
+  KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecialized1SmSm100: 'cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise1SmSm100',
+  KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecialized2SmSm100: 'cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise2SmSm100',
+
   KernelScheduleType.Mxf4TmaWarpSpecialized1SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized1SmMxf4Sm100',
   KernelScheduleType.Mxf4TmaWarpSpecialized2SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized2SmMxf4Sm100',
   KernelScheduleType.Nvf4TmaWarpSpecialized1SmSm100: 'cutlass::gemm::KernelTmaWarpSpecialized1SmNvf4Sm100',
@@ -575,6 +597,8 @@ class KernelScheduleType(enum.Enum):
   KernelScheduleType.PtrArrayTmaWarpSpecializedPingpong: 'cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong',
   KernelScheduleType.PtrArrayTmaWarpSpecializedPingpongFP8FastAccum: 'cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum',
 
+  KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecializedCooperative: 'cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8BlockScaledAccum',
+
   KernelScheduleType.PtrArrayTmaWarpSpecialized1SmBlockScaledSm100: "cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmBlockScaledSm100",
   KernelScheduleType.PtrArrayTmaWarpSpecialized2SmBlockScaledSm100: "cutlass::gemm::KernelPtrArrayTmaWarpSpecialized2SmBlockScaledSm100",
   KernelScheduleType.PtrArrayNvf4TmaWarpSpecialized1SmSm100: "cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmNvf4Sm100",
@@ -609,6 +633,8 @@ class KernelScheduleType(enum.Enum):
   KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum: '_warpspecialized_cooperative_fp8_fastaccum',
   KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum: '_warpspecialized_pingpong_fp8_fastaccum',
   KernelScheduleType.ImplicitTmaWarpSpecializedSm90: '_warpspecialized',
+
+  KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative: '_warpspecialized_cooperative',
   
   KernelScheduleType.TmaWarpSpecialized1SmSm100: '_1sm',
   KernelScheduleType.TmaWarpSpecialized2SmSm100: '_2sm',
@@ -627,6 +653,11 @@ class KernelScheduleType(enum.Enum):
   KernelScheduleType.Mxf8f6f4TmaWarpSpecialized1SmSm100: '_q_1sm',
   KernelScheduleType.Mxf8f6f4TmaWarpSpecialized2SmSm100: '_q_2sm',
 
+  KernelScheduleType.BlockwiseTmaWarpSpecialized1SmSm100: '_1sm',
+  KernelScheduleType.BlockwiseTmaWarpSpecialized2SmSm100: '_2sm',
+  KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecialized1SmSm100: '_1sm',
+  KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecialized2SmSm100: '_2sm',
+
   KernelScheduleType.Mxf4TmaWarpSpecialized1SmSm100: '_o_vs32_1sm',
   KernelScheduleType.Mxf4TmaWarpSpecialized2SmSm100: '_o_vs32_2sm',
   KernelScheduleType.Nvf4TmaWarpSpecialized1SmSm100: '_o_vs16_1sm',
@@ -637,6 +668,8 @@ class KernelScheduleType(enum.Enum):
   KernelScheduleType.PtrArrayTmaWarpSpecializedPingpong: '_warpspecialized_pingpong',
   KernelScheduleType.PtrArrayTmaWarpSpecializedPingpongFP8FastAccum: '_warpspecialized_pingpong_fp8_fastaccum',
 
+  KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecializedCooperative: '_warpspecialized_cooperative',
+
   KernelScheduleType.PtrArrayTmaWarpSpecialized1SmBlockScaledSm100: '_1sm',
   KernelScheduleType.PtrArrayTmaWarpSpecialized2SmBlockScaledSm100: '_2sm',
   KernelScheduleType.PtrArrayNvf4TmaWarpSpecialized1SmSm100: '_o_vs16_1sm',
@@ -731,6 +764,7 @@ def to_grouped_schedule(schedule, grouped):
   group_schedule_map = {
     # SM90
     KernelScheduleType.TmaWarpSpecializedCooperative : KernelScheduleType.PtrArrayTmaWarpSpecializedCooperative,
+    KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative : KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecializedCooperative,
     KernelScheduleType.TmaWarpSpecializedPingpong    : KernelScheduleType.PtrArrayTmaWarpSpecializedPingpong,
     KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum : KernelScheduleType.PtrArrayTmaWarpSpecializedCooperativeFP8FastAccum,
     KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum    : KernelScheduleType.PtrArrayTmaWarpSpecializedPingpongFP8FastAccum,
@@ -746,6 +780,9 @@ def to_grouped_schedule(schedule, grouped):
     KernelScheduleType.Mxf8f6f4TmaWarpSpecialized2SmSm100 : KernelScheduleType.PtrArrayMxf8f6f4TmaWarpSpecialized2SmSm100,
     EpilogueScheduleType.TmaWarpSpecialized1Sm: EpilogueScheduleType.PtrArrayTmaWarpSpecialized1Sm,
     EpilogueScheduleType.TmaWarpSpecialized2Sm: EpilogueScheduleType.PtrArrayTmaWarpSpecialized2Sm,
+    KernelScheduleType.BlockwiseTmaWarpSpecialized1SmSm100 : KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecialized1SmSm100,
+    KernelScheduleType.BlockwiseTmaWarpSpecialized2SmSm100 : KernelScheduleType.PtrArrayBlockwiseTmaWarpSpecialized2SmSm100,
+
   }
 
   return group_schedule_map[schedule]
@@ -935,6 +972,8 @@ class GemmKind(enum.Enum):
   BlockScaledUniversal3x = enum_auto()                                   
   GroupedUniversal3x = enum_auto()
   GroupedBlockScaledUniversal3x = enum_auto()
+  BlockwiseUniversal3x = enum_auto()
+  GroupedBlockwiseUniversal3x = enum_auto()
 
 #
 GemmKindNames = {
@@ -948,7 +987,9 @@ class GemmKind(enum.Enum):
   GemmKind.Grouped: "gemm_grouped",
   GemmKind.BlockScaledUniversal3x: "gemm",
   GemmKind.GroupedUniversal3x: "gemm_grouped",
-  GemmKind.GroupedBlockScaledUniversal3x: "gemm_grouped"
+  GemmKind.GroupedBlockScaledUniversal3x: "gemm_grouped",
+  GemmKind.BlockwiseUniversal3x: "gemm",
+  GemmKind.GroupedBlockwiseUniversal3x: "gemm_grouped"
 }
 
 #
@@ -1152,7 +1193,7 @@ def __init__(self,
 #
 class TileDescription:
 
-  def __init__(self, threadblock_shape, stages, warp_count, math_instruction, min_compute, max_compute, cluster_shape = [1,1,1]):
+  def __init__(self, threadblock_shape, stages, warp_count, math_instruction, min_compute, max_compute, cluster_shape = [1,1,1], explicit_vector_sizes = None):
     self.threadblock_shape = threadblock_shape
     self.tile_shape = threadblock_shape
     self.stages = stages
@@ -1161,6 +1202,7 @@ def __init__(self, threadblock_shape, stages, warp_count, math_instruction, min_
     self.minimum_compute_capability = min_compute
     self.maximum_compute_capability = max_compute
     self.cluster_shape = cluster_shape
+    self.explicit_vector_sizes = explicit_vector_sizes
 
   def procedural_name(self):
     if self.minimum_compute_capability >= 90:
diff --git a/python/cutlass_library/sm90_utils.py b/python/cutlass_library/sm90_utils.py
index 79895305e5..63ff6f1f31 100644
--- a/python/cutlass_library/sm90_utils.py
+++ b/python/cutlass_library/sm90_utils.py
@@ -511,16 +511,23 @@ def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types,
                 return [], []
             if CudaToolkitVersionSatisfies(cuda_version, 12, 1) and can_do_cooperative and can_do_tma_epilogue:
                 schedules = []
-                schedules.append(
-                    [
-                        to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperative, grouped),
-                        to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
-                    ])
-                schedules.append(
-                    [
-                        to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, grouped),
-                        to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
-                    ])
+                if is_blockwise(gemm_kind):
+                    schedules.append(
+                        [
+                            to_grouped_schedule(KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative, grouped),
+                            to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
+                        ])
+                else:
+                    schedules.append(
+                        [
+                            to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperative, grouped),
+                            to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
+                        ])
+                    schedules.append(
+                        [
+                            to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, grouped),
+                            to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
+                        ])
                 return schedules, []
             return [], []
 
@@ -547,26 +554,34 @@ def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types,
         epilogue_schedule = EpilogueScheduleType.TmaWarpSpecialized
         if a_type_size > b_type_size:
             epilogue_schedule = EpilogueScheduleType.EpilogueTransposed
-        schedules.append([
-            KernelScheduleType.TmaWarpSpecialized,
-            epilogue_schedule
-        ])
-        schedules.append([
-            KernelScheduleType.TmaWarpSpecializedPingpong,
-            epilogue_schedule
-        ])
+        
+        if not is_blockwise(gemm_kind):
+            schedules.append([
+                KernelScheduleType.TmaWarpSpecialized,
+                epilogue_schedule
+            ])
+            schedules.append([
+                KernelScheduleType.TmaWarpSpecializedPingpong,
+                epilogue_schedule
+            ])
         if cta_m >= 128:
             if a_type_size > b_type_size:
                 epilogue_schedule = EpilogueScheduleType.EpilogueTransposed
             else:
                 epilogue_schedule = EpilogueScheduleType.TmaWarpSpecializedCooperative
-            schedules.append([
-                KernelScheduleType.TmaWarpSpecializedCooperative,
-                epilogue_schedule
-            ])
+            if is_blockwise(gemm_kind):
+                schedules.append([
+                    KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative,
+                    epilogue_schedule
+                ])
+            else:
+                schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedCooperative,
+                    epilogue_schedule
+                ])
         return schedules, []
 
-    if not is_aligned:
+    if not is_aligned and not is_blockwise(gemm_kind):
         schedules = [[KernelScheduleType.CpAsyncWarpSpecialized,
                     default_epilogue]]
         stream_k_schedules = []
@@ -585,7 +600,7 @@ def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types,
 
     schedules = []
     # Pruning: emit Void-C and Grouped kernels with persistent kernels only
-    if (level >= 1 or not is_void_c) and not grouped:
+    if (level >= 1 or not is_void_c) and not grouped and not is_blockwise(gemm_kind):
         # Pruning: don't stamp out fp8 kernels with auto schedule
         if not is_fp8:
             schedules.append([KernelScheduleType.ScheduleAuto, auto_epilogue])
@@ -596,7 +611,7 @@ def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types,
         if can_do_tma_epilogue:
             assert not requires_transposed_epilogue
             # Inconsistency: fp8 pingpong only gets stamped out with fast accum
-            if not is_fp8 or level >= 1:
+            if (not is_fp8 or level >= 1) and not is_blockwise(gemm_kind):
                 schedules.append([
                     to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedPingpong, grouped),
                     to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecialized, grouped)
@@ -618,14 +633,24 @@ def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types,
             schedules.append([to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum, grouped), to_grouped_schedule(default_epilogue, grouped)])
 
         if can_do_cooperative:
-            schedules.append([
-                to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperative, grouped),
-                to_grouped_schedule(default_epilogue, grouped)
-            ])
-            stream_k_schedules.append([
-                KernelScheduleType.TmaWarpSpecializedCooperative,
-                default_epilogue
-            ])
+            if is_blockwise(gemm_kind):
+                schedules.append([
+                    to_grouped_schedule(KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative, grouped),
+                    to_grouped_schedule(default_epilogue, grouped)
+                ])
+                stream_k_schedules.append([
+                    KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative,
+                    default_epilogue
+                ])
+            else:
+                schedules.append([
+                    to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperative, grouped),
+                    to_grouped_schedule(default_epilogue, grouped)
+                ])
+                stream_k_schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedCooperative,
+                    default_epilogue
+                ])
             if can_do_fp8_fast_accum:
                 schedules.append([
                     to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, grouped),
@@ -640,14 +665,24 @@ def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types,
         if can_do_tma_epilogue:
             assert not requires_transposed_epilogue
             if can_do_cooperative:
-                schedules.append([
-                    to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperative, grouped),
-                    to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
-                ])
-                stream_k_schedules.append([
-                    KernelScheduleType.TmaWarpSpecializedCooperative,
-                    EpilogueScheduleType.TmaWarpSpecializedCooperative
-                ])
+                if is_blockwise(gemm_kind):
+                    schedules.append([
+                        to_grouped_schedule(KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative, grouped),
+                        to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
+                    ])
+                    stream_k_schedules.append([
+                        KernelScheduleType.BlockwiseTmaWarpSpecializedCooperative,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ])
+                else:
+                    schedules.append([
+                        to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperative, grouped),
+                        to_grouped_schedule(EpilogueScheduleType.TmaWarpSpecializedCooperative, grouped)
+                    ])
+                    stream_k_schedules.append([
+                        KernelScheduleType.TmaWarpSpecializedCooperative,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ])
                 if can_do_fp8_fast_accum:
                     schedules.append([
                         to_grouped_schedule(KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, grouped),
diff --git a/python/setup_library.py b/python/setup_library.py
index 003111f307..8262e5a7ad 100644
--- a/python/setup_library.py
+++ b/python/setup_library.py
@@ -36,7 +36,7 @@
 def perform_setup():
     setup(
         name='cutlass_library',
-        version='3.8.0',
+        version='3.9.2',
         description='CUTLASS library generation scripts',
         packages=['cutlass_library']
     )
diff --git a/python/setup_pycute.py b/python/setup_pycute.py
index 822dfe16a1..cb9450490f 100644
--- a/python/setup_pycute.py
+++ b/python/setup_pycute.py
@@ -36,7 +36,7 @@
 def perform_setup():
     setup(
         name='pycute',
-        version='3.8.0',
+        version='3.9.2',
         description='Python implementation of CuTe',
         packages=['pycute'],
     )
diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt
index 9c6b3d0a82..144cb2fe8f 100644
--- a/test/unit/CMakeLists.txt
+++ b/test/unit/CMakeLists.txt
@@ -154,7 +154,7 @@ function(cutlass_test_unit_add_executable_split_file NAME)
   if (CUTLASS_UNIT_TEST_SPLIT_FILES)
     execute_process(
       WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
-      COMMAND ${Python3_EXECUTABLE} ${CUTLASS_SOURCE_DIR}/tools/scripts/split_test_cmake.py
+      COMMAND ${Python3_EXECUTABLE} ${CUTLASS_SOURCE_DIR}/tools/util/scripts/split_test_cmake.py
         ${NAME}
         ${CMAKE_CURRENT_SOURCE_DIR}
         --src_files ${SUBARGV}
diff --git a/test/unit/common/filter_architecture.cpp b/test/unit/common/filter_architecture.cpp
index 161a161c52..5dcd7c666a 100644
--- a/test/unit/common/filter_architecture.cpp
+++ b/test/unit/common/filter_architecture.cpp
@@ -28,7 +28,6 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-
 #if !defined(CUTLASS_ENABLE_SYCL)
 #include <cuda_runtime_api.h>
 #endif
@@ -67,7 +66,10 @@ std::ostream &operator<<(std::ostream &out, cudaDeviceProp const &deviceProperti
 
   int deviceMajorMinor = deviceProperties.major * 10 + deviceProperties.minor;
   if (deviceMajorMinor) {
-    int32_t clock_MHz = deviceProperties.clockRate / 1000;
+    int32_t clock_MHz;
+    int32_t clock_KHz;
+    cudaDeviceGetAttribute(&clock_KHz, cudaDevAttrClockRate, 0);
+    clock_MHz = clock_KHz / 1000;
     out << "GPU(compute_"
       << deviceMajorMinor << ", "
       << deviceProperties.multiProcessorCount << " SMs @ " << clock_MHz << " MHz)";
@@ -131,7 +133,6 @@ void FilterArchitecture() {
               << " [" << cudaGetErrorString(err) << "]" << std::endl;
     exit(1);
   }
-
   cudaDeviceProp deviceProperties;
   err = cudaGetDeviceProperties(&deviceProperties, cudaDeviceId);
   if (cudaSuccess != err) {
@@ -152,9 +153,9 @@ void FilterArchitecture() {
     /// Minimum compute capability for the kernels in the named test
     int min_compute_capability;
 
-    /// Maximum compute capability for which the kernels are enabled
+    /// Maximum compute capability for which the kernels are enabled 
     int max_compute_capability;
-  }
+  } 
   test_filters[] = {
     { "SM50*",                      50, kMaxDevice},
     { "SM60*",                      60, kMaxDevice},
@@ -165,8 +166,8 @@ void FilterArchitecture() {
     { "SM89*",                      89, 89},
     { "SM90*",                      90, 90},
     { "SM100*",                    100, 100},
-    { "IntelXe",                   0,  0},
-    { "IntelBMG",                   1,  1},
+    { "IntelPVC",                    0,   0},
+    { "IntelBMG",                    1,   1},
     { 0, 0, false }
   };
 
diff --git a/test/unit/cute/ampere/cooperative_gemm.cu b/test/unit/cute/ampere/cooperative_gemm.cu
index b192ec7283..48d6e36bad 100644
--- a/test/unit/cute/ampere/cooperative_gemm.cu
+++ b/test/unit/cute/ampere/cooperative_gemm.cu
@@ -499,3 +499,112 @@ TEST(SM80_CuTe_Ampere, CooperativeGemm2_Double_MMA_Predicated_Reg) {
 
   test_cooperative_gemm_col_major_layout_rmem_c<thread_block_size, value_type>(shape_mnk, tiled_mma);
 }
+
+TEST(SM80_CuTe_Ampere, CooperativeGemmLDSMx2) {
+
+  constexpr uint32_t thread_block_size = 128;
+  constexpr int MaxVecBits = 128;
+  using TA = cute::half_t;
+  using TB = cute::half_t;
+  using TC = float;
+
+  auto tiled_mma =
+      TiledMMA<
+        MMA_Atom<SM80_16x8x16_F32F16F16F32_TN>,
+        Layout<Shape<_2, _2, _1>, Stride<_1, _2, _0>>,
+        Tile<_32, _16, _16>
+      >{};
+
+  auto global_a_layout = make_layout(Shape<_32, _32>{}, LayoutRight{});
+  auto global_b_layout = make_layout(Shape<_16, _32>{}, LayoutRight{});
+  auto global_c_layout = make_layout(Shape<_32, _16>{}, LayoutRight{});
+
+  test_cooperative_gemm<thread_block_size,
+                        MaxVecBits,
+                        TA, TB, TC>
+    (global_a_layout,
+     global_b_layout,
+     global_c_layout,
+     global_a_layout,
+     global_b_layout,
+     global_c_layout,
+     tiled_mma, 
+     identity{}, 
+     identity{},
+     identity{},
+     identity{},
+     SM75_U32x4_LDSM_N{},
+     SM75_U32x2_LDSM_N{});
+}
+
+TEST(SM89_CuTe_Ada, CooperativeGemm_e4m3e4m3f32_MMA) {
+  using TA = cutlass::float_e4m3_t;
+  using TB = cutlass::float_e4m3_t;
+  using TC = float;
+
+  constexpr uint32_t thread_block_size = 128;
+  constexpr int MaxVecBits = 128;
+
+  auto shape_mnk = Shape<_64, _64, _64>{};
+  auto tiled_mma =
+      TiledMMA<
+        MMA_Atom<SM89_16x8x32_F32E4M3E4M3F32_TN>,
+        Layout<Shape<_2, _2, _1>>
+      >{};
+
+  test_cooperative_gemm_col_major_layout<thread_block_size, MaxVecBits, TA, TB, TC>(shape_mnk, tiled_mma);
+}
+
+TEST(SM89_CuTe_Ada, CooperativeGemm_e4m3e5m2f32_MMA) {
+  using TA = cutlass::float_e4m3_t;
+  using TB = cutlass::float_e5m2_t;
+  using TC = float;
+
+  constexpr uint32_t thread_block_size = 128;
+  constexpr int MaxVecBits = 128;
+
+  auto shape_mnk = Shape<_64, _64, _64>{};
+  auto tiled_mma =
+      TiledMMA<
+        MMA_Atom<SM89_16x8x32_F32E4M3E5M2F32_TN>,
+        Layout<Shape<_2, _2, _1>>
+      >{};
+
+  test_cooperative_gemm_col_major_layout<thread_block_size, MaxVecBits, TA, TB, TC>(shape_mnk, tiled_mma);
+}
+
+TEST(SM89_CuTe_Ada, CooperativeGemm_e5m2e4m3f32_MMA) {
+  using TA = cutlass::float_e5m2_t;
+  using TB = cutlass::float_e4m3_t;
+  using TC = float;
+
+  constexpr uint32_t thread_block_size = 128;
+  constexpr int MaxVecBits = 128;
+
+  auto shape_mnk = Shape<_64, _64, _64>{};
+  auto tiled_mma =
+      TiledMMA<
+        MMA_Atom<SM89_16x8x32_F32E5M2E4M3F32_TN>,
+        Layout<Shape<_2, _2, _1>>
+      >{};
+
+  test_cooperative_gemm_col_major_layout<thread_block_size, MaxVecBits, TA, TB, TC>(shape_mnk, tiled_mma);
+}
+
+TEST(SM89_CuTe_Ada, CooperativeGemm_e5m2e5m2f32_MMA) {
+  using TA = cutlass::float_e5m2_t;
+  using TB = cutlass::float_e5m2_t;
+  using TC = float;
+
+  constexpr uint32_t thread_block_size = 128;
+  constexpr int MaxVecBits = 128;
+
+  auto shape_mnk = Shape<_64, _64, _64>{};
+  auto tiled_mma =
+      TiledMMA<
+        MMA_Atom<SM89_16x8x32_F32E5M2E5M2F32_TN>,
+        Layout<Shape<_2, _2, _1>>
+      >{};
+
+  test_cooperative_gemm_col_major_layout<thread_block_size, MaxVecBits, TA, TB, TC>(shape_mnk, tiled_mma);
+}
diff --git a/test/unit/cute/cooperative_gemm_common.hpp b/test/unit/cute/cooperative_gemm_common.hpp
index 3fd61d93f9..51d7e8f624 100644
--- a/test/unit/cute/cooperative_gemm_common.hpp
+++ b/test/unit/cute/cooperative_gemm_common.hpp
@@ -193,7 +193,8 @@ template<uint32_t ThreadBlockSize,
          class CStoreTransform,
          class SMemCopyOpA,
          class SMemCopyOpB,
-         class SMemCopyOpC>
+         class SMemCopyLdOpC,
+         class SMemCopyStOpC>
 void
 cooperative_gemm_kernel(GMemALayout gmem_a_layout,
                         GMemBLayout gmem_b_layout,
@@ -214,7 +215,8 @@ cooperative_gemm_kernel(GMemALayout gmem_a_layout,
                         CStoreTransform c_store_transform,
                         SMemCopyOpA     a_copy_op,
                         SMemCopyOpB     b_copy_op,
-                        SMemCopyOpC     c_copy_op,
+                        SMemCopyLdOpC   c_copy_ld_op,
+                        SMemCopyStOpC   c_copy_st_op,
                         sycl::local_ptr<char> base_smem)
 {
     using namespace cute;
@@ -249,7 +251,7 @@ cooperative_gemm_kernel(GMemALayout gmem_a_layout,
       ThreadIdxX(), tiled_mma,
       alpha, s_a_tensor, s_b_tensor, beta, s_c_tensor,
       a_load_transform, b_load_transform, c_load_transform, c_store_transform,
-      a_copy_op, b_copy_op, c_copy_op
+      a_copy_op, b_copy_op, c_copy_ld_op, c_copy_st_op
     );
     syncthreads();
 
@@ -376,7 +378,8 @@ template<uint32_t ThreadBlockSize,
          class CStoreTransform,
          class SMemCopyOpA,
          class SMemCopyOpB,
-         class SMemCopyOpC>
+         class SMemCopyLdOpC,
+         class SMemCopyStOpC>
 __launch_bounds__(ThreadBlockSize) __global__ void
 cooperative_gemm_kernel(GMemALayout gmem_a_layout,
                         GMemBLayout gmem_b_layout,
@@ -397,7 +400,8 @@ cooperative_gemm_kernel(GMemALayout gmem_a_layout,
                         CStoreTransform c_store_transform,
                         SMemCopyOpA     a_copy_op,
                         SMemCopyOpB     b_copy_op,
-                        SMemCopyOpC     c_copy_op)
+                        SMemCopyLdOpC   c_copy_ld_op,
+                        SMemCopyStOpC   c_copy_st_op)
 {
     using namespace cute;
 
@@ -430,7 +434,7 @@ cooperative_gemm_kernel(GMemALayout gmem_a_layout,
       threadIdx.x, tiled_mma,
       alpha, s_a_tensor, s_b_tensor, beta, s_c_tensor,
       a_load_transform, b_load_transform, c_load_transform, c_store_transform,
-      a_copy_op, b_copy_op, c_copy_op
+      a_copy_op, b_copy_op, c_copy_ld_op, c_copy_st_op
     );
     __syncthreads();
 
@@ -555,7 +559,8 @@ template<uint32_t ThreadBlockSize,
          class CStoreTransform = cute::identity,
          class ASMemCopyOp = AutoVectorizingCopyWithAssumedAlignment<CopyMaxVecBits>,
          class BSMemCopyOp = AutoVectorizingCopyWithAssumedAlignment<CopyMaxVecBits>,
-         class CSMemCopyOp = AutoVectorizingCopyWithAssumedAlignment<CopyMaxVecBits>>
+         class CSMemCopyLdOp = AutoVectorizingCopyWithAssumedAlignment<CopyMaxVecBits>,
+         class CSMemCopyStOp = AutoVectorizingCopyWithAssumedAlignment<CopyMaxVecBits>>
 void test_cooperative_gemm(GMemALayout     gmem_a_layout,
                            GMemBLayout     gmem_b_layout,
                            GMemCLayout     gmem_c_layout,
@@ -569,7 +574,8 @@ void test_cooperative_gemm(GMemALayout     gmem_a_layout,
                            CStoreTransform c_store_transform = {},
                            ASMemCopyOp     a_smem_copy_op = {},
                            BSMemCopyOp     b_smem_copy_op = {},
-                           CSMemCopyOp     c_smem_copy_op = {})
+                           CSMemCopyLdOp   c_smem_copy_ld_op = {},
+                           CSMemCopyStOp   c_smem_copy_st_op = {})
 {
   static_assert(std::is_same_v<typename fp64_tester<TA>::value_type, typename fp64_tester<TB>::value_type>);
   static_assert(std::is_same_v<typename fp64_tester<TB>::value_type, typename fp64_tester<TC>::value_type>);
@@ -618,7 +624,7 @@ void test_cooperative_gemm(GMemALayout     gmem_a_layout,
     TA, TB, TC, decltype(alpha), decltype(beta),
     TiledMma,
     ALoadTransform, BLoadTransform, CLoadTransform, CStoreTransform,
-    ASMemCopyOp, BSMemCopyOp, CSMemCopyOp
+    ASMemCopyOp, BSMemCopyOp, CSMemCopyLdOp, CSMemCopyStOp
   >>
   ( sc_exp::launch_policy{sc::dim3(1), sc::dim3(ThreadBlockSize), sc_exp::local_mem_size{shared_memory_size}},
        gmem_a_layout,
@@ -640,7 +646,8 @@ void test_cooperative_gemm(GMemALayout     gmem_a_layout,
        c_store_transform,
        a_smem_copy_op,
        b_smem_copy_op,
-       c_smem_copy_op
+       c_smem_copy_ld_op,
+       c_smem_copy_st_op
      );
 #else
     auto kernel = cooperative_gemm_kernel<
@@ -650,7 +657,7 @@ void test_cooperative_gemm(GMemALayout     gmem_a_layout,
     TA, TB, TC, decltype(alpha), decltype(beta),
     TiledMma,
     ALoadTransform, BLoadTransform, CLoadTransform, CStoreTransform,
-    ASMemCopyOp, BSMemCopyOp, CSMemCopyOp
+    ASMemCopyOp, BSMemCopyOp, CSMemCopyLdOp, CSMemCopyStOp
   >;
 
   ASSERT_EQ(cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, static_cast<int>(shared_memory_size)), 0);
@@ -675,7 +682,8 @@ void test_cooperative_gemm(GMemALayout     gmem_a_layout,
     c_store_transform,
     a_smem_copy_op,
     b_smem_copy_op,
-    c_smem_copy_op
+    c_smem_copy_ld_op,
+    c_smem_copy_st_op
   );
 
   cudaError_t result = cudaDeviceSynchronize();
diff --git a/test/unit/cute/core/tuple.cpp b/test/unit/cute/core/tuple.cpp
index ea31edd99f..a0f2a312cc 100644
--- a/test/unit/cute/core/tuple.cpp
+++ b/test/unit/cute/core/tuple.cpp
@@ -510,7 +510,7 @@ void test_sizes_and_not_storing_empty_types() {
 
 } // namespace test
 
-TEST(CuTe_core, PackedTuple2)
+TEST(CuTe_core, PackedTuple)
 {
   CUTLASS_TRACE_HOST("-------------------------------");
   CUTLASS_TRACE_HOST("tuple");
@@ -522,7 +522,7 @@ TEST(CuTe_core, PackedTuple2)
   pt_test::test_sizes_and_not_storing_empty_types();
 }
 
-TEST(CuTe_core, PackedTuple2Get) {
+TEST(CuTe_core, PackedTupleGet) {
   using cute::tuple;
   using pt_test::Empty;
   using pt_test::Nonempty;
@@ -678,6 +678,42 @@ TEST(CuTe_core, PackedTuple2Get) {
   }
 }
 
+TEST(CuTe_core, PackedTupleGetValueCategory) {
+  using cute::tuple;
+  using pt_test::Empty;
+  using pt_test::Nonempty;
+
+  tuple<Nonempty<int>, int, Empty<42>> tup(Nonempty<int>{42}, 7, Empty<42>{});
+
+  // Lvalue ref
+  decltype(auto) t0 = cute::get<0>(tup);
+  decltype(auto) t1 = cute::get<1>(tup);
+  decltype(auto) t2 = cute::get<2>(tup);
+
+  EXPECT_TRUE((cute::is_same_v<decltype(t0), Nonempty<int>&>));
+  EXPECT_TRUE((cute::is_same_v<decltype(t1), int&>));
+  EXPECT_TRUE((cute::is_same_v<decltype(t2), Empty<42>>));
+
+  // Const lvalue ref
+  auto const& ctup = tup;
+  decltype(auto) ct0 = cute::get<0>(ctup);
+  decltype(auto) ct1 = cute::get<1>(ctup);
+  decltype(auto) ct2 = cute::get<2>(ctup);
+
+  EXPECT_TRUE((cute::is_same_v<decltype(ct0), Nonempty<int> const&>));
+  EXPECT_TRUE((cute::is_same_v<decltype(ct1), int const&>));
+  EXPECT_TRUE((cute::is_same_v<decltype(ct2), Empty<42>>));
+
+  // Rvalue ref
+  decltype(auto) r0 = cute::get<0>(cute::move(tup));
+  decltype(auto) r1 = cute::get<1>(cute::move(tup));
+  decltype(auto) r2 = cute::get<2>(cute::move(tup));
+
+  EXPECT_TRUE((cute::is_same_v<decltype(r0), Nonempty<int>&&>));
+  EXPECT_TRUE((cute::is_same_v<decltype(r1), int&&>));
+  EXPECT_TRUE((cute::is_same_v<decltype(r2), Empty<42>>));
+}
+
 namespace pt_test {
 
 // An empty class type to which Empty is convertible.
@@ -705,14 +741,14 @@ TEST(CuTe_core, PackedTupleConstexprDefaultConstruction) {
 
   using pt_test::Empty;
   {
-    [[maybe_unused]] constexpr cute::detail::ESO_t<Empty<0>> eso1{};
-    [[maybe_unused]] constexpr cute::detail::ESO_t<int64_t> eso2{};
+    [[maybe_unused]] constexpr cute::eso::ESO_t<Empty<0>> eso1{};
+    [[maybe_unused]] constexpr cute::eso::ESO_t<int64_t> eso2{};
   }
   {
-    [[maybe_unused]] constexpr cute::detail::ESO_t<Empty<0>, Empty<1>> eso0{};
-    [[maybe_unused]] constexpr cute::detail::ESO_t<int64_t, Empty<1>> eso1{};
-    [[maybe_unused]] constexpr cute::detail::ESO_t<Empty<0>, int64_t> eso2{};
-    [[maybe_unused]] constexpr cute::detail::ESO_t<int64_t, int64_t> eso3{};
+    [[maybe_unused]] constexpr cute::eso::ESO_t<Empty<0>, Empty<1>> eso0{};
+    [[maybe_unused]] constexpr cute::eso::ESO_t<int64_t, Empty<1>> eso1{};
+    [[maybe_unused]] constexpr cute::eso::ESO_t<Empty<0>, int64_t> eso2{};
+    [[maybe_unused]] constexpr cute::eso::ESO_t<int64_t, int64_t> eso3{};
   }
 }
 
diff --git a/test/unit/cute/hopper/cooperative_gemm.cu b/test/unit/cute/hopper/cooperative_gemm.cu
index c6ed2eb2c3..bb71e879cf 100644
--- a/test/unit/cute/hopper/cooperative_gemm.cu
+++ b/test/unit/cute/hopper/cooperative_gemm.cu
@@ -115,3 +115,46 @@ TEST(SM90_CuTe_Hopper, CooperativeGemmTilingF16) {
 }
 
 #endif
+
+#if defined(CUTE_ARCH_STSM_SM90_ENABLED)
+
+TEST(SM90_CuTe_Hopper, CooperativeGemmSTSM) {
+
+  constexpr uint32_t thread_block_size = 128;
+  constexpr int MaxVecBits = 128;
+  using TA = cute::half_t;
+  using TB = cute::half_t;
+  using TC = cute::half_t;
+
+  auto tiled_mma =
+      TiledMMA<
+        MMA_Atom<SM80_16x8x16_F16F16F16F16_TN>,
+        Layout<Shape<_2, _2, _1>, Stride<_1, _2, _0>>,
+        Tile<_32, _32, _16>
+      >{};
+
+  auto global_a_layout = make_layout(Shape<_64, _64>{}, LayoutRight{});
+  auto global_b_layout = make_layout(Shape<_64, _64>{}, LayoutRight{});
+  auto global_c_layout = make_layout(Shape<_64, _64>{}, LayoutRight{});
+
+  test_cooperative_gemm<thread_block_size,
+                        MaxVecBits,
+                        TA, TB, TC>
+    (global_a_layout,
+     global_b_layout,
+     global_c_layout,
+     global_a_layout,
+     global_b_layout,
+     global_c_layout,
+     tiled_mma, 
+     identity{}, 
+     identity{},
+     identity{},
+     identity{},
+     SM75_U32x4_LDSM_N{},
+     SM75_U32x4_LDSM_N{},
+     SM75_U32x4_LDSM_N{},
+     SM90_U32x4_STSM_N{});
+}
+
+#endif
diff --git a/test/unit/gemm/device/gemm_splitk_serial_tensor_op_sm75.cu b/test/unit/gemm/device/gemm_splitk_serial_tensor_op_sm75.cu
index 311b50c1df..cd51ec642a 100644
--- a/test/unit/gemm/device/gemm_splitk_serial_tensor_op_sm75.cu
+++ b/test/unit/gemm/device/gemm_splitk_serial_tensor_op_sm75.cu
@@ -92,7 +92,7 @@ TEST(SM75_Device_GemmSplitKSerial_f16n_f16n_f16t_tensor_op_f32, 128x256x32_64x64
     cutlass::gemm::GemmShape<128, 256, 32>,
     cutlass::gemm::GemmShape<64, 64, 32>,
     cutlass::gemm::GemmShape<16, 8, 8>,
-    cutlass::epilogue::thread::LinearCombinationRelu<
+    cutlass::epilogue::thread::LinearCombination<
       ElementOutput,
       128 / cutlass::sizeof_bits<ElementOutput>::value,
       ElementAccumulator,
@@ -105,7 +105,7 @@ TEST(SM75_Device_GemmSplitKSerial_f16n_f16n_f16t_tensor_op_f32, 128x256x32_64x64
     kSplitKSerial
   >;
 
-  bool result = test::gemm::device::TestAllGemm<Gemm, true>();
+  bool result = test::gemm::device::TestAllGemm<Gemm, false>();
   EXPECT_TRUE(result);
 }
 
diff --git a/test/unit/gemm/device/gemm_testbed_3x.hpp b/test/unit/gemm/device/gemm_testbed_3x.hpp
index 015846bd07..18b953963e 100644
--- a/test/unit/gemm/device/gemm_testbed_3x.hpp
+++ b/test/unit/gemm/device/gemm_testbed_3x.hpp
@@ -2474,17 +2474,20 @@ struct HostCollectiveEpilogue {
       // example of how to set kernel activation arguments
       // see ActivationFunctor::Arguments in activation.h for definition
       // if Arguments doesn't exist then fusion_args.activation is empty
+      auto init_activation_args = [] (auto activation, auto& args) {
+        using Activation = cute::remove_cvref_t<decltype(activation)>;
+        if constexpr (cute::is_same_v<Activation, cutlass::epilogue::thread::Clamp<ElementCompute>>) {
+          args.lower_bound = 0; // Treat Clamp as ReLU
+          args.upper_bound = cutlass::platform::identity_for_minimum<ElementCompute>();
+        }
+        if constexpr (cute::is_same_v<Activation, cutlass::epilogue::thread::ScaledGELU_taylor<ElementCompute>>) {
+          args.scale = ElementCompute(1);
+        }
+      };
 
-      if constexpr (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::ScaledGELU_taylor<ElementCompute>>) {
-        fusion_args.activation.scale = ElementCompute(1);
-      }
-
-      // Treat Clamp as ReLU
-      if constexpr (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>) {
-        fusion_args.activation.lower_bound = 0;
-        fusion_args.activation.upper_bound = std::numeric_limits<ElementCompute>::max();
+      if constexpr (not cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Identity<ElementCompute>>) {
+        init_activation_args(ActivationFunctor{}, fusion_args.activation);
       }
-
       if constexpr (IsAbsMaxEnabledD) {
         fusion_args.amax_D_ptr = abs_max_D.device_data();
       }
diff --git a/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp b/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp
index ae3794198b..d449742241 100644
--- a/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp
+++ b/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp
@@ -379,7 +379,6 @@ struct HostCollectiveMainloop {
     //
     // Allocate the GEMM workspace
     //
-
     // for pointer array problem_shapes.groups() is 1
 
     tensors_A.clear();
@@ -565,7 +564,7 @@ struct HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlo
 
   static constexpr int SFVecSize = Gemm::GemmKernel::CollectiveMainloop::SFVecSize;
 
-  using ElementSF = typename Gemm::GemmKernel::ElementSF;
+  using ElementSF = typename Gemm::GemmKernel::CollectiveMainloop::ElementSF;
   using Sm1xxBlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig;
   using Blk_MN   = typename Sm1xxBlkScaledConfig::Blk_MN;
   using Blk_SF   = typename Sm1xxBlkScaledConfig::Blk_SF;
@@ -633,6 +632,7 @@ struct HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlo
     //
     // Allocate the GEMM workspace
     //
+
     tensors_A.clear();
     tensors_B.clear();
     stride_a_host.clear();
@@ -800,6 +800,56 @@ struct HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlo
   }
 };
 
+//
+// Block Scaled Gemm Input Operands : A , B, scalefactorA, scalefactorB
+//
+template<
+  class Gemm,
+  int SchedulerPipelineStageCount_,
+  class ElementA_,
+  class ElementB_
+>
+struct HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongBlockScaledSm120<SchedulerPipelineStageCount_>, 
+                              Gemm, ElementA_, ElementB_> : public
+       HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockScaledSm100<0,0>,
+                              Gemm, ElementA_, ElementB_> {
+  using Base = HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockScaledSm100<0,0>,
+                                      Gemm, ElementA_, ElementB_>;
+  HostCollectiveMainloop(
+    CheckEquality check_relative_equality_ = CheckEquality::EXACT,
+    cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
+    cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
+    uint64_t seed_ = Base::kDefaultSeed,
+    typename Base::LayoutTagA::Stride stride_factor_A_ = typename Base::LayoutTagA::Stride(),
+    typename Base::LayoutTagB::Stride stride_factor_B_ = typename Base::LayoutTagB::Stride()
+  ) : Base::HostCollectiveMainloop(check_relative_equality_, init_A_, init_B_, seed_, stride_factor_A_, stride_factor_B_) {}
+};
+
+//
+// Block Scaled Gemm Input Operands : A , B, scalefactorA, scalefactorB
+//
+template<
+  class Gemm,
+  int SchedulerPipelineStageCount_,
+  class ElementA_,
+  class ElementB_
+>
+struct HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeBlockScaledSm120<SchedulerPipelineStageCount_>, 
+                              Gemm, ElementA_, ElementB_> : public
+       HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockScaledSm100<0,0>,
+                              Gemm, ElementA_, ElementB_> {
+  using Base = HostCollectiveMainloop<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockScaledSm100<0,0>,
+                                      Gemm, ElementA_, ElementB_>;
+  HostCollectiveMainloop(
+    CheckEquality check_relative_equality_ = CheckEquality::EXACT,
+    cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
+    cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
+    uint64_t seed_ = Base::kDefaultSeed,
+    typename Base::LayoutTagA::Stride stride_factor_A_ = typename Base::LayoutTagA::Stride(),
+    typename Base::LayoutTagB::Stride stride_factor_B_ = typename Base::LayoutTagB::Stride()
+  ) : Base::HostCollectiveMainloop(check_relative_equality_, init_A_, init_B_, seed_, stride_factor_A_, stride_factor_B_) {}
+};
+
 
 template<class Gemm>
 struct HostCollectiveDefaultEpilogue {
@@ -1537,6 +1587,12 @@ struct HostCollectiveEpilogue {
         << "\n\nComputed Aux =\n" << tensors_Aux[batch].host_view();
     }
 
+    if constexpr (IsBlockScaleSupported) {
+      file
+        << "\n\nReference SFD =\n" << references_SFD[batch].host_view()
+        << "\n\nComputed SFD =\n" << tensors_SFD[batch].host_view();
+    }
+
     file
     << "\nC =\n" << tensors_C[batch].host_view()
     << "\n\nReference =\n" << references_D[batch].host_view()
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/CMakeLists.txt b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/CMakeLists.txt
index 735b3760e3..5575fe98d6 100644
--- a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/CMakeLists.txt
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/CMakeLists.txt
@@ -29,22 +29,25 @@
 if (CUTLASS_NVCC_ARCHS MATCHES 100a)
 
 add_custom_target(
-  cutlass_test_unit_gemm_device_sm100_bssp
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse
   DEPENDS
-  cutlass_test_unit_gemm_device_sm100_bssp_nvf4_nvf4_f32_f32_f32_o
-  cutlass_test_unit_gemm_device_sm100_bssp_nvf4_nvf4_f32_f16_f16_o
-  cutlass_test_unit_gemm_device_sm100_bssp_nvf4_nvf4_f32_f16_nvf4_o
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf8_f32_f32_f32_q
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf8_f32_f16_f16_q
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf8_f32_f16_mxf8_q
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf4_f32_q
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf4_mxf4_f32_q
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf4_mxf4_f32_o
-  cutlass_test_unit_gemm_device_sm100_bssp_streamk
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_nvf4_nvf4_f32_f32_f32_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_nvf4_nvf4_f32_f16_f16_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_nvf4_nvf4_f32_f16_nvf4_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8_mxf8_f32_f32_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8_mxf8_f32_f16_f16_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8_mxf8_f32_f16_mxf8_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8mxf4_mxf4mxf8_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8mxf6_mxf6mxf8_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf4mxf6_mxf4mxf6_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf6_mxf6_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf4_mxf4_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf4_mxf4_f32_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_streamk
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_nvf4_nvf4_f32_f32_f32_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_nvf4_nvf4_f32_f32_f32_o
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -57,7 +60,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_nvf4_nvf4_f32_f16_f16_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_nvf4_nvf4_f32_f16_f16_o
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -70,7 +73,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_nvf4_nvf4_f32_f16_nvf4_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_nvf4_nvf4_f32_f16_nvf4_o
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -83,7 +86,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf8_f32_f32_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8_mxf8_f32_f32_f32_q
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -96,7 +99,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf8_f32_f16_f16_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8_mxf8_f32_f16_f16_q
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -109,7 +112,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf8_f32_f16_mxf8_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8_mxf8_f32_f16_mxf8_q
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -127,7 +130,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf4_mxf4_f32_o
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf4_mxf4_f32_o
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -140,7 +143,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf8_mxf4_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8mxf4_mxf4mxf8_f32_q
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -148,10 +151,32 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
   sm100_bssp_gemm_mxf8_mxf4_f32_f16_mxf8_q_tnt.cu
   sm100_bssp_gemm_mxf8_mxf4_f32_f16_f16_q_tnt.cu
   sm100_bssp_gemm_mxf8_mxf4_f32_f32_f32_q_tnt.cu
+
+  sm100_bssp_gemm_mxf4_mxf8_f32_f16_f16_q_tnt.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf8mxf6_mxf6mxf8_f32_q
+
+  BATCH_SOURCES ON
+  BATCH_SIZE 1
+
+  sm100_bssp_gemm_mxf6_mxf8_f32_f16_f16_q_tnt.cu
+  sm100_bssp_gemm_mxf8_mxf6_f32_f16_f16_q_tnt.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf4mxf6_mxf4mxf6_f32_q
+
+  BATCH_SOURCES ON
+  BATCH_SIZE 1
+
+  sm100_bssp_gemm_mxf4_mxf6_f32_f16_f16_q_tnt.cu
+  sm100_bssp_gemm_mxf6_mxf4_f32_f16_f16_q_tnt.cu
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_mxf4_mxf4_f32_q
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf4_mxf4_f32_q
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
@@ -162,7 +187,16 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_bssp_streamk
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_mxf6_mxf6_f32_q
+
+  BATCH_SOURCES ON
+  BATCH_SIZE 1
+
+  sm100_bssp_gemm_mxf6_mxf6_f32_f16_f16_q_tnt.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_blockscaled_sparse_streamk
 
   BATCH_SOURCES ON
   BATCH_SIZE 1
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf6_f32_f16_f16_q_tnt.cu b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf6_f32_f16_f16_q_tnt.cu
new file mode 100644
index 0000000000..c96607468e
--- /dev/null
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf6_f32_f16_f16_q_tnt.cu
@@ -0,0 +1,1102 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt_vs64in
+// 2. 128x192_tnt_vs64in
+// 3. 128x256_tnt_vs64in
+// 4. 256x128_tnt_vs64in
+// 5. 256x192_tnt_vs64in
+// 6. 256x256_tnt_vs64in
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3. 
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe2m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf8_f32_f16_f16_q_tnt.cu b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf8_f32_f16_f16_q_tnt.cu
new file mode 100644
index 0000000000..2d1f7fe278
--- /dev/null
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf4_mxf8_f32_f16_f16_q_tnt.cu
@@ -0,0 +1,1102 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt_vs64in
+// 2. 128x192_tnt_vs64in
+// 3. 128x256_tnt_vs64in
+// 4. 256x128_tnt_vs64in
+// 5. 256x192_tnt_vs64in
+// 6. 256x256_tnt_vs64in
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3. 
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m1_ue8m0xe4m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf4_f32_f16_f16_q_tnt.cu b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf4_f32_f16_f16_q_tnt.cu
new file mode 100644
index 0000000000..3a2b8fb048
--- /dev/null
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf4_f32_f16_f16_q_tnt.cu
@@ -0,0 +1,1102 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt_vs64in
+// 2. 128x192_tnt_vs64in
+// 3. 128x256_tnt_vs64in
+// 4. 256x128_tnt_vs64in
+// 5. 256x192_tnt_vs64in
+// 6. 256x256_tnt_vs64in
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3. 
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe2m1_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf6_f32_f16_f16_q_tnt.cu b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf6_f32_f16_f16_q_tnt.cu
new file mode 100644
index 0000000000..3645ef36ca
--- /dev/null
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf6_f32_f16_f16_q_tnt.cu
@@ -0,0 +1,1102 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt_vs64in
+// 2. 128x192_tnt_vs64in
+// 3. 128x256_tnt_vs64in
+// 4. 256x128_tnt_vs64in
+// 5. 256x192_tnt_vs64in
+// 6. 256x256_tnt_vs64in
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3. 
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe3m2_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf8_f32_f16_f16_q_tnt.cu b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf8_f32_f16_f16_q_tnt.cu
new file mode 100644
index 0000000000..d3850ca0a1
--- /dev/null
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf6_mxf8_f32_f16_f16_q_tnt.cu
@@ -0,0 +1,1102 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt_vs64in
+// 2. 128x192_tnt_vs64in
+// 3. 128x256_tnt_vs64in
+// 4. 256x128_tnt_vs64in
+// 5. 256x192_tnt_vs64in
+// 6. 256x256_tnt_vs64in
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3. 
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementPairB = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe2m3_ue8m0xe4m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif
diff --git a/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf8_mxf6_f32_f16_f16_q_tnt.cu b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf8_mxf6_f32_f16_f16_q_tnt.cu
new file mode 100644
index 0000000000..e3040f8644
--- /dev/null
+++ b/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm/sm100_bssp_gemm_mxf8_mxf6_f32_f16_f16_q_tnt.cu
@@ -0,0 +1,1102 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt_vs64in
+// 2. 128x192_tnt_vs64in
+// 3. 128x256_tnt_vs64in
+// 4. 256x128_tnt_vs64in
+// 5. 256x192_tnt_vs64in
+// 6. 256x256_tnt_vs64in
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = float;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_f16_f16_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 1. 
+namespace cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3. 
+namespace cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 5.
+namespace cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _192, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 6.
+namespace cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementPairA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
+    using ElementPairB = cutlass::mx_float6_t<cutlass::float_e2m3_t>;
+    using ElementC = void;
+    using ElementD = float;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2SmMxf8f6f4;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmMxf8f6f4Sm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = float;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType,
+            cutlass::epilogue::fusion::PerRowLinCombPerRowBiasEltAct<
+                cutlass::epilogue::thread::Clamp,
+                ElementD,
+                ElementEpilogueCompute,
+                ElementBias,
+                ElementC,
+                ElementEpilogueCompute>
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassTag,
+            ElementPairA, LayoutA, kAlignmentA,
+            ElementPairB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x128x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x192x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s128x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_128x256x256_0_vs64_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x128x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x128x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 5.
+TEST(cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x192x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x192x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 6.
+TEST(cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_bssptensorop_s256x256x64bsspgemm_ue8m0xe4m3_ue8m0xe2m3_f32_void_f32_256x256x256_0_vs64_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/CMakeLists.txt b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/CMakeLists.txt
index 334cdce44a..f61bf8ad80 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/CMakeLists.txt
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/CMakeLists.txt
@@ -26,18 +26,19 @@
 # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+add_subdirectory(narrow_precision)
+
 if (CUTLASS_NVCC_ARCHS MATCHES 100a)
 
 add_custom_target(
-  cutlass_test_unit_gemm_device_sm100_sp
+  cutlass_test_unit_gemm_device_sm100_sparse
   DEPENDS
-  cutlass_test_unit_gemm_device_sm100_sp_general
-  cutlass_test_unit_gemm_device_sm100_sp_qmma_variance
-  cutlass_test_unit_gemm_device_sm100_sp_streamk
+  cutlass_test_unit_gemm_device_sm100_sparse_general
+  cutlass_test_unit_gemm_device_sm100_sparse_streamk
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_sp_general
+  cutlass_test_unit_gemm_device_sm100_sparse_general
 
   # No batching of source to control compiler memory usage
   BATCH_SOURCES ON
@@ -52,23 +53,7 @@ cutlass_test_unit_gemm_device_add_executable_split_file(
 )
 
 cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_sp_qmma_variance
-
-  # No batching of source to control compiler memory usage
-  BATCH_SOURCES ON
-  BATCH_SIZE 1
-
-  sm100_sp_gemm_f4_f4_f32_f16_f8_qmma.cu
-  sm100_sp_gemm_f4_f4_f32_f16_f16_qmma.cu
-  sm100_sp_gemm_f4_f4_f32_f32_f32_qmma.cu
-
-  sm100_sp_gemm_f6_f6_f32_f16_f8_qmma.cu
-  sm100_sp_gemm_f6_f6_f32_f16_f16_qmma.cu
-  sm100_sp_gemm_f6_f6_f32_f32_f32_qmma.cu
-)
-
-cutlass_test_unit_gemm_device_add_executable_split_file(
-  cutlass_test_unit_gemm_device_sm100_sp_streamk
+  cutlass_test_unit_gemm_device_sm100_sparse_streamk
 
   # No batching of source to control compiler memory usage
   BATCH_SOURCES ON
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/CMakeLists.txt b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/CMakeLists.txt
new file mode 100644
index 0000000000..dc1549ec51
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/CMakeLists.txt
@@ -0,0 +1,77 @@
+# Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+
+add_custom_target(
+  cutlass_test_unit_gemm_device_sm100_sparse_narrow_precision
+  DEPENDS
+  cutlass_test_unit_gemm_device_sm100_sparse_f4xf4
+  cutlass_test_unit_gemm_device_sm100_sparse_f6xf6
+  cutlass_test_unit_gemm_device_sm100_sparse_f4f6xf4f6
+  cutlass_test_unit_gemm_device_sm100_sparse_f4f8xf4f8
+  cutlass_test_unit_gemm_device_sm100_sparse_f6f8xf6f8
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_sparse_f4xf4
+  sm100_sp_gemm_f4_f4_f32_f16_f8_tn.cu
+  sm100_sp_gemm_f4_f4_f32_f16_f16_tn.cu
+  sm100_sp_gemm_f4_f4_f32_f32_f32_tn.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_sparse_f6xf6
+
+  sm100_sp_gemm_f6_f6_f32_f16_f8_tn.cu
+  sm100_sp_gemm_f6_f6_f32_f16_f16_tn.cu
+  sm100_sp_gemm_f6_f6_f32_f32_f32_tn.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_sparse_f4f6xf4f6
+
+  sm100_sp_gemm_f4_f6_f32_f16_f16_tn.cu
+  sm100_sp_gemm_f6_f4_f32_f16_f16_tn.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_sparse_f4f8xf4f8
+
+  sm100_sp_gemm_f4_f8_f32_f16_f16_tn.cu
+  sm100_sp_gemm_f8_f4_f32_f16_f16_tn.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable_split_file(
+  cutlass_test_unit_gemm_device_sm100_sparse_f6f8xf6f8
+
+  sm100_sp_gemm_f6_f8_f32_f16_f16_tn.cu
+  sm100_sp_gemm_f8_f6_f32_f16_f16_tn.cu
+)
+
+endif()
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f16_f16_qmma.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f16_tn.cu
similarity index 99%
rename from test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f16_f16_qmma.cu
rename to test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f16_tn.cu
index 7819c33ff0..e46934c8af 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f16_f16_qmma.cu
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f16_tn.cu
@@ -40,8 +40,8 @@
 #include "cutlass/epilogue/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/epilogue/thread/activation.h"
-#include "../../../common/cutlass_unit_test.h"
-#include "../gemm_testbed_3x.hpp"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
 
 using namespace cute;
 
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f16_f8_qmma.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f8_tn.cu
similarity index 99%
rename from test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f16_f8_qmma.cu
rename to test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f8_tn.cu
index b3a6101183..dbce7f5d05 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f16_f8_qmma.cu
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f16_f8_tn.cu
@@ -40,8 +40,8 @@
 #include "cutlass/epilogue/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/epilogue/thread/activation.h"
-#include "../../../common/cutlass_unit_test.h"
-#include "../gemm_testbed_3x.hpp"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
 
 using namespace cute;
 
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f32_f32_qmma.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f32_f32_tn.cu
similarity index 99%
rename from test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f32_f32_qmma.cu
rename to test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f32_f32_tn.cu
index 91ac9c123b..94f76ca0e9 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f4_f4_f32_f32_f32_qmma.cu
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f4_f32_f32_f32_tn.cu
@@ -40,8 +40,8 @@
 #include "cutlass/epilogue/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/epilogue/thread/activation.h"
-#include "../../../common/cutlass_unit_test.h"
-#include "../gemm_testbed_3x.hpp"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
 
 using namespace cute;
 
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f6_f32_f16_f16_tn.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f6_f32_f16_f16_tn.cu
new file mode 100644
index 0000000000..1896e27d0d
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f6_f32_f16_f16_tn.cu
@@ -0,0 +1,705 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt
+// 2. 128x256_tnt
+// 3. 256x128_tnt
+// 4. 256x256_tnt
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e3m2_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e3m2_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e3m2_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e3m2_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e3m2_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e3m2_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e3m2_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e3m2_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e3m2_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e3m2_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e3m2_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e3m2_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e3m2_f32_void_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e3m2_f32_void_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e3m2_f32_void_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e3m2_f32_void_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e3m2_f32_void_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e3m2_f32_void_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e3m2_f32_void_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e3m2_f32_void_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e3m2_f32_void_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e3m2_f32_void_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e3m2_f32_void_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e3m2_f32_void_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f8_f32_f16_f16_tn.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f8_f32_f16_f16_tn.cu
new file mode 100644
index 0000000000..4960e6a70a
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f4_f8_f32_f16_f16_tn.cu
@@ -0,0 +1,705 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt
+// 2. 128x256_tnt
+// 3. 256x128_tnt
+// 4. 256x256_tnt
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e4m3_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e4m3_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e4m3_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e4m3_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e4m3_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e4m3_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e4m3_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e4m3_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e4m3_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e4m3_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e4m3_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e4m3_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e4m3_f32_void_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e4m3_f32_void_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e4m3_f32_void_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e4m3_f32_void_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e2m1_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e4m3_f32_void_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e2m1_e4m3_f32_void_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e4m3_f32_void_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e2m1_e4m3_f32_void_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e4m3_f32_void_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e2m1_e4m3_f32_void_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e4m3_f32_void_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e2m1_e4m3_f32_void_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f4_f32_f16_f16_tn.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f4_f32_f16_f16_tn.cu
new file mode 100644
index 0000000000..9643370ad7
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f4_f32_f16_f16_tn.cu
@@ -0,0 +1,705 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt
+// 2. 128x256_tnt
+// 3. 256x128_tnt
+// 4. 256x256_tnt
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e2m1_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e2m1_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e2m1_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e2m1_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e2m1_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e2m1_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e2m1_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e2m1_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e2m1_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e2m1_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e2m1_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e2m1_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e2m1_f32_void_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e2m1_f32_void_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e2m1_f32_void_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e2m1_f32_void_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e2m1_f32_void_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e2m1_f32_void_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e2m1_f32_void_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e2m1_f32_void_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e2m1_f32_void_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e2m1_f32_void_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e2m1_f32_void_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e2m1_f32_void_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f16_f16_qmma.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f16_tn.cu
similarity index 99%
rename from test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f16_f16_qmma.cu
rename to test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f16_tn.cu
index e89f844c28..94b52d6033 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f16_f16_qmma.cu
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f16_tn.cu
@@ -40,8 +40,8 @@
 #include "cutlass/epilogue/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/epilogue/thread/activation.h"
-#include "../../../common/cutlass_unit_test.h"
-#include "../gemm_testbed_3x.hpp"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
 
 using namespace cute;
 
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f16_f8_qmma.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f8_tn.cu
similarity index 99%
rename from test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f16_f8_qmma.cu
rename to test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f8_tn.cu
index 1f3cc952c2..c8ab01da1a 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f16_f8_qmma.cu
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f16_f8_tn.cu
@@ -40,8 +40,8 @@
 #include "cutlass/epilogue/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/epilogue/thread/activation.h"
-#include "../../../common/cutlass_unit_test.h"
-#include "../gemm_testbed_3x.hpp"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
 
 using namespace cute;
 
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f32_f32_qmma.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f32_f32_tn.cu
similarity index 99%
rename from test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f32_f32_qmma.cu
rename to test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f32_f32_tn.cu
index 602b07cab0..f97911be1c 100644
--- a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/sm100_sp_gemm_f6_f6_f32_f32_f32_qmma.cu
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f6_f32_f32_f32_tn.cu
@@ -40,8 +40,8 @@
 #include "cutlass/epilogue/dispatch_policy.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/epilogue/thread/activation.h"
-#include "../../../common/cutlass_unit_test.h"
-#include "../gemm_testbed_3x.hpp"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
 
 using namespace cute;
 
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f8_f32_f16_f16_tn.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f8_f32_f16_f16_tn.cu
new file mode 100644
index 0000000000..ad81c5f85e
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f6_f8_f32_f16_f16_tn.cu
@@ -0,0 +1,705 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt
+// 2. 128x256_tnt
+// 3. 256x128_tnt
+// 4. 256x256_tnt
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e4m3_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e4m3_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e4m3_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e4m3_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e4m3_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e4m3_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e4m3_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e4m3_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e4m3_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e4m3_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e4m3_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e4m3_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e4m3_f32_void_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e4m3_f32_void_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e4m3_f32_void_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e4m3_f32_void_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e3m2_t;
+    using ElementB = cutlass::float_e4m3_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 256;
+    constexpr int kAlignmentB = 16;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e4m3_f32_void_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e3m2_e4m3_f32_void_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e4m3_f32_void_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e3m2_e4m3_f32_void_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e4m3_f32_void_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e3m2_e4m3_f32_void_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e4m3_f32_void_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e3m2_e4m3_f32_void_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f4_f32_f16_f16_tn.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f4_f32_f16_f16_tn.cu
new file mode 100644
index 0000000000..4b0ab7f8b2
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f4_f32_f16_f16_tn.cu
@@ -0,0 +1,705 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt
+// 2. 128x256_tnt
+// 3. 256x128_tnt
+// 4. 256x256_tnt
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e2m1_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e2m1_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e2m1_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e2m1_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e2m1_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e2m1_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e2m1_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e2m1_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e2m1_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e2m1_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e2m1_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e2m1_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e2m1_f32_void_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e2m1_f32_void_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e2m1_f32_void_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e2m1_f32_void_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e2m1_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e2m1_f32_void_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e2m1_f32_void_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e2m1_f32_void_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e2m1_f32_void_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e2m1_f32_void_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e2m1_f32_void_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e2m1_f32_void_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e2m1_f32_void_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
diff --git a/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f6_f32_f16_f16_tn.cu b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f6_f32_f16_f16_tn.cu
new file mode 100644
index 0000000000..adf4188e21
--- /dev/null
+++ b/test/unit/gemm/device/sm100_sparse_tensorop_gemm/narrow_precision/sm100_sp_gemm_f8_f6_f32_f16_f16_tn.cu
@@ -0,0 +1,705 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "../../../../common/cutlass_unit_test.h"
+#include "../../gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+// * Test list
+// 1. 128x128_tnt
+// 2. 128x256_tnt
+// 3. 256x128_tnt
+// 4. 256x256_tnt
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e3m2_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e3m2_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e3m2_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e3m2_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = cutlass::half_t;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e3m2_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e3m2_f32_f16_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e3m2_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e3m2_f32_f16_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e3m2_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e3m2_f32_f16_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e3m2_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e3m2_f32_f16_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 1,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+
+// 1. 
+namespace cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e3m2_f32_void_f16_128x128x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 2.
+namespace cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e3m2_f32_void_f16_128x256x256_0_tnt_align32_q_1sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_128, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized1Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 3.
+namespace cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e3m2_f32_void_f16_256x128x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _128, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 4.
+namespace cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e3m2_f32_void_f16_256x256x256_0_tnt_align32_q_2sm {
+
+    using LayoutA = cutlass::layout::RowMajor;
+    using LayoutB = cutlass::layout::ColumnMajor;
+    using LayoutC = cutlass::layout::RowMajor;
+    using LayoutD = cutlass::layout::RowMajor;
+
+    using ElementA = cutlass::float_e4m3_t;
+    using ElementB = cutlass::float_e3m2_t;
+    using ElementC = void;
+    using ElementD = cutlass::half_t;
+
+    constexpr int kAlignmentA = 32;
+    constexpr int kAlignmentB = 128;
+    constexpr int kAlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+    constexpr int kAlignmentC = cute::is_same_v<ElementC, void> ? kAlignmentD : 128 / cutlass::sizeof_bits<ElementC>::value;
+
+    using ProblemShape = Shape<int,int,int,int>;
+    using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
+    using MmaTileShape = Shape<_256, _256, _256>;
+    using ArchTag = cutlass::arch::Sm100;
+    using OpClassEpilogue = cutlass::arch::OpClassSparseTensorOp;
+    using OpClassMainLoop = cutlass::arch::OpClassSparseTensorOp;
+    using EpilogueTile = cutlass::epilogue::collective::EpilogueTileAuto;
+    using EpilogueScheduleType = cutlass::epilogue::TmaWarpSpecialized2Sm;
+    using KernelScheduleType = cutlass::gemm::KernelSparseTmaWarpSpecialized2SmSm100;
+    using ElementAccumulator = float;
+    using ElementEpilogueCompute = float;
+    using ElementBias = cutlass::half_t;
+    using TileScheduler = cutlass::gemm::PersistentScheduler;
+
+    using CollectiveEpilogue =
+        typename cutlass::epilogue::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassEpilogue,
+            MmaTileShape,
+            ClusterShape,
+            EpilogueTile,
+            ElementAccumulator,
+            ElementEpilogueCompute,
+            ElementC, LayoutC, kAlignmentC,
+            ElementD, LayoutD, kAlignmentD,
+            EpilogueScheduleType
+        >::CollectiveOp;
+
+    using StageCount = cutlass::gemm::collective::StageCountAutoCarveoutEpi<CollectiveEpilogue>;
+
+    using CollectiveMainloop =
+        typename cutlass::gemm::collective::CollectiveBuilder<
+            ArchTag,
+            OpClassMainLoop,
+            ElementA, LayoutA, kAlignmentA,
+            ElementB, LayoutB, kAlignmentB,
+            ElementAccumulator,
+            MmaTileShape,
+            ClusterShape,
+            StageCount,
+            KernelScheduleType
+        >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        cute::Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        void>;
+
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+}
+
+// 1.
+TEST(cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e3m2_f32_void_f16_128x128x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x128x64spgemm_e4m3_e3m2_f32_void_f16_128x128x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 2.
+TEST(cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e3m2_f32_void_f16_128x256x256_0_tnt_align32_q_1sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s128x256x64spgemm_e4m3_e3m2_f32_void_f16_128x256x256_0_tnt_align32_q_1sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 3.
+TEST(cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e3m2_f32_void_f16_256x128x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x128x64spgemm_e4m3_e3m2_f32_void_f16_256x128x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+// 4.
+TEST(cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e3m2_f32_void_f16_256x256x256_0_tnt_align32_q_2sm, functional) {
+  namespace gemm = cutlass3x_sm100_sptensorop_s256x256x64spgemm_e4m3_e3m2_f32_void_f16_256x256x256_0_tnt_align32_q_2sm;
+  EXPECT_TRUE(test::gemm::device::TestSmall<gemm::Gemm>(
+    1, 0,
+    test::gemm::device::CheckEquality::RELATIVE,
+    test::gemm::device::ScalarLoc::ON_DEVICE,
+    test::gemm::device::VectorScale::ENABLED,
+    {256, 2560}));
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/CMakeLists.txt b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/CMakeLists.txt
index 380b4aa42d..27b5a00a82 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/CMakeLists.txt
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/CMakeLists.txt
@@ -37,6 +37,7 @@ add_custom_target(
   cutlass_test_unit_bs_gemm_device_tensorop_epilogue_fusion_sm120
   cutlass_test_unit_bs_gemm_device_tensorop_sm120
   cutlass_test_unit_bs_gemm_device_tensorop_sm120_stream_k
+  cutlass_test_unit_bs_grouped_gemm_device_tensorop_sm120
 )
 
 cutlass_test_unit_gemm_device_add_executable(
@@ -45,24 +46,31 @@ cutlass_test_unit_gemm_device_add_executable(
   BATCH_SOURCES ON
   BATCH_SIZE 1
 
-  sm120_bs_gemm_f4_f4_f32_f32_epilogue_fusion.cu
-  sm120_bs_gemm_f4_f4_f32_f4_epilogue_fusion.cu
-  sm120_bs_gemm_f4_f4_f32_bf16_epilogue_fusion.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_f32_epilogue_fusion.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_nvf4_epilogue_fusion.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_bf16_epilogue_fusion.cu
 )
 
 cutlass_test_unit_gemm_device_add_executable(
   cutlass_test_unit_bs_gemm_device_tensorop_sm120
 
-  sm120_bs_gemm_f4_f4_f32_bf16.cu
-  sm120_bs_gemm_f4_f4_f32_f16.cu
-  sm120_bs_gemm_f4_f4_f32_f32.cu
-  sm120_bs_gemm_f4_f4_f32_f32_narrow_output.cu
-  sm120_bs_gemm_f4_f4_f32_epilogue.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_bf16.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_f16.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_f32.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_f32_narrow_output.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_epilogue.cu
+  sm120_bs_gemm_mxf4_mxf4_f32_f32.cu
+  sm120_bs_gemm_mxf6_mxf8_f32_f32.cu
 )
 
 cutlass_test_unit_gemm_device_add_executable(
   cutlass_test_unit_bs_gemm_device_tensorop_sm120_stream_k
-  sm120_bs_gemm_f4_f4_f32_f32_stream_k.cu
+  sm120_bs_gemm_nvf4_nvf4_f32_f32_stream_k.cu
+)
+
+cutlass_test_unit_gemm_device_add_executable(
+  cutlass_test_unit_bs_grouped_gemm_device_tensorop_sm120
+  sm120_bs_gemm_nvf4_nvf4_f32_nvf4_group_gemm_fusion.cu
 )
 
 endif()
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_epilogue.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_epilogue.cu
deleted file mode 100644
index b2f5b8789b..0000000000
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_epilogue.cu
+++ /dev/null
@@ -1,590 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-
-#include <iostream>
-
-#include "cutlass/cutlass.h"
-#include "cute/tensor.hpp"
-#include "cute/atom/mma_atom.hpp"
-
-#include "cutlass/numeric_types.h"
-
-#include "cutlass/gemm/device/gemm_universal_adapter.h"
-#include "cutlass/gemm/kernel/gemm_universal.hpp"
-#include "cutlass/gemm/collective/collective_builder.hpp"
-#include "cutlass/epilogue/collective/collective_builder.hpp"
-#include "cutlass/gemm/collective/collective_builder.hpp"
-#include "cutlass/epilogue/collective/default_epilogue.hpp"
-#include "cutlass/epilogue/thread/linear_combination.h"
-#include "cutlass/gemm/dispatch_policy.hpp"
-
-#include "../../../common/cutlass_unit_test.h"
-
-#include "../gemm_testbed_3x.hpp"
-
-#if (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
-
-using namespace cute;
-
-///////////////////////////////////////////////////////////////////////////////
-
-namespace kernel_1 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = float;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue4m3_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 16;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-
-} // kernel_1
-
-namespace kernel_2 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = float;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 32;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-} // kernel_2
-
-namespace kernel_3 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::half_t;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue4m3_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 16;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-
-} // kernel_3
-
-namespace kernel_4 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::half_t;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 32;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-} // kernel_4
-
-namespace kernel_5 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::bfloat16_t;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue4m3_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 16;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-
-} // kernel_5
-
-namespace kernel_6 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::bfloat16_t;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 32;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-} // kernel_6
-
-namespace kernel_7 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = void;
-  using ElementD = cutlass::float_e2m1_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue4m3_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-  using LayoutSFD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementD>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 16;
-  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
-    SFVectorSize,
-    ElementD,
-    ElementCompute,
-    ElementSF,
-    LayoutSFD,
-    ElementC
-  >;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
-      FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-
-} // kernel_7
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue_vs16, 128x128x256) {
-  bool result = test::gemm::device::TestSmall<kernel_1::Gemm>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs32_tensor_op_f32_f32_epilogue_vs32, 128x128x256) {
-  bool result = test::gemm::device::TestSmall<kernel_2::Gemm>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
-// ==== mixed datatypes for C (fp16/bf16) / D (fp32) matrices ==== //
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f16_f32_epilogue_vs16, 128x128x256) {
-  bool result = test::gemm::device::TestSmall<kernel_3::Gemm>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs32_tensor_op_f16_f32_epilogue_vs32, 128x128x256) {
-  bool result = test::gemm::device::TestSmall<kernel_4::Gemm>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_bf16_f32_epilogue_vs16, 128x128x256) {
-  bool result = test::gemm::device::TestSmall<kernel_5::Gemm>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs32_tensor_op_bf16_f32_epilogue_vs32, 128x128x256) {
-  bool result = test::gemm::device::TestSmall<kernel_6::Gemm>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs32_tensor_op_void_f32_epilogue_vs32, 128x128x256) {
-  bool result = test::gemm::device::TestSmallFusion<kernel_7::Gemm>(1.0, 0.0);
-  EXPECT_TRUE(result);
-}
-
-#endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf4_mxf4_f32_f32.cu
similarity index 96%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf4_mxf4_f32_f32.cu
index 02421673c3..60e9f4d051 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf4_mxf4_f32_f32.cu
@@ -87,7 +87,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -97,7 +97,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -114,7 +114,8 @@ namespace kernel_1 {
 
 } // kernel_1
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs32_tensor_op_f32_static_sched, 128x128x256) {
+
+TEST(SM120_Device_Blockscaled_Gemm_mxf4t_mxf4n_f32n_tensor_op_f32, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_1::Gemm, true>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf6_mxf8_f32_f32.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf6_mxf8_f32_f32.cu
new file mode 100644
index 0000000000..64b79cad07
--- /dev/null
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf6_mxf8_f32_f32.cu
@@ -0,0 +1,123 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+
+#include "../../../common/cutlass_unit_test.h"
+
+#include "../gemm_testbed_3x.hpp"
+
+#if (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
+
+using namespace cute;
+
+///////////////////////////////////////////////////////////////////////////////
+
+namespace kernel_1 {
+  using ElementA = cutlass::float_e2m1_t;
+  using ElementB = cutlass::float_e2m1_t;
+  using ElementC = float;
+  using ElementD = float;
+  using ElementAccumulator = float;
+  using ElementCompute = float;
+  using ElementSF = cutlass::float_ue8m0_t;
+
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::ColumnMajor;
+
+  using ElementPairA = cutlass::mx_float6_t<cutlass::float_e3m2_t>;
+  using ElementPairB = cutlass::mx_float8_t<cutlass::float_e5m2_t>;
+
+  static constexpr int AlignmentA = 64 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 64 bytes.
+  static constexpr int AlignmentB = 96 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 96 bytes.
+  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
+  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+  using TileShape = Shape<_128,_128,_128>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementPairA, LayoutA, AlignmentA,
+      ElementPairB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  template <typename T>
+  struct dummy {
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue
+    >;    
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  };
+  using GemmKernel = typename dummy<void>::GemmKernel;
+  using Gemm = typename dummy<void>::Gemm;
+
+} // kernel_1
+
+
+TEST(SM120_Device_Blockscaled_Gemm_mxf6t_mxf8n_f32n_tensor_op_f32, 128x128x128) {
+  bool result = test::gemm::device::TestSmall<kernel_1::Gemm, true>(1.0, 0.5);
+  EXPECT_TRUE(result);
+}
+
+#endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_bf16.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_bf16.cu
similarity index 64%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_bf16.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_bf16.cu
index e64d6483f3..782eec30be 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_bf16.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_bf16.cu
@@ -87,7 +87,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -97,7 +97,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -115,74 +115,10 @@ namespace kernel_1 {
 
 } // kernel_1
 
-namespace kernel_3 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::bfloat16_t;
-  using ElementD = cutlass::bfloat16_t;
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::ColumnMajor;
-
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 64 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 64 bytes.
-  static constexpr int AlignmentB = 64 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 64 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_128>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
-    >::CollectiveOp;
 
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf8f6f4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using TileSchedulerTag = cutlass::gemm::PersistentScheduler; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        TileSchedulerTag>;    
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-} // kernel_3
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_bf16, 128x128x256) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_bf16, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_1::Gemm, true>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs32_tensor_op_bf16, 128x128x128) {
-  bool result = test::gemm::device::TestSmall<kernel_3::Gemm, true>(1.0, 0.5);
-  EXPECT_TRUE(result);
-}
-
 #endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_bf16_epilogue_fusion.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_bf16_epilogue_fusion.cu
similarity index 93%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_bf16_epilogue_fusion.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_bf16_epilogue_fusion.cu
index ef7a1e5fa9..5ae107463d 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_bf16_epilogue_fusion.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_bf16_epilogue_fusion.cu
@@ -97,7 +97,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -108,7 +108,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -167,7 +167,7 @@ namespace kernel_2 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -178,7 +178,7 @@ namespace kernel_2 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -237,7 +237,7 @@ namespace kernel_3 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -248,7 +248,7 @@ namespace kernel_3 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -309,7 +309,7 @@ namespace kernel_4 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -320,7 +320,7 @@ namespace kernel_4 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -343,7 +343,7 @@ namespace kernel_4 {
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_relu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_bf16n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_relu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_1::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -354,7 +354,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_ep
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_gelu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_bf16n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_gelu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_2::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -365,7 +365,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_ep
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_relu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_bf16n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_relu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_3::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -377,7 +377,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_ep
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_bf16n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_gelu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_bf16n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_gelu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_4::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f16.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_epilogue.cu
similarity index 59%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f16.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_epilogue.cu
index ad0d8a2ef0..a88bc06bf8 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f16.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_epilogue.cu
@@ -39,6 +39,7 @@
 
 #include "cutlass/gemm/device/gemm_universal_adapter.h"
 #include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/gemm/collective/collective_builder.hpp"
 #include "cutlass/epilogue/collective/default_epilogue.hpp"
@@ -58,8 +59,8 @@ using namespace cute;
 namespace kernel_1 {
   using ElementA = cutlass::float_e2m1_t;
   using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::half_t;
-  using ElementD = cutlass::half_t;
+  using ElementC = float;
+  using ElementD = cutlass::float_e2m1_t;
   using ElementAccumulator = float;
   using ElementCompute = float;
   using ElementSF = cutlass::float_ue4m3_t;
@@ -67,7 +68,8 @@ namespace kernel_1 {
   using LayoutA = cutlass::layout::RowMajor;
   using LayoutB = cutlass::layout::ColumnMajor;
   using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::RowMajor;
+  using LayoutSFD = cutlass::layout::RowMajor;
 
   using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
   using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
@@ -80,6 +82,15 @@ namespace kernel_1 {
   using TileShape = Shape<_128,_128,_256>;
   using ClusterShape = Shape<_1,_1,_1>;
 
+  constexpr int SFVectorSize = 16;
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+    SFVectorSize,
+    ElementD,
+    ElementCompute,
+    ElementSF,
+    LayoutSFD
+  >;
+
   using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
       cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
       TileShape, ClusterShape,
@@ -87,7 +98,8 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -97,40 +109,41 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
   struct dummy {
-    using TileSchedulerTag = cutlass::gemm::PersistentScheduler; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
     using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
         Shape<int,int,int,int>,
         CollectiveMainloop,
         CollectiveEpilogue,
-        TileSchedulerTag>;    
+        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
     using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
   };
   using GemmKernel = typename dummy<void>::GemmKernel;
   using Gemm = typename dummy<void>::Gemm;
 
+
 } // kernel_1
 
 namespace kernel_2 {
   using ElementA = cutlass::float_e2m1_t;
   using ElementB = cutlass::float_e2m1_t;
   using ElementC = cutlass::half_t;
-  using ElementD = cutlass::half_t;
+  using ElementD = cutlass::float_e2m1_t;
   using ElementAccumulator = float;
   using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
+  using ElementSF = cutlass::float_ue4m3_t;
 
   using LayoutA = cutlass::layout::RowMajor;
   using LayoutB = cutlass::layout::ColumnMajor;
   using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::RowMajor;
+  using LayoutSFD = cutlass::layout::RowMajor;
 
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
 
   static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
   static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
@@ -140,6 +153,15 @@ namespace kernel_2 {
   using TileShape = Shape<_128,_128,_256>;
   using ClusterShape = Shape<_1,_1,_1>;
 
+  constexpr int SFVectorSize = 16;
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+    SFVectorSize,
+    ElementD,
+    ElementCompute,
+    ElementSF,
+    LayoutSFD
+  >;
+
   using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
       cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
       TileShape, ClusterShape,
@@ -147,7 +169,8 @@ namespace kernel_2 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -157,7 +180,7 @@ namespace kernel_2 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -165,41 +188,51 @@ namespace kernel_2 {
     using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
         Shape<int,int,int,int>,
         CollectiveMainloop,
-        CollectiveEpilogue
-        >;
+        CollectiveEpilogue,
+        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
     using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
   };
   using GemmKernel = typename dummy<void>::GemmKernel;
   using Gemm = typename dummy<void>::Gemm;
 
-} // kernel_2
 
+} // kernel_2
 
 namespace kernel_3 {
   using ElementA = cutlass::float_e2m1_t;
   using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::half_t;
-  using ElementD = cutlass::half_t;
+  using ElementC = cutlass::bfloat16_t;
+  using ElementD = cutlass::float_e2m1_t;
   using ElementAccumulator = float;
   using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
+  using ElementSF = cutlass::float_ue4m3_t;
 
   using LayoutA = cutlass::layout::RowMajor;
   using LayoutB = cutlass::layout::ColumnMajor;
   using LayoutC = cutlass::layout::ColumnMajor;
-  using LayoutD = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::RowMajor;
+  using LayoutSFD = cutlass::layout::RowMajor;
 
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
 
-  static constexpr int AlignmentA = 64 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 64 bytes.
-  static constexpr int AlignmentB = 64 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 64 bytes.
+  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
+  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
   static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
   static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
 
-  using TileShape = Shape<_128,_128,_128>;
+  using TileShape = Shape<_128,_128,_256>;
   using ClusterShape = Shape<_1,_1,_1>;
 
+  constexpr int SFVectorSize = 16;
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+    SFVectorSize,
+    ElementD,
+    ElementCompute,
+    ElementSF,
+    LayoutSFD
+  >;
+
   using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
       cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
       TileShape, ClusterShape,
@@ -207,7 +240,8 @@ namespace kernel_3 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -217,37 +251,118 @@ namespace kernel_3 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf8f6f4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
   struct dummy {
-    using TileSchedulerTag = cutlass::gemm::PersistentScheduler; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
     using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
         Shape<int,int,int,int>,
         CollectiveMainloop,
         CollectiveEpilogue,
-        TileSchedulerTag>;    
+        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
     using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
   };
   using GemmKernel = typename dummy<void>::GemmKernel;
   using Gemm = typename dummy<void>::Gemm;
 
+
 } // kernel_3
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f16, 128x128x256) {
+namespace kernel_4 {
+  using ElementA = cutlass::float_e2m1_t;
+  using ElementB = cutlass::float_e2m1_t;
+  using ElementC = void;
+  using ElementD = cutlass::float_e2m1_t;
+  using ElementAccumulator = float;
+  using ElementCompute = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::RowMajor;
+  using LayoutSFD = cutlass::layout::RowMajor;
+
+  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+
+  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
+  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
+  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementD>::value;
+  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+  using TileShape = Shape<_128,_128,_256>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  constexpr int SFVectorSize = 16;
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+    SFVectorSize,
+    ElementD,
+    ElementCompute,
+    ElementSF,
+    LayoutSFD,
+    ElementC
+  >;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementPairA, LayoutA, AlignmentA,
+      ElementPairB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  template <typename T>
+  struct dummy {
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  };
+  using GemmKernel = typename dummy<void>::GemmKernel;
+  using Gemm = typename dummy<void>::Gemm;
+
+
+} // kernel_4
+
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue_vs16, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_1::Gemm>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs32_tensor_op_f16_static_sched, 128x128x256) {
+
+// ==== mixed datatypes for C (fp16/bf16) / D (fp32) matrices ==== //
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f16_f32_epilogue_vs16, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_2::Gemm>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs32_tensor_op_f16, 128x128x128) {
+
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_bf16_f32_epilogue_vs16, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_3::Gemm>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
 
+
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_void_f32_epilogue_vs32, 128x128x256) {
+  bool result = test::gemm::device::TestSmallFusion<kernel_4::Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
 #endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f16.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f16.cu
new file mode 100644
index 0000000000..22fcf89a08
--- /dev/null
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f16.cu
@@ -0,0 +1,126 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+
+#include "../../../common/cutlass_unit_test.h"
+
+#include "../gemm_testbed_3x.hpp"
+
+#if (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
+
+using namespace cute;
+
+///////////////////////////////////////////////////////////////////////////////
+
+namespace kernel_1 {
+  using ElementA = cutlass::float_e2m1_t;
+  using ElementB = cutlass::float_e2m1_t;
+  using ElementC = cutlass::half_t;
+  using ElementD = cutlass::half_t;
+  using ElementAccumulator = float;
+  using ElementCompute = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::ColumnMajor;
+
+  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+
+  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
+  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
+  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
+  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+  using TileShape = Shape<_128,_128,_256>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementPairA, LayoutA, AlignmentA,
+      ElementPairB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  template <typename T>
+  struct dummy {
+    using TileSchedulerTag = cutlass::gemm::PersistentScheduler; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue,
+        TileSchedulerTag>;    
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  };
+  using GemmKernel = typename dummy<void>::GemmKernel;
+  using Gemm = typename dummy<void>::Gemm;
+
+} // kernel_1
+
+
+
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f16, 128x128x256) {
+  bool result = test::gemm::device::TestSmall<kernel_1::Gemm>(1.0, 0.5);
+  EXPECT_TRUE(result);
+}
+
+
+#endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32.cu
new file mode 100644
index 0000000000..5154034129
--- /dev/null
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32.cu
@@ -0,0 +1,125 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+
+#include "../../../common/cutlass_unit_test.h"
+
+#include "../gemm_testbed_3x.hpp"
+
+#if (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
+
+using namespace cute;
+
+///////////////////////////////////////////////////////////////////////////////
+
+
+namespace kernel_1 {
+  using ElementA = cutlass::float_e2m1_t;
+  using ElementB = cutlass::float_e2m1_t;
+  using ElementC = float;
+  using ElementD = float;
+  using ElementAccumulator = float;
+  using ElementCompute = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::ColumnMajor;
+
+  using ElementPairA = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+  using ElementPairB = cutlass::nv_float4_t<cutlass::float_e2m1_t>;
+
+  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
+  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
+  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
+  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+  using TileShape = Shape<_128,_128,_256>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementPairA, LayoutA, AlignmentA,
+      ElementPairB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedPingpong
+    >::CollectiveOp;
+
+  template <typename T>
+  struct dummy {
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue
+    >;    
+    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  };
+  using GemmKernel = typename dummy<void>::GemmKernel;
+  using Gemm = typename dummy<void>::Gemm;
+
+} // kernel_1
+
+
+
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32, 128x128x256) {
+  bool result = test::gemm::device::TestSmall<kernel_1::Gemm, true>(1.0, 0.5);
+  EXPECT_TRUE(result);
+}
+
+#endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_epilogue_fusion.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_epilogue_fusion.cu
similarity index 93%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_epilogue_fusion.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_epilogue_fusion.cu
index 16c69bbf27..0e76bf69e3 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_epilogue_fusion.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_epilogue_fusion.cu
@@ -97,7 +97,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -108,7 +108,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -167,7 +167,7 @@ namespace kernel_2 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -178,7 +178,7 @@ namespace kernel_2 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -237,7 +237,7 @@ namespace kernel_3 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -248,7 +248,7 @@ namespace kernel_3 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -307,7 +307,7 @@ namespace kernel_4 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -318,7 +318,7 @@ namespace kernel_4 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -377,7 +377,7 @@ namespace kernel_5 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -388,7 +388,7 @@ namespace kernel_5 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -447,7 +447,7 @@ namespace kernel_6 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -458,7 +458,7 @@ namespace kernel_6 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -481,7 +481,7 @@ namespace kernel_6 {
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: fp32
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_per_row_bias_relu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_f32_epilogue, 128x128x256_per_row_bias_relu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_1::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -492,7 +492,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epi
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: fp32
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_per_row_bias_gelu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_f32_epilogue, 128x128x256_per_row_bias_gelu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_2::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -503,7 +503,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epi
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: fp32
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_gelu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_gelu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_3::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -514,7 +514,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epi
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: fp32
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_relu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_relu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_4::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -525,7 +525,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epi
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: fp32
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_per_row_bias_clamp) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_f32_epilogue, 128x128x256_per_row_bias_clamp) {
   bool result = test::gemm::device::TestSmallFusion<kernel_5::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -536,7 +536,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epi
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: fp32
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_clamp) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_clamp) {
   bool result = test::gemm::device::TestSmallFusion<kernel_6::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_narrow_output.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_narrow_output.cu
similarity index 62%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_narrow_output.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_narrow_output.cu
index fd584cee28..069b66eb13 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_narrow_output.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_narrow_output.cu
@@ -102,7 +102,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -113,7 +113,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -130,89 +130,9 @@ namespace kernel_1 {
 
 } // kernel_1
 
-namespace kernel_2 {
-  using ElementA = cutlass::float_e2m1_t;
-  using ElementB = cutlass::float_e2m1_t;
-  using ElementC = cutlass::bfloat16_t;
-  using ElementD = cutlass::float_e2m3_t;
-
-  using ElementAccumulator = float;
-  using ElementCompute = float;
-  using ElementSF = cutlass::float_ue8m0_t;
-
-  using LayoutA = cutlass::layout::RowMajor;
-  using LayoutB = cutlass::layout::ColumnMajor;
-  using LayoutC = cutlass::layout::RowMajor;
-  using LayoutD = cutlass::layout::RowMajor;
-
-  using ElementPairA = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-  using ElementPairB = cutlass::mx_float4_t<cutlass::float_e2m1_t>;
-
-  static constexpr int AlignmentA = 16 * 8 / cutlass::sizeof_bits<ElementA>::value; // Align to 16 bytes.
-  static constexpr int AlignmentB = 16 * 8 / cutlass::sizeof_bits<ElementB>::value; // Align to 16 bytes.
-  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
-  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
-
-  using TileShape = Shape<_128,_128,_256>;
-  using ClusterShape = Shape<_1,_1,_1>;
-
-  constexpr int SFVectorSize = 32;
-  using LayoutSFD = cutlass::layout::RowMajor;
-  using ElementBias = cutlass::bfloat16_t;
-  using GmemLayoutSFC = cutlass::layout::RowMajor;
-
-  using FusionOperation = cutlass::epilogue::fusion::LinCombPerColBiasEltActBlockScaleFactor<
-      cutlass::epilogue::thread::ReLU,
-      SFVectorSize,
-      ElementD, 
-      ElementCompute, 
-      ElementSF, LayoutSFD,
-      ElementBias,
-      ElementC>;
-
-  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassTensorOp,
-      TileShape, ClusterShape,
-      cutlass::epilogue::collective::EpilogueTileAuto,
-      ElementAccumulator, ElementCompute,
-      ElementC, LayoutC, AlignmentC,
-      ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
-      ,FusionOperation
-    >::CollectiveOp;
-
-  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
-      ElementPairA, LayoutA, AlignmentA,
-      ElementPairB, LayoutB, AlignmentB,
-      ElementAccumulator,
-      TileShape, ClusterShape,
-      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedMxf4Sm120
-    >::CollectiveOp;
-
-  template <typename T>
-  struct dummy {
-    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-        Shape<int,int,int,int>,
-        CollectiveMainloop,
-        CollectiveEpilogue,
-        cutlass::gemm::PersistentScheduler>; // both void (default) and PersistentScheduler map to dynamic scheduler with CLC query 
-    using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
-  };
-  using GemmKernel = typename dummy<void>::GemmKernel;
-  using Gemm = typename dummy<void>::Gemm;
-
-} // kernel_2
-
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_fe2m3n, 128x128x256) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_fe2m3n, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_1::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs32_tensor_op_f32_fe2m3n, 128x128x256) {
-  bool result = test::gemm::device::TestSmallFusion<kernel_2::Gemm, false, false>(1.0, 0);
-  EXPECT_TRUE(result);
-}
-
 #endif // (defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) || defined(CUTLASS_ARCH_MMA_SM121_SUPPORTED))
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_stream_k.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_stream_k.cu
similarity index 96%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_stream_k.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_stream_k.cu
index 7f1132d232..caff7ff95f 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f32_stream_k.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32_stream_k.cu
@@ -87,7 +87,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -97,7 +97,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -115,7 +115,7 @@ namespace kernel_1 {
 
 } // kernel_1
 
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_f32n_vs16_tensor_op_f32_stream_k, 128x128x256) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_f32n_tensor_op_f32_stream_k, 128x128x256) {
   bool result = test::gemm::device::TestSmall<kernel_1::Gemm, true>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f4_epilogue_fusion.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_nvf4_epilogue_fusion.cu
similarity index 93%
rename from test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f4_epilogue_fusion.cu
rename to test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_nvf4_epilogue_fusion.cu
index 1c864f3e1f..818b6b74a4 100644
--- a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_f4_f4_f32_f4_epilogue_fusion.cu
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_nvf4_epilogue_fusion.cu
@@ -104,7 +104,7 @@ namespace kernel_1 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -115,7 +115,7 @@ namespace kernel_1 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -182,7 +182,7 @@ namespace kernel_2 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -193,7 +193,7 @@ namespace kernel_2 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -260,7 +260,7 @@ namespace kernel_3 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -271,7 +271,7 @@ namespace kernel_3 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -336,7 +336,7 @@ namespace kernel_4 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -347,7 +347,7 @@ namespace kernel_4 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -413,7 +413,7 @@ namespace kernel_5 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -424,7 +424,7 @@ namespace kernel_5 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -490,7 +490,7 @@ namespace kernel_6 {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
       FusionOperation
     >::CollectiveOp;
 
@@ -501,7 +501,7 @@ namespace kernel_6 {
       ElementAccumulator,
       TileShape, ClusterShape,
       cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-      cutlass::gemm::KernelTmaWarpSpecializedNvf4Sm120
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
     >::CollectiveOp;
 
   template <typename T>
@@ -527,7 +527,7 @@ namespace kernel_6 {
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias) {
   bool result = test::gemm::device::TestSmallFusion<kernel_1::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -538,7 +538,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_e
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_relu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_relu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_2::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -549,7 +549,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_e
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_gelu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_row_bias_gelu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_3::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -560,7 +560,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_e
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias) {
   bool result = test::gemm::device::TestSmallFusion<kernel_4::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -571,7 +571,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_e
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_relu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_relu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_5::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
@@ -582,7 +582,7 @@ TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_e
 // Acc: fp32
 // Scale (alpha, beta): fp32
 // D: bf16
-TEST(SM120_Device_Blockscaled_Gemm_fe2m1t_fe2m1n_fe2m1t_vs16_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_gelu) {
+TEST(SM120_Device_Blockscaled_Gemm_nvf4t_nvf4n_nvf4t_tensor_op_f32_f32_epilogue, 128x128x256_alpha_beta_per_col_bias_gelu) {
   bool result = test::gemm::device::TestSmallFusion<kernel_6::Gemm, false, false>(1.0, 0.5);
   EXPECT_TRUE(result);
 }
diff --git a/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_nvf4_group_gemm_fusion.cu b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_nvf4_group_gemm_fusion.cu
new file mode 100644
index 0000000000..f4a0071b61
--- /dev/null
+++ b/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_nvf4_group_gemm_fusion.cu
@@ -0,0 +1,358 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+/*! \file
+    \brief Tests for device-wide grouped GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+
+#include "../../../common/cutlass_unit_test.h"
+#include "../gemm_testbed_3x_ptr_array.hpp"
+
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
+
+// Pingpong kernel schedule
+TEST(SM120_Device_Gemm_e2m1t_e2m1n_e2m1t_tensorop_f32_epilogue_VS16_group_pingpong, row_sf) {
+  using ElementInput = float_e2m1_t;
+  using ElementA = cutlass::nv_float4_t<ElementInput>;
+  using ElementB = cutlass::nv_float4_t<ElementInput>;
+  using ElementC = cutlass::half_t;
+  using ElementD = cutlass::float_e2m1_t;
+  using ElementCompute = float;
+  using ElementAccumulator = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+  using ElementSFD  = ElementSF;
+  using ElementAccumulator = float;
+  using GmemLayoutA = cutlass::layout::RowMajor;
+  using GmemLayoutB = cutlass::layout::ColumnMajor;
+  using GmemLayoutC = cutlass::layout::RowMajor;
+  constexpr int SFVectorSize = 16;
+  using TileShape_MNK = Shape<_128,_128,_128>;
+  using ClusterShape_MNK = Shape<_1,_1,_1>;
+
+  constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementInput>::value;
+  constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementInput>::value;  
+  constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;
+  constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;  
+
+  //
+  // Construct CollectiveEpilogue
+  //
+
+  constexpr int OutputSFVectorSize = SFVectorSize;
+  // D = alpha * acc + beta * C
+  // With Row-major BlockScaleFactor generation.
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+      OutputSFVectorSize,
+      ElementD, 
+      ElementCompute, 
+      ElementSFD, GmemLayoutC,
+      ElementC>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, GmemLayoutC *, AlignmentC,
+      ElementD, GmemLayoutC *, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
+    >::CollectiveOp;
+
+  //
+  // Construct CollectiveMainloop
+  //
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementA, GmemLayoutA *, AlignmentA,
+      ElementB, GmemLayoutB *, AlignmentB,
+      ElementAccumulator,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+  
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  auto pass = test::gemm::device::TestSmallFusion<Gemm>(1.0, 0.5);
+  EXPECT_TRUE(pass);
+}
+
+
+
+TEST(SM120_Device_Gemm_e2m1t_e2m1n_e2m1t_tensorop_f32_epilogue_VS16_group_pingpong, silu_row_sf) {
+  using ElementInput = float_e2m1_t;
+  using ElementA = cutlass::nv_float4_t<ElementInput>;
+  using ElementB = cutlass::nv_float4_t<ElementInput>;
+  using ElementC = cutlass::half_t;
+  using ElementD = cutlass::float_e2m1_t;
+  using ElementCompute = float;
+  using ElementAccumulator = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+  using ElementSFD  = ElementSF;
+  using ElementAccumulator = float;
+  using GmemLayoutA = cutlass::layout::RowMajor;
+  using GmemLayoutB = cutlass::layout::ColumnMajor;
+  using GmemLayoutC = cutlass::layout::RowMajor;
+  constexpr int SFVectorSize = 16;
+  using TileShape_MNK = Shape<_128,_128,_256>;
+  using ClusterShape_MNK = Shape<_1,_1,_1>;
+
+  constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementInput>::value;
+  constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementInput>::value;  
+  constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;
+  constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;  
+
+  //
+  // Construct CollectiveEpilogue
+  //
+
+  constexpr int OutputSFVectorSize = SFVectorSize;
+  // D = SiLu(alpha * acc + beta * C)
+  // With Row-major BlockScaleFactor generation.
+  using FusionOperation = cutlass::epilogue::fusion::LinCombEltActBlockScaleFactor<
+      cutlass::epilogue::thread::SiLu,
+      OutputSFVectorSize,
+      ElementD, 
+      ElementCompute, 
+      ElementSFD, GmemLayoutC,
+      ElementC>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, GmemLayoutC *, AlignmentC,
+      ElementD, GmemLayoutC *, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
+    >::CollectiveOp;
+
+  //
+  // Construct CollectiveMainloop
+  //
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementA, GmemLayoutA *, AlignmentA,
+      ElementB, GmemLayoutB *, AlignmentB,
+      ElementAccumulator,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+  
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  auto pass = test::gemm::device::TestSmallFusion<Gemm>(1.0, 0.5);
+  EXPECT_TRUE(pass);
+}
+
+
+// Cooperative kenel schedule
+TEST(SM120_Device_Gemm_e2m1t_e2m1n_e2m1t_tensorop_f32_epilogue_VS16_group_cooperative, row_sf) {
+  using ElementInput = float_e2m1_t;
+  using ElementA = cutlass::nv_float4_t<ElementInput>;
+  using ElementB = cutlass::nv_float4_t<ElementInput>;
+  using ElementC = cutlass::half_t;
+  using ElementD = cutlass::float_e2m1_t;
+  using ElementCompute = float;
+  using ElementAccumulator = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+  using ElementSFD  = ElementSF;
+  using ElementAccumulator = float;
+  using GmemLayoutA = cutlass::layout::RowMajor;
+  using GmemLayoutB = cutlass::layout::ColumnMajor;
+  using GmemLayoutC = cutlass::layout::RowMajor;
+  constexpr int SFVectorSize = 16;
+  using TileShape_MNK = Shape<_128,_128,_128>;
+  using ClusterShape_MNK = Shape<_1,_1,_1>;
+
+  constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementInput>::value;
+  constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementInput>::value;  
+  constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;
+  constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;  
+
+  //
+  // Construct CollectiveEpilogue
+  //
+
+  constexpr int OutputSFVectorSize = SFVectorSize;
+  // D = alpha * acc + beta * C
+  // With Row-major BlockScaleFactor generation.
+  using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+      OutputSFVectorSize,
+      ElementD, 
+      ElementCompute, 
+      ElementSFD, GmemLayoutC,
+      ElementC>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, GmemLayoutC *, AlignmentC,
+      ElementD, GmemLayoutC *, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
+    >::CollectiveOp;
+
+  //
+  // Construct CollectiveMainloop
+  //
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementA, GmemLayoutA *, AlignmentA,
+      ElementB, GmemLayoutB *, AlignmentB,
+      ElementAccumulator,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+  
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  auto pass = test::gemm::device::TestSmallFusion<Gemm>(1.0, 0.5);
+  EXPECT_TRUE(pass);
+}
+
+
+
+TEST(SM120_Device_Gemm_e2m1t_e2m1n_e2m1t_tensorop_f32_epilogue_VS16_group_cooperative, silu_row_sf) {
+  using ElementInput = float_e2m1_t;
+  using ElementA = cutlass::nv_float4_t<ElementInput>;
+  using ElementB = cutlass::nv_float4_t<ElementInput>;
+  using ElementC = cutlass::half_t;
+  using ElementD = cutlass::float_e2m1_t;
+  using ElementCompute = float;
+  using ElementAccumulator = float;
+  using ElementSF = cutlass::float_ue4m3_t;
+  using ElementSFD  = ElementSF;
+  using ElementAccumulator = float;
+  using GmemLayoutA = cutlass::layout::RowMajor;
+  using GmemLayoutB = cutlass::layout::ColumnMajor;
+  using GmemLayoutC = cutlass::layout::RowMajor;
+  constexpr int SFVectorSize = 16;
+  using TileShape_MNK = Shape<_128,_128,_256>;
+  using ClusterShape_MNK = Shape<_1,_1,_1>;
+
+  constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementInput>::value;
+  constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementInput>::value;  
+  constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;
+  constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;  
+
+  //
+  // Construct CollectiveEpilogue
+  //
+
+  constexpr int OutputSFVectorSize = SFVectorSize;
+  // D = SiLu(alpha * acc + beta * C)
+  // With Row-major BlockScaleFactor generation.
+  using FusionOperation = cutlass::epilogue::fusion::LinCombEltActBlockScaleFactor<
+      cutlass::epilogue::thread::SiLu,
+      OutputSFVectorSize,
+      ElementD, 
+      ElementCompute, 
+      ElementSFD, GmemLayoutC,
+      ElementC>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, GmemLayoutC *, AlignmentC,
+      ElementD, GmemLayoutC *, AlignmentD,
+      cutlass::epilogue::collective::EpilogueScheduleAuto,
+      FusionOperation
+    >::CollectiveOp;
+
+  //
+  // Construct CollectiveMainloop
+  //
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm120, cutlass::arch::OpClassBlockScaledTensorOp,
+      ElementA, GmemLayoutA *, AlignmentA,
+      ElementB, GmemLayoutB *, AlignmentB,
+      ElementAccumulator,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+  
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  auto pass = test::gemm::device::TestSmallFusion<Gemm>(1.0, 0.5);
+  EXPECT_TRUE(pass);
+}
+#endif // #if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED)
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f16_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f16_tensor_op.cu
index 458ecf6535..38f53cd958 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f16_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f16_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe2m1n_f16n_void_f32_tensor_op, 128x64x128_1x1x1)
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe2m1n_f16n_void_f16_tensor_op, 128x64x128_1x1x1)
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -185,7 +185,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe2m1n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -236,7 +236,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe2m1n_f16n_tensor_op_f16, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f32_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f32_tensor_op.cu
index f83637d1ad..12dcc0e969 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f32_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f4_f32_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe2m1n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op.cu
index 5c34b5e76e..e51db2532c 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe3m2n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe2m3t_fe2m1n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op_narrow_output.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op_narrow_output.cu
index 6e93ea3cb2..d9f936768f 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op_narrow_output.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f16_tensor_op_narrow_output.cu
@@ -85,7 +85,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe3m2n_f16n_tensor_op_fe2m3n, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -138,7 +138,7 @@ TEST(SM120_Device_Gemm_fe2m3t_fe2m1n_f16n_tensor_op_fe2m1t, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op.cu
index 697cf3e376..864e513716 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe3m2n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe2m3t_fe2m1n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op_narrow_output.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op_narrow_output.cu
index eb2278be42..f7e3026f8d 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op_narrow_output.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f6_f32_tensor_op_narrow_output.cu
@@ -85,7 +85,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe3m2n_f32n_tensor_op_fe2m3n, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -137,7 +137,7 @@ TEST(SM120_Device_Gemm_fe2m3t_fe2m1n_f32n_tensor_op_fe2m1t, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f16_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f16_tensor_op.cu
index 1c969677cc..326ae459ce 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f16_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f16_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe5m2n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe4m3t_fe2m1n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f32_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f32_tensor_op.cu
index 7d761104cc..25deaba967 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f32_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f4_f8_f32_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe2m1t_fe5m2n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe4m3t_fe2m1n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f16_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f16_tensor_op.cu
index a62071ede9..206756130e 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f16_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f16_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe3m2t_fe3m2n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f32_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f32_tensor_op.cu
index f5289211ae..1da47b4195 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f32_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f6_f32_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe3m2t_fe3m2n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f16_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f16_tensor_op.cu
index fd1f79a8f5..ef108e2abf 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f16_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f16_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe3m2t_fe4m3n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe4m3t_fe3m2n_f16n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f32_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f32_tensor_op.cu
index f5b5db4ac5..f88f7161e7 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f32_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f6_f8_f32_tensor_op.cu
@@ -83,7 +83,7 @@ TEST(SM120_Device_Gemm_fe3m2t_fe4m3n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -134,7 +134,7 @@ TEST(SM120_Device_Gemm_fe4m3t_fe3m2n_f32n_tensor_op_f32, 128x64x128_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f16_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f16_tensor_op.cu
index 70b7d67c39..d93604cce5 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f16_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f16_tensor_op.cu
@@ -82,7 +82,7 @@ TEST(SM120_Device_Gemm_fe4m3t_fe4m3n_f16n_tensor_op_f32, 128x64x64_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f32_tensor_op.cu b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f32_tensor_op.cu
index e5d0c46a5b..96c3b75081 100644
--- a/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f32_tensor_op.cu
+++ b/test/unit/gemm/device/sm120_tensorop_gemm/sm120_gemm_f8_f8_f32_tensor_op.cu
@@ -82,7 +82,7 @@ TEST(SM120_Device_Gemm_fe4m3t_fe4m3n_f32n_tensor_op_f32, 128x64x64_1x1x1) {
       ElementAccumulator, ElementCompute,
       ElementC, LayoutC, AlignmentC,
       ElementD, LayoutD, AlignmentD,
-      cutlass::epilogue::TmaWarpSpecializedCooperative
+      cutlass::epilogue::collective::EpilogueScheduleAuto
     >::CollectiveOp;
 
   using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
diff --git a/test/unit/pipeline/testbed_cluster_launch_control.h b/test/unit/pipeline/testbed_cluster_launch_control.h
index 49f65d8ac2..50a68a1437 100644
--- a/test/unit/pipeline/testbed_cluster_launch_control.h
+++ b/test/unit/pipeline/testbed_cluster_launch_control.h
@@ -137,7 +137,7 @@ class TestbedClusterLaunch {
   return true;
 #endif
 
-#if 1
+#if 0
     bool is_success = false;
     for (int i = 0; i< 10; i++){
       printf("iteration = %d\n", i);
diff --git a/tools/library/CMakeLists.txt b/tools/library/CMakeLists.txt
index 820cc8fbeb..20a6efe624 100644
--- a/tools/library/CMakeLists.txt
+++ b/tools/library/CMakeLists.txt
@@ -235,6 +235,10 @@ if (NOT CUTLASS_ENABLE_SYCL)
   src/reference/gemm_f6_f8_f32.cu
   src/reference/gemm_f8_f4_f32.cu
   src/reference/gemm_f8_f6_f32.cu
+
+  src/reference/blockwise_gemm_fp8_fp16out.cu   
+  src/reference/blockwise_gemm_fp8_fp32out.cu   
+  src/reference/blockwise_gemm_fp8_bf16out.cu   
   
   src/reference/gemm_s8_s8_s32.cu
   src/reference/gemm_u8_u8_s32.cu
diff --git a/tools/library/include/cutlass/library/descriptions.h b/tools/library/include/cutlass/library/descriptions.h
index cddb51c862..5e80c124e5 100644
--- a/tools/library/include/cutlass/library/descriptions.h
+++ b/tools/library/include/cutlass/library/descriptions.h
@@ -313,10 +313,16 @@ struct BlockScaleDescription {
   TensorDescription SFD;
 
   /// Describes the input ScaleFactor VectorSize
-  int SFVecSize;
+  int SFMVecSize;
+  int SFNVecSize;
+  int SFKVecSize;
 
   /// Describes the Output ScaleFactor VectorSize
   int EpilogueSFVecSize;
+
+  /// Describes the underlying kind of scaling: 
+  /// Tensor Core supported (BlockScaled) or manual scaling (Blockwise)
+  OperationKind kind;
 };
 
 struct GroupedGemmDescription : public OperationDescription {
@@ -418,6 +424,96 @@ struct BlockScaledGemmDescription : public OperationDescription {
     transform_B(transform_B) {}
 };
 
+/// Description of all GEMM computations
+struct BlockwiseGemmDescription : public OperationDescription {
+
+  /// Indicates the kind of GEMM performed
+  GemmKind gemm_kind;
+
+  /// Describes the A operand
+  TensorDescription A;
+
+  /// Describes the B operand
+  TensorDescription B;
+
+  /// Describes the source matrix
+  TensorDescription C;
+
+  /// Describes the destination matrix
+  TensorDescription D;
+
+  /// Describes the SFA operand
+  TensorDescription SFA;
+
+  /// Describes the SFB operand
+  TensorDescription SFB;
+
+  /// Describes the data type of the scalars passed to the epilogue
+  NumericTypeID element_epilogue;
+
+  /// Describes the structure of parallel reductions
+  SplitKMode split_k_mode;
+
+  /// Transformation on A operand
+  ComplexTransform transform_A;
+
+  /// Transformation on B operand
+  ComplexTransform transform_B;
+
+  /// Describes the input ScaleFactor VectorSize 
+  int SFMVecSize;
+  int SFNVecSize;
+  int SFKVecSize;
+
+  //
+  // Methods
+  //
+
+  BlockwiseGemmDescription(
+    GemmKind gemm_kind = GemmKind::kGemm,
+    TensorDescription const& A = TensorDescription(),
+    TensorDescription const& B = TensorDescription(),
+    TensorDescription const& C = TensorDescription(),
+    TensorDescription const& D = TensorDescription(),
+    NumericTypeID element_epilogue = NumericTypeID::kInvalid,
+    SplitKMode split_k_mode = SplitKMode::kNone,
+    ComplexTransform transform_A = ComplexTransform::kNone,
+    ComplexTransform transform_B = ComplexTransform::kNone
+  ):
+    gemm_kind(gemm_kind),
+    A(A),
+    B(B),
+    C(C),
+    D(D),
+    element_epilogue(element_epilogue),
+    split_k_mode(split_k_mode),
+    transform_A(transform_A),
+    transform_B(transform_B) {} 
+
+  BlockwiseGemmDescription(
+    OperationDescription op_desc,
+    GemmKind gemm_kind,
+    TensorDescription const& A,
+    TensorDescription const& B,
+    TensorDescription const& C,
+    TensorDescription const& D,
+    NumericTypeID element_epilogue,
+    SplitKMode split_k_mode,
+    ComplexTransform transform_A,
+    ComplexTransform transform_B
+  ):
+    OperationDescription(op_desc),
+    gemm_kind(gemm_kind),
+    A(A),
+    B(B),
+    C(C),
+    D(D),
+    element_epilogue(element_epilogue),
+    split_k_mode(split_k_mode),
+    transform_A(transform_A),
+    transform_B(transform_B) {}
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Description for structured sparse GEMMs.
diff --git a/tools/library/include/cutlass/library/library.h b/tools/library/include/cutlass/library/library.h
index 9e70e9ebcc..ceb420f220 100644
--- a/tools/library/include/cutlass/library/library.h
+++ b/tools/library/include/cutlass/library/library.h
@@ -121,6 +121,13 @@ class Operation {
     void *device_workspace = nullptr,
     cudaStream_t stream = nullptr) const = 0;
 
+  // Set arguments that should only be set once before verifying or profiling the kernel.
+  // This should encompass any expensive operations that don't vary from run to run
+  // (e.g., max_active_clusters).
+  virtual Status initialize_with_arguments(void* arguments_ptr) const {
+    return Status::kSuccess;
+  }
+
 };
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -389,6 +396,56 @@ struct BlockScaledGemmArguments {
   bool use_pdl{false};
 };
 
+/// Blockwise GEMM
+//
+// OperationKind: kBlockwiseGemm
+// GemmKind:      Universal
+
+struct BlockwiseGemmArguments {
+  // NOTE: these are replicated for 3.0 interfaces
+  gemm::GemmCoord problem_size{};
+  gemm::GemmCoord cluster_shape{};  
+  gemm::GemmCoord cluster_shape_fallback{}; 
+  int batch_count{1};
+
+  void const *A{nullptr};
+  void const *B{nullptr};
+  void const *SFA{nullptr};
+  void const *SFB{nullptr};
+  void const *C{nullptr};
+  void *D{nullptr};
+
+  void const *alpha{nullptr};
+  void const *beta{nullptr};
+  ScalarPointerMode pointer_mode{};
+
+  // NOTE: these are replicated for 3.0 interfaces
+  int64_t lda{0};
+  int64_t ldb{0};
+  int64_t ldc{0};
+  int64_t ldd{0};
+
+  int64_t batch_stride_A{0};
+  int64_t batch_stride_B{0};
+  int64_t batch_stride_C{0};
+  int64_t batch_stride_D{0};
+
+  int sf_m_vec_size{0};
+  int sf_n_vec_size{0};
+  int sf_k_vec_size{0};
+
+  // Needed for some 3.x kernels
+  int sm_count{0};
+  library::RasterOrder raster_order{};
+  int swizzle_size{1};
+  int split_k_slices{1};
+
+  library::RuntimeDatatype runtime_input_datatype_a{library::RuntimeDatatype::kStatic}; 
+  library::RuntimeDatatype runtime_input_datatype_b{library::RuntimeDatatype::kStatic}; 
+
+  bool use_pdl{false};
+};
+
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -521,6 +578,8 @@ struct GemmGroupedArguments {
 
   // these should really be in the configuration but staying consistent with GEMM
   int sm_count{0};
+  int max_active_clusters{0};
+
   // The user is responsible for allocating storage for problem sizes.
   // Since GemmGroupedArguments is used by both the 2.x and 3.x APIs, we
   // unfortunately need to have both options in this struct, and the
@@ -536,6 +595,12 @@ struct GroupedGemmBlockScaledArguments : GemmGroupedArguments {
   void* norm_constant{nullptr};
 };
 
+struct GroupedGemmBlockwiseArguments : GemmGroupedArguments {
+  void* SFA{nullptr};
+  void* SFB{nullptr};
+};
+
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // OperationKind: kSparseGemm
diff --git a/tools/library/include/cutlass/library/operation_table.h b/tools/library/include/cutlass/library/operation_table.h
index 6a8655ceaf..f36232c8dc 100644
--- a/tools/library/include/cutlass/library/operation_table.h
+++ b/tools/library/include/cutlass/library/operation_table.h
@@ -427,6 +427,183 @@ using BlockScaledGemmOperationFunctionalMap = std::unordered_map<
   BlockScaledGemmFunctionalKeyHasher
 >;
 
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+//                          Data Structures for Blockwise Gemm Functional Maps
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Tuple uniquely identifying Gemm functional behavior
+struct BlockwiseGemmFunctionalKey {
+
+  Provider provider;
+  GemmKind gemm_kind;
+  OperationKind kind;
+  NumericTypeID element_compute;
+  NumericTypeID element_scalar;
+  NumericTypeID element_A;
+  LayoutTypeID layout_A;
+  NumericTypeID element_SFA;
+  NumericTypeID element_B;
+  LayoutTypeID layout_B;
+  NumericTypeID element_SFB;
+  NumericTypeID element_C;
+  LayoutTypeID layout_C;
+  NumericTypeID element_D;
+  LayoutTypeID layout_D;
+  int SFMVecSize;
+  int SFNVecSize;
+  int SFKVecSize;
+  //
+  // Methods
+  //
+
+  inline
+  BlockwiseGemmFunctionalKey(
+    Provider provider,
+    GemmKind gemm_kind = GemmKind::kGemm,
+    OperationKind kind = OperationKind::kBlockwiseGemm,
+    NumericTypeID element_compute = NumericTypeID::kF32,
+    NumericTypeID element_scalar = NumericTypeID::kF32,
+    NumericTypeID element_A = NumericTypeID::kF16,
+    LayoutTypeID layout_A = LayoutTypeID::kColumnMajor,
+    NumericTypeID element_SFA = NumericTypeID::kF16,
+    NumericTypeID element_B = NumericTypeID::kF16,
+    LayoutTypeID layout_B = LayoutTypeID::kColumnMajor,
+    NumericTypeID element_SFB = NumericTypeID::kF16,
+    NumericTypeID element_C = NumericTypeID::kF16,
+    LayoutTypeID layout_C = LayoutTypeID::kColumnMajor,
+    NumericTypeID element_D = NumericTypeID::kF16,
+    LayoutTypeID layout_D = LayoutTypeID::kColumnMajor,
+    int sfm_vec_size = 32,
+    int sfn_vec_size = 32,
+    int sfk_vec_size = 32
+  ):
+    provider(provider),
+    gemm_kind(gemm_kind),
+    kind(kind),
+    element_compute(element_compute),
+    element_scalar(element_scalar),
+    element_A(element_A),
+    layout_A(layout_A),
+    element_SFA(element_SFA),
+    element_B(element_B),
+    layout_B(layout_B),
+    element_SFB(element_SFB),
+    element_C(element_C),
+    layout_C(layout_C),
+    element_D(element_D),
+    layout_D(layout_D),
+    SFMVecSize(sfm_vec_size),
+    SFNVecSize(sfn_vec_size),
+    SFKVecSize(sfk_vec_size)
+  { }
+
+  inline
+  bool operator==(BlockwiseGemmFunctionalKey const &rhs) const {
+    return
+      (provider == rhs.provider) &&
+      (gemm_kind == rhs.gemm_kind) &&
+      (kind == rhs.kind) &&
+      (element_compute == rhs.element_compute) &&
+      (element_scalar == rhs.element_scalar) &&
+      (element_A == rhs.element_A) &&
+      (layout_A == rhs.layout_A) &&
+      (element_SFA == rhs.element_SFA) &&
+      (element_B == rhs.element_B) &&
+      (layout_B == rhs.layout_B) &&
+      (element_SFB == rhs.element_SFB) &&
+      (element_C == rhs.element_C) &&
+      (layout_C == rhs.layout_C) &&
+      (element_D == rhs.element_D) &&
+      (layout_D == rhs.layout_D) &&
+      (SFMVecSize == rhs.SFMVecSize) &&
+      (SFNVecSize == rhs.SFNVecSize) && 
+      (SFKVecSize == rhs.SFKVecSize);
+  }
+
+  inline
+  bool operator!=(BlockwiseGemmFunctionalKey const &rhs) const {
+    return !(*this == rhs);
+  }
+};
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+inline
+std::ostream & operator<<(std::ostream &out, cutlass::library::BlockwiseGemmFunctionalKey const &k) {
+
+  out << "{\n"
+    << "         provider: " << to_string(k.provider) << "\n"
+    << "        gemm_kind: " << to_string(k.gemm_kind) << "\n"
+    << "             kind: " << to_string(k.kind) << "\n"
+    << "  element_compute: " << to_string(k.element_compute) << "\n"
+    << "   element_scalar: " << to_string(k.element_scalar) << "\n"
+    << "        element_A: " << to_string(k.element_A) << "\n"
+    << "         layout_A: " << to_string(k.layout_A) << "\n"
+    << "      element_SFA: " << to_string(k.element_SFA) << "\n"
+    << "        element_B: " << to_string(k.element_B) << "\n"
+    << "         layout_B: " << to_string(k.layout_B) << "\n"
+    << "      element_SFB: " << to_string(k.element_SFB) << "\n"
+    << "        element_C: " << to_string(k.element_C) << "\n"
+    << "         layout_C: " << to_string(k.layout_C) << "\n"
+    << "        element_D: " << to_string(k.element_D) << "\n"
+    << "         layout_D: " << to_string(k.layout_D) << "\n"
+    << "        SFMVecSize: " << k.SFMVecSize << "\n"
+    << "        SFNVecSize: " << k.SFNVecSize << "\n"
+    << "        SFKVecSize: " << k.SFKVecSize << "\n"
+    << "}";
+
+  return out;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Hash function for BlockwiseGemmFunctionalKeyHasher
+struct  BlockwiseGemmFunctionalKeyHasher {
+  using IntHash = std::hash<int>;
+
+  inline
+  static size_t rotl(size_t key, int shl) {
+    return (key << shl) | (key >> (sizeof(key)*8u - static_cast<size_t>(shl)));
+  }
+
+  inline
+  size_t operator()(BlockwiseGemmFunctionalKey const &key) const {
+    IntHash hash;
+
+    return
+      rotl(hash(int(key.provider)),           1) ^
+      rotl(hash(int(key.gemm_kind)),          2) ^
+      rotl(hash(int(key.kind)),               3) ^
+      rotl(hash(int(key.element_compute)),    4) ^
+      rotl(hash(int(key.element_scalar)),     5) ^
+      rotl(hash(int(key.element_A)),          6) ^
+      rotl(hash(int(key.layout_A)),           7) ^
+      rotl(hash(int(key.element_SFA)),        8) ^
+      rotl(hash(int(key.element_B)),          9) ^
+      rotl(hash(int(key.layout_B)),          10) ^
+      rotl(hash(int(key.element_SFB)),       11) ^
+      rotl(hash(int(key.element_C)),         12) ^
+      rotl(hash(int(key.layout_C)),          13) ^
+      rotl(hash(int(key.element_D)),         14) ^
+      rotl(hash(int(key.layout_D)),          15) ^
+      rotl(hash(int(key.SFMVecSize)),        16) ^ 
+      rotl(hash(int(key.SFNVecSize)),        17) ^ 
+      rotl(hash(int(key.SFKVecSize)),        18) 
+      ;
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Maps a GemmFunctionalKey onto a vector of Operation * objects expected to be of kind kGemm
+using BlockwiseGemmOperationFunctionalMap = std::unordered_map<
+  BlockwiseGemmFunctionalKey,
+  GemmOperationVectorMap,
+  BlockwiseGemmFunctionalKeyHasher
+>;
+
+
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //                          Data Structures for Conv Functional Maps
@@ -697,6 +874,9 @@ class OperationTable {
   // provider (kCUTLASS, kReferenceHost, kReferenceDevice)                        
   BlockScaledGemmOperationFunctionalMap block_scaled_gemm_operations;             
 
+  // provider (kCUTLASS, kReferenceHost, kReferenceDevice)                        
+  BlockwiseGemmOperationFunctionalMap blockwise_gemm_operations;             
+
   /// Map of all operations of type kConv2d
   // provider (kCUTLASS, kReferenceHost, kReferenceDevice)
   ConvOperationFunctionalMap conv2d_operations;
diff --git a/tools/library/include/cutlass/library/types.h b/tools/library/include/cutlass/library/types.h
index ebc0b1bdd8..9f8c4ff13b 100644
--- a/tools/library/include/cutlass/library/types.h
+++ b/tools/library/include/cutlass/library/types.h
@@ -143,6 +143,7 @@ enum class Provider {
 enum class OperationKind {
   kGemm,
   kBlockScaledGemm,
+  kBlockwiseGemm,
   kRankK,
   kRank2K,
   kTrmm,
diff --git a/tools/library/src/blockwise_gemm_operation_3x.hpp b/tools/library/src/blockwise_gemm_operation_3x.hpp
new file mode 100644
index 0000000000..00347a993e
--- /dev/null
+++ b/tools/library/src/blockwise_gemm_operation_3x.hpp
@@ -0,0 +1,429 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Defines operations for all GEMM operation kinds in CUTLASS Library.
+*/
+
+
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/detail/collective.hpp"
+#include "cutlass/library/library.h"
+#include "library_internal.h"
+#include "gemm_operation_3x.hpp"
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename Operator_>
+class BlockwiseGemmUniversal3xOperation : public GemmOperation3xBase<Operator_> {
+public:
+  using Operator = Operator_;
+  using OperatorArguments = typename Operator::Arguments;
+  using ElementA = typename Operator::CollectiveMainloop::ElementA;
+  using ElementSFA = typename Operator::ElementAccumulator;
+  using LayoutA = typename Operator::LayoutA;
+  using ElementB = typename Operator::CollectiveMainloop::ElementB;
+  using ElementSFB = typename Operator::ElementAccumulator;
+  using LayoutB = typename Operator::LayoutB;
+  using ElementC = typename Operator::ElementC;
+  using LayoutC = typename Operator::LayoutC;
+  using ElementD = typename Operator::ElementD;
+  using LayoutD = typename Operator::LayoutD;
+  using ElementAccumulator = typename Operator::ElementAccumulator;
+  using ElementCompute = typename Operator::EpilogueOutputOp::ElementCompute;
+
+  using TiledMma = typename Operator::CollectiveMainloop::TiledMma;
+
+  using CollectiveMainloop = typename Operator::CollectiveMainloop;
+  using CollectiveEpilogue = typename Operator::CollectiveEpilogue;
+  using ThreadEpilogueOp = typename CollectiveEpilogue::ThreadEpilogueOp;
+  
+  static constexpr bool IsRuntimeDataTypeA = cutlass::gemm::collective::detail::is_sm10x_runtime_f8f6f4<ElementA>();
+
+  static constexpr bool IsRuntimeDataTypeB = cutlass::gemm::collective::detail::is_sm10x_runtime_f8f6f4<ElementB>();
+
+  static_assert((IsRuntimeDataTypeA && IsRuntimeDataTypeB) ||
+                (!IsRuntimeDataTypeA && !IsRuntimeDataTypeB), 
+                "ElementA and ElementB in a GEMM kernel should be both runtime or both static.");
+
+  static constexpr bool IsRuntimeDataType = IsRuntimeDataTypeA && IsRuntimeDataTypeB;
+
+private:
+  BlockwiseGemmDescription description_;
+
+public:
+
+  /// Constructor
+  BlockwiseGemmUniversal3xOperation(char const *name = "unknown_gemm"):
+      GemmOperation3xBase<Operator_>(name, GemmKind::kUniversal) {
+    description_.kind = OperationKind::kBlockwiseGemm;
+    description_.SFA.element = NumericTypeMap<ElementSFA>::kId;
+    description_.SFA.layout = size<0,1>(typename CollectiveMainloop::LayoutSFA{}.stride()) == 1 ? 
+        LayoutTypeID::kColumnMajor : LayoutTypeID::kRowMajor;
+    description_.SFA.alignment = CollectiveMainloop::AlignmentSFA;
+    description_.SFA.log_extent_range = 32;
+    description_.SFA.log_stride_range = 32;
+
+    description_.SFB.element = NumericTypeMap<ElementSFB>::kId;
+    description_.SFB.layout = size<0,1>(typename CollectiveMainloop::LayoutSFB{}.stride()) == 1 ? 
+        LayoutTypeID::kRowMajor : LayoutTypeID::kColumnMajor;
+    description_.SFB.alignment = CollectiveMainloop::AlignmentSFA;
+    description_.SFB.log_extent_range = 32;
+    description_.SFB.log_stride_range = 32;
+
+    description_.SFMVecSize = Operator::CollectiveMainloop::ScaleGranularityM;
+    description_.SFNVecSize = Operator::CollectiveMainloop::ScaleGranularityN;
+    description_.SFKVecSize = Operator::CollectiveMainloop::ScaleGranularityK;
+
+    description_.name = name;
+    description_.provider = Provider::kCUTLASS;
+    description_.gemm_kind = GemmKind::kUniversal;
+
+    description_.tile_description.threadblock_shape = make_Coord(
+      Operator::ThreadblockShape::kM,
+      Operator::ThreadblockShape::kN,
+      Operator::ThreadblockShape::kK);
+
+    if constexpr (Operator::ArchTag::kMinComputeCapability >= 90) {
+      description_.tile_description.cluster_shape = make_Coord(
+        Operator::ClusterShape::kM,
+        Operator::ClusterShape::kN,
+        Operator::ClusterShape::kK);
+    }
+
+    description_.tile_description.threadblock_stages = Operator::kStages;
+
+    description_.tile_description.warp_count = make_Coord(
+      Operator::WarpCount::kM,
+      Operator::WarpCount::kN,
+      Operator::WarpCount::kK);
+
+    description_.tile_description.math_instruction.instruction_shape = make_Coord(
+      Operator::InstructionShape::kM,
+      Operator::InstructionShape::kN,
+      Operator::InstructionShape::kK);
+
+    description_.tile_description.math_instruction.element_accumulator =
+      NumericTypeMap<ElementAccumulator>::kId;
+
+    description_.tile_description.math_instruction.opcode_class =
+      OpcodeClassMap<typename Operator::OperatorClass>::kId;
+
+    description_.tile_description.math_instruction.math_operation =
+      MathOperationMap<typename Operator::MathOperator>::kId;
+
+    description_.tile_description.minimum_compute_capability =
+      ArchMap<typename Operator::ArchTag, typename Operator::OperatorClass>::kMin;
+
+    description_.tile_description.maximum_compute_capability =
+      ArchMap<typename Operator::ArchTag, typename Operator::OperatorClass>::kMax;
+
+    description_.A = make_TensorDescription<ElementA, LayoutA>(Operator::kAlignmentA);
+    description_.B = make_TensorDescription<ElementB, LayoutB>(Operator::kAlignmentB);
+    description_.C = make_TensorDescription<ElementC, LayoutC>(Operator::kAlignmentC);
+    description_.D = make_TensorDescription<ElementD, LayoutD>(Operator::kAlignmentD);
+    description_.element_epilogue = NumericTypeMap<ElementCompute>::kId;
+
+    description_.split_k_mode = SplitKMode::kNone;
+  }
+
+  /// Returns the description of the GEMM operation
+  virtual OperationDescription const & description() const {
+    return description_;
+  }
+
+  /// Returns the description of the GEMM operation
+  BlockwiseGemmDescription const& get_gemm_description() const {
+    return description_;
+  }
+
+protected:
+
+  /// Constructs the arguments structure given the configuration and arguments
+  static Status construct_arguments_(
+      OperatorArguments &operator_args, GemmUniversalConfiguration const *configuration) {
+    // NOTE: GemmUniversalConfiguration does not contain problem shapes or batch strides
+    // Do nothing here and construct kernel arguments in update_arguments_ instead
+    // We also cannot construct TMA descriptors without all the arguments available
+
+    operator_args.mode = configuration->mode;
+    return Status::kSuccess;
+  }
+
+  template<class FusionArgs, class = void>
+  struct UpdateFusionArgs {
+    static Status update_(FusionArgs const& fusion_args, BlockwiseGemmArguments const &arguments) {
+      // If a custom EVT is instantiated then it is the users's responsibility
+      // to ensure alpha and beta are updated appropriately
+      return Status::kSuccess;
+    }
+  };
+
+  template<class FusionArgs>
+  struct UpdateFusionArgs<FusionArgs, cute::void_t<decltype(FusionArgs{}.alpha)>> {
+    static Status update_(FusionArgs& fusion_args, BlockwiseGemmArguments const &arguments) {
+      if (arguments.pointer_mode == ScalarPointerMode::kHost) {
+        fusion_args.alpha = *static_cast<ElementCompute const *>(arguments.alpha);
+        fusion_args.beta = *static_cast<ElementCompute const *>(arguments.beta);
+        fusion_args.alpha_ptr = nullptr;
+        fusion_args.beta_ptr = nullptr;
+
+        return Status::kSuccess;
+      }
+      else if (arguments.pointer_mode == ScalarPointerMode::kDevice) {
+        fusion_args.alpha = 0;
+        fusion_args.beta = 0;
+        fusion_args.alpha_ptr = static_cast<ElementCompute const *>(arguments.alpha);
+        fusion_args.beta_ptr = static_cast<ElementCompute const *>(arguments.beta);
+
+        return Status::kSuccess;
+      }
+      else {
+        return Status::kErrorInvalidProblem;
+      }
+    }
+  };
+
+  /// Constructs the arguments structure given the configuration and arguments
+  static Status update_arguments_(
+      OperatorArguments &operator_args,
+      BlockwiseGemmArguments const *arguments) {
+    Status status = Status::kSuccess;
+
+    status = UpdateFusionArgs<decltype(operator_args.epilogue.thread)>::update_(
+      operator_args.epilogue.thread, *arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    operator_args.problem_shape = cute::make_shape(
+      arguments->problem_size.m(),
+      arguments->problem_size.n(),
+      arguments->problem_size.k(),
+      arguments->batch_count);
+
+    // update arguments
+    
+    if constexpr (IsRuntimeDataType) {
+      using ArrayElementA = typename Operator::GemmKernel::CollectiveMainloop::ArrayElementA;
+      using ArrayElementB = typename Operator::GemmKernel::CollectiveMainloop::ArrayElementB;
+      operator_args.mainloop.ptr_A = static_cast<ArrayElementA const *>(arguments->A);
+      operator_args.mainloop.ptr_B = static_cast<ArrayElementB const *>(arguments->B);
+
+      std::unordered_map<RuntimeDatatype, cute::UMMA::MXF8F6F4Format> mapping = {
+          {RuntimeDatatype::kE4M3, cute::UMMA::MXF8F6F4Format::E4M3},
+          {RuntimeDatatype::kE5M2, cute::UMMA::MXF8F6F4Format::E5M2}, 
+          {RuntimeDatatype::kE3M2, cute::UMMA::MXF8F6F4Format::E3M2},
+          {RuntimeDatatype::kE2M1, cute::UMMA::MXF8F6F4Format::E2M1}
+      };
+
+      auto iter_runtime_a = mapping.find(arguments->runtime_input_datatype_a);
+      auto iter_runtime_b = mapping.find(arguments->runtime_input_datatype_b);
+
+      if (iter_runtime_a != mapping.end()) {
+          operator_args.mainloop.runtime_data_type_a = iter_runtime_a->second;
+      } else {
+        assert("invalid runtime argument for datatype A!");
+      }
+
+      if (iter_runtime_b != mapping.end()) {
+          operator_args.mainloop.runtime_data_type_b = iter_runtime_b->second;
+      } else {
+        assert("invalid runtime argument for datatype B!");
+      }
+
+   }
+    else {
+    
+    operator_args.mainloop.ptr_A = static_cast<ElementA const *>(arguments->A);
+    operator_args.mainloop.ptr_B = static_cast<ElementB const *>(arguments->B);
+    } 
+    operator_args.mainloop.ptr_SFA = static_cast<ElementSFA const *>(arguments->SFA);
+    operator_args.mainloop.ptr_SFB = static_cast<ElementSFB const *>(arguments->SFB);
+    operator_args.epilogue.ptr_C = static_cast<ElementC const *>(arguments->C);
+    operator_args.epilogue.ptr_D = static_cast<ElementD       *>(arguments->D);
+
+    operator_args.mainloop.dA = cute::make_int_tuple_from<typename Operator::GemmKernel::StrideA>(
+        arguments->lda, arguments->batch_stride_A);
+    operator_args.mainloop.dB = cute::make_int_tuple_from<typename Operator::GemmKernel::StrideB>(
+        arguments->ldb, arguments->batch_stride_B);
+    operator_args.epilogue.dC = cute::make_int_tuple_from<typename Operator::GemmKernel::StrideC>(
+        arguments->ldc, arguments->batch_stride_C);
+    operator_args.epilogue.dD = operator_args.epilogue.dC;
+
+    operator_args.mainloop.layout_SFA = Operator::CollectiveMainloop::ScaleConfig::tile_atom_to_shape_SFA(operator_args.problem_shape);
+    operator_args.mainloop.layout_SFB = Operator::CollectiveMainloop::ScaleConfig::tile_atom_to_shape_SFB(operator_args.problem_shape);
+
+    /* Query device SM count to pass onto the kernel as an argument, where needed */
+    operator_args.hw_info.sm_count = arguments->sm_count;
+    if constexpr (!std::is_const_v<decltype(operator_args.scheduler.max_swizzle_size)>) {
+      operator_args.scheduler.max_swizzle_size = arguments->swizzle_size;
+    }
+    
+    if constexpr (!std::is_const_v<decltype(operator_args.scheduler.raster_order)>) {
+      using Enum_t = decltype(operator_args.scheduler.raster_order);
+      switch (arguments->raster_order) {
+        case RasterOrder::kAlongN:
+          operator_args.scheduler.raster_order = Enum_t::AlongN;
+          break;
+        case RasterOrder::kAlongM:
+          operator_args.scheduler.raster_order = Enum_t::AlongM;
+          break;
+        default: 
+          operator_args.scheduler.raster_order = Enum_t::Heuristic;
+      }
+    }
+
+    if constexpr (std::is_same_v<typename Operator::GemmKernel::TileSchedulerTag, cutlass::gemm::StreamKScheduler>) {
+      operator_args.scheduler.splits = arguments->split_k_slices;
+    }
+
+    
+    if constexpr (Operator::ArchTag::kMinComputeCapability >= 100) {
+      operator_args.hw_info.cluster_shape = dim3(
+        arguments->cluster_shape.m(),
+        arguments->cluster_shape.n(),
+        arguments->cluster_shape.k());
+      operator_args.hw_info.cluster_shape_fallback = dim3(
+        arguments->cluster_shape_fallback.m(),
+        arguments->cluster_shape_fallback.n(),
+        arguments->cluster_shape_fallback.k());
+    }
+    
+    return status;
+  }
+
+public:
+
+  /// Returns success if the operation can proceed
+  Status can_implement(
+      void const *configuration_ptr, void const *arguments_ptr) const override {
+
+    GemmUniversalConfiguration const *configuration = 
+      static_cast<GemmUniversalConfiguration const *>(configuration_ptr);
+    BlockwiseGemmArguments const *arguments =
+      static_cast<BlockwiseGemmArguments const *>(arguments_ptr);
+
+    if (arguments->sf_m_vec_size != description_.SFMVecSize && arguments->sf_m_vec_size != 0) {
+      return Status::kErrorInvalidProblem;
+    }
+    if (arguments->sf_n_vec_size != description_.SFNVecSize && arguments->sf_n_vec_size != 0) {
+      return Status::kErrorInvalidProblem;
+    }
+    if (arguments->sf_k_vec_size != description_.SFKVecSize && arguments->sf_k_vec_size != 0) {
+      return Status::kErrorInvalidProblem;
+    }
+
+    OperatorArguments args;
+    auto status = update_arguments_(args, arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    // can_implement rules may need access to problem shape
+    args.problem_shape = cute::make_shape(
+      configuration->problem_size.m(),
+      configuration->problem_size.n(),
+      configuration->problem_size.k(),
+      configuration->batch_count);
+
+    return Operator::can_implement(args);
+  }
+
+  /// Gets the host-side workspace
+  uint64_t get_host_workspace_size(void const *configuration) const override {
+    return sizeof(Operator);
+  }
+
+  /// Gets the device-side workspace
+  uint64_t get_device_workspace_size(
+      void const *configuration_ptr,void const *arguments_ptr) const override {
+
+    OperatorArguments args;
+    auto status = update_arguments_(
+      args, static_cast<BlockwiseGemmArguments const *>(arguments_ptr));
+    if (status != Status::kSuccess) {
+      return 0;
+    }
+
+    uint64_t size = Operator::get_workspace_size(args);
+    return size;
+  }
+
+  /// Initializes the workspace
+  Status initialize(
+      void const *configuration_ptr,
+      void *host_workspace,
+      void *device_workspace,
+      cudaStream_t stream = nullptr) const override {
+    Operator *op = new (host_workspace) Operator;
+    return Status::kSuccess;
+  }
+
+  Status initialize_with_profiler_workspace(
+      void const *configuration, 
+      void *host_workspace, 
+      void *device_workspace, 
+      uint8_t **profiler_workspaces,
+      int problem_count_from_profiler,
+      cudaStream_t stream = nullptr) {
+    return Status::kSuccess;
+  }
+
+  /// Runs the kernel
+  Status run(
+      void const *arguments_ptr,
+      void *host_workspace,
+      void *device_workspace = nullptr,
+      cudaStream_t stream = nullptr) const override {
+
+    OperatorArguments args;
+    Status status = update_arguments_(args, static_cast<BlockwiseGemmArguments const *>(arguments_ptr));
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    Operator *op = static_cast<Operator *>(host_workspace);
+    // We need to call initialize() since we have to rebuild TMA desc for every new set of args
+    status = op->run(args, device_workspace, stream, nullptr, static_cast<BlockwiseGemmArguments const *>(arguments_ptr)->use_pdl);
+    return status;
+  }
+};
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::library
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/library/src/grouped_gemm_operation_3x.hpp b/tools/library/src/grouped_gemm_operation_3x.hpp
index e94d7c7338..7c67ae3a2d 100644
--- a/tools/library/src/grouped_gemm_operation_3x.hpp
+++ b/tools/library/src/grouped_gemm_operation_3x.hpp
@@ -72,19 +72,6 @@ class GroupedGemmOperation3xBase : public GemmOperation3xBase<Operator_> {
 
     this->description_.gemm = GemmOperation3xBase<Operator_>::description_;
     this->description_.tile_description = this->description_.gemm.tile_description;
-
-    if constexpr (Operator::ArchTag::kMinComputeCapability >= 90) {
-      dim3 cluster_dims(
-        cute::size<0>(typename Operator::GemmKernel::ClusterShape{}),
-        cute::size<1>(typename Operator::GemmKernel::ClusterShape{}),
-        cute::size<2>(typename Operator::GemmKernel::ClusterShape{}));
-      uint32_t threads_per_block = Operator::GemmKernel::MaxThreadsPerBlock;
-      void const* kernel_ptr = (void*)(device_kernel<typename Operator::GemmKernel>);
-      max_active_clusters = cutlass::KernelHardwareInfo::query_device_max_active_clusters(
-        cluster_dims,
-        threads_per_block,
-        kernel_ptr);
-    }
   };
 
 public:
@@ -102,7 +89,6 @@ class GroupedGemmOperation3xBase : public GemmOperation3xBase<Operator_> {
 
 protected:
   library::GroupedGemmDescription description_;
-  int max_active_clusters;
 
   Status initialize_strides(GemmGroupedConfiguration const& config) const {
     auto const num_groups = config.problem_count;
@@ -182,7 +168,7 @@ class GroupedGemmOperation3xBase : public GemmOperation3xBase<Operator_> {
 
     operator_args.hw_info.sm_count = arguments.sm_count;
     if constexpr (Operator::ArchTag::kMinComputeCapability >= 90) {
-      operator_args.hw_info.max_active_clusters = max_active_clusters;
+      operator_args.hw_info.max_active_clusters = arguments.max_active_clusters;
     }
     if constexpr (Operator::ArchTag::kMinComputeCapability >= 100) {
       operator_args.hw_info.cluster_shape =
@@ -343,6 +329,47 @@ class GroupedGemmUniversal3xOperation : public GroupedGemmOperation3xBase<Operat
     status = op->run(operator_args, device_workspace, stream, nullptr, args.use_pdl);
     return status;
   }
+
+
+  // Set arguments that should only be set once before verifying or profiling the kernel.
+  // This should encompass any expensive operations that don't vary from run to run
+  // (e.g., max_active_clusters).
+  Status initialize_with_arguments(void* arguments_ptr) const override {
+    if constexpr (Operator::ArchTag::kMinComputeCapability < 90) {
+      return Status::kSuccess;
+    }
+
+    GemmGroupedArguments* args = static_cast<GemmGroupedArguments*>(arguments_ptr);
+
+    dim3 cluster_dims;
+    if constexpr (cute::is_static_v<typename Operator::GemmKernel::ClusterShape>) {
+      cluster_dims = dim3(
+        cute::size<0>(typename Operator::GemmKernel::ClusterShape{}),
+        cute::size<1>(typename Operator::GemmKernel::ClusterShape{}),
+        cute::size<2>(typename Operator::GemmKernel::ClusterShape{})
+      );
+    }
+    else {
+      cluster_dims = dim3(
+        args->cluster_shape.m(),
+        args->cluster_shape.n(),
+        args->cluster_shape.k()
+      );      
+    }
+
+    uint32_t threads_per_block = Operator::GemmKernel::MaxThreadsPerBlock;
+    void const* kernel_ptr = (void*)(device_kernel<typename Operator::GemmKernel>);
+    args->max_active_clusters = cutlass::KernelHardwareInfo::query_device_max_active_clusters(
+      cluster_dims,
+      threads_per_block,
+      kernel_ptr);
+    
+    if (args->max_active_clusters == 0) {
+      return Status::kErrorInternal;
+    }
+
+    return Status::kSuccess;
+  }
 };
 
 template <typename Operator_>
@@ -375,6 +402,7 @@ class GroupedBlockScaledGemmUniversal3xOperation : public GroupedGemmOperation3x
       : GroupedGemmOperation3xBase<Operator_>(name) {
 
     BlockScaleDescription block_scaled_desc{};
+    block_scaled_desc.kind = OperationKind::kBlockScaledGemm;
     block_scaled_desc.SFA.element = NumericTypeMap<ElementSFA>::kId;
     block_scaled_desc.SFA.layout = LayoutTypeID::kRowMajor;
     block_scaled_desc.SFA.alignment = 128;
@@ -387,7 +415,9 @@ class GroupedBlockScaledGemmUniversal3xOperation : public GroupedGemmOperation3x
     block_scaled_desc.SFB.log_extent_range = 32;
     block_scaled_desc.SFB.log_stride_range = 32;
 
-    block_scaled_desc.SFVecSize = SFVecSize;
+    block_scaled_desc.SFMVecSize = 1;
+    block_scaled_desc.SFNVecSize = 1;
+    block_scaled_desc.SFKVecSize = SFVecSize;
 
     block_scaled_desc.SFD = make_TensorDescription<ElementSFD, LayoutSFD>(128);
     block_scaled_desc.EpilogueSFVecSize = SFD_VectorSize;
@@ -555,4 +585,206 @@ class GroupedBlockScaledGemmUniversal3xOperation : public GroupedGemmOperation3x
   }
 };
 
+template <typename Operator_>
+class GroupedBlockwiseGemmUniversal3xOperation : public GroupedGemmOperation3xBase<Operator_> {
+public:
+  using Operator = Operator_;
+  using OperatorArguments = typename Operator::Arguments;
+  using ElementD = typename Operator::ElementD;
+  using LayoutD = typename Operator::LayoutD;
+  using ElementAccumulator = typename Operator::ElementAccumulator;
+  using ElementCompute = typename Operator::EpilogueOutputOp::ElementCompute;
+
+  using CollectiveMainloop = typename Operator::CollectiveMainloop;
+  using CollectiveEpilogue = typename Operator::CollectiveEpilogue;
+  using ThreadEpilogueOp = typename CollectiveEpilogue::ThreadEpilogueOp;
+
+  using ElementSFA = typename Operator::ElementAccumulator;
+  using ElementSFB = typename Operator::ElementAccumulator;
+
+  using TiledMma = typename Operator::CollectiveMainloop::TiledMma;
+
+  GroupedBlockwiseGemmUniversal3xOperation(char const* name = "unknown_gemm")
+      : GroupedGemmOperation3xBase<Operator_>(name) {
+
+    BlockScaleDescription blockwise_desc{};
+    blockwise_desc.kind = OperationKind::kBlockwiseGemm;
+    blockwise_desc.SFA.element = NumericTypeMap<ElementSFA>::kId;
+    blockwise_desc.SFA.layout = size<0,1>(typename CollectiveMainloop::InternalLayoutSFA{}.stride()) == 1 ? 
+        LayoutTypeID::kColumnMajor : LayoutTypeID::kRowMajor;
+    blockwise_desc.SFA.alignment = CollectiveMainloop::AlignmentSFA;
+    blockwise_desc.SFA.log_extent_range = 32;
+    blockwise_desc.SFA.log_stride_range = 32;
+
+    blockwise_desc.SFB.element = NumericTypeMap<ElementSFB>::kId;
+    blockwise_desc.SFB.layout = size<0,1>(typename CollectiveMainloop::InternalLayoutSFB{}.stride()) == 1 ? 
+        LayoutTypeID::kRowMajor : LayoutTypeID::kColumnMajor;
+    blockwise_desc.SFB.alignment = CollectiveMainloop::AlignmentSFA;
+    blockwise_desc.SFB.log_extent_range = 32;
+    blockwise_desc.SFB.log_stride_range = 32;
+
+    blockwise_desc.SFMVecSize = Operator::CollectiveMainloop::ScaleGranularityM;
+    blockwise_desc.SFNVecSize = Operator::CollectiveMainloop::ScaleGranularityN;
+    blockwise_desc.SFKVecSize = Operator::CollectiveMainloop::ScaleGranularityK;
+
+    blockwise_desc.EpilogueSFVecSize = 0;
+
+    this->description_.block_scales = blockwise_desc;
+  }
+
+  ~GroupedBlockwiseGemmUniversal3xOperation() override = default;
+
+  mutable CudaBuffer layout_SFA_device;
+  mutable CudaBuffer layout_SFB_device;
+
+protected:
+  template <class FusionArgs, class = void> struct UpdateFusionArgs {
+    static Status update_(FusionArgs const& fusion_args, GemmGroupedArguments const& arguments) {
+      // If a custom EVT is instantiated then it is the users's responsibility
+      // to ensure alpha and beta are updated appropriately
+      return Status::kSuccess;
+    }
+  };
+
+  template <class FusionArgs>
+  struct UpdateFusionArgs<FusionArgs, cute::void_t<decltype(FusionArgs{}.alpha)>> {
+    static Status
+    update_(FusionArgs& fusion_args, GroupedGemmBlockwiseArguments const& arguments) {
+      return GroupedGemmOperation3xBase<Operator>::update_fusion_args(fusion_args, arguments);
+    }
+  };
+
+public:
+  /// Returns success if the operation can proceed
+  Status can_implement([[maybe_unused]] void const* configuration_ptr, void const* arguments_ptr)
+    const override {
+    GroupedGemmBlockwiseArguments const* arguments =
+      static_cast<GroupedGemmBlockwiseArguments const*>(arguments_ptr);
+    OperatorArguments args;
+    auto status = update_arguments_(args, arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    status = Operator::can_implement(args);
+    return status;
+  }
+
+  Status update_arguments_(
+    OperatorArguments& operator_args,
+    GroupedGemmBlockwiseArguments const* arguments) const {
+    Status status = UpdateFusionArgs<decltype(operator_args.epilogue.thread)>::update_(
+      operator_args.epilogue.thread,
+      *arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    operator_args.mainloop.ptr_SFA =
+      static_cast<const typename Operator::GemmKernel::ElementAccumulator**>(arguments->SFA);
+    operator_args.mainloop.ptr_SFB =
+      static_cast<const typename Operator::GemmKernel::ElementAccumulator**>(arguments->SFB);
+
+    operator_args.mainloop.layout_SFA =
+      static_cast<typename CollectiveMainloop::InternalLayoutSFA*>(this->layout_SFA_device.data());
+    operator_args.mainloop.layout_SFB =
+      static_cast<typename CollectiveMainloop::InternalLayoutSFB*>(this->layout_SFB_device.data());
+
+    return this->update_arguments_base(operator_args, *arguments);
+  }
+
+  uint64_t get_device_workspace_size(void const* configuration_ptr, void const* arguments_ptr)
+    const override {
+
+    OperatorArguments args;
+    auto status =
+      update_arguments_(args, static_cast<GroupedGemmBlockwiseArguments const*>(arguments_ptr));
+    if (status != Status::kSuccess) {
+      return 0;
+    }
+
+    uint64_t size = Operator::get_workspace_size(args);
+    return size;
+  }
+
+  /// Initializes the workspace
+  /// **** CAUTION ****
+  /// Must be called when lda, ldb, ldc, or ldd change.
+  /// The CUTLASS library stores the operations in a type-
+  /// erased manifest. Therefore, only this class knows
+  /// the type of strideA, strideB, strideC, and strideD.
+  /// Since grouped GEMM needs to allocate storage for
+  /// the strides on device, the concrete type of the stride
+  /// must be known in order to copy in the correct memory
+  /// layout on device.
+  Status initialize(
+    void const* configuration_ptr,
+    void* host_workspace,
+    void* device_workspace,
+    cudaStream_t stream = nullptr) const override {
+
+    auto const& config = *static_cast<GemmGroupedConfiguration const*>(configuration_ptr);
+    auto status = this->initialize_strides(config);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    auto num_groups = config.problem_count;
+    this->layout_SFA_device =
+      CudaBuffer(sizeof(typename CollectiveMainloop::InternalLayoutSFA) * num_groups);
+    this->layout_SFB_device =
+      CudaBuffer(sizeof(typename CollectiveMainloop::InternalLayoutSFB) * num_groups);
+    auto layout_SFA_host = std::vector<typename CollectiveMainloop::InternalLayoutSFA>(num_groups);
+    auto layout_SFB_host = std::vector<typename CollectiveMainloop::InternalLayoutSFB>(num_groups);
+
+    for (int group_idx = 0; group_idx < num_groups; group_idx++) {
+      auto const& shape = config.problem_sizes_3x_host[group_idx];
+      auto M = get<0>(shape);
+      auto N = get<1>(shape);
+      auto K = get<2>(shape);
+
+      auto layout_SFA = CollectiveMainloop::ScaleConfig::tile_atom_to_shape_SFA(cute::make_shape(M, N, K, 1));
+      auto layout_SFB = CollectiveMainloop::ScaleConfig::tile_atom_to_shape_SFB(cute::make_shape(M, N, K, 1));
+      layout_SFA_host[group_idx] = layout_SFA;
+      layout_SFB_host[group_idx] = layout_SFB;
+    }
+
+    CUDA_CHECK(cudaMemcpy(
+      this->layout_SFA_device.data(),
+      layout_SFA_host.data(),
+      sizeof(typename CollectiveMainloop::InternalLayoutSFA) * num_groups,
+      cudaMemcpyHostToDevice));
+    CUDA_CHECK(cudaMemcpy(
+      this->layout_SFB_device.data(),
+      layout_SFB_host.data(),
+      sizeof(typename CollectiveMainloop::InternalLayoutSFB) * num_groups,
+      cudaMemcpyHostToDevice));
+
+    Operator* op = new (host_workspace) Operator;
+    return status;
+  }
+
+  /// **** CAUTION ****
+  /// initialize() must be called if lda, ldb, ldc, or ldd change.
+  Status run(
+    void const* arguments_ptr,
+    void* host_workspace,
+    void* device_workspace = nullptr,
+    cudaStream_t stream = nullptr) const override {
+
+    OperatorArguments operator_args;
+    auto const& args = *static_cast<GroupedGemmBlockwiseArguments const*>(arguments_ptr);
+
+    Status status = update_arguments_(operator_args, &args);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    Operator* op = static_cast<Operator*>(host_workspace);
+    status = op->run(operator_args, device_workspace, stream, nullptr);
+    return status;
+  }
+};
+
+
 } // namespace cutlass::library
diff --git a/tools/library/src/operation_table.cu b/tools/library/src/operation_table.cu
index dd2b48c61e..a14ce5924d 100644
--- a/tools/library/src/operation_table.cu
+++ b/tools/library/src/operation_table.cu
@@ -85,7 +85,42 @@ void OperationTable::append(Manifest const &manifest) {
 
       block_scaled_gemm_operations[functional_key][preference_key].push_back(op);
     }
-    
+
+    if (desc.kind == OperationKind::kBlockwiseGemm) {
+      BlockwiseGemmDescription const &gemm_desc = static_cast<BlockwiseGemmDescription const &>(desc);
+
+      BlockwiseGemmFunctionalKey functional_key(
+        gemm_desc.provider,
+        gemm_desc.gemm_kind,
+        gemm_desc.kind,
+        gemm_desc.tile_description.math_instruction.element_accumulator,
+        gemm_desc.element_epilogue,        
+        gemm_desc.A.element,
+        gemm_desc.A.layout,
+        gemm_desc.SFA.element,
+        gemm_desc.B.element,
+        gemm_desc.B.layout,
+        gemm_desc.SFB.element,
+        gemm_desc.C.element,
+        gemm_desc.C.layout,
+        gemm_desc.D.element,
+        gemm_desc.D.layout,
+        gemm_desc.SFMVecSize,
+        gemm_desc.SFNVecSize,
+        gemm_desc.SFKVecSize
+      );
+ 
+      Operation const *op = operation.get();
+
+      int cc = gemm_desc.tile_description.minimum_compute_capability;
+        
+      int alignment = std::max(std::max(
+        gemm_desc.A.alignment, gemm_desc.B.alignment), gemm_desc.C.alignment);
+
+      GemmPreferenceKey preference_key(cc, alignment);
+
+      blockwise_gemm_operations[functional_key][preference_key].push_back(op);
+    }
 
     // insert all gemm operation into operation table
     if (desc.kind == OperationKind::kGemm) {
@@ -121,6 +156,96 @@ void OperationTable::append(Manifest const &manifest) {
       gemm_operations[functional_key][preference_key].push_back(op);
     }
 
+    // insert all grouped gemm operation into operation table
+    if (desc.kind == OperationKind::kGroupedGemm) {
+      GroupedGemmDescription const &grouped_gemm_desc = static_cast<GroupedGemmDescription const &>(desc);
+      GemmDescription const &gemm_desc = grouped_gemm_desc.gemm;
+
+      int cc = gemm_desc.tile_description.minimum_compute_capability;
+
+      int alignment = std::max(std::max(
+        gemm_desc.A.alignment, gemm_desc.B.alignment), gemm_desc.C.alignment);
+
+      GemmPreferenceKey preference_key(cc, alignment);
+
+      Operation const *op = operation.get();
+
+      if (!grouped_gemm_desc.block_scales.has_value()) {
+        GemmFunctionalKey functional_key(
+          gemm_desc.provider,
+          gemm_desc.gemm_kind,
+          gemm_desc.tile_description.math_instruction.element_accumulator,
+          gemm_desc.element_epilogue,
+          gemm_desc.A.element,
+          gemm_desc.A.layout,
+          gemm_desc.transform_A,
+          gemm_desc.B.element,
+          gemm_desc.B.layout,
+          gemm_desc.transform_B,
+          gemm_desc.C.element,
+          gemm_desc.C.layout,
+          gemm_desc.D.element,
+          gemm_desc.D.layout
+        );
+
+        gemm_operations[functional_key][preference_key].push_back(op);
+      }
+      else {
+        const BlockScaleDescription &block_scale_desc = grouped_gemm_desc.block_scales.value();
+        if (block_scale_desc.kind == OperationKind::kBlockScaledGemm) {
+          
+          BlockScaledGemmFunctionalKey functional_key(
+            gemm_desc.provider,
+            gemm_desc.gemm_kind,
+            gemm_desc.kind,
+            gemm_desc.tile_description.math_instruction.element_accumulator,
+            gemm_desc.element_epilogue,
+            gemm_desc.A.element,
+            gemm_desc.A.layout,
+            block_scale_desc.SFA.element,
+            gemm_desc.B.element,
+            gemm_desc.B.layout,
+            block_scale_desc.SFB.element,
+            gemm_desc.C.element,
+            gemm_desc.C.layout,
+            gemm_desc.D.element,
+            gemm_desc.D.layout,
+            block_scale_desc.SFD.element,
+            block_scale_desc.SFD.layout,
+            block_scale_desc.SFKVecSize,
+            block_scale_desc.EpilogueSFVecSize
+          );
+
+          block_scaled_gemm_operations[functional_key][preference_key].push_back(op);
+        }
+        else {
+          assert(block_scale_desc.kind == OperationKind::kBlockwiseGemm);
+          BlockwiseGemmFunctionalKey functional_key(
+            gemm_desc.provider,
+            gemm_desc.gemm_kind,
+            gemm_desc.kind,
+            gemm_desc.tile_description.math_instruction.element_accumulator,
+            gemm_desc.element_epilogue,        
+            gemm_desc.A.element,
+            gemm_desc.A.layout,
+            block_scale_desc.SFA.element,
+            gemm_desc.B.element,
+            gemm_desc.B.layout,
+            block_scale_desc.SFB.element,
+            gemm_desc.C.element,
+            gemm_desc.C.layout,
+            gemm_desc.D.element,
+            gemm_desc.D.layout,
+            block_scale_desc.SFMVecSize,
+            block_scale_desc.SFNVecSize,
+            block_scale_desc.SFKVecSize
+          );
+
+          blockwise_gemm_operations[functional_key][preference_key].push_back(op);
+        }
+      }
+    }
+
     // insert all conv2d or conv3d operation into operation table
     if (desc.kind == OperationKind::kConv2d || desc.kind == OperationKind::kConv3d) {
       auto &conv_desc = static_cast<library::ConvDescription const &>(desc);
diff --git a/tools/library/src/reference/blockwise_gemm_fp8_bf16out.cu b/tools/library/src/reference/blockwise_gemm_fp8_bf16out.cu
new file mode 100644
index 0000000000..e99cb1a306
--- /dev/null
+++ b/tools/library/src/reference/blockwise_gemm_fp8_bf16out.cu
@@ -0,0 +1,58 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Instantiates GEMM reference implementations.
+*/
+
+#include "cutlass/cutlass.h"
+#include "cutlass/library/library.h"
+#include "cutlass/library/manifest.h"
+
+#include "blockwise_gemm_reference_operation.h"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+void initialize_blockwise_gemm_reference_operations_bf16out(Manifest &manifest) {
+  initialize_blockwise_gemm_reference_operations_given_C_and_D<void, bfloat16_t>(manifest);
+  initialize_blockwise_gemm_reference_operations_given_C_and_D<bfloat16_t, bfloat16_t>(manifest);
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace library
+} // namespace cutlass
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/library/src/reference/blockwise_gemm_fp8_fp16out.cu b/tools/library/src/reference/blockwise_gemm_fp8_fp16out.cu
new file mode 100644
index 0000000000..eb5c20d12f
--- /dev/null
+++ b/tools/library/src/reference/blockwise_gemm_fp8_fp16out.cu
@@ -0,0 +1,58 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Instantiates GEMM reference implementations.
+*/
+
+#include "cutlass/cutlass.h"
+#include "cutlass/library/library.h"
+#include "cutlass/library/manifest.h"
+
+#include "blockwise_gemm_reference_operation.h"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+void initialize_blockwise_gemm_reference_operations_fp16out(Manifest &manifest) {
+  initialize_blockwise_gemm_reference_operations_given_C_and_D<void, half_t>(manifest);
+  initialize_blockwise_gemm_reference_operations_given_C_and_D<half_t, half_t>(manifest);
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace library
+} // namespace cutlass
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/library/src/reference/blockwise_gemm_fp8_fp32out.cu b/tools/library/src/reference/blockwise_gemm_fp8_fp32out.cu
new file mode 100644
index 0000000000..b0b8d9f275
--- /dev/null
+++ b/tools/library/src/reference/blockwise_gemm_fp8_fp32out.cu
@@ -0,0 +1,58 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Instantiates GEMM reference implementations.
+*/
+
+#include "cutlass/cutlass.h"
+#include "cutlass/library/library.h"
+#include "cutlass/library/manifest.h"
+
+#include "blockwise_gemm_reference_operation.h"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+void initialize_blockwise_gemm_reference_operations_fp32out(Manifest &manifest) {
+  initialize_blockwise_gemm_reference_operations_given_C_and_D<void, float>(manifest);
+  initialize_blockwise_gemm_reference_operations_given_C_and_D<float, float>(manifest);
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace library
+} // namespace cutlass
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/library/src/reference/blockwise_gemm_reference_operation.h b/tools/library/src/reference/blockwise_gemm_reference_operation.h
new file mode 100644
index 0000000000..591a5ce368
--- /dev/null
+++ b/tools/library/src/reference/blockwise_gemm_reference_operation.h
@@ -0,0 +1,664 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+  \brief Defines reference operations for blockwise/groupwise GEMM operation kinds in CUTLASS Library
+*/
+
+
+
+#pragma once
+
+#include <iostream>
+#include <sstream>
+#include <cstring>
+
+#include "cutlass/cutlass.h"
+
+#include "cutlass/library/library.h"
+#include "cutlass/library/manifest.h"
+#include "cutlass/library/util.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "library_internal.h"
+
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/detail/blockwise_scale_layout.hpp"
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  Provider Provider_,
+  typename ElementA_, 
+  typename LayoutA_,
+  typename LayoutSFA_,
+  typename ElementSFA_,
+  typename ElementB_,
+  typename LayoutB_,
+  typename LayoutSFB_,
+  typename ElementSFB_,
+  typename ElementC_,
+  typename LayoutC_,
+  typename ElementCompute_,
+  typename ElementAccumulator_ = ElementCompute_,
+  typename ElementD_ = ElementC_,
+  typename ConvertOp_ = NumericConverter<ElementD_, ElementCompute_>,
+  typename InnerProductOp_ = multiply_add<ElementAccumulator_>
+>
+class BlockwiseGemmReferenceOperation : public Operation {
+public:
+  static Provider const kProvider = Provider_;
+
+  using ElementA = ElementA_;
+  using LayoutA = LayoutA_;
+  using ElementSFA = ElementSFA_;
+  using ElementB = ElementB_;
+  using LayoutB = LayoutB_;
+  using ElementSFB = ElementSFB_;
+  using ElementC = ElementC_;
+  using LayoutC = LayoutC_;
+  using ElementD = ElementD_;
+  using ElementCompute = ElementCompute_;
+  using ElementAccumulator = ElementAccumulator_;
+  using ConvertOp = ConvertOp_;
+  using InnerProductOp = InnerProductOp_;
+
+protected:
+
+  /// Storage for the name string
+  std::string name_;
+
+  ///
+  BlockwiseGemmDescription description_;
+
+public:
+
+  /// Constructor
+  BlockwiseGemmReferenceOperation(int SFMVecSize_, int SFNVecSize_, int SFKVecSize_)
+    : SFMVecSize(SFMVecSize_), SFNVecSize(SFNVecSize_), SFKVecSize(SFKVecSize_) {
+    
+    // Basic information
+    description_.provider = kProvider;
+    description_.kind = OperationKind::kBlockwiseGemm;
+    description_.gemm_kind = GemmKind::kUniversal;
+
+    // Tensor description
+    description_.A = make_TensorDescription<ElementA, LayoutA>();
+    description_.SFA = make_TensorDescription<ElementSFA, LayoutSFA_>();
+    description_.B = make_TensorDescription<ElementB, LayoutB>();
+    description_.SFB = make_TensorDescription<ElementSFB, LayoutSFB_>();
+    description_.C = make_TensorDescription<ElementC, LayoutC>();
+    description_.D = make_TensorDescription<ElementD, LayoutC>();
+    
+    // Epilogue compute and accumulator type description
+    description_.element_epilogue = NumericTypeMap<ElementCompute>::kId;
+
+    description_.tile_description.math_instruction.element_accumulator =
+      NumericTypeMap<ElementAccumulator>::kId;
+
+    // Compute capability for gemm reference
+    description_.tile_description.minimum_compute_capability = 
+      (kProvider == Provider::kReferenceDevice ? 50 : 0);
+
+    description_.tile_description.maximum_compute_capability = 1024;
+
+    description_.SFMVecSize = SFMVecSize;
+    description_.SFNVecSize = SFNVecSize;
+    description_.SFKVecSize = SFKVecSize;
+
+    // Procedural name
+    std::stringstream ss;
+
+    ss << "gemm"  
+      << "_reference_" << to_string(description_.provider)
+      << "_" << to_string(description_.A.element) << to_string(description_.A.layout)
+      << "_" << to_string(description_.SFA.element) << SFMVecSize << "x" << SFKVecSize << to_string(description_.SFA.layout)
+      << "_" << to_string(description_.B.element) << to_string(description_.B.layout)
+      << "_" << to_string(description_.SFB.element)  << SFNVecSize << "x" << SFKVecSize << to_string(description_.SFB.layout)
+      << "_" << to_string(description_.C.element) << to_string(description_.C.layout)
+      << "_" << to_string(description_.tile_description.math_instruction.element_accumulator);
+
+    name_ = ss.str();
+
+    description_.name = name_.c_str();
+
+    // Epilogue compute and accumulator type description
+    description_.element_epilogue = NumericTypeMap<ElementCompute>::kId;
+
+    description_.tile_description.math_instruction.element_accumulator =
+      NumericTypeMap<ElementAccumulator>::kId;
+  }
+
+  /// Returns the description of the GEMM operation
+  virtual OperationDescription const & description() const {
+    return description_;
+  }
+
+  virtual Status can_implement(
+    void const *configuration,
+    void const *arguments) const {
+
+    return Status::kSuccess;
+  }
+
+  virtual uint64_t get_host_workspace_size(
+    void const *configuration) const {
+
+    return sizeof(GemmUniversalConfiguration);
+  }
+
+  virtual uint64_t get_device_workspace_size(
+    void const *configuration,
+    void const *arguments = nullptr) const {
+
+    return 0;
+  }
+
+  virtual Status initialize(
+    void const *configuration,
+    void *host_workspace,
+    void *device_workspace = nullptr,
+    cudaStream_t stream = nullptr) const {
+    return Status::kSuccess;
+  }
+
+  virtual Status run(
+    void const *arguments,
+    void *host_workspace,
+    void *device_workspace = nullptr,
+    cudaStream_t stream = nullptr) const {
+    using namespace cute;
+
+    BlockwiseGemmArguments const &args = *static_cast<BlockwiseGemmArguments const *>(arguments);
+
+    // Construct cute::Tensor A/B/C 
+
+    int M = args.problem_size.m();
+    int N = args.problem_size.n();
+    int K = args.problem_size.k();
+    int L = args.batch_count;
+
+    auto problem_shape_MNKL = cute::make_shape(M, N, K, L);
+
+    auto alpha = *(static_cast<ElementCompute const*>(args.alpha));
+    auto beta = *(static_cast<ElementCompute const*>(args.beta));
+
+    using StrideA = cutlass::gemm::TagToStrideA_t<LayoutA>;
+    using StrideB = cutlass::gemm::TagToStrideB_t<LayoutB>;
+    using StrideC = cutlass::gemm::TagToStrideC_t<LayoutC>;
+    using StrideD = cutlass::gemm::TagToStrideC_t<LayoutC>;
+
+    auto stride_a = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, L));
+    auto stride_b = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, L));
+    auto stride_c = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(M, N, L));
+    auto stride_d = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(M, N, L));
+    using BlockwiseConfig = cutlass::detail::RuntimeBlockwiseScaleConfig<>;
+    auto A = cute::make_tensor(static_cast<ElementA const*>(args.A),
+        cute::make_layout(cute::make_shape(M, K, L), stride_a));
+    auto SfA = make_tensor(static_cast<ElementSFA const*>(args.SFA), BlockwiseConfig::tile_atom_to_shape_SFA(problem_shape_MNKL, cute::make_tuple(SFMVecSize, SFNVecSize, SFKVecSize)));
+
+    auto B = cute::make_tensor(static_cast<ElementB const*>(args.B),
+        cute::make_layout(cute::make_shape(N, K, L), stride_b));
+    auto SfB = make_tensor(static_cast<ElementSFB const*>(args.SFB), BlockwiseConfig::tile_atom_to_shape_SFB(problem_shape_MNKL, cute::make_tuple(SFMVecSize, SFNVecSize, SFKVecSize)));
+
+    auto C = [&]() {
+      if constexpr (not is_same_v<ElementC, void>) {
+        return cute::make_tensor(static_cast<ElementC const*>(args.C),
+            cute::make_layout(cute::make_shape(M, N, L), stride_c));
+      }
+      else {
+        return cute::make_tensor(static_cast<ElementD const*>(nullptr),
+            cute::make_layout(cute::make_shape(M, N, L), stride_c));
+      }
+    }();
+
+    auto D = cute::make_tensor(static_cast<ElementD *>(args.D),
+        cute::make_layout(cute::make_shape(M, N, L), stride_d));
+
+    cutlass::reference::host::GettBlockScalingMainloopParams<ElementAccumulator, 
+        decltype(A), decltype(SfA), 
+        decltype(B), decltype(SfB)> 
+        mainloop_params{A, SfA, B, SfB};
+
+    //  W/O SF generation
+    cutlass::reference::host::GettEpilogueParams<
+        ElementCompute, ElementAccumulator, ElementAccumulator, ElementCompute,
+        decltype(C), decltype(D)>
+        epilogue_params{alpha, beta, C, D};
+
+    cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+    return Status::kSuccess;
+  }
+
+private:
+  int SFMVecSize;
+  int SFNVecSize;
+  int SFKVecSize;
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  typename ElementA_,
+  typename ElementSFA_,
+  typename ElementB_,
+  typename ElementSFB_,
+  typename ElementC_,
+  typename ElementCompute_,
+  typename ElementAccumulator_ = ElementCompute_,
+  typename ElementD_ = ElementC_,
+  typename ConvertOp_ = NumericConverter<ElementD_, ElementCompute_>,
+  typename InnerProductOp_ = multiply_add<ElementAccumulator_>
+>
+void make_blockwise_gemm(Manifest &manifest, int SFMVecSize, int SFNVecSize, int SFKVecSize) {
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::RowMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::RowMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+  manifest.append(new BlockwiseGemmReferenceOperation<
+    Provider::kReferenceHost,
+    ElementA_,
+    cutlass::layout::ColumnMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFA_,
+    ElementB_,
+    cutlass::layout::RowMajor,
+    cutlass::layout::ColumnMajor,
+    ElementSFB_,
+    ElementC_,
+    cutlass::layout::ColumnMajor,
+    ElementCompute_,
+    ElementAccumulator_,
+    ElementD_,
+    ConvertOp_,
+    InnerProductOp_
+  >(SFMVecSize, SFNVecSize, SFKVecSize));
+
+
+}
+
+template<class ElementC,
+         class ElementD>
+void initialize_blockwise_gemm_reference_operations_given_C_and_D(Manifest &manifest) {
+  make_blockwise_gemm<
+    float_e4m3_t /*A*/, float /*SFA*/, float_e4m3_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 1 , 128);
+  make_blockwise_gemm<
+    float_e4m3_t /*A*/, float /*SFA*/, float_e4m3_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 128, 128);
+  make_blockwise_gemm<
+    float_e4m3_t /*A*/, float /*SFA*/, float_e4m3_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 128, 128, 128);
+
+
+  make_blockwise_gemm<
+    float_e4m3_t /*A*/, float /*SFA*/, float_e5m2_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 1 , 128);
+  make_blockwise_gemm<
+    float_e4m3_t /*A*/, float /*SFA*/, float_e5m2_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 128, 128);
+  make_blockwise_gemm<
+    float_e4m3_t /*A*/, float /*SFA*/, float_e5m2_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 128, 128, 128);
+
+  make_blockwise_gemm<
+    float_e5m2_t /*A*/, float /*SFA*/, float_e4m3_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 1 , 128);
+  make_blockwise_gemm<
+    float_e5m2_t /*A*/, float /*SFA*/, float_e4m3_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 128, 128);
+  make_blockwise_gemm<
+    float_e5m2_t /*A*/, float /*SFA*/, float_e4m3_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 128, 128, 128);
+
+
+  make_blockwise_gemm<
+    float_e5m2_t /*A*/, float /*SFA*/, float_e5m2_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 1 , 128);
+  make_blockwise_gemm<
+    float_e5m2_t /*A*/, float /*SFA*/, float_e5m2_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 1, 128, 128);
+  make_blockwise_gemm<
+    float_e5m2_t /*A*/, float /*SFA*/, float_e5m2_t /*B*/, float /*SFB*/,
+    ElementC /*D*/, float /*Compute*/, float /*Accum*/, ElementD /*D*/
+  >(manifest, 128, 128, 128);
+
+}
+
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace library
+} // namespace cutlass
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/tools/library/src/reference/initialize_reference_operations.cu b/tools/library/src/reference/initialize_reference_operations.cu
index 33e6e9a86a..3a80d118a9 100644
--- a/tools/library/src/reference/initialize_reference_operations.cu
+++ b/tools/library/src/reference/initialize_reference_operations.cu
@@ -64,6 +64,10 @@ void initialize_gemm_reference_operations_f8_f6_f32(Manifest &manifest);
 void initialize_block_scaled_gemm_reference_operations_fp4a_vs16(Manifest &manifest);
 void initialize_block_scaled_gemm_reference_operations_fp4a_vs32(Manifest &manifest);
 void initialize_block_scaled_gemm_reference_operations_mixed8bitsa(Manifest &manifest);
+void initialize_blockwise_gemm_reference_operations_fp32out(Manifest &manifest);
+void initialize_blockwise_gemm_reference_operations_fp16out(Manifest &manifest);
+void initialize_blockwise_gemm_reference_operations_bf16out(Manifest &manifest);
+
 void initialize_gemm_reference_operations_fp8in_fp16out(Manifest &manifest);
 void initialize_gemm_reference_operations_fp8in_bf16out(Manifest &manifest);
 void initialize_gemm_reference_operations_fp8in_fp32out(Manifest &manifest);
@@ -113,6 +117,9 @@ void initialize_reference_operations(Manifest &manifest) {
   initialize_block_scaled_gemm_reference_operations_fp4a_vs16(manifest);
   initialize_block_scaled_gemm_reference_operations_fp4a_vs32(manifest);
   initialize_block_scaled_gemm_reference_operations_mixed8bitsa(manifest);
+  initialize_blockwise_gemm_reference_operations_fp32out(manifest);
+  initialize_blockwise_gemm_reference_operations_fp16out(manifest);
+  initialize_blockwise_gemm_reference_operations_bf16out(manifest);
 }
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/library/src/util.cu b/tools/library/src/util.cu
index 525b4794ab..ecc636a170 100644
--- a/tools/library/src/util.cu
+++ b/tools/library/src/util.cu
@@ -334,6 +334,7 @@ static struct {
   {"eq_gemm", "EqGemm", OperationKind::kEqGemm},
   {"gemm", "Gemm", OperationKind::kGemm},
   {"block_scaled_gemm", "blockScaledGemm", OperationKind::kBlockScaledGemm}, 
+  {"blockwise_gemm", "blockwiseGemm", OperationKind::kBlockwiseGemm}, 
   {"rank_k", "RankK", OperationKind::kRankK},
   {"rank_2k", "Rank2K", OperationKind::kRank2K},
   {"trmm", "Trmm", OperationKind::kTrmm},
diff --git a/tools/profiler/CMakeLists.txt b/tools/profiler/CMakeLists.txt
index 53f13ab926..f2b10dac43 100644
--- a/tools/profiler/CMakeLists.txt
+++ b/tools/profiler/CMakeLists.txt
@@ -48,6 +48,7 @@ set(CUTLASS_TOOLS_PROFILER_SOURCES
   src/gemm_operation_profiler.cu
   src/grouped_gemm_operation_profiler.cu
   src/block_scaled_gemm_operation_profiler.cu
+  src/blockwise_gemm_operation_profiler.cu
   src/rank_k_operation_profiler.cu
   src/rank_2k_operation_profiler.cu
   src/trmm_operation_profiler.cu
diff --git a/tools/profiler/include/cutlass/profiler/blockwise_gemm_operation_profiler.h b/tools/profiler/include/cutlass/profiler/blockwise_gemm_operation_profiler.h
new file mode 100644
index 0000000000..c6a1aa352b
--- /dev/null
+++ b/tools/profiler/include/cutlass/profiler/blockwise_gemm_operation_profiler.h
@@ -0,0 +1,290 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Blockscale Gemm Profiler
+*/
+
+
+
+#pragma once
+
+#include <vector>
+#include <string>
+#include <memory>
+#include <algorithm>
+#include <unordered_map>
+
+// CUTLASS Library includes
+#include "cutlass/library/library.h"
+#include "cutlass/library/util.h"
+#include "cutlass/library/manifest.h"
+
+// Profiler includes
+#include "options.h"
+#include "device_context.h"
+#include "operation_profiler.h"
+#include "performance_result.h"
+#include "problem_space.h"
+#include "reduction_operation_profiler.h"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace profiler {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Abstract base class for each math function
+class BlockwiseGemmOperationProfiler : public OperationProfiler {
+public:
+
+  /// Problem structure obtained from problem space
+  struct GemmProblem {
+
+    cutlass::library::GemmUniversalMode mode{library::GemmUniversalMode::kGemm};
+
+    int64_t m{16};
+    int64_t n{16};
+    int64_t k{16};
+
+    int64_t sf_vec_m{0};
+    int64_t sf_vec_n{0};
+    int64_t sf_vec_k{0};
+    
+    int cluster_m{1};
+    int cluster_n{1};
+    int cluster_k{1};
+    int cluster_m_fallback{1};
+    int cluster_n_fallback{1};
+    int cluster_k_fallback{1};
+    
+
+    int64_t lda{0};
+    int64_t ldb{0};
+    int64_t ldc{0};
+    std::vector<uint8_t> alpha;
+    std::vector<uint8_t> beta;
+
+    cutlass::library::SplitKMode split_k_mode{library::SplitKMode::kNone};
+    int split_k_slices{1};
+    int batch_count{1};
+
+    cutlass::library::RasterOrder raster_order{cutlass::library::RasterOrder::kHeuristic};
+    int swizzle_size{1};
+
+    
+    cutlass::library::RuntimeDatatype runtime_input_datatype_a{};
+    cutlass::library::RuntimeDatatype runtime_input_datatype_b{};
+    
+
+    // gemm with parallel interleaved reduction
+    // gemm epilogue (alpha, beta) = (1.0, 0.0)
+    // reduction epilogue (alpha, beta) = (GemmProblem::alpha, GemmProblem::beta)
+    std::vector<uint8_t> alpha_one;
+    std::vector<uint8_t> beta_zero;
+
+    bool use_pdl{false};
+    //
+    // Methods
+    //
+
+    /// Parses the problem
+    Status parse(
+      library::BlockwiseGemmDescription const &operation_desc,
+      ProblemSpace const &problem_space,
+      ProblemSpace::Problem const &problem);
+
+    /// Total number of bytes loaded
+    int64_t bytes(library::BlockwiseGemmDescription const &operation_desc) const;
+
+    /// Total number of flops computed
+    int64_t flops(library::BlockwiseGemmDescription const &operation_desc) const;
+
+    /// Initializes a performance result
+    void initialize_result(
+      PerformanceResult &result,
+      library::BlockwiseGemmDescription const &operation_desc,
+      ProblemSpace const &problem_space);
+  };
+
+  /// Workspace used 
+  struct GemmWorkspace {
+
+    DeviceAllocation *A{nullptr};
+    DeviceAllocation *SFA{nullptr};
+    DeviceAllocation *B{nullptr};
+    DeviceAllocation *SFB{nullptr};
+    DeviceAllocation *C{nullptr};
+    DeviceAllocation *Computed{nullptr};
+    DeviceAllocation *Reference{nullptr};
+
+    /// Number of copies of the problem workspace which are visited sequentially during
+    /// profiling to avoid camping in the last level cache.
+    int problem_count{1};
+
+    library::GemmUniversalConfiguration configuration;
+    library::BlockwiseGemmArguments arguments;
+
+    /// Buffer used for the operation's host workspace
+    std::vector<uint8_t> host_workspace;
+
+    /// Buffer used for the operations' device workspace
+    DeviceAllocation device_workspace;
+
+    /// Library configuration and arguments for reduction operator
+    library::ReductionConfiguration reduction_configuration;
+    library::ReductionArguments reduction_arguments;
+
+    /// Buffer used for the cutlass reduction operations' host workspace
+    std::vector<uint8_t> reduction_host_workspace;
+  };
+
+protected:
+
+  //
+  // Data members
+  //
+
+  /// GEMM problem obtained from problem space
+  GemmProblem problem_;
+
+  /// Device memory allocations 
+  GemmWorkspace gemm_workspace_;
+
+  /// CUTLASS parallel reduction operation to follow this* gemm operation
+  library::Operation const *reduction_op_;
+
+public:
+  //
+  // Methods
+  //
+
+  /// Ctor
+  BlockwiseGemmOperationProfiler(Options const &options);
+
+  /// Destructor
+  virtual ~BlockwiseGemmOperationProfiler();
+
+  GemmProblem const& problem() const { return problem_; }
+
+  /// Prints usage statement for the math function
+  virtual void print_usage(std::ostream &out) const;
+
+  /// Prints examples
+  virtual void print_examples(std::ostream &out) const;
+
+  /// Extracts the problem dimensions
+  virtual Status initialize_configuration(
+    Options const &options, 
+    PerformanceReport &report, 
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem);
+
+  /// Initializes workspace
+  virtual Status initialize_workspace(
+    Options const &options, 
+    PerformanceReport &report, 
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem);
+
+  /// Verifies CUTLASS against references
+  virtual bool verify_cutlass(
+    Options const &options,  
+    PerformanceReport &report,
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem);
+
+  /// Measures performance results
+  virtual bool profile(
+    Options const &options, 
+    PerformanceReport &report, 
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem);
+
+protected:
+
+  /// Initializes the performance result
+  void initialize_result_(
+    PerformanceResult &result,
+    Options const &options,  
+    library::BlockwiseGemmDescription const &operation_desc,
+    ProblemSpace const &problem_space);
+
+  /// Verifies CUTLASS against references
+  bool verify_with_cublas_(
+    Options const &options,  
+    PerformanceReport &report,
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem);
+
+  /// Verifies CUTLASS against host and device references
+  bool verify_with_reference_(
+    Options const &options,  
+    PerformanceReport &report,
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem,
+    cutlass::library::NumericTypeID element_A,
+    cutlass::library::NumericTypeID element_B);
+
+  /// Method to profile a CUTLASS Operation
+  Status profile_cutlass_(
+    PerformanceResult &result,
+    Options const &options,
+    library::Operation const *operation,
+    void *arguments,
+    void *host_workspace,
+    void *device_workspace);
+
+  /// Initialize reduction problem dimensions and library::Operation
+  bool initialize_reduction_configuration_(
+    library::Operation const *operation,
+    ProblemSpace::Problem const &problem);
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace profiler
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/tools/profiler/include/cutlass/profiler/grouped_gemm_operation_profiler.h b/tools/profiler/include/cutlass/profiler/grouped_gemm_operation_profiler.h
index c34ac03851..6a4803c788 100644
--- a/tools/profiler/include/cutlass/profiler/grouped_gemm_operation_profiler.h
+++ b/tools/profiler/include/cutlass/profiler/grouped_gemm_operation_profiler.h
@@ -198,6 +198,11 @@ class GroupedGemmOperationProfiler : public OperationProfiler {
       arguments.SFD = block_scaled_ws.SFD_ptr_array_device[0]->data();
       arguments.norm_constant = block_scaled_ws.norm_constant->data();
     }
+    else if (is_blockwise) {
+      auto& block_scaled_ws = gemm_workspace_.block_scales.value();
+      arguments.SFA = block_scaled_ws.SFA_ptr_array_device[0]->data();
+      arguments.SFB = block_scaled_ws.SFB_ptr_array_device[0]->data();
+    }
   }
 
 protected:
@@ -208,6 +213,7 @@ class GroupedGemmOperationProfiler : public OperationProfiler {
   GroupedGemmWorkspace gemm_workspace_;
 
   bool is_block_scaled{false};
+  bool is_blockwise{false};
 
 public:
   GroupedGemmOperationProfiler(Options const& options);
diff --git a/tools/profiler/src/block_scaled_gemm_operation_profiler.cu b/tools/profiler/src/block_scaled_gemm_operation_profiler.cu
index 3328418503..bb8940b4c0 100644
--- a/tools/profiler/src/block_scaled_gemm_operation_profiler.cu
+++ b/tools/profiler/src/block_scaled_gemm_operation_profiler.cu
@@ -437,9 +437,11 @@ void BlockScaledGemmOperationProfiler::GemmProblem::initialize_result(
   set_argument(result, "k", problem_space, k);
 
   
-  set_argument(result, "cluster_m", problem_space, cluster_m);
-  set_argument(result, "cluster_n", problem_space, cluster_n);
-  set_argument(result, "cluster_k", problem_space, cluster_k);
+  auto cluster_shape = operation_desc.tile_description.cluster_shape;
+  auto is_dynamic = cluster_shape.m() == 0 || cluster_shape.n() == 0 || cluster_shape.k() == 0;
+  set_argument(result, "cluster_m", problem_space, is_dynamic ? this->cluster_m : cluster_shape.m());
+  set_argument(result, "cluster_n", problem_space, is_dynamic ? this->cluster_n : cluster_shape.n());
+  set_argument(result, "cluster_k", problem_space, is_dynamic ? this->cluster_k : cluster_shape.k());
   set_argument(result, "cluster_m_fallback", problem_space, cluster_m_fallback);
   set_argument(result, "cluster_n_fallback", problem_space, cluster_n_fallback);
   set_argument(result, "cluster_k_fallback", problem_space, cluster_k_fallback);
diff --git a/tools/profiler/src/blockwise_gemm_operation_profiler.cu b/tools/profiler/src/blockwise_gemm_operation_profiler.cu
new file mode 100644
index 0000000000..5716869d97
--- /dev/null
+++ b/tools/profiler/src/blockwise_gemm_operation_profiler.cu
@@ -0,0 +1,1299 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Execution environment
+*/
+
+
+
+#include <iostream>
+#include <stdexcept>
+#include <iomanip>
+#include <ios>
+#include <vector>
+
+#include "cutlass/core_io.h"
+
+#include "cutlass/profiler/cublas_helpers.h"
+#include "cutlass/profiler/blockwise_gemm_operation_profiler.h"
+#include "cutlass/profiler/gpu_timer.h"
+#include "cutlass/library/singleton.h"
+#include "cutlass/library/library.h"
+#include "cutlass/library/handle.h"
+
+#include "cutlass/util/reference/host/gett.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace profiler {
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Ctor
+BlockwiseGemmOperationProfiler::BlockwiseGemmOperationProfiler(Options const &options):
+  OperationProfiler(
+    options,
+    library::OperationKind::kBlockwiseGemm,
+    {
+      {ArgumentTypeID::kEnumerated, {"gemm_kind"}, "Variant of GEMM (universal, gemm, planar_complex, planar_complex_array)"},
+      {ArgumentTypeID::kInteger, {"m", "problem-size::m"}, "M dimension of the GEMM problem space"},
+      {ArgumentTypeID::kInteger, {"n", "problem-size::n"}, "N dimension of the GEMM problem space"},
+      {ArgumentTypeID::kInteger, {"k", "problem-size::k"}, "K dimension of the GEMM problem space"},
+      {ArgumentTypeID::kInteger, {"scale_vec_size_m", "scale-vec-size-m"}, "Scale vector size in GEMM M dimension"},
+      {ArgumentTypeID::kInteger, {"scale_vec_size_n", "scale-vec-size-n"}, "Scale vector size in GEMM N dimension"},
+      {ArgumentTypeID::kInteger, {"scale_vec_size_k", "scale-vec-size-k"}, "Scale vector size in GEMM K dimension"},
+      {ArgumentTypeID::kTensor, {"A"}, "Tensor storing the A operand"},
+      {ArgumentTypeID::kTensor, {"B"}, "Tensor storing the B operand"},
+      {ArgumentTypeID::kTensor, {"C"}, "Tensor storing the C operand"},
+      {ArgumentTypeID::kTensor, {"D"}, "Tensor storing the D output"},
+      {ArgumentTypeID::kScalar, {"alpha", "epilogue::alpha"}, "Epilogue scalar alpha"},
+      {ArgumentTypeID::kScalar, {"beta", "epilogue::beta"}, "Epilogue scalar beta"},
+      {ArgumentTypeID::kEnumerated, {"split_k_mode", "split-k-mode"}, "Variant of split K mode(serial, parallel)"},
+      {ArgumentTypeID::kInteger, {"split_k_slices", "split-k-slices"}, "Number of partitions of K dimension"},
+      {ArgumentTypeID::kInteger, {"batch_count", "batch-count"}, "Number of GEMMs computed in one batch"},
+      {ArgumentTypeID::kEnumerated, {"runtime_input_datatype_a", "runtime-input-datatype::a"}, "Runtime datatype (e4m3, e5m2, e3m2, e2m3, e2m1)"}, 
+      {ArgumentTypeID::kEnumerated, {"runtime_input_datatype_b", "runtime-input-datatype::b"}, "Runtime datatype (e4m3, e5m2, e3m2, e2m3, e2m1)"}, 
+      {ArgumentTypeID::kEnumerated, {"raster_order", "raster-order"}, "Raster order (heuristic, along_n, along_m)"},
+      {ArgumentTypeID::kInteger, {"swizzle_size", "swizzle-size"}, "Size to swizzle"},
+      {ArgumentTypeID::kEnumerated, {"use_pdl", "use_pdl"}, "Use PDL (true, false)"},
+    },
+    { library::Provider::kCUBLAS}
+  ) {
+
+  description_ = "      General matrix-matrix product. D = alpha * A*B + beta * C";
+}
+
+/// Destructor
+BlockwiseGemmOperationProfiler::~BlockwiseGemmOperationProfiler() {
+
+}
+
+/// Prints usage statement for the math function
+void BlockwiseGemmOperationProfiler::print_usage(std::ostream &out) const {
+  out << "Blockwise GEMM" << "\n\n";
+
+  OperationProfiler::print_usage(out);
+}
+
+/// Prints examples
+void BlockwiseGemmOperationProfiler::print_examples(std::ostream &out) const {
+
+  out << "\nExamples:\n\n"
+    << "Profile a particular problem size:\n"
+    << "  $ cutlass_profiler --operation=blockwise_gemm --m=1024 --n=1024 --k=128\n\n"
+
+    << "Schmoo over problem size and beta:\n"
+    << "  $ cutlass_profiler --operation=blockwise_gemm --m=1024:4096:256 --n=1024:4096:256 --k=128:8192:128 --beta=0,1,2.5\n\n"
+
+    << "For column major, use column, col, or n. For row major use, row or t:\n"
+    << "  $ cutlass_profiler --operation=Gemm --A=f16:column --B=*:row\n\n"
+
+    << "Profile a particular problem size with split K and parallel reduction:\n"
+    << "  $ cutlass_profiler --operation=Gemm --split_k_mode=parallel --split_k_slices=2 --m=1024 --n=1024 --k=128\n\n"
+
+    << "Using various input value distribution:\n"
+    << "  $ cutlass_profiler --operation=Gemm --dist=uniform,min:0,max:3\n"
+    << "  $ cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3\n"
+    << "  $ cutlass_profiler --operation=Gemm --dist=sequential,start:0,delta:1\n\n"
+
+    << "Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):\n"
+    << " $ cutlass_profiler --operation=Gemm --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect\n\n"
+
+    << "Test your changes to gemm kernels with a quick functional test and save results in functional-test.csv:\n"
+    << " $ cutlass_profiler  --operation=Gemm \\ \n"
+    << "   --m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
+    << "   --n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
+    << "   --k=8,16,32,64,128,256,288,384,504,512,520 \\ \n"
+    << "   --beta=0,1,2 --profiling-iterations=1 \\ \n"
+    << "   --providers=cutlass --output=functional-test.csv\n\n";
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if 0
+// used this for debugging
+static std::string byte_string(std::vector<uint8_t> const &bytes) {
+  std::stringstream ss;
+
+  ss << "0x";
+
+  for (size_t idx = bytes.size(); idx > 0; --idx) {
+    ss << std::hex << std::setw(2) << std::setfill('0') << uint32_t(bytes.at(idx - 1));
+  }
+
+  return ss.str();
+}
+#endif
+
+Status BlockwiseGemmOperationProfiler::GemmProblem::parse(
+  library::BlockwiseGemmDescription const &operation_desc,
+  ProblemSpace const &problem_space,
+  ProblemSpace::Problem const &problem) {
+
+  this->mode = library::GemmUniversalMode::kGemm;
+
+  if (!arg_as_int(this->m, "m", problem_space, problem)) {
+    // default value
+    this->m = 1024;
+  }
+
+  if (!arg_as_int(this->n, "n", problem_space, problem)) {
+    // default value
+    this->n = 1024;
+  }
+
+  if (!arg_as_int(this->k, "k", problem_space, problem)) {
+    // default value
+    this->k = 1024;
+  }
+
+  if (!arg_as_int(this->sf_vec_m, "scale_vec_size_m", problem_space, problem)) {
+    // default value
+    this->sf_vec_m = 0;
+  }
+
+  if (!arg_as_int(this->sf_vec_n, "scale_vec_size_n", problem_space, problem)) {
+    // default value
+    this->sf_vec_n = 0;
+  }
+
+  if (!arg_as_int(this->sf_vec_k, "scale_vec_size_k", problem_space, problem)) {
+    // default value
+    this->sf_vec_k = 0;
+  }
+  
+  if (!arg_as_int(this->cluster_m, "cluster_m", problem_space, problem)) {
+    // default value
+    this->cluster_m = 1;
+  }
+
+  if (!arg_as_int(this->cluster_n, "cluster_n", problem_space, problem)) {
+    // default value
+    this->cluster_n = 1;
+  }
+
+  if (!arg_as_int(this->cluster_k, "cluster_k", problem_space, problem)) {
+    // default value
+    this->cluster_k = 1;
+  }
+
+  if (!arg_as_int(this->cluster_m_fallback, "cluster_m_fallback", problem_space, problem)) {
+    // default value
+    this->cluster_m_fallback = 0;
+  }
+
+  if (!arg_as_int(this->cluster_n_fallback, "cluster_n_fallback", problem_space, problem)) {
+    // default value
+    this->cluster_n_fallback = 0;
+  }
+
+  if (!arg_as_int(this->cluster_k_fallback, "cluster_k_fallback", problem_space, problem)) {
+    // default value
+    this->cluster_k_fallback = 0;
+  }
+  
+
+  if (!arg_as_SplitKModeID(this->split_k_mode, "split_k_mode", problem_space, problem)) {
+    // default value
+    this->split_k_mode = library::SplitKMode::kSerial;
+  }
+
+  this->mode = library::GemmUniversalMode::kGemm;
+  if (this->split_k_mode == library::SplitKMode::kParallel) {
+    this->mode = library::GemmUniversalMode::kGemmSplitKParallel;
+  }
+
+  if (!arg_as_int(this->split_k_slices, "split_k_slices", problem_space, problem)) {
+    // default value
+    this->split_k_slices = 1;
+  }
+
+  if (this->split_k_mode != library::SplitKMode::kSerial) {
+    std::cout<<"SplitK/StreamK feature is not supported yet!";
+    return Status::kErrorInvalidProblem;
+  }
+
+  if (!arg_as_bool(this->use_pdl, "use_pdl", problem_space, problem)) {
+    // default value
+    this->use_pdl = false;
+  }
+
+  
+  if (!arg_as_RuntimeDatatype(this->runtime_input_datatype_a, "runtime_input_datatype_a", problem_space, problem)) {
+    // default value
+    this->runtime_input_datatype_a = cutlass::library::RuntimeDatatype::kStatic;
+  }
+
+  if (!arg_as_RuntimeDatatype(this->runtime_input_datatype_b, "runtime_input_datatype_b", problem_space, problem)) {
+    // default value
+    this->runtime_input_datatype_b = cutlass::library::RuntimeDatatype::kStatic;
+  }
+  
+
+  if (!arg_as_int(this->batch_count, "batch_count", problem_space, problem)) {
+    // default value
+    this->batch_count = 1;
+  } else if (this->batch_count > 1) {
+    this->mode = library::GemmUniversalMode::kBatched;
+  }
+
+  if (!arg_as_int(this->swizzle_size, "swizzle_size", problem_space, problem)) {
+    // default value
+    this->swizzle_size = 1;
+  }
+
+  if (!arg_as_RasterOrder(this->raster_order, "raster_order", problem_space, problem)) {
+    // default value
+    this->raster_order = library::RasterOrder::kHeuristic;
+  }
+
+  if (this->split_k_slices > 1 && this->batch_count > 1) {
+    // At least one of these must be one
+    return Status::kErrorInvalidProblem;
+  }
+
+  if (!tensor_description_satisfies(operation_desc.A, "A", problem_space, problem)) {
+    return Status::kErrorInvalidProblem;
+  }
+
+  if (!tensor_description_satisfies(operation_desc.B, "B", problem_space, problem)) {
+    return Status::kErrorInvalidProblem;
+  }
+
+  if (!tensor_description_satisfies(operation_desc.C, "C", problem_space, problem)) {
+    return Status::kErrorInvalidProblem;
+  }
+
+  if (!tensor_description_satisfies(operation_desc.D, "D", problem_space, problem)) {
+    return Status::kErrorInvalidProblem;
+  }
+
+  if (!arg_as_scalar(
+    this->alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
+    problem)) {
+
+    if (!cast_from_double(this->alpha, operation_desc.element_epilogue, 1)) {
+      return Status::kErrorInternal;
+    }
+  }
+
+  if (!arg_as_scalar(
+    this->beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
+    problem)) {
+
+    if (!cast_from_double(this->beta, operation_desc.element_epilogue, 0)) {
+      return Status::kErrorInternal;
+    }
+  }
+
+  this->lda = DeviceAllocation::get_packed_layout(
+    operation_desc.A.layout, {int(this->m), int(this->k)}).front();
+
+  this->ldb = DeviceAllocation::get_packed_layout(
+    operation_desc.B.layout, {int(this->k), int(this->n)}).front();
+
+  this->ldc = DeviceAllocation::get_packed_layout(
+    operation_desc.C.layout, {int(this->m), int(this->n)}).front();
+
+  return Status::kSuccess;
+}
+
+/// Total number of bytes loaded
+int64_t BlockwiseGemmOperationProfiler::GemmProblem::bytes(library::BlockwiseGemmDescription const &operation_desc) const {
+  // Input bytes read and Output bytes written for the gemm problem
+  int64_t bytes =
+    int64_t(library::sizeof_bits(operation_desc.A.element) * m / 8) * k +
+    int64_t(library::sizeof_bits(operation_desc.B.element) * n / 8) * k +
+    int64_t(library::sizeof_bits(operation_desc.C.element) * m / 8) * n;
+
+  // Set is_beta_zero true if beta is zero
+  bool is_beta_zero = std::all_of(beta.begin(), beta.end(), [](uint8_t i) { return i==0; });
+
+  // Output bytes read for the gemm problem for non-zero beta values
+  if (!is_beta_zero) {
+    bytes += int64_t(library::sizeof_bits(operation_desc.C.element) * m / 8) * n;
+  }
+
+  bytes *= batch_count;
+
+  return bytes;
+}
+
+/// Total number of flops computed
+int64_t BlockwiseGemmOperationProfiler::GemmProblem::flops(library::BlockwiseGemmDescription const &operation_desc) const {
+  int64_t flops_ = (int64_t(m) * n * k + m * n) * 2 * batch_count;
+
+  // complex-valued support
+  switch (operation_desc.tile_description.math_instruction.math_operation) {
+  case library::MathOperationID::kMultiplyAddComplex:
+    flops_ *= 4;
+    break;
+
+  case library::MathOperationID::kMultiplyAddComplexFastF32:
+    flops_ *= 4;
+    break;
+
+  case library::MathOperationID::kMultiplyAddGaussianComplex:
+    flops_ *= 3;
+    break;
+
+  default: break;
+  }
+
+  return flops_;
+}
+
+
+/// Initializes a performance result
+void BlockwiseGemmOperationProfiler::GemmProblem::initialize_result(
+  PerformanceResult &result,
+  library::BlockwiseGemmDescription const &operation_desc,
+  ProblemSpace const &problem_space) {
+
+  result.arguments.resize(problem_space.rank());
+
+  set_argument(result, "gemm_kind", problem_space, library::to_string(operation_desc.gemm_kind));
+
+  set_argument(result, "A", problem_space,
+    std::string(library::to_string(operation_desc.A.element)) + ":" + library::to_string(operation_desc.A.layout));
+
+  set_argument(result, "B", problem_space,
+    std::string(library::to_string(operation_desc.B.element)) + ":" + library::to_string(operation_desc.B.layout));
+
+  set_argument(result, "C", problem_space,
+    std::string(library::to_string(operation_desc.C.element)) + ":" + library::to_string(operation_desc.C.layout));
+
+  set_argument(result, "D", problem_space,
+    std::string(library::to_string(operation_desc.D.element)) + ":" + library::to_string(operation_desc.D.layout));
+
+  set_argument(result, "m", problem_space, m);
+  set_argument(result, "n", problem_space, n);
+  set_argument(result, "k", problem_space, k);
+
+  set_argument(result, "scale_vec_size_m", problem_space, sf_vec_m);
+  set_argument(result, "scale_vec_size_n", problem_space, sf_vec_n);
+  set_argument(result, "scale_vec_size_k", problem_space, sf_vec_k);
+
+  
+  set_argument(result, "cluster_m", problem_space, cluster_m);
+  set_argument(result, "cluster_n", problem_space, cluster_n);
+  set_argument(result, "cluster_k", problem_space, cluster_k);
+  set_argument(result, "cluster_m_fallback", problem_space, cluster_m_fallback);
+  set_argument(result, "cluster_n_fallback", problem_space, cluster_n_fallback);
+  set_argument(result, "cluster_k_fallback", problem_space, cluster_k_fallback);
+  
+
+  set_argument(result, "split_k_mode", problem_space, library::to_string(split_k_mode));
+  set_argument(result, "split_k_slices", problem_space, split_k_slices);
+  set_argument(result, "batch_count", problem_space, batch_count);
+  set_argument(result, "raster_order", problem_space, library::to_string(raster_order));
+  set_argument(result, "swizzle_size", problem_space, swizzle_size);
+  set_argument(result, "use_pdl", problem_space, library::to_string(use_pdl));
+
+  
+  set_argument(result, "runtime_input_datatype_a", problem_space, library::to_string(runtime_input_datatype_a));
+  set_argument(result, "runtime_input_datatype_b", problem_space, library::to_string(runtime_input_datatype_b));
+  
+
+  set_argument(result, "alpha", problem_space,
+    library::lexical_cast(alpha, operation_desc.element_epilogue));
+
+  set_argument(result, "beta", problem_space,
+    library::lexical_cast(beta, operation_desc.element_epilogue));
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Extracts the problem dimensions
+Status BlockwiseGemmOperationProfiler::initialize_configuration(
+    Options const &options,
+    PerformanceReport &report,
+    DeviceContext &device_context,
+    library::Operation const *operation,
+    ProblemSpace const &problem_space,
+    ProblemSpace::Problem const &problem) {
+
+  library::BlockwiseGemmDescription const &operation_desc =
+    static_cast<library::BlockwiseGemmDescription const &>(operation->description());
+
+  if (operation_desc.gemm_kind != library::GemmKind::kUniversal) {
+    return Status::kErrorInvalidProblem;
+  }
+
+  Status status = problem_.parse(operation_desc, problem_space, problem);
+
+  if (status != Status::kSuccess) {
+    return status;
+  }
+
+  gemm_workspace_.configuration.mode = problem_.mode;
+  gemm_workspace_.configuration.problem_size.m() = int(problem_.m);
+  gemm_workspace_.configuration.problem_size.n() = int(problem_.n);
+  gemm_workspace_.configuration.problem_size.k() = int(problem_.k);
+  
+  gemm_workspace_.configuration.cluster_shape.m() = int(problem_.cluster_m);
+  gemm_workspace_.configuration.cluster_shape.n() = int(problem_.cluster_n);
+  gemm_workspace_.configuration.cluster_shape.k() = int(problem_.cluster_k);
+  gemm_workspace_.configuration.cluster_shape_fallback.m() = int(problem_.cluster_m_fallback);
+  gemm_workspace_.configuration.cluster_shape_fallback.n() = int(problem_.cluster_n_fallback);
+  gemm_workspace_.configuration.cluster_shape_fallback.k() = int(problem_.cluster_k_fallback);
+  
+  gemm_workspace_.configuration.lda = problem_.lda;
+  gemm_workspace_.configuration.ldb = problem_.ldb;
+  gemm_workspace_.configuration.ldc = problem_.ldc;
+  gemm_workspace_.configuration.ldd = problem_.ldc;
+
+  if (problem_.mode == library::GemmUniversalMode::kBatched) {
+    gemm_workspace_.configuration.batch_count = problem_.batch_count;
+  }
+  else {
+    gemm_workspace_.configuration.batch_count = problem_.split_k_slices;
+  }
+
+  gemm_workspace_.arguments.problem_size.m() = int(problem_.m);
+  gemm_workspace_.arguments.problem_size.n() = int(problem_.n);
+  gemm_workspace_.arguments.problem_size.k() = int(problem_.k);
+  gemm_workspace_.arguments.batch_count = problem_.batch_count;
+
+  gemm_workspace_.arguments.A = nullptr;
+  gemm_workspace_.arguments.B = nullptr;
+  gemm_workspace_.arguments.C = nullptr;
+  gemm_workspace_.arguments.D = nullptr;
+  gemm_workspace_.arguments.alpha = problem_.alpha.data();
+  gemm_workspace_.arguments.beta = problem_.beta.data();
+  gemm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
+  gemm_workspace_.arguments.swizzle_size = problem_.swizzle_size;
+  gemm_workspace_.arguments.raster_order = problem_.raster_order;
+  gemm_workspace_.arguments.cluster_shape = {int(problem_.cluster_m), int(problem_.cluster_n), int(problem_.cluster_k)}; 
+  gemm_workspace_.arguments.cluster_shape_fallback = {int(problem_.cluster_m_fallback), int(problem_.cluster_n_fallback), int(problem_.cluster_k_fallback)}; 
+  gemm_workspace_.arguments.split_k_slices = problem_.split_k_slices;
+
+  
+  gemm_workspace_.arguments.runtime_input_datatype_a = problem_.runtime_input_datatype_a;
+  gemm_workspace_.arguments.runtime_input_datatype_b = problem_.runtime_input_datatype_b;
+  
+  gemm_workspace_.arguments.sf_m_vec_size = problem_.sf_vec_m;
+  gemm_workspace_.arguments.sf_n_vec_size = problem_.sf_vec_n;
+  gemm_workspace_.arguments.sf_k_vec_size = problem_.sf_vec_k;
+
+  gemm_workspace_.arguments.use_pdl = problem_.use_pdl;
+
+  // initialize reduction operation for parallel splitKMode
+  if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+    if (!initialize_reduction_configuration_(operation, problem)) {
+      return Status::kErrorInternal;
+    }
+  }
+
+  initialize_result_(this->model_result_, options, operation_desc, problem_space);
+
+  return operation->can_implement(&gemm_workspace_.configuration, &gemm_workspace_.arguments);
+}
+
+/// Initializes the performance result
+void BlockwiseGemmOperationProfiler::initialize_result_(
+    PerformanceResult &result,
+    Options const &options,
+    library::BlockwiseGemmDescription const &operation_desc,
+    ProblemSpace const &problem_space) {
+
+  result.provider = library::Provider::kCUTLASS;
+  result.disposition = Disposition::kNotRun;
+  result.status = Status::kSuccess;
+  result.operation_name = operation_desc.name;
+
+  problem_.initialize_result(result, operation_desc, problem_space);
+
+  OperationProfiler::initialize_result_(result, operation_desc, problem_space);
+
+  result.bytes = problem_.bytes(operation_desc);
+  result.flops = problem_.flops(operation_desc);
+  result.runtime = 0;
+
+}
+
+/// Initialize reduction problem dimensions and library::Operation
+bool BlockwiseGemmOperationProfiler::initialize_reduction_configuration_(
+  library::Operation const *operation,
+  ProblemSpace::Problem const &problem) {
+
+  library::BlockwiseGemmDescription const &gemm_desc =
+    static_cast<library::BlockwiseGemmDescription const&>(operation->description());
+
+  if (!cast_from_double(problem_.alpha_one, gemm_desc.element_epilogue, 1)) {
+    return false;
+  }
+
+  if (!cast_from_double(problem_.beta_zero, gemm_desc.element_epilogue, 0)) {
+    return false;
+  }
+
+  /// initialize library::ReductionConfiguration
+  gemm_workspace_.reduction_configuration.problem_size      = gemm::GemmCoord(int(problem_.n), int(problem_.m), int(problem_.k)).mn();
+  gemm_workspace_.reduction_configuration.partitions        = int(problem_.split_k_slices);
+  gemm_workspace_.reduction_configuration.partition_stride  = gemm::GemmCoord(int(problem_.n), int(problem_.m), int(problem_.k)).mn().product();
+  gemm_workspace_.reduction_configuration.ldw               = problem_.ldc;
+  gemm_workspace_.reduction_configuration.lds               = problem_.ldc;
+  gemm_workspace_.reduction_configuration.ldd               = problem_.ldc;
+
+  // find reduction operation
+  library::ReductionFunctionalKey reduction_key(
+    library::Provider::kCUTLASS,
+    gemm_desc.tile_description.math_instruction.element_accumulator,    // element workspace
+    gemm_desc.tile_description.math_instruction.element_accumulator,    // element accumulator
+    gemm_desc.D.element,                                                // element output
+    gemm_desc.element_epilogue                                          // element compute
+  );
+
+  auto reduction_it = library::Singleton::get().operation_table.reduction_operations.find(reduction_key);
+
+  if (reduction_it == library::Singleton::get().operation_table.reduction_operations.end()) {
+    return false;
+  }
+
+  // initialize reduction operation required for parallel split-k operator
+  reduction_op_ = reduction_it->second;
+
+  // reduction operation found and initialized
+  return true;
+}
+
+/// Initializes workspace
+Status BlockwiseGemmOperationProfiler::initialize_workspace(
+  Options const &options,
+  PerformanceReport &report,
+  DeviceContext &device_context,
+  library::Operation const *operation,
+  ProblemSpace const &problem_space,
+  ProblemSpace::Problem const &problem) {
+
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
+  library::Operation const* underlying_operation = operation;
+
+  if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+    if (!(underlying_operation = library::find_gemm_operation_for_parallel_reduction(operation))) {
+      return Status::kErrorNotSupported;
+    }
+  }
+
+  library::BlockwiseGemmDescription const &operation_desc =
+    static_cast<library::BlockwiseGemmDescription const &>(operation->description());
+
+  // Compute the number of copies of the problem to avoid L2 camping.
+  if (!options.profiling.workspace_count) {
+    int64_t bytes = problem_.bytes(operation_desc);
+    if (bytes < 3 * int64_t(options.device.properties[0].l2CacheSize)) {
+      gemm_workspace_.problem_count =
+        1 + int((3 * int64_t(options.device.properties[0].l2CacheSize)) / bytes);
+    }
+    else {
+      gemm_workspace_.problem_count = 1;
+    }
+  }
+  else {
+    gemm_workspace_.problem_count = options.profiling.workspace_count;
+  }
+
+  bool allocate_device_tensors = options.execution_mode != ExecutionMode::kDryRun;
+  if (allocate_device_tensors) {
+    int seed_shift = 0;
+    gemm_workspace_.A = device_context.allocate_and_initialize_tensor(
+      options,
+      "A",
+      operation_desc.A.element,
+      operation_desc.A.layout,
+      {int(problem_.m), int(problem_.k)},
+      {int(problem_.lda)},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      seed_shift++,
+      0 // device_index
+    );
+
+    int sfa_m     = ceil_div(int(problem_.m), operation_desc.SFMVecSize);
+    int sfb_n     = ceil_div(int(problem_.n), operation_desc.SFNVecSize);
+    int sfa_sfb_k = ceil_div(int(problem_.k), operation_desc.SFKVecSize);
+
+    gemm_workspace_.SFA = device_context.allocate_and_initialize_tensor(
+      options,
+      "SFA",
+      operation_desc.SFA.element,
+      operation_desc.SFA.layout,
+      {sfa_m, sfa_sfb_k},
+      {sfa_m},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      seed_shift++,
+      0 // device_index
+    );
+
+    gemm_workspace_.SFB = device_context.allocate_and_initialize_tensor(
+      options,
+      "SFB",
+      operation_desc.SFB.element,
+      operation_desc.SFB.layout,
+      {sfa_sfb_k, sfb_n},
+      {sfb_n},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      seed_shift++,
+      0 // device_index
+    );
+
+    gemm_workspace_.B = device_context.allocate_and_initialize_tensor(
+      options,
+      "B",
+      operation_desc.B.element,
+      operation_desc.B.layout,
+      {int(problem_.k), int(problem_.n)},
+      {int(problem_.ldb)},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      seed_shift++,
+      0 // device_index
+    );
+
+    gemm_workspace_.C = device_context.allocate_and_initialize_tensor(
+      options,
+      "C",
+      operation_desc.C.element,
+      operation_desc.C.layout,
+      {int(problem_.m), int(problem_.n)},
+      {int(problem_.ldc)},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      seed_shift++,
+      0 // device_index
+    );
+
+    gemm_workspace_.Computed = device_context.allocate_tensor(
+      options,
+      "D",
+      operation_desc.D.element,
+      operation_desc.D.layout,
+      {int(problem_.m), int(problem_.n)},
+      {int(problem_.ldc)},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      0 // device_index
+    );
+
+    gemm_workspace_.Reference = device_context.allocate_tensor(
+      options,
+      "Reference",
+      operation_desc.D.element,
+      operation_desc.D.layout,
+      {int(problem_.m), int(problem_.n)},
+      {int(problem_.ldc)},
+      problem_.batch_count * gemm_workspace_.problem_count,
+      0 // device_index
+    );
+  }
+
+  if (options.execution_mode != ExecutionMode::kDryRun) {
+
+    // NOTE: the leading non-batch strides are duplicated here for 3.0 API kernels
+    gemm_workspace_.arguments.problem_size = {int(problem_.m), int(problem_.n), int(problem_.k)};
+    gemm_workspace_.arguments.cluster_shape = {int(problem_.cluster_m), int(problem_.cluster_n), int(problem_.cluster_k)}; 
+    gemm_workspace_.arguments.cluster_shape_fallback = {int(problem_.cluster_m_fallback), int(problem_.cluster_n_fallback), int(problem_.cluster_k_fallback)}; 
+    gemm_workspace_.arguments.split_k_slices = problem_.split_k_slices;
+    gemm_workspace_.arguments.batch_count = problem_.batch_count;
+    gemm_workspace_.arguments.lda = problem_.lda;
+    gemm_workspace_.arguments.ldb = problem_.ldb;
+    gemm_workspace_.arguments.ldc = problem_.ldc;
+    gemm_workspace_.arguments.ldd = problem_.ldc;
+    gemm_workspace_.arguments.batch_stride_A = gemm_workspace_.A->batch_stride();
+    gemm_workspace_.arguments.batch_stride_B = gemm_workspace_.B->batch_stride();
+    gemm_workspace_.arguments.batch_stride_C = gemm_workspace_.C->batch_stride();
+    gemm_workspace_.arguments.batch_stride_D = gemm_workspace_.Computed->batch_stride();
+    gemm_workspace_.arguments.use_pdl = problem_.use_pdl;
+
+    /* Query device SM count to pass onto the kernel as an argument, where needed */
+    gemm_workspace_.arguments.sm_count = options.device.properties[0].multiProcessorCount;
+  }
+
+  //
+  // Initialize the CUTLASS operation
+  //
+  Status status = Status::kSuccess;
+
+  if (options.profiling.provider_enabled(library::Provider::kCUTLASS)) {
+
+    if (options.execution_mode != ExecutionMode::kDryRun) {
+      uint64_t workspace_size = underlying_operation->get_host_workspace_size(&gemm_workspace_.configuration);
+      gemm_workspace_.host_workspace.resize(workspace_size, 0);
+
+      workspace_size = underlying_operation->get_device_workspace_size(&gemm_workspace_.configuration,
+                                                            &gemm_workspace_.arguments);
+      gemm_workspace_.device_workspace.reset(library::NumericTypeID::kU8, workspace_size);
+
+      status = underlying_operation->initialize(
+        &gemm_workspace_.configuration,
+        gemm_workspace_.host_workspace.data(),
+        gemm_workspace_.device_workspace.data());
+      if (status != Status::kSuccess) {
+        return status;
+      }
+
+      if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+        workspace_size = reduction_op_->get_host_workspace_size(&gemm_workspace_.reduction_configuration);
+        gemm_workspace_.reduction_host_workspace.resize(workspace_size, 0);
+
+        status = reduction_op_->initialize(
+          &gemm_workspace_.reduction_configuration,
+          gemm_workspace_.reduction_host_workspace.data(),
+          nullptr);
+
+        if (status != Status::kSuccess) {
+          return status;
+        }
+      }
+    }
+
+    //
+    // If CUTLASS is enabled, generate a result for it
+    //
+    results_.push_back(model_result_);
+    results_.back().provider = library::Provider::kCUTLASS;
+    results_.back().op_kind = library::OperationKind::kGemm;
+    results_.back().disposition = Disposition::kNotRun;
+
+    for (auto provider : verification_providers_) {
+      results_.back().verification_map[provider] = Disposition::kNotRun;
+    }
+  }
+  return status;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Verifies CUTLASS against references
+bool BlockwiseGemmOperationProfiler::verify_cutlass(
+  Options const &options,
+  PerformanceReport &report,
+  DeviceContext &device_context,
+  library::Operation const *operation,
+  ProblemSpace const &problem_space,
+  ProblemSpace::Problem const &problem) {
+
+  if (!options.profiling.provider_enabled(library::Provider::kCUTLASS)) {
+    return true;
+  }
+
+  if (options.execution_mode == ExecutionMode::kDryRun) {
+    return true;
+  }
+
+  // Initialize structure containing GEMM arguments
+  gemm_workspace_.arguments.A = gemm_workspace_.A->data();
+  gemm_workspace_.arguments.B = gemm_workspace_.B->data();
+  gemm_workspace_.arguments.SFA = gemm_workspace_.SFA->data();
+  gemm_workspace_.arguments.SFB = gemm_workspace_.SFB->data();
+  gemm_workspace_.arguments.C = gemm_workspace_.C->data();
+  gemm_workspace_.arguments.D = gemm_workspace_.Computed->data();
+  gemm_workspace_.arguments.alpha = problem_.alpha.data();
+  gemm_workspace_.arguments.beta = problem_.beta.data();
+  gemm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
+  gemm_workspace_.arguments.batch_stride_A = gemm_workspace_.A->batch_stride();
+  gemm_workspace_.arguments.batch_stride_B = gemm_workspace_.B->batch_stride();
+  gemm_workspace_.arguments.batch_stride_C = gemm_workspace_.C->batch_stride();
+  gemm_workspace_.arguments.batch_stride_D = gemm_workspace_.Computed->batch_stride();
+
+  if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+    gemm_workspace_.arguments.D                       = gemm_workspace_.device_workspace.data();
+    gemm_workspace_.arguments.alpha                   = problem_.alpha_one.data();
+    gemm_workspace_.arguments.beta                    = problem_.beta_zero.data();
+
+    gemm_workspace_.reduction_arguments.workspace     = gemm_workspace_.device_workspace.data();
+    gemm_workspace_.reduction_arguments.source        = gemm_workspace_.C->data();
+    gemm_workspace_.reduction_arguments.destination   = gemm_workspace_.Computed->data();
+    gemm_workspace_.reduction_arguments.alpha         = problem_.alpha.data();
+    gemm_workspace_.reduction_arguments.beta          = problem_.beta.data();
+    gemm_workspace_.reduction_arguments.pointer_mode  = library::ScalarPointerMode::kHost;
+  }
+
+  //
+  // Run the CUTLASS operation
+  //
+
+  // initialize gemm underlying operation to handle parallel reduction
+  library::Operation const * underlying_operation = operation;
+
+  if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+    if (!(underlying_operation = library::find_gemm_operation_for_parallel_reduction(operation))) {
+      results_.back().disposition = Disposition::kFailed;
+      return false;
+    }
+  }
+
+  results_.back().status = underlying_operation->run(
+    &gemm_workspace_.arguments,
+    gemm_workspace_.host_workspace.data(),
+    gemm_workspace_.device_workspace.data(),
+    nullptr);
+
+  if (results_.back().status != Status::kSuccess) {
+    results_.back().disposition = Disposition::kFailed;
+    return false;
+  }
+
+  // Run parallel reduction kernel for parallel split_k_mode
+  if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+    results_.back().status = reduction_op_->run(
+      &gemm_workspace_.reduction_arguments,
+      gemm_workspace_.reduction_host_workspace.data(),
+      nullptr,
+      nullptr);
+
+    if (results_.back().status != Status::kSuccess) {
+      results_.back().disposition = Disposition::kFailed;
+      return false;
+    }
+  }
+
+  cudaError_t result = cudaDeviceSynchronize();
+  if (result != cudaSuccess) {
+    results_.back().disposition = Disposition::kFailed;
+    return false;
+  }
+
+  // CUTLASS op ran the but not yet verified against any verification provider
+  results_.back().disposition = Disposition::kNotVerified;
+
+  //
+  // Run verification providers
+  //
+
+  if (options.verification.enabled) {
+
+#if CUTLASS_ENABLE_CUBLAS
+    if (options.verification.provider_enabled(library::Provider::kCUBLAS)) {
+      // set verification map for cublas to not supported
+      results_.back().verification_map[library::Provider::kCUBLAS] = Disposition::kNotSupported;
+    }
+#endif // #if CUTLASS_ENABLE_CUBLAS
+
+    
+    cutlass::library::RuntimeDatatype runtime_datatype_a = gemm_workspace_.arguments.runtime_input_datatype_a;
+    cutlass::library::RuntimeDatatype runtime_datatype_b = gemm_workspace_.arguments.runtime_input_datatype_b;
+
+    bool is_runtime_datatype_a = runtime_datatype_a != cutlass::library::RuntimeDatatype::kStatic;
+    bool is_runtime_datatype_b = runtime_datatype_b != cutlass::library::RuntimeDatatype::kStatic;
+
+    assert(is_runtime_datatype_a == is_runtime_datatype_b && "runtime datatype should be both dynamic or static.");
+    
+    library::OperationDescription const &desc = operation->description();
+    auto &gemm_desc = static_cast<library::BlockwiseGemmDescription const &>(desc);
+
+    cutlass::library::NumericTypeID element_A = gemm_desc.A.element;
+    cutlass::library::NumericTypeID element_B = gemm_desc.B.element;
+    
+    if (is_runtime_datatype_a) {
+      element_A = cutlass::library::dynamic_datatype_to_id(runtime_datatype_a);
+    }
+
+    if (is_runtime_datatype_b) {
+      element_B = cutlass::library::dynamic_datatype_to_id(runtime_datatype_b);
+    }
+    
+
+    bool verification_status = verify_with_reference_(options, report, device_context, operation, problem_space, problem, element_A, element_B);
+
+    // Update disposition to worst case verification outcome among all
+    // verification providers which are supported
+    bool is_any_verification_run_passed = false;
+    for (auto &m : results_.back().verification_map) {
+      if (m.second == Disposition::kFailed || m.second == Disposition::kIncorrect) {
+        results_.back().disposition = m.second;
+        return true;
+      }
+      if (!is_any_verification_run_passed && m.second == Disposition::kPassed) {
+        is_any_verification_run_passed = true;
+      }
+    }
+
+    if (is_any_verification_run_passed) {
+      results_.back().disposition = Disposition::kPassed;
+    }
+  }
+
+  // if verification.required is set, then return success iff at least one ref-check was run
+  if (options.verification.required) {
+    bool did_any_verification_run = false;
+    for (auto provider : options.verification.providers) {
+      did_any_verification_run |= (Disposition::kNotRun != results_.back().verification_map[provider]);
+    }
+
+    if (not did_any_verification_run) {
+      results_.back().status = Status::kErrorNotSupported;
+      return false;
+    }
+  }
+
+  // Return true means continue profiling
+  return true;
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Verifies CUTLASS against references
+bool BlockwiseGemmOperationProfiler::verify_with_cublas_(
+  Options const &options,
+  PerformanceReport &report,
+  DeviceContext &device_context,
+  library::Operation const *operation,
+  ProblemSpace const &problem_space,
+  ProblemSpace::Problem const &problem) {
+
+#if CUTLASS_ENABLE_CUBLAS
+  std::cerr << "cuBLAS is not supported" << std::endl;
+#endif
+
+  // Return true means continue profiling
+  return true;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Verifies CUTLASS against host and device references
+bool BlockwiseGemmOperationProfiler::verify_with_reference_(
+  Options const &options,
+  PerformanceReport &report,
+  DeviceContext &device_context,
+  library::Operation const *operation,
+  ProblemSpace const &problem_space,
+  ProblemSpace::Problem const &problem,
+  cutlass::library::NumericTypeID element_A,
+  cutlass::library::NumericTypeID element_B) {
+
+  /// Verifies CUTLASS against host reference
+
+  //
+  // Find host reference operation using conv2d functional description key
+  //
+  library::OperationDescription const &desc = operation->description();
+
+  auto &gemm_desc = static_cast<library::BlockwiseGemmDescription const &>(desc);
+
+  library::BlockwiseGemmFunctionalKey blockwiseGemm_key(
+    library::Provider::kReferenceHost,
+    gemm_desc.gemm_kind,
+    gemm_desc.kind,
+    gemm_desc.tile_description.math_instruction.element_accumulator,
+    gemm_desc.element_epilogue,
+    element_A,
+    gemm_desc.A.layout,
+    gemm_desc.SFA.element,
+    element_B,
+    gemm_desc.B.layout,
+    gemm_desc.SFB.element,
+    gemm_desc.C.element,
+    gemm_desc.C.layout,
+    gemm_desc.D.element,
+    gemm_desc.D.layout,
+    gemm_desc.SFMVecSize,
+    gemm_desc.SFNVecSize,
+    gemm_desc.SFKVecSize
+  );
+
+  auto operators_it = library::Singleton::get().operation_table.blockwise_gemm_operations.find(blockwiseGemm_key);
+
+  if (operators_it == library::Singleton::get().operation_table.blockwise_gemm_operations.end()) {
+    return true;
+  }
+
+  if (operators_it->second.empty()) {
+    return true;
+  }
+
+  // Not use preference to filter the reference kernel.
+  auto cc_it = operators_it->second.begin();
+
+  if(cc_it == operators_it->second.end()) {
+    std::cout<< "not find any reference kernel" << std::endl;
+    results_.back().verification_map[library::Provider::kReferenceHost] = Disposition::kNotRun;
+    return true;
+  }
+
+  // host reference has only one instances in BlockwiseOperationVectorMap
+  library::Operation const *reference_op = cc_it->second[0];
+
+  // To support the host-side reference, conditionally allocate and
+  // copy tensors to host memory.
+  std::vector<uint8_t> host_data_A;
+  std::vector<uint8_t> host_data_SFA;
+  std::vector<uint8_t> host_data_B;
+  std::vector<uint8_t> host_data_SFB;
+  std::vector<uint8_t> host_data_C;
+  std::vector<uint8_t> host_data_D;
+
+  //
+  // Copy input tensors A, B, and C from device to host buffers
+  //
+
+  host_data_A.resize(gemm_workspace_.A->bytes());
+  void * ptr_A = host_data_A.data();
+  gemm_workspace_.A->copy_to_host(ptr_A);
+
+  host_data_SFA.resize(gemm_workspace_.SFA->bytes());
+  void * ptr_SFA = host_data_SFA.data();
+  gemm_workspace_.SFA->copy_to_host(ptr_SFA);
+
+  host_data_B.resize(gemm_workspace_.B->bytes());
+  void * ptr_B = host_data_B.data();
+  gemm_workspace_.B->copy_to_host(ptr_B);
+
+  host_data_SFB.resize(gemm_workspace_.SFB->bytes());
+  void * ptr_SFB = host_data_SFB.data();
+  gemm_workspace_.SFB->copy_to_host(ptr_SFB);
+
+  host_data_C.resize(gemm_workspace_.C->bytes());
+  void * ptr_C = host_data_C.data();
+  gemm_workspace_.C->copy_to_host(ptr_C);
+  
+  host_data_D.resize(gemm_workspace_.Reference->bytes());
+  void * ptr_D = host_data_D.data();
+
+  /// Set reference kernel Arguments
+
+  library::BlockwiseGemmArguments arguments {
+    {int(problem_.m), int(problem_.n), int(problem_.k)},
+    {int(problem_.cluster_m), int(problem_.cluster_n), int(problem_.cluster_k)},
+    {int(problem_.cluster_m_fallback), int(problem_.cluster_n_fallback), int(problem_.cluster_k_fallback)},
+    gemm_workspace_.configuration.batch_count,
+    ptr_A,
+    ptr_B,
+    ptr_SFA,
+    ptr_SFB,
+    ptr_C,
+    ptr_D,
+    problem_.alpha.data(),
+    problem_.beta.data(),
+    library::ScalarPointerMode::kHost,
+    int(gemm_workspace_.configuration.lda),
+    int(gemm_workspace_.configuration.ldb),
+    int(gemm_workspace_.configuration.ldc),
+    int(gemm_workspace_.configuration.ldd),
+    gemm_workspace_.A->batch_stride(),
+    gemm_workspace_.B->batch_stride(),
+    gemm_workspace_.C->batch_stride(),
+    gemm_workspace_.Reference->batch_stride()
+  };
+
+  // Query host work space size
+  uint64_t host_workspace_size_needed = reference_op->get_host_workspace_size(&gemm_workspace_.configuration);
+
+  std::vector<char> host_workspace(host_workspace_size_needed);
+
+  // Query device workspace size
+  uint64_t device_workspace_size_needed = reference_op->get_device_workspace_size(&gemm_workspace_.configuration);
+  // Initialize host and device workspaces
+  Status status = reference_op->initialize(
+    &gemm_workspace_.configuration,
+    host_workspace.data()
+  );
+
+  if (status != cutlass::Status::kSuccess) {
+    results_.back().verification_map[library::Provider::kReferenceHost] = Disposition::kNotRun;
+    return true;
+  }
+
+  // Run the operator
+  status = reference_op->run(&arguments, host_workspace.data());
+
+  results_.back().status = status;
+
+  gemm_workspace_.Reference->copy_from_host(ptr_D);
+
+  //
+  // Verify results
+  //
+  auto resultD = compare_tensors(
+    options,
+    *gemm_workspace_.Computed,
+    *gemm_workspace_.Reference,
+    gemm_workspace_.Computed->batch_stride()
+  );
+  
+  results_.back().verification_map[library::Provider::kReferenceHost] = resultD;
+
+  // Save workspace if incorrect
+  if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
+    results_.back().verification_map[library::Provider::kReferenceHost] == Disposition::kIncorrect) {
+    save_workspace(
+      device_context,
+      options,
+      gemm_desc,
+      library::Provider::kCUTLASS,
+      library::Provider::kReferenceHost);
+  }
+
+  // Return true means continue profiling
+  return true;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Measures performance results
+bool BlockwiseGemmOperationProfiler::profile(
+  Options const &options,
+  PerformanceReport &report,
+  DeviceContext &device_context,
+  library::Operation const *operation,
+  ProblemSpace const &problem_space,
+  ProblemSpace::Problem const &problem) {
+
+  if (options.profiling.provider_enabled(library::Provider::kCUTLASS)) {
+
+    // Initialize structure containing GEMM arguments
+    gemm_workspace_.arguments.A = gemm_workspace_.A->data();
+    gemm_workspace_.arguments.B = gemm_workspace_.B->data();
+    gemm_workspace_.arguments.SFA = gemm_workspace_.SFA->data();
+    gemm_workspace_.arguments.SFB = gemm_workspace_.SFB->data();
+    gemm_workspace_.arguments.C = gemm_workspace_.C->data();
+    gemm_workspace_.arguments.D = gemm_workspace_.Computed->data();
+    gemm_workspace_.arguments.alpha = problem_.alpha.data();
+    gemm_workspace_.arguments.beta = problem_.beta.data();
+    gemm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
+    gemm_workspace_.arguments.batch_stride_A = gemm_workspace_.A->batch_stride();
+    gemm_workspace_.arguments.batch_stride_B = gemm_workspace_.B->batch_stride();
+    gemm_workspace_.arguments.batch_stride_C = gemm_workspace_.C->batch_stride();
+    gemm_workspace_.arguments.batch_stride_D = gemm_workspace_.Computed->batch_stride();
+
+    if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+      gemm_workspace_.arguments.D                       = gemm_workspace_.device_workspace.data();
+      gemm_workspace_.arguments.alpha                   = problem_.alpha_one.data();
+      gemm_workspace_.arguments.beta                    = problem_.beta_zero.data();
+
+      gemm_workspace_.reduction_arguments.workspace     = gemm_workspace_.device_workspace.data();
+      gemm_workspace_.reduction_arguments.source        = gemm_workspace_.C->data();
+      gemm_workspace_.reduction_arguments.destination   = gemm_workspace_.Computed->data();
+      gemm_workspace_.reduction_arguments.alpha         = problem_.alpha.data();
+      gemm_workspace_.reduction_arguments.beta          = problem_.beta.data();
+      gemm_workspace_.reduction_arguments.pointer_mode  = library::ScalarPointerMode::kHost;
+    }
+
+    results_.back().status = profile_cutlass_(
+      results_.back(),
+      options,
+      operation,
+      &gemm_workspace_.arguments,
+      gemm_workspace_.host_workspace.data(),
+      gemm_workspace_.device_workspace.data()
+    );
+  }
+  return true;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Method to profile a CUTLASS Operation
+Status BlockwiseGemmOperationProfiler::profile_cutlass_(
+  PerformanceResult &result,
+  Options const &options,
+  library::Operation const *operation,
+  void *arguments,
+  void *host_workspace,
+  void *device_workspace) {
+
+  // initialize gemm underlying operation to handle parallel reduction
+  library::Operation const * underlying_operation = operation;
+
+  if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+    if (!(underlying_operation = library::find_gemm_operation_for_parallel_reduction(operation))) {
+      return Status::kErrorNotSupported;
+    }
+  }
+
+  auto func = [&](cudaStream_t, int iteration) {
+    // Iterate over copies of the problem in memory
+    int problem_idx = (iteration % gemm_workspace_.problem_count) * problem_.batch_count;
+
+    gemm_workspace_.arguments.A = gemm_workspace_.A->batch_data(problem_idx);
+    gemm_workspace_.arguments.B = gemm_workspace_.B->batch_data(problem_idx);
+    gemm_workspace_.arguments.C = gemm_workspace_.C->batch_data(problem_idx);
+    gemm_workspace_.arguments.D = gemm_workspace_.Computed->batch_data(problem_idx);
+
+    if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+      gemm_workspace_.arguments.D                     = gemm_workspace_.device_workspace.data();
+
+      gemm_workspace_.reduction_arguments.workspace   = gemm_workspace_.device_workspace.data();
+      gemm_workspace_.reduction_arguments.source      = gemm_workspace_.C->batch_data(problem_idx);
+      gemm_workspace_.reduction_arguments.destination = gemm_workspace_.Computed->batch_data(problem_idx);
+    }
+
+    Status status = underlying_operation->run(
+      arguments,
+      host_workspace,
+      device_workspace,
+      nullptr);
+
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    // Run parallel reduction kernel for parallel split_k_mode
+    if (problem_.split_k_mode == library::SplitKMode::kParallel) {
+      status = reduction_op_->run(
+        &gemm_workspace_.reduction_arguments,
+        gemm_workspace_.reduction_host_workspace.data(),
+        nullptr,
+        nullptr);
+
+      if (status != Status::kSuccess) {
+        return status;
+      }
+    }
+
+    return status;
+  };
+
+  return profile_kernel_(result, options, func);
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace profiler
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/profiler/src/cutlass_profiler.cu b/tools/profiler/src/cutlass_profiler.cu
index 6ecee707b8..6e3fb4b8ea 100644
--- a/tools/profiler/src/cutlass_profiler.cu
+++ b/tools/profiler/src/cutlass_profiler.cu
@@ -37,6 +37,7 @@
 
 // Profiler includes
 #include "cutlass/profiler/block_scaled_gemm_operation_profiler.h"
+#include "cutlass/profiler/blockwise_gemm_operation_profiler.h"
 #include "cutlass/profiler/conv2d_operation_profiler.h"
 #include "cutlass/profiler/conv3d_operation_profiler.h"
 #include "cutlass/profiler/cutlass_profiler.h"
@@ -64,6 +65,8 @@ CutlassProfiler::CutlassProfiler(
 
   operation_profilers_.emplace_back(new BlockScaledGemmOperationProfiler(options));   
 
+  operation_profilers_.emplace_back(new BlockwiseGemmOperationProfiler(options));   
+
   operation_profilers_.emplace_back(new SparseGemmOperationProfiler(options));
 
   operation_profilers_.emplace_back(new Conv2dOperationProfiler(options));
diff --git a/tools/profiler/src/gemm_operation_profiler.cu b/tools/profiler/src/gemm_operation_profiler.cu
index 779cfdb11f..b0a725539c 100644
--- a/tools/profiler/src/gemm_operation_profiler.cu
+++ b/tools/profiler/src/gemm_operation_profiler.cu
@@ -440,10 +440,11 @@ void GemmOperationProfiler::GemmProblem::initialize_result(
   set_argument(result, "n", problem_space, n);
   set_argument(result, "k", problem_space, k);
 
-  
-  set_argument(result, "cluster_m", problem_space, cluster_m);
-  set_argument(result, "cluster_n", problem_space, cluster_n);
-  set_argument(result, "cluster_k", problem_space, cluster_k);
+  auto cluster_shape = operation_desc.tile_description.cluster_shape;
+  auto is_dynamic = cluster_shape.m() == 0 || cluster_shape.n() == 0 || cluster_shape.k() == 0;
+  set_argument(result, "cluster_m", problem_space, is_dynamic ? this->cluster_m : cluster_shape.m());
+  set_argument(result, "cluster_n", problem_space, is_dynamic ? this->cluster_n : cluster_shape.n());
+  set_argument(result, "cluster_k", problem_space, is_dynamic ? this->cluster_k : cluster_shape.k());
   set_argument(result, "cluster_m_fallback", problem_space, cluster_m_fallback);
   set_argument(result, "cluster_n_fallback", problem_space, cluster_n_fallback);
   set_argument(result, "cluster_k_fallback", problem_space, cluster_k_fallback);
diff --git a/tools/profiler/src/grouped_gemm_operation_profiler.cu b/tools/profiler/src/grouped_gemm_operation_profiler.cu
index e4d1e6bbae..ff702f6a0b 100644
--- a/tools/profiler/src/grouped_gemm_operation_profiler.cu
+++ b/tools/profiler/src/grouped_gemm_operation_profiler.cu
@@ -40,6 +40,7 @@
 #include <stdexcept>
 #include <string>
 #include <vector>
+#include <regex>
 
 #include <cuda_runtime_api.h>
 
@@ -459,9 +460,11 @@ void GroupedGemmOperationProfiler::GroupedGemmProblem::initialize_result(
     set_argument(result, "problem-sizes", problem_space, ss.str());
   }
 
-  set_argument(result, "cluster_m", problem_space, cluster_m);
-  set_argument(result, "cluster_n", problem_space, cluster_n);
-  set_argument(result, "cluster_k", problem_space, cluster_k);
+  auto cluster_shape = operation_desc.gemm.tile_description.cluster_shape;
+  auto is_dynamic = cluster_shape.m() == 0 || cluster_shape.n() == 0 || cluster_shape.k() == 0;
+  set_argument(result, "cluster_m", problem_space, is_dynamic ? this->cluster_m : cluster_shape.m());
+  set_argument(result, "cluster_n", problem_space, is_dynamic ? this->cluster_n : cluster_shape.n());
+  set_argument(result, "cluster_k", problem_space, is_dynamic ? this->cluster_k : cluster_shape.k());
   set_argument(result, "cluster_m_fallback", problem_space, cluster_m_fallback);
   set_argument(result, "cluster_n_fallback", problem_space, cluster_n_fallback);
   set_argument(result, "cluster_k_fallback", problem_space, cluster_k_fallback);
@@ -497,10 +500,22 @@ Status GroupedGemmOperationProfiler::initialize_configuration(
   // We distinguish between block scaled and non-block scaled operations by looking at the kernel
   // name, which tells us what reference kernel to use, which arguments to pass to the operation
   // etc. This avoids creating yet another OperationProfiler with a lot of boilerplate in it.
+
+  std::string sf_tuple = "\\d+x\\d+";
+  std::string datatypes_regex = "\\w?f\\d+|e\\dm\\d"; // bf16 | f16 | f32 | e4m3 | ...
+  std::string blockwise_regex_string = sf_tuple + "(" +  datatypes_regex + ")x(" + 
+                                       datatypes_regex + ")_" + sf_tuple + "(" + 
+                                       datatypes_regex + ")x(" + datatypes_regex + ")";
+                                       
+
   if (std::string(operation_desc.gemm.name).find("bstensor") != std::string::npos) {
     is_block_scaled = true;
     gemm_workspace_.block_scales = BlockScalingWorkspace{};
   }
+  else if (std::regex_search(operation_desc.gemm.name, std::regex(blockwise_regex_string))) {
+    is_blockwise = true;
+    gemm_workspace_.block_scales = BlockScalingWorkspace{};
+  }
   else {
     is_block_scaled = false;
     gemm_workspace_.block_scales = std::nullopt;
@@ -605,6 +620,12 @@ Status GroupedGemmOperationProfiler::initialize_workspace(
       block_scaling_ws.SFD_ptr_array_host.resize(num_groups);
       block_scaling_ws.SFD_reference_ptr_array_host.resize(num_groups);
     }
+    else if (is_blockwise) {
+      auto& block_scaling_ws = gemm_workspace_.block_scales.value();
+      block_scaling_ws.SFA_ptr_array_host.resize(num_groups);
+      block_scaling_ws.SFB_ptr_array_host.resize(num_groups);
+      block_scaling_ws.SFC_ptr_array_host.resize(num_groups);
+    }
     static_assert(sizeof(void*) == 8); // allocating blocks for pointers, so verify pointer size
     // ldx
     gemm_workspace_.lda_array_device =
@@ -698,7 +719,7 @@ Status GroupedGemmOperationProfiler::initialize_workspace(
         int sfa_m = round_up(int(problem_.m(group_idx)), 128);
         int sfb_n = round_up(int(problem_.n(group_idx)), 128);
         int sfa_sfb_k =
-          round_up(ceil_div(int(problem_.k(group_idx)), block_scale_desc.SFVecSize), 4);
+          round_up(ceil_div(int(problem_.k(group_idx)), block_scale_desc.SFKVecSize), 4);
 
         int sfd_m =
           block_scale_desc.SFD.layout == cutlass::library::LayoutTypeID::kRowMajor
@@ -760,6 +781,37 @@ Status GroupedGemmOperationProfiler::initialize_workspace(
           block_scale_ws.SFD_ptr_array_host[group_idx]->fill_device(0);
         }
       }
+      else if (is_blockwise) {
+        auto const block_scale_desc = operation_desc.block_scales.value();
+        auto& block_scale_ws = gemm_workspace_.block_scales.value();
+        int sfa_m     = ceil_div(int(problem_.m(group_idx)), block_scale_desc.SFMVecSize);
+        int sfb_n     = ceil_div(int(problem_.n(group_idx)), block_scale_desc.SFNVecSize);
+        int sfa_sfb_k = ceil_div(int(problem_.k(group_idx)), block_scale_desc.SFKVecSize);
+
+        block_scale_ws.SFA_ptr_array_host[group_idx] =
+          device_context.allocate_and_initialize_tensor(
+            options,
+            "SFA_" + std::to_string(group_idx),
+            block_scale_desc.SFA.element,
+            block_scale_desc.SFA.layout,
+            {sfa_m, sfa_sfb_k},
+            {sfa_m},
+            gemm_workspace_.problem_count,
+            seed_shift++,
+            0);
+
+        block_scale_ws.SFB_ptr_array_host[group_idx] =
+          device_context.allocate_and_initialize_tensor(
+            options,
+            "SFB_" + std::to_string(group_idx),
+            block_scale_desc.SFB.element,
+            block_scale_desc.SFB.layout,
+            {sfa_sfb_k, sfb_n},
+            {sfb_n},
+            gemm_workspace_.problem_count,
+            seed_shift++,
+            0);
+      }
     }
 
     // takes the allocated tensors and initializes an array of pointers per problem in the workspace
@@ -825,6 +877,18 @@ Status GroupedGemmOperationProfiler::initialize_workspace(
         0 // device_index
       );
     }
+    else if (is_blockwise) {
+      auto& block_scale_ws = gemm_workspace_.block_scales.value();
+      create_dev_ptr_array_all_workspace(
+        block_scale_ws.SFA_ptr_array_device,
+        block_scale_ws.SFA_ptr_array_host,
+        "SFA");
+      create_dev_ptr_array_all_workspace(
+        block_scale_ws.SFB_ptr_array_device,
+        block_scale_ws.SFB_ptr_array_host,
+        "SFB");
+    }
+
     init_arguments(options);
   }
 
@@ -896,6 +960,11 @@ bool GroupedGemmOperationProfiler::verify_cutlass(
   init_arguments(options);
 
   library::Operation const* underlying_operation = operation;
+  results_.back().status = underlying_operation->initialize_with_arguments(&gemm_workspace_.arguments);
+  if (results_.back().status != Status::kSuccess) {
+    return false;
+  }
+
   results_.back().status = underlying_operation->run(
     &gemm_workspace_.arguments,
     gemm_workspace_.host_workspace.data(),
@@ -998,7 +1067,7 @@ bool GroupedGemmOperationProfiler::verify_with_reference_(
     }
 
     // we only have a block scaled reference kernel implemented on the host
-    if (is_block_scaled && provider != library::Provider::kReferenceHost) {
+    if ((is_block_scaled || is_blockwise) && provider != library::Provider::kReferenceHost) {
       continue;
     }
 
@@ -1064,12 +1133,22 @@ bool GroupedGemmOperationProfiler::verify_with_reference_(
           ptr_norm_constant = host_data_norm_constant.data();
           ws.norm_constant->copy_to_host(ptr_norm_constant);
         }
+        else if (is_blockwise) {
+          auto const& ws = gemm_workspace_.block_scales.value();
+
+          host_data_SFA.resize(ws.SFA_ptr_array_host[group_idx]->bytes());
+          ptr_SFA = host_data_SFA.data();
+          ws.SFA_ptr_array_host[group_idx]->copy_to_host(ptr_SFA);
+          host_data_SFB.resize(ws.SFB_ptr_array_host[group_idx]->bytes());
+          ptr_SFB = host_data_SFB.data();
+          ws.SFB_ptr_array_host[group_idx]->copy_to_host(ptr_SFB);
+        }
       }
 
       const auto &desc = static_cast<library::GroupedGemmDescription const &>(operation->description());
       const auto& gemm_desc = desc.gemm;
 
-      if (!is_block_scaled) {
+      if (!is_block_scaled and !is_blockwise) {
         library::Handle handle;
         handle.set_provider(provider);
 
@@ -1112,7 +1191,7 @@ bool GroupedGemmOperationProfiler::verify_with_reference_(
           gemm_workspace_.C_ptr_array_host[group_idx]->batch_stride(),
           gemm_workspace_.reference_ptr_array_host[group_idx]->batch_stride());
       }
-      else {
+      else if (is_block_scaled) {
         auto const& block_scale_desc = desc.block_scales.value();
         auto& block_scale_ws = gemm_workspace_.block_scales.value();
 
@@ -1134,7 +1213,7 @@ bool GroupedGemmOperationProfiler::verify_with_reference_(
           gemm_desc.D.layout,
           block_scale_desc.SFD.element,
           block_scale_desc.SFD.layout,
-          block_scale_desc.SFVecSize,
+          block_scale_desc.SFKVecSize,
           block_scale_desc.EpilogueSFVecSize);
 
         auto operators_it =
@@ -1208,6 +1287,100 @@ bool GroupedGemmOperationProfiler::verify_with_reference_(
 
         block_scale_ws.SFD_reference_ptr_array_host[group_idx]->copy_from_host(ptr_SFD);
       }
+      else {
+        // Blockwise
+        auto const& block_scale_desc = desc.block_scales.value();
+        auto& block_scale_ws = gemm_workspace_.block_scales.value();
+
+        library::BlockwiseGemmFunctionalKey blockwiseGemm_key(
+          library::Provider::kReferenceHost,
+          library::GemmKind::kUniversal,
+          library::OperationKind::kBlockwiseGemm,
+          gemm_desc.tile_description.math_instruction.element_accumulator,
+          gemm_desc.element_epilogue,
+          element_A,
+          gemm_desc.A.layout,
+          block_scale_desc.SFA.element,
+          element_B,
+          gemm_desc.B.layout,
+          block_scale_desc.SFB.element,
+          gemm_desc.C.element,
+          gemm_desc.C.layout,
+          gemm_desc.D.element,
+          gemm_desc.D.layout,
+          block_scale_desc.SFMVecSize,
+          block_scale_desc.SFNVecSize,
+          block_scale_desc.SFKVecSize
+        );
+
+        auto operators_it = library::Singleton::get().operation_table.blockwise_gemm_operations.find(blockwiseGemm_key);
+        if (
+          operators_it ==
+          library::Singleton::get().operation_table.blockwise_gemm_operations.end()) {
+          disposition = Disposition::kNotSupported;
+          break;
+        }
+
+        if (operators_it->second.empty()) {
+          disposition = Disposition::kNotSupported;
+          break;
+        }
+
+        auto cc_it = operators_it->second.begin();
+        if (cc_it == operators_it->second.end()) {
+          disposition = Disposition::kNotSupported;
+          break;
+        }
+
+        // host reference has only one instances in BlockScaledOperationVectorMap
+        library::Operation const* reference_op = cc_it->second[0];
+
+        library::BlockwiseGemmArguments arguments {
+          {int(problem_.m(group_idx)), int(problem_.n(group_idx)), int(problem_.k(group_idx))},
+          {int(problem_.cluster_m), int(problem_.cluster_n), int(problem_.cluster_k)},
+          {int(problem_.cluster_m_fallback), int(problem_.cluster_n_fallback), int(problem_.cluster_k_fallback)},
+          1, // batch_count
+          ptr_A,
+          ptr_B,
+          ptr_SFA,
+          ptr_SFB,
+          ptr_C,
+          ptr_D,
+          problem_.alpha.data(),
+          problem_.beta.data(),
+          library::ScalarPointerMode::kHost,
+          problem_.lda[group_idx],
+          problem_.ldb[group_idx],
+          problem_.ldc[group_idx],
+          problem_.ldc[group_idx],
+          gemm_workspace_.A_ptr_array_host[group_idx]->batch_stride(),
+          gemm_workspace_.B_ptr_array_host[group_idx]->batch_stride(),
+          gemm_workspace_.C_ptr_array_host[group_idx]->batch_stride(),
+          gemm_workspace_.reference_ptr_array_host[group_idx]->batch_stride(),
+        };
+
+        library::GemmUniversalConfiguration configuration{
+          library::GemmUniversalMode::kGemm,
+          problem_.problem_sizes[group_idx],
+          {problem_.cluster_m, problem_.cluster_n, problem_.cluster_k},
+          {problem_.cluster_m_fallback, problem_.cluster_n_fallback, problem_.cluster_k_fallback},
+          1,
+          problem_.lda[group_idx],
+          problem_.ldb[group_idx],
+          problem_.ldc[group_idx],
+          problem_.ldc[group_idx],
+          1,
+        };
+        uint64_t host_workspace_size_needed = reference_op->get_host_workspace_size(&gemm_workspace_.configuration);
+        std::vector<char> host_workspace(host_workspace_size_needed);
+        status = reference_op->initialize(&configuration, host_workspace.data());
+        if (status != Status::kSuccess) {
+          break;
+        }
+
+        status = reference_op->run(&arguments, host_workspace.data());
+      }
+
       if (status != Status::kSuccess) {
         break;
       }
@@ -1292,6 +1465,10 @@ Status GroupedGemmOperationProfiler::profile_cutlass_(
   void* device_workspace) {
 
   library::Operation const* underlying_operation = operation;
+  results_.back().status = underlying_operation->initialize_with_arguments(&gemm_workspace_.arguments);
+  if (results_.back().status != Status::kSuccess) {
+    return results_.back().status;
+  }
 
   auto func = [&](cudaStream_t stream, int iteration) {
     // Iterate over copies of the problem in memory
diff --git a/tools/profiler/src/operation_profiler.cu b/tools/profiler/src/operation_profiler.cu
index 3f071e890c..387640b254 100644
--- a/tools/profiler/src/operation_profiler.cu
+++ b/tools/profiler/src/operation_profiler.cu
@@ -301,6 +301,9 @@ std::ostream& operator<<(std::ostream& out, library::OperationKind op_kind) {
   else if (op_kind == library::OperationKind::kBlockScaledGemm) {
     out << "kBlockScaledGemm";
   }
+  else if (op_kind == library::OperationKind::kBlockwiseGemm) {
+    out << "kBlockwiseGemm";
+  }
   else if (op_kind == library::OperationKind::kRankK) {
     out << "kRankK";
   }
diff --git a/tools/profiler/src/options.cu b/tools/profiler/src/options.cu
index 0573bd9d8f..4b27c64fab 100644
--- a/tools/profiler/src/options.cu
+++ b/tools/profiler/src/options.cu
@@ -31,7 +31,8 @@
 /* \file
    \brief Command line options for performance test program
 */
-
+#include <cuda.h>
+#include <cuda_runtime_api.h>
 #include <algorithm>
 #include <fstream>
 #include <set>
@@ -165,9 +166,11 @@ void Options::Device::print_usage(std::ostream &out) const {
         break;
       }
       else {
+        int32_t clock_KHz;
+        cudaDeviceGetAttribute(&clock_KHz, cudaDevAttrClockRate, 0);
         out << "    [" << idx << "] - "
           << prop.name << " - SM " << prop.major << "." << prop.minor << ", "
-          << prop.multiProcessorCount << " SMs @ " << (prop.clockRate / 1000.0) << " MHz, "
+          << prop.multiProcessorCount << " SMs @ " << (clock_KHz / 1000.0) << " MHz, "
           << "L2 cache: " << (prop.l2CacheSize >> 20) << " MB, Global Memory: " << (prop.totalGlobalMem >> 30) << " GB"
           << std::endl;
       }
@@ -216,9 +219,11 @@ void Options::Device::print_options(std::ostream &out, int indent) const {
   for (int device : devices) {
     out << device << ',';
   }
+  int32_t clock_KHz;
+  cudaDeviceGetAttribute(&clock_KHz, cudaDevAttrClockRate, 0);
   out
     << "\n"
-    << indent_str(indent) << "clock: " << int(double(properties[0].clockRate) / 1000.0) << "\n"
+    << indent_str(indent) << "clock: " << int(double(clock_KHz) / 1000.0) << "\n"
     << indent_str(indent) << "compute-capability: " << compute_capability(0) << "\n";
 }
 
diff --git a/tools/util/include/cutlass/util/reference/device/tensor_compare.h b/tools/util/include/cutlass/util/reference/device/tensor_compare.h
index 3451f7e093..7bb2bcf208 100644
--- a/tools/util/include/cutlass/util/reference/device/tensor_compare.h
+++ b/tools/util/include/cutlass/util/reference/device/tensor_compare.h
@@ -143,7 +143,8 @@ bool BlockCompareEqual(
   Element const *ptr_B,
   size_t capacity,
   int grid_size = 0, 
-  int block_size = 0) {
+  int block_size = 0,
+  cudaStream_t stream = nullptr) {
 
   int equal_flag = 1;
   int *device_equal_flag = nullptr;
@@ -202,7 +203,10 @@ bool BlockCompareEqual(
 #else
   dim3 grid(grid_size, 1, 1);
   dim3 block(block_size, 1, 1);
-  kernel::BlockCompareEqual<Element><<< grid, block >>>(device_equal_flag, ptr_A, ptr_B, capacity);
+
+  kernel::BlockCompareEqual<Element><<< grid, block, 0, stream >>>(device_equal_flag, ptr_A, ptr_B, capacity);
+
+  cudaStreamSynchronize(stream);
 
   if (cudaMemcpy(
     &equal_flag, 
@@ -232,7 +236,8 @@ bool BlockCompareRelativelyEqual(
   Element epsilon,
   Element nonzero_floor,
   int grid_size = 0, 
-  int block_size = 0) {
+  int block_size = 0,
+  cudaStream_t stream = nullptr) {
 
   int equal_flag = 1;
   int *device_equal_flag = nullptr;
@@ -293,7 +298,8 @@ bool BlockCompareRelativelyEqual(
 #else
   dim3 grid(grid_size, 1, 1);
   dim3 block(block_size, 1, 1);
-  kernel::BlockCompareRelativelyEqual<Element><<< grid, block >>>(
+
+  kernel::BlockCompareRelativelyEqual<Element><<< grid, block, 0, stream >>>(
     device_equal_flag, 
     ptr_A, 
     ptr_B, 
@@ -302,6 +308,8 @@ bool BlockCompareRelativelyEqual(
     nonzero_floor
   );
 
+  cudaStreamSynchronize(stream);
+
   if (cudaMemcpy(
     &equal_flag, 
     device_equal_flag,
diff --git a/tools/util/include/cutlass/util/reference/device/tensor_reduce.h b/tools/util/include/cutlass/util/reference/device/tensor_reduce.h
index c210d533f8..3e6d7b300f 100644
--- a/tools/util/include/cutlass/util/reference/device/tensor_reduce.h
+++ b/tools/util/include/cutlass/util/reference/device/tensor_reduce.h
@@ -232,6 +232,8 @@ ComputeType TensorTransformReduce(
     workspace, identity, workspace_size, reduce
   );
 
+  cudaStreamSynchronize(stream);
+
   if (copy_out) {
     cudaError_t result = cudaMemcpy(&identity, workspace, sizeof(identity), cudaMemcpyDeviceToHost);
     if (result != cudaSuccess) {
@@ -285,6 +287,8 @@ ComputeType TensorTransformReduce(
     workspace, identity, workspace_size, reduce
   );
 
+  cudaStreamSynchronize(stream);
+
   if (copy_out) {
     cudaError_t result = cudaMemcpy(&identity, workspace, sizeof(identity), cudaMemcpyDeviceToHost);
     if (result != cudaSuccess) {
diff --git a/tools/util/include/cutlass/util/reference/host/gett.hpp b/tools/util/include/cutlass/util/reference/host/gett.hpp
index 45be9e72d2..dd54dc6e37 100644
--- a/tools/util/include/cutlass/util/reference/host/gett.hpp
+++ b/tools/util/include/cutlass/util/reference/host/gett.hpp
@@ -645,8 +645,9 @@ void gett_epilogue(
       (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::ReLu<ElementCompute>> or
        cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>) and 
       cute::is_same_v<ElementAux, cutlass::uint1b_t>;
-  constexpr bool IsClamp =
-      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>;
+  constexpr bool UseReLU =
+      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>; // Treat Clamp as ReLU
+
   constexpr bool IsBackpropFusion =
       cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dGELU<ElementCompute>> or
       cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dReLU<ElementCompute>>;
@@ -752,8 +753,9 @@ void gett_epilogue(
             }
           }
 
-          if constexpr (IsClamp) { // Treat Clamp as ReLU
-            output = activation(output, {0, std::numeric_limits<ElementCompute>::max()});
+          if constexpr (UseReLU) {
+            cutlass::epilogue::thread::ReLU<ElementCompute> relu;
+            output = relu(output);
           }
           else {
             output = activation(output);
diff --git a/tools/util/scripts/split_test_cmake.py b/tools/util/scripts/split_test_cmake.py
new file mode 100644
index 0000000000..6541ce1b26
--- /dev/null
+++ b/tools/util/scripts/split_test_cmake.py
@@ -0,0 +1,356 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+
+"""
+Given a set of test files to be included in a CMake target, this script extracts
+the TEST definitions from each file, writes them into new files, and prints the names
+of the new files so that they can be processed as part of a new CMake target.
+
+For example, given a set of --src_files test_a.cu test_b.cu containing 3 and 2 TEST
+definitions, respectively, this script would produce:
+    test_a_000.cu
+    test_a_001.cu
+    test_a_002.cu
+    test_b_000.cu
+    test_b_001.cu
+
+The splitting follows a fairly rudimentary algorithm that does not support all valid C++ programs.
+We walk through a given input test file line by line. Any lines that are not within a TEST definition is added to a running
+"filler" text. When a TEST definition is encountered, the current filler text becomes the prefix
+for that test. All subsequent lines are considered to be part of the TEST definition until the
+number of starting function braces ('{') match the number of closing function braces ('}'). When
+these counts are equal, the TEST definition is considered to be completed. At this point, we return
+to adding lines to the "filler" text until a new TEST definition is encountered. Any "filler" text
+following a TEST definition is added to the suffix of that TEST definition (this is useful for finishing
+off #if statements, as is common in unit tests.).
+
+A state machine illustrating this algorithm at a high level is provided in the source below.
+
+Example: Suppose an input test `test.cu` has the following source:
+    // COPYRIGHT
+    #include <iostream>
+
+    #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+    // Test #1
+    TEST(SM90_a, 256x128x64_2x2x1) {
+        std::cout << "Test #1" << std::endl;
+    }
+
+    // Test #2
+    TEST(SM90_b, 256x128x64_1x1x1) {
+        std::cout << "Test #2" << std::endl;
+    }
+
+    #endif defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+The contents of the two resulting test files will be:
+  $ cat test_000.cu
+    // COPYRIGHT
+    #include <iostream>
+
+    #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+    // Test #1
+    TEST(SM90_a, 256x128x64_2x2x1) {
+        std::cout << "Test #1" << std::endl;
+    }
+
+    // Test #2
+
+    #endif defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+  $ cat test_001.cu
+    // COPYRIGHT
+    #include <iostream>
+
+    #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+    // Test #1
+
+    // Test #2
+    TEST(SM90_b, 256x128x64_1x1x1) {
+        std::cout << "Test #2" << std::endl;
+    }
+
+    #endif defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+Notice that each of test_000.cu and test_001.cu contain comments that appear outside
+the TEST definitions not included in each file. This is by design, as these
+would be considered "filler" text.
+
+As expected, some cases can't be handled. Below is a non-exhaustive list:
+    1. New TEST following the closing '}' of a TEST case on the same line:
+        TEST(x, y) {
+            // Do stuff
+        } TEST(a, b) {
+
+        In this case, "TEST(a, b) {" will be ignored
+
+    2. Preprocessor macros that occur midway through a test case and extend
+       beyond the conclusion of a testcase
+
+       Example:
+            TEST(a, b) {
+                // Do stuff
+        #if X
+                // Do more stuff
+            }
+        #else
+                // Do other stuff
+            }
+        #endif
+"""
+
+
+import argparse
+import enum
+import os
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument("cmake_target", type=str,
+                    help="Name of the CMake target being generated.")
+parser.add_argument("src_dir", type=str,
+                    help="Path to the directory containing test files.")
+parser.add_argument("--src_files", nargs='+',
+                    help="Files containing TEST instances to split.")
+parser.add_argument("--max_tests_per_file", type=int, default=1,
+                    help="Maximum number of TEST instances per file.")
+parser.add_argument("--dst_dir", type=str,
+                    help="Path to the directory to which to write new test files. If not set, uses src_dir.")
+args = parser.parse_args()
+
+
+if args.dst_dir == None:
+    args.dst_dir = args.src_dir
+
+
+class Testcase:
+    """
+    Lightweight tracker of test-case processing status
+    """
+    def __init__(self, prefix_text):
+        # Any text that preceded the TEST definition that was
+        # not part of another TEST definition
+        self.prefix = prefix_text
+
+        # Any text within the TEST definition
+        self.test = ""
+
+        # Any text that follows the completion of the TEST definition
+        # and is not included in other TEST definitions
+        self.suffix = ""
+
+        # Whether the test's definition has concluded
+        self.completed = False
+
+        # Current balance of opening and closing curly brackets in
+        # the TEST definition. '{' increments the count and '}' decrements it.
+        # A value of 0 (when self.completed == False) indicates that the test
+        # has completed.
+        self.curly_bracket_balance = 0
+
+
+class ParseState(enum.Enum):
+    """
+      State machine for processing.
+      Transitions occur on each line encountered in the soruce file
+
+
+      Line does not contain 'TEST('
+                 +----+
+                 |    |
+                 |    v          'TEST('
+               +--------+      encountered         +--------------------------+
+        ------>| Filler | -----------------------> | TestDeclaredWaitingStart |
+               +--------+                          +--------------------------+
+                   ^                                         |
+ Number of '{'     |                                         | First '{' encountered
+ equals number of  |           +--------+                    |
+ '}' encountered   +-----------| InTest | <------------------+
+                               +--------+
+                                 |    ^
+                                 |    |
+                                 +----+
+                      Number of '{' encountered
+                      exceeds number of '}' encountered
+    """
+
+
+    # Any text that is not part of a TEST case
+    Filler = 0
+
+    # Processing text within the first { of the TEST case
+    # and before the en of the final } of the TEST case
+    InTest = 1
+
+    # Processing text from the start of the TEST definition
+    # but before the first {. This could occur if the opening {
+    # occurs on a separate line than the TEST definition.
+    TestDeclaredWaitingStart = 2
+
+
+cmake_src_list = []
+for filename in args.src_files:
+    if '.' not in filename:
+        # Add any non-filename arguments to the command list by default
+        cmake_src_list.append(filename)
+        continue
+
+    if '/' in filename:
+        raise Exception(
+            f"Source files passed to {__file__} must be within the same directory "
+            "as the CMakeLists defining the target using the files. "
+            f"Provided path {filename} is in a different directory.")
+
+    full_filename = os.path.join(args.src_dir, filename)
+    with open(full_filename, 'r') as infile:
+        lines = infile.readlines()
+
+    # Find the number of instances of "TEST("
+    ntest = sum([1 for line in lines if "TEST(" in line])
+
+    if ntest <= args.max_tests_per_file:
+        # File contains fewer than max_tests_per_file TEST instances. It does
+        # not need to be split
+        cmake_src_list.append(filename)
+        continue
+
+    # Current state of the parsing state machine. We start with filler text
+    state = ParseState.Filler
+
+    # List of individual TESTs found
+    tests = []
+
+    # Ongoing text that is not included in a TEST definition. This will serve
+    # as the prefix for any yet-to-be encountered TEST definitions.
+    filler_text = ""
+
+    def add_filler_text(text):
+        global filler_text
+        # Add new text to the ongoing filler text and to the suffixes of
+        # any completed tests
+        filler_text += text
+        for i in range(len(tests)):
+            if tests[i].completed:
+                tests[i].suffix += text
+
+    for line in lines:
+        if state == ParseState.Filler:
+            # We are not currently within a TEST definition.
+
+            if 'TEST(' in line:
+                # We have encountered a new TEST( case. Any text preceding this
+                # must be added to the filler text (e.g., if we have a line of the form:
+                #   "static constexpr int Val = 4; TEST(blah) {"
+                #   then "static constexpr int Val = 4;" needs to be included in filler
+                #   text, as it could be used by subsequent tests.)
+                splits = line.split('TEST')
+
+                # There should not be more than one TEST definition on a given line
+                assert len(splits) <= 2
+
+                if len(splits) > 1:
+                    if not splits[0].isspace():
+                        # Only add text to filler if there are non-whitespace charcters
+                        # preceding the TEST definition in the line
+                        filler_text += splits[0]
+
+                        # The new line is just the TEST-related line
+                        line = 'TEST' + splits[-1]
+
+                # Add tests and transtion to TestDeclaredWaitingStart state.
+                # Do not add the line to the test text of the new test case; this
+                # will be done in either the TestDeclaredWaitingStart state processing
+                # below or in the InTest state processing below.
+                tests.append(Testcase(filler_text))
+                state = ParseState.TestDeclaredWaitingStart
+            else:
+                # Any remaining filler text is added to the running filler_text
+                # which will be used as the prefix for any new tests, and to the
+                # suffix of any completed tests
+                add_filler_text(line)
+
+        if state == ParseState.TestDeclaredWaitingStart:
+            # We have seen a TEST definition but have not yet seen its opening {.
+
+            if '{' in line:
+                # The first curly bracket for the TEST definition has been found.
+                # Advance to state InTests. Do not add the line to the test's text
+                # or change the curly-brace balance of the test; these will be done
+                # when processing the state == ParseState.InTest condition below.
+                state = ParseState.InTest
+            else:
+                tests[-1].test += line
+
+        if state == ParseState.InTest:
+            # We are currently within a TEST definition.
+            # Process lines character-by-character looking for opening and closing
+            # braces. If we reach parity between opening and closing braces, the
+            # test is considered done.
+            filler_text_to_add = ""
+            for char in line:
+                if not tests[-1].completed:
+                    tests[-1].test += char
+                    if char == '{':
+                        tests[-1].curly_bracket_balance += 1
+                    elif char == '}':
+                        tests[-1].curly_bracket_balance -= 1
+                        if tests[-1].curly_bracket_balance == 0:
+                            tests[-1].completed = True
+                else:
+                    filler_text_to_add += char
+
+            if filler_text_to_add != "" and (not filler_text_to_add.isspace() or '\n' in filler_text_to_add):
+                add_filler_text('\n' + filler_text_to_add)
+
+            if tests[-1].completed:
+                state = ParseState.Filler
+
+    # Write out the new files for tests
+    filename_prefix, filename_suffix = filename.split('.')
+    for i, test in enumerate(tests):
+        assert test.completed
+        new_filename = filename_prefix + '_' + str(i).zfill(3) + '.' + filename_suffix
+        full_new_filename = os.path.join(args.dst_dir, new_filename)
+
+        # Replace any '\' with '/'. CMake doesn't like '\'.
+        full_new_filename = full_new_filename.replace('\\', '/')
+
+        with open(full_new_filename, 'w') as outfile:
+            outfile.write(test.prefix + test.test + test.suffix)
+        cmake_src_list.append(full_new_filename)
+
+
+for cmake_file in cmake_src_list:
+    print(cmake_file)