Cutlass 3.9.2 #371

aacostadiaz · 2025-05-14T11:18:45Z

This PR adds the changes from Cutlass 3.9.2

* v3.9 update * voidD --------- Co-authored-by: yuzhai <[email protected]>

Co-authored-by: yuzhai <[email protected]>

remove useless code

Co-authored-by: wenju.li <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

* add support for sm89 in cute and the unit tests * rebase v3.9 and format code * minor fix --------- Co-authored-by: Haicheng Wu <[email protected]>

Co-authored-by: Haicheng Wu <[email protected]>

Some typos in comments

…2234) * Fix broken links in cluster launch control docs * Improve titles and alt text

…IA#2219) With the usual register allocation (producer 40, consumer 232) compiling Gemm with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of register spilling (e.g. ~3000 bytes spill). For this case we can change the register allocation to producer 24, consumer 240, which avoids spills.

If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set to 128 x 32, which does not compile as TileN is required to divide EpiTileN

* Update config.hpp * 更新 config.hpp * 更新 config.hpp

* cutlass 3.9 update * rebase * fixes out of shared memory for blockwise Blackwell * doc format * fix issue 2253 * disable host ref by default * fix sm120 smem capacity --------- Co-authored-by: yuzhai <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

…DIA#2270)

…dule. (NVIDIA#2256) Co-authored-by: Yuhang Qi <[email protected]>

* Lazy cuda import * More lazy cuda import * More lazy cuda imports * minor fixes --------- Co-authored-by: Haicheng Wu <[email protected]>

Co-authored-by: Jiazhen Han <[email protected]>

Adds "Generalized Neighborhood Attention" to list of publications using CUTLASS. https://arxiv.org/abs/2504.16922 Co-authored-by: Ali Hassani <[email protected]>

* 3.9.2 doc/version * whitespace

# Conflicts: # examples/CMakeLists.txt # examples/README.md # include/cute/arch/tmem_allocator_sm100.hpp # include/cutlass/epilogue/dispatch_policy.hpp # include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp # include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp # include/cutlass/gemm/dispatch_policy.hpp # include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp # include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp # include/cutlass/gemm/kernel/tile_scheduler.hpp # media/docs/cpp/build/building_with_sycl_support.md # python/cutlass/backend/operation.py # python/cutlass/backend/utils/device.py # python/cutlass/library_defaults.py # python/cutlass/op/gemm.py # test/unit/common/filter_architecture.cpp # tools/util/include/cutlass/util/reference/device/tensor_compare.h

Add Xe Group Scheduler

yzhaiustc and others added 30 commits April 2, 2025 15:11

v3.9 update (NVIDIA#2203)

6f49218

* v3.9 update * voidD --------- Co-authored-by: yuzhai <[email protected]>

v3.9 update (NVIDIA#2213)

79fc51f

Co-authored-by: yuzhai <[email protected]>

Update mma_atom.hpp (NVIDIA#2159)

df8a550

remove useless code

[Doc]fix typo (NVIDIA#2174)

09df6ac

Co-authored-by: wenju.li <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

add support for sm89 in cute and the unit tests (NVIDIA#2177)

19cc2a5

* add support for sm89 in cute and the unit tests * rebase v3.9 and format code * minor fix --------- Co-authored-by: Haicheng Wu <[email protected]>

[Doc] Make C++ code more plausible (NVIDIA#2156)

dd76dec

Co-authored-by: Haicheng Wu <[email protected]>

suppress compilation warnings (NVIDIA#2195)

5120b21

fix-left-inverse-for-nvcc114 (NVIDIA#2196)

9e1b649

Update tile_iterator.cu (NVIDIA#2204)

b3f3c77

Some typos in comments

fix: fig link in cute docs (NVIDIA#2216)

5e49724

Fix broken links and alt text in cluster launch control docs (NVIDIA#…

bb4dd68

…2234) * Fix broken links in cluster launch control docs * Improve titles and alt text

Set EpiTile correctly when TileN is not divisible by 32 (NVIDIA#2220)

81a43e6

If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set to 128 x 32, which does not compile as TileN is required to divide EpiTileN

fix_missing_stdint (NVIDIA#2199)

8e345c5

* Update config.hpp * 更新 config.hpp * 更新 config.hpp

Update README.md for 3.9

f02a7c2

Update CHANGELOG.md for 3.9

be73ad2

Update CHANGELOG.md

e94e888

fix blackwell grouped groupwise hang (NVIDIA#2267)

6971260

cherry-pick feature/hopper-blockwise-generalization-optimization (NVI…

2b78c2f

…DIA#2270)

Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation sche…

e5b810b

…dule. (NVIDIA#2256) Co-authored-by: Yuhang Qi <[email protected]>

Fix wrong detection of python version for use_rmm. (NVIDIA#2224)

35136f5

Import pydot lazily (NVIDIA#2248)

fe75ead

Make cc a positional argument (NVIDIA#2249)

b3ce7e1

Lazy scipy import (NVIDIA#2250)

c4bdfe8

Import cuda, cudart, nvrtc lazily (NVIDIA#2251)

e3cb8a7

* Lazy cuda import * More lazy cuda import * More lazy cuda imports * minor fixes --------- Co-authored-by: Haicheng Wu <[email protected]>

3.9.1 doc/version change (NVIDIA#2273)

f535c33

Fix group scale gemm when K==128 (NVIDIA#2275)

89f6bf2

Co-authored-by: Jiazhen Han <[email protected]>

[CUTLASS] Add GNA to PUBLICATIONS.md (NVIDIA#2276)

40f124e

Adds "Generalized Neighborhood Attention" to list of publications using CUTLASS. https://arxiv.org/abs/2504.16922 Co-authored-by: Ali Hassani <[email protected]>

3.9.2 doc/version (NVIDIA#2279)

ad7b2f5

* 3.9.2 doc/version * whitespace

aacostadiaz and others added 11 commits May 13, 2025 16:59

Use gpu_generics functions

443c793

Add Xe Group Scheduler

aedee57

Merge pull request #1 from muhammad-tanvir-1211/xe_group_scheduler

93b3324

Add Xe Group Scheduler

Fix tests

692e584

Merge remote-tracking branch 'origin/aacosta/3.9.2' into aacosta/3.9.2

33b332f

Merge branch 'sycl-develop' into aacosta/3.9.2

b53d801

fix python

c2efea2

Merge remote-tracking branch 'origin/aacosta/3.9.2' into aacosta/3.9.2

3e47ace

fix python

a121708

fix python

f909ce4

aacostadiaz marked this pull request as ready for review May 20, 2025 17:27

aacostadiaz merged commit dee3370 into codeplaysoftware:sycl-develop May 20, 2025
12 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cutlass 3.9.2 #371

Cutlass 3.9.2 #371

aacostadiaz commented May 14, 2025

Cutlass 3.9.2 #371

Cutlass 3.9.2 #371

Conversation

aacostadiaz commented May 14, 2025