cutedsl 4.3.4 #2092

drisspg · 2025-12-21T22:11:25Z

Summary

Update to dsl 3.4.3

Confirms this fixes: pytorch/pytorch#169921
And fixes: #2084

:)

This reverts commit ceb4110.

* Remove old xentropy kernel This hasn't been used since 2023-09 * Remove old fused softmax kernel from apex/Megatron * Remove old attn decode kernel from FasterTransformer * Remove old rotary kernel * [Cute] Implement page table with TMA for fwd_sm100 * [Cute] Remove trailing bracket (Dao-AILab#1809) This fixes Commit 81cdf4c * [Cute] Make sure R2P happen * feat: add support for pytorch2.8 (Dao-AILab#1801) * [Cute] Implement PackGQA with TMA for fwd_sm100 Credit: Jay Shah's idea * Bump to v2.8.3 * [BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen (Dao-AILab#1795) When the parameter `cache_seqlen` is scalar, it should be expand to vector of shape (batch_size). In the original code, whenever `block_table` is used, the shape of `k_cache` is (num_blocks, page_size, ...), and thus `cache_seqlen` is expanded to shape (num_blocks) instead of (batch_size), which is wrong. This fix uses the shape of `q`, which is always `batch_size`. * [Cute] Port fwd_combine kernel from C++ to cute-dsl * [Cute] Simplify tile scheduler storing params * [Cute] Implement sink for fwd_sm90 * [Cute] Implement PackGQA with TMA for fwd_sm90 * [Cute] Use R2P for masking in fwd_sm90 Actually doesn't seem to make it faster * Add sorting and head swizzle to varlen scheduler (Dao-AILab#1823) * use LPT order in varlen kernel * add prefill decode benchmark script * add sort in prepare * add full implementation: * add varlen kvhead swizzle * add settings for swizzle ablation * add correction term for sort when causal * remove ablation options from frontend and clean up comments * add comments in prepare kernel * remove debug code and scripts * put back defaults in tests * remove excess Nones returned in python interface for varlen * revert opinionated change to setup.py on cuda version 12.9 * force inline sort op and make east const * more templating in varlen scheduler to cure some register spilling * fix exploding build by splitting compilation and add qol macros for hdimdiff * fix metadata mismatch with seqlenk in test script * extend prepare kernel to >992 batches and always call it for varlen * do inter-batch sort per every 992 batches * better names in combine and fix prepare condition in api * Fixes incorrect variable reference in comment (Dao-AILab#1775) Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described. * Update the initialization of dk/dv_semaphore (Dao-AILab#1839) When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue. * Update tile_scheduler.hpp (Dao-AILab#1841) * ci: Move build job to workflow template (Dao-AILab#1835) * ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Build via workflow template (Dao-AILab#1844) * ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827) * ci: Allow build/deploy of arbitrary configurations Signed-off-by: oliver könig <okoenig@nvidia.com> * add Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanui Signed-off-by: oliver könig <okoenig@nvidia.com> * cxx11_abi Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * upload Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Switch to workflow_dispatch (Dao-AILab#1847) * [`FA3`] Allow returning LSE via kwarg (Dao-AILab#1851) * lse output * style * style * revert test changes, introduce optional kwarg to output lse * [BugFix] fix flash_fwd.FlashAttentionForwardSm80 bugs (Dao-AILab#1856) * [BugFix] fix softcap condition softcap should only be referenced when its not none, currently the logic is reversed and will result in an error * [BugFix] fix sm80 cuteDSL error 1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100 2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces. * Fix typo of range_constexpr * Fix seqlen * [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs * make FA3 compatible with CUDA 13 Builds (Dao-AILab#1860) Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1. * [BUILD] SBSA wheels + CUDA 13 Support (Dao-AILab#1865) * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml * benchmark: qualify all attention backends by methods list (Dao-AILab#1881) * ABI stable fa3 (Dao-AILab#1791) * squashed * fixes * fixes * Fix narrow * Add TORCH_STABLE_ONLY flag * new_empty + zero_ --> new_zeros * revert flash_api.cpp and add flash_api_stable.cpp * update setup.py * Only pass TORCH_STABLE_ONLY for stable build * Address Jane's comments * > to >= * [NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882) * fix typo * Update setup.py * Update setup.py * Update setup.py * Update setup.py * fix typo in flops calculation for local attention (Dao-AILab#1883) * flash-attn-cute bwd sm90 (Dao-AILab#1868) * [Cute] Make testing utils standlone for cute (Dao-AILab#1892) * Bump pin for CuTeDSL (Dao-AILab#1891) * Improve causal backward determinism perf with SPT schedule (Dao-AILab#1893) * add spt scheduler for causal bwd determinism * add new torch check for det hdim 256 to stable api * Upgrade to cutlass v4.2.1 (Dao-AILab#1905) * switch to use cutlass.utils.get_smem_capacity_in_bytes instead of deprecated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906) * Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908) Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local> * C++11 fix warnings (Dao-AILab#1904) * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * Update flash_api_stable.cpp * upstream cutlass v4.2.1 csrc * [Cute] Write ex2 emulation in a more readable form * [Cute] Simplify utils.py a bit * [Cute] Remove arith & vector import in utils.py * [CuteDSL] Fix test (Dao-AILab#1925) * Refactors to enable FlexAttention (Dao-AILab#1840) * Refactors to enable FlexAttention * Thread throught the buffers to the score_mod * add-test * add fastdivmod * comments * comments * [Cute] Fix softmax for cutlass-dsl==4.2.1 * [Cute] Fix softmax for fwd_sm100 * [Cute,Bwd] Simplify bwd_preprocessing kernel * [Cute,Fwd,Sm90] Simplify by passing around functions * [Cute,Fwd,Sm90] Simplify score mode by passing around partial fn * [Cute] Optionally dump cubin and sass * [Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n * [Cute,Bwd,Sm90] Format file w ruff * [Cute,Bwd,Sm90] Fix bwd dK & dV, more async * [Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum * [Cute,Bwd,Sm90] Use 1 barrier for loading both K & V * [Cute,Bwd,Sm90] Don't clear dK & dV, use zero_init mma flag instead * [Cute,Bwd,Sm90] Use TMA to store dK & dV * [Cute,Bwd,Sm90] Load K together w Q & LSE in the first iteration * [Cute,Sm90] Move gemm helper functions to hopper_helpers.py * Swap masking to not use R2P * Pre-indent to make commit diffs readable * Adding varlen support + tests * Remove self refs in softmax for loop (Dao-AILab#1924) Co-authored-by: Tri Dao <tridao@users.noreply.github.com> * [Cute,Bwd,Sm90] Make postprocessing kernel work * [Cute] Run ruff format on bwd files * [CI] Add pre-commit GH action * [Cute,Bwd,Sm90] Try dO_stage=1, PdS_stage=1 * [Cute,Bwd,Sm90] Make causal work * [Cute,Bwd,Sm90] Implement dQ_swapAB * [Cute,Bwd,Sm90] Implement SdP_swapAB * [AMD] Torch Compile Issues (Dao-AILab#1756) * fix rounding and dropout metdata bug * fix lse shape and bug in interface * return softmax is true * [Cute,Bwd,Sm90] Implement mma_dkv_is_rs * [Cute,Bwd,Sm90] Use block size 80x128 * [CUTE] Enable Pack GQA for score mods (Dao-AILab#1937) * Add precommit list and then uncomment in chunks (Dao-AILab#1941) * create list to work through * include ampere * [ROCm] prepare CK sources for pytorch hipify v2 APIs (Dao-AILab#1944) See pytorch/pytorch#151845. pytorch has removed caffe2, but hipify still contained work-arounds for caffe2 vs torch compatibility. As a result of hipify v2 changes, some torch APIs are changing. * [Cute] Add flake8 config file * [Cute,Fwd,Sm90] Load Q & K using the same mbarrier * [Cute,Bwd,Sm90] Use the same producer states if Q_stage == dO_stage * [Cute,Bwd,Sm90] Split sdQaccum layout into 2 warp groups * [Cute,Bwd,Sm90] Implement masking * [Cute,Fwd,Sm100] Parse swizzle from pointer, don't need to pass in * [Cute,Fwd,Sm100] Clean up * [Cute,Fwd,Sm100] Clean up mask * [Cute] Reformat blackwell_helpers.py, block_info.py * [Cute] Format mma_sm100_desc.py, seqlen_info.py * sm100 bwd add kernel and update postprocess mask and barriers (Dao-AILab#1945) * [Cute,Bwd,Sm100] Format flash_bwd_sm100.py and flash_bwd_postprocess * [Cute,Bwd,Sm100] Rename var {m,n}_block_size->tile_{m,n} * [Cute,Bwd,Sm100] Clean up a bit * add barrier module (Dao-AILab#1946) * [Cute,Bwd,Sm100] Have a separate function to set up the mma * [Cute,Bwd,Sm100] Load LSE with cpasync_bulk * [Cute,Bwd,Sm100] Load dPsum with cpasync_bulk * [Cute,Bwd,Sm100] Use copy_utils functions to load Q & dO * [Cute,Bwd,Sm100] Load K & Q, V & dO in the first iteration * [Cute,Bwd,Sm100] Simplify mma by using functools.partial * [Cute,Bwd,Sm100] Don't need q_dk_consumer_state * [Cute,Bwd,Sm100] Simplify dQacc_reduce, don't need mbarrier * [Cute,Bwd,Sm100] Iterate from m_block_min -> m_block_max * [Cute,Bwd,Sm100] Try direct atomicadd rmem -> gmem * [Cute,Bwd,Sm100] Combine pipeline_dK and pipeline_dV into one * [Cute,Bwd,Sm100] All compute warps wait for lse_barrier * [Cute,Bwd,Sm100] sdQaccum doesn't need swizzle * [Cute,Bwd,Sm100] Try gemm_ptx * [Cute,Bwd,Sm100] Clean up compute fn * [Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1 * [Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables * [Cute,Bwd,Sm100] Hardcode dS_stage = 1 * [Cute,Bwd,Sm100] Add option for delay tma store * Fix hopper cuda 13 build (Dao-AILab#1949) * [CuteDSL] Fix hash function for cute.jit decorator (Dao-AILab#1953) * Block Sparsity and Flex Attention mask mod support (Dao-AILab#1942) * clean up and rebase for PR * add mask mod tests * add benchmarking files * refactor for better style * remove extraneous csrc * type hint buffers * refactor: order of non/overlap and modify blocksparse producer to agree with dense * change variable name back to buffers * remove unnecessary variable in first_half_block * restore erroneous packgqa deletion * add blocksparsity and mask_mod asserts to interface.py * fix rebase issues * Restore submodule and reset pointer to upstream/main * rename cutlass.const_expr to const_expr * support fully masked m blocks (i.e. skipped tiles) * remove outdated commented code * cutlass v4.3.0 (Dao-AILab#1952) * [Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx * [Cute,Bwd,Sm100] More cleanup * [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (Dao-AILab#1961) * clean up and rebase for PR * add mask mod tests * add benchmarking files * refactor for better style * remove extraneous csrc * type hint buffers * refactor: order of non/overlap and modify blocksparse producer to agree with dense * change variable name back to buffers * remove unnecessary variable in first_half_block * restore erroneous packgqa deletion * add blocksparsity and mask_mod asserts to interface.py * fix rebase issues * Restore submodule and reset pointer to upstream/main * rename cutlass.const_expr to const_expr * support fully masked m blocks (i.e. skipped tiles) * remove outdated commented code * rename buffers -> aux_tensors, fix score_mod test in sm90 fwd * fix mask mod interface issues and tests * remove newline at end of file * format with ruff * format mask & sm100 with ruff * format more files with ruff * format barrier.py with ruff * Fix FA3 segfault with custom CUDA streams in ABI stable build (Dao-AILab#1957) The ABI stable implementation incorrectly used getCurrentStream().id() which returns a StreamId (int64_t) instead of the actual cudaStream_t pointer. Casting an integer ID to a stream pointer caused segmentation faults when using custom CUDA streams. Fixed by using the proper AOTI C API function aoti_torch_get_current_cuda_stream() which returns the actual CUDA stream pointer. * [Cute,Fwd,Sm100] Fix interface w score mod to get it to run * [Cute,Sm100] In gemm ptx, add to base smem_address instead * [Cute,Bwd,Sm100] Make postprocessing work, add interface * [Cute,Bwd,Sm100] Simplify layouts in compute_loop * [Cute,Bwd,Sm100] Causal mask * [Cute,Bwd,Sm100] Enable bwd tests * [Cute,Bwd] Enable bwd benchmarks * [Cute] Add store_shared_remote_fp32x4 util function * [Cute,Bwd,Sm100] Tune registers * [Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr * [Cute,Bwd,Sm100] Reduce sync * [Cute] Change utils.view_transpose back * [Cute,Bwd,Sm100] Remove delay_tma_store option * [Cute,Bwd,Sm100] Implement cluster Co-authored-by: Ted Zadouri <tz6037@princeton.edu> * [Cute] Copy benchmark util functions to cute directory Easier to benchmark without having to install FA2 * [Cute,Bwd,Sm100] Use pipeline class for LSE and dPsum * [Cute,Bwd,Sm100] Remove stage from sK, sV, tP, sdS * [Cute,Bwd,Sm100] Fix wrong LSE and dPsum indexing in load * [Cute] Blocks tweaks (Dao-AILab#1964) * [Cute,Bwd,Sm100] Use TS MMA for dK * [Cute,Blocksparse] Group block sparse input torch tensors * [Cute,Bwd,Sm100] Separate mma_S and mma_dP * [Cute,Bwd,Sm100] Try LPTBwdScheduler * [Cute,Bwd,Sm100] Try separating warps loading Q and dO * BlockSparse Tweaks (Dao-AILab#1970) * Tweaks * better errors * Switch to new API * [Cute] Fix main (Dao-AILab#1982) * [Cute,Fwd,Sm100] Implement SplitKV (Dao-AILab#1940) * Implement split KV * Remove modal bench harness * Fixes * [Cute] Extract block-sparse utilities from SM80/90 (Dao-AILab#1984) - Create block_sparse_utils.py with SM80/90 block-sparse logic - Refactor flash_fwd.py to use extracted utilities - Clean up whitespace in block_sparsity.py This extracts the block-sparse consumer loop and related utilities from flash_fwd.py into a reusable module for SM80/90 architectures. * Enable python-3.10+ (Dao-AILab#1998) * [Cute, Bwd, Sm100] Add GQA support (Dao-AILab#2004) * add gqa for sm100 bwd * remove mha guard for test * change to cluster size 1 * [Cute,Fwd,Sm100] fix major regression with split kv (Dao-AILab#2006) * [CuTe DSL] Block sparsity computation kernel (Dao-AILab#1983) * begin block sparsity computation kernel * block sparsity computation kernel and benchmark working * loop range_constexpr * add fast kernel * merge fast and regular kernel * use TensorSSA approach to mask mod * update with OOB check * tests and benchmarks for block sparsity working * remove extraneous files * Revert mask.py to previous state - removing unintended changes from block sparsity work * remove flex attn test stub * add sleeps to benchmark * correct block sparsity benchmark to use torch.compile * Restore missing mask definitions and fix benchmark window_size handling * move benchmarks into new directory * compute_block_sparsity docstring * streamline compute block sparsity benchmark script * [NVIDIA] bump github actions (Dao-AILab#1996) * Update GitHub Actions to use checkout@v5 and setup-python@v6; enhance compute capability support * revert changes * revert * Update publish.yml * Update publish.yml * Update publish.yml * Update publish.yml * cuda-toolkit@v0.2.29 * [Cute,Fwd,Sm100] Support paged attention (Dao-AILab#1999) * modal bench and correctness * implement for one thread per row * coalesced(?) gmem loads * use cp async * use 64 threads to load * fill in smem for V * pass tests * fixes * removed extra files * handle V loading for n_block < 0 * Add torch.compile support to flash attention 3 * Don't return mutated variables in mha_bwd * Change fake_check flag to be opt-in; Remove build.sh and remove if-else around `torch.library.custom_op` usage * Remove print statements and update exception message * Fix flash_attn_backward_fake * Add `safe_aot_autograd_check` * Update namespace to flash_attn_3 * Add `flash_attn_forward.register_autograd` * Fix bug in `flash_attn_backward_fake` * Add support and tests for torch.export and aoti_compile_and_package * format code * update flash_api_stable.cpp * Fix flash_api_stable.cpp build * Only run schema_check if dtype is not float8_e4m3fn * Correctly compute kBlockM for sm88/86/80 * Fix bug in boxed_mha_bwd * don't run autograd_check when num_splits > 0 * [Cute] Add block-sparsity support to SM100 (Dao-AILab#1985) - Implement block-sparse attention in flash_fwd_sm100.py - Update interface.py to handle SM100 block size calculations (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows) - Add mask_mod parameter support in mask.py for block-sparse masking - Add SM100 test fixtures and tile size handling in test_mask_mod.py This enables block-sparsity on SM 10.0 architecture, including mask_mod support and proper block size accounting. * [Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao-AILab#2014) * use correction warps for epi when varlen (non tma O) * properly enable fallback epilogue for varlen q * fix rebase errors * update tests * Raise TypeError if out is specified when compiling _flash_attn_forward * add fastdivmod for oob reads in mask_mods (Dao-AILab#2020) * add fastdivmod for oob reads in mask_mods * Updates for h100 * don't pass mask_fn to softmax_step generically (Dao-AILab#2026) * swap order of decorators (Dao-AILab#2029) * [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions (Dao-AILab#2033) * enable deterministic mode for sm100 bwd and fix race conditions * turn off lpt scheduler for causal * use more regs for reduce when deterministic * make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release * use 100k iterations for default * [NFC] Trivial fix to silence linter (Dao-AILab#1928) Not much to see here, but this causes linter noise * Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032) * [Cute] Add authors * [Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031) * Bump pin (Dao-AILab#2025) * Bump pin * Swtich to new fastdivmod * cleanup varlen on blackwell * Allow for only cute install * ruff all the smaller files (Dao-AILab#2040) * [Flash] Fix head dim 64 bwd (Dao-AILab#2035) * Add headdim64 tests (Dao-AILab#2041) * [Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046) * add local for sm100 bwd * add deterministic * update tests * ruff files * remove old code * move comment * override window_size = None for causal * revert to fwd test defaults * Add hash attr to shortcut expensive check (Dao-AILab#2048) * [AMD ROCm] Update to latest composable_kernel to improve performance (Dao-AILab#2052) * Update CK and c++ version * update CK * update ck * Update comment to reflect qscale_type in fmha_fwd_traits --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> * fixing cute bwd func def (Dao-AILab#2056) * Fix use-after-free in FA3 deterministic mode. The pytorch caching allocator actually saves us here, but if you turn it off, then compute-sanitizer will detect this. (Dao-AILab#2063) * [CUTE] Allow grads to be preallocated (Dao-AILab#2065) * [Cute,Fwd] Extend score_mod to variable sequence length (Dao-AILab#2043) * rebase to main * varlen support for score mod * interface change for varlen score mod * implement varlen support for score mod * varlen score mod working; updated tests * modify varlen score mod to use fastdiv_mods updated per sequence * updated test suite * current working state of varlen score mod * refactor varlen score mod tests * fix to transpose * refactor varlen score mod tests; fix bug; clean up varlen score mod application in kernel * refactor test_score_mod.py to use external score mod definition file * update flash_fwd.py for varlen score mod * sm90 varlen score mod working; test revisions * enable packgqa for varlen score mod; set up fastdiv_mod recomputation * update flash_fwd_sm100.py for recomputing fastdiv_mods & format varlen score mod test * Overwrite pack_gqa.py, tile_scheduler.py, and test_flash_attn.py with origin/main versions * rebase to main * fix test rebase artifacts * fix floor_if_packed redundancy * correct sm90 divmods mismatch * revert test_flash_attn to main * add varlen score mod benchmark script * packgqa for varlen (independent of score mod) * rm benchmark from PR * move score mod arg wrapping to utils.py * format with ruff * major refactor: change score_mod signature to accept seqlen_info and update all tests accordingly * reinstate varlen packgqa exclusion checks * move fastdiv_mods recomputation out of apply_score_mod in prep for varlen mask_mod support * remove duplicate fastdiv_mod recomputation * [Fix] fastdiv_mods for paged attn and seqused_* * clean up PR; fix paged_kv varlen for sm90 * update to varlen score mod test script (paged kv) * remove premature seqlen arguments from sm90 apply_mask_mod * [CUTE] Seeing if tvvm reduces cpu overhead (Dao-AILab#2042) * [FIRST] Fix softcap scoremod kwargs typo. (Dao-AILab#2072) * basics working (Dao-AILab#2070) * Blocksparse impl (Dao-AILab#2085) * Fix IMA in fwd on m boundary (Dao-AILab#2091) * Fix IMA in fwd on m boundary * Fix compeltely OOB loads * Update to dsl 3.4.3 (Dao-AILab#2092) * README for AMD ROCm (Dao-AILab#2068) * readme update for rocm Signed-off-by: seungrok.jung <seungrok.jung@amd.com> * readme update for rocm Signed-off-by: seungrok.jung <seungrok.jung@amd.com> --------- Signed-off-by: seungrok.jung <seungrok.jung@amd.com> * fix shuffle sync for pack gqa epilogue (Dao-AILab#2097) * improve paged cpasync * Enable Thor (Dao-AILab#2108) * [Cute] Add quack as dependency * [Cute,Fwd,Sm90] Change PipelineTMAAsync sublass to signal per warp Previous we signal per warp group, but that makes the code more complicated for a tiny bit of perf gain. * Add pack-gqa support for blcoksparse impl w/ braodcasted H dim (Dao-AILab#2098) * [Cute,Fwd] improved block sparsity (Dao-AILab#2100) * improved block sparsity computation * refactor blocksparsity computation for tvm-ffi * refactor mask mod definitions and tests * refactor of block sparsity and mask mod application; eventually allow varlen * remove fastdivmods from compute block sparsity * remove unnecessary imports * revert to 1-phase block sparsity computation * update bwd kernels to use new AttentionMaskCls api * fix linter error * [Cute] Fix minor lint issue in shuffle_sync * Misc tests that should be xfailed for now (Dao-AILab#2127) * Update cutlass to fix undefined symbol: cuDriverGetVersion. (Dao-AILab#2142) * [Cute,Fwd,Sm100] Support `q_stage=1` for inference (Dao-AILab#1993) * use q_stage=1 for split kv * determine q_stage via seqlen_q for sm100 * repurpose softmax1 warps for cp.async load * address comments * [Cute] Fix two tests that were failing (Dao-AILab#2149) * [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py The paged KV cache tests (test_score_mod_with_paged_kvcache and test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY to skip tests on SM90 since paged KV cache is only supported on SM100. However, the variable was never defined, causing a NameError. This adds the same definition used in test_mask_mod.py: COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0] * [Cute] Fix missing seqlen_info parameter in mask_mod call The mask_mod call in apply_mask_sm100_transposed was missing the seqlen_info parameter. All mask functions expect the signature: (batch, head, m_idx, n_idx, seqlen_info, aux_tensors) The other two mask_mod calls in the same file correctly pass all 6 arguments, but this one only passed 5, causing: TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors' This fixes test_mask_mod.py::test_mask_mod_ima_partial_block. * cleanup * [Cute, Bwd, Sm100] Add varlen for sm100 bwd (Dao-AILab#2150) * varlen bwd with rounded padded offsets * fix mha * change offset mode to round down multiple * enable varlen bwd tests * enable deterministic mode * fix deadlock and switch mha to no postprocess * reenable tests * fix lint error * use head swizzle/spt for deterministic, update tests * change padding offset based on arch * rebase and update interface, tests * add arch dispatch for padded offset q to postprocess * address comments * remove tile sizes from seqlen info class vars * block-sparse backward SM90 (Dao-AILab#2136) * score-mod backward SM90 (Dao-AILab#2137) * [Cute] Clarify and fix subtle cachekey bug (Dao-AILab#2143) * [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change (Dao-AILab#2146) * [CUTE][SM90]Enable pack-gqa with broadcasted maskmods (Dao-AILab#2145) * [CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158) * [Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167) * fix seqused in varlen bwd * enable store zero for zero len seqused q * [CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170) * [Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes * [Cute][Flex] Remove no longer needed contig (Dao-AILab#2172) * [Cute] update row_max before safe overwrite for online_softmax (Dao-AILab#2174) * update row_max before safe overwrite * move up row_max_prev * [Cute][Flex] add back in contig (Dao-AILab#2177) * [Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180) * baseline local flops * [Cute,Fwd,Sm100] distributed offset calculation for paged KV (Dao-AILab#2104) * fully shard paged KV address calculation across threads * use t0 indices for static bound checking * increase tiled copy to full KV row * shrink predicate tensor * clarify paged KV divisibility constraints * increase load register allocation * Add R2P dual bound masking for local attention Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention. * remove benchmark result, undo changes to benchmark * Add R2P dual bound masking for local attention Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention. * switch from xor to mask_right & ~ mask_left * flip in_bound to out_bound * remove zero logic for right_s and left_s * remove 24 clamp * doc * lint * added back clamp to avoid "OverflowError: Python int too large to convert to C long" * add comment * [Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189) * [Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AILab#2194) * fix * same fix for bwd and SM80 * reduce chance of build oom (Dao-AILab#2079) * [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases (Dao-AILab#2187) * Remove hopper/flash_api_torch_lib.cpp from CMakeLists.txt Upstream flash_api.cpp already has torch bindings, so this file is no longer needed. * Fix compatibility between upstream flash_api.cpp and downstream flash.h - Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API) - Restore static_switch.h from downstream (has QV_SWITCH macro) * Restore entire hopper/ folder from downstream Using downstream's hopper code (with n_offset, CP, varlen combine) for full compatibility. Upstream changes are kept in non-hopper files. --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: seungrok.jung <seungrok.jung@amd.com> Co-authored-by: Tri Dao <tridpq@gmail.com> Co-authored-by: Jean-Luc Duprat <jld@acm.org> Co-authored-by: NanoCode012 <nano@axolotl.ai> Co-authored-by: Chao Shi <stepinto@live.com> Co-authored-by: jayhshah <jayhshah@gmail.com> Co-authored-by: Jingze Shi <losercheems@gmail.com> Co-authored-by: y-sq <58683402+y-sq@users.noreply.github.com> Co-authored-by: Ravi Ghadia <40660742+ghadiaravi13@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> Co-authored-by: Mingyang <mhao1999@outlook.com> Co-authored-by: Reuben Stern <107093092+reubenconducts@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Rajesh Shashi Kumar <35628747+rajesh-s@users.noreply.github.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Henry Tsang <henrylhtsang@meta.com> Co-authored-by: Ted Zadouri <tedzadouri@gmail.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: brandonsun <brandons@nvidia.com> Co-authored-by: JackCharlesZhang <113156832+JackCharlesZhang@users.noreply.github.com> Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local> Co-authored-by: imbr92 <40306754+imbr92@users.noreply.github.com> Co-authored-by: Kevin Tong <kevin@augmentcode.com> Co-authored-by: Tri Dao <tridao@users.noreply.github.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Kevin Wang <kevmo314@gmail.com> Co-authored-by: Ted Zadouri <tz6037@princeton.edu> Co-authored-by: timmy-feng <70349932+timmy-feng@users.noreply.github.com> Co-authored-by: Guilherme Leobas <guilhermeleobas@gmail.com> Co-authored-by: Anakin(Yancheng) Zheng <103552181+anakinxc@users.noreply.github.com> Co-authored-by: Markus Hoehnerbach <mhoehnerbach@meta.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: liangel-02 <liangel@meta.com> Co-authored-by: skarupke <malteskarupke@fastmail.fm> Co-authored-by: Leo Dong <leodong0315@gmail.com> Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Kareem <81531392+KareemMusleh@users.noreply.github.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

Update to dsl 3.4.3

4e3e5ce

tridao approved these changes Dec 22, 2025

View reviewed changes

tridao merged commit ceb4110 into Dao-AILab:main Dec 22, 2025

Fridge003 added a commit to Fridge003/sgl-flash-attn that referenced this pull request Dec 30, 2025

Revert "Update to dsl 3.4.3 (Dao-AILab#2092)"

b8b1a4e

This reverts commit ceb4110.

0xDELUXA pushed a commit to 0xDELUXA/flash-attention that referenced this pull request Jan 24, 2026

Update to dsl 3.4.3 (Dao-AILab#2092)

0e3722e

elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026

Update to dsl 3.4.3 (Dao-AILab#2092)

189fe94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cutedsl 4.3.4 #2092

cutedsl 4.3.4 #2092

Uh oh!

drisspg commented Dec 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cutedsl 4.3.4 #2092

cutedsl 4.3.4 #2092

Uh oh!

Conversation

drisspg commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drisspg commented Dec 21, 2025 •

edited

Loading