Skip to content

Conversation

@drisspg
Copy link
Collaborator

@drisspg drisspg commented Dec 21, 2025

Summary

  • Update to dsl 3.4.3

Confirms this fixes: pytorch/pytorch#169921
And fixes: #2084

:)

@tridao tridao merged commit ceb4110 into Dao-AILab:main Dec 22, 2025
Fridge003 added a commit to Fridge003/sgl-flash-attn that referenced this pull request Dec 30, 2025
0xDELUXA pushed a commit to 0xDELUXA/flash-attention that referenced this pull request Jan 24, 2026
LucasWilkinson added a commit to vllm-project/flash-attention that referenced this pull request Jan 29, 2026
* Remove old xentropy kernel

This hasn't been used since 2023-09

* Remove old fused softmax kernel from apex/Megatron

* Remove old attn decode kernel from FasterTransformer

* Remove old rotary kernel

* [Cute] Implement page table with TMA for fwd_sm100

* [Cute] Remove trailing bracket (Dao-AILab#1809)

This fixes Commit 81cdf4c

* [Cute] Make sure R2P happen

* feat: add support for pytorch2.8 (Dao-AILab#1801)

* [Cute] Implement PackGQA with TMA for fwd_sm100

Credit: Jay Shah's idea

* Bump to v2.8.3

* [BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen (Dao-AILab#1795)

When the parameter `cache_seqlen` is scalar, it should be expand to
vector of shape (batch_size).  In the original code, whenever `block_table`
is used, the shape of `k_cache` is (num_blocks, page_size, ...), and
thus `cache_seqlen` is expanded to shape (num_blocks) instead of
(batch_size), which is wrong.  This fix uses the shape of `q`, which
is always `batch_size`.

* [Cute] Port fwd_combine kernel from C++ to cute-dsl

* [Cute] Simplify tile scheduler storing params

* [Cute] Implement sink for fwd_sm90

* [Cute] Implement PackGQA with TMA for fwd_sm90

* [Cute] Use R2P for masking in fwd_sm90

Actually doesn't seem to make it faster

* Add sorting and head swizzle to varlen scheduler (Dao-AILab#1823)

* use LPT order in varlen kernel

* add prefill decode benchmark script

* add sort in prepare

* add full implementation:

* add varlen kvhead swizzle

* add settings for swizzle ablation

* add correction term for sort when causal

* remove ablation options from frontend and clean up comments

* add comments in prepare kernel

* remove debug code and scripts

* put back defaults in tests

* remove excess Nones returned in python interface for varlen

* revert opinionated change to setup.py on cuda version 12.9

* force inline sort op and make east const

* more templating in varlen scheduler to cure some register spilling

* fix exploding build by splitting compilation and add qol macros for hdimdiff

* fix metadata mismatch with seqlenk in test script

* extend prepare kernel to >992 batches and always call it for varlen

* do inter-batch sort per every 992 batches

* better names in combine and fix prepare condition in api

* Fixes incorrect variable reference in comment (Dao-AILab#1775)

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

* Update the initialization of dk/dv_semaphore (Dao-AILab#1839)

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

* Update tile_scheduler.hpp (Dao-AILab#1841)

* ci: Move build job to workflow template (Dao-AILab#1835)

* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Build via workflow template (Dao-AILab#1844)

* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827)

* ci: Allow build/deploy of arbitrary configurations

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanui

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cxx11_abi

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upload

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Switch to workflow_dispatch (Dao-AILab#1847)

* [`FA3`] Allow returning LSE via kwarg (Dao-AILab#1851)

* lse output

* style

* style

* revert test changes, introduce optional kwarg to output lse

* [BugFix] fix flash_fwd.FlashAttentionForwardSm80  bugs (Dao-AILab#1856)

* [BugFix] fix softcap condition

softcap should only be referenced when its not none, currently the logic is reversed and will result in an error

* [BugFix] fix sm80 cuteDSL error


1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100
2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces.

* Fix typo of range_constexpr

* Fix seqlen

* [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL (Dao-AILab#1858)

* update num_threads based on num wgs

* fix bug when not intra_wg_overlap and not mma_pv_is_rs

* make FA3 compatible with CUDA 13 Builds (Dao-AILab#1860)

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0
when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128),
leading to a compiler failure during barrier initialization. Changed to round-up
division to ensure a minimum value of 1.

* [BUILD] SBSA wheels + CUDA 13 Support (Dao-AILab#1865)

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* drop 12.4

* drop 12.4

* fix correct name

* fix correct name

* fix correct name

* fix correct name

* cibuildwheel.yml

* benchmark: qualify all attention backends by methods list (Dao-AILab#1881)

* ABI stable fa3 (Dao-AILab#1791)

* squashed

* fixes

* fixes

* Fix narrow

* Add TORCH_STABLE_ONLY flag

* new_empty + zero_ --> new_zeros

* revert flash_api.cpp and add flash_api_stable.cpp

* update setup.py

* Only pass TORCH_STABLE_ONLY for stable build

* Address Jane's comments

* > to >=

* [NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882)

* fix typo

* Update setup.py

* Update setup.py

* Update setup.py

* Update setup.py

* fix typo in flops calculation for local attention (Dao-AILab#1883)

* flash-attn-cute bwd sm90 (Dao-AILab#1868)

* [Cute] Make testing utils standlone for cute (Dao-AILab#1892)

* Bump pin for CuTeDSL (Dao-AILab#1891)

* Improve causal backward determinism perf with SPT schedule (Dao-AILab#1893)

* add spt scheduler for causal bwd determinism

* add new torch check for det hdim 256 to stable api

* Upgrade to cutlass v4.2.1 (Dao-AILab#1905)

* switch to use cutlass.utils.get_smem_capacity_in_bytes instead of deprecated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906)

* Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908)

Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>

* C++11 fix warnings (Dao-AILab#1904)

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* Update flash_api_stable.cpp

* upstream cutlass v4.2.1 csrc

* [Cute] Write ex2 emulation in a more readable form

* [Cute] Simplify utils.py a bit

* [Cute] Remove arith & vector import in utils.py

* [CuteDSL] Fix test (Dao-AILab#1925)

* Refactors to enable FlexAttention (Dao-AILab#1840)

* Refactors to enable FlexAttention

* Thread throught the buffers to the score_mod

* add-test

* add fastdivmod

* comments

* comments

* [Cute] Fix softmax for cutlass-dsl==4.2.1

* [Cute] Fix softmax for fwd_sm100

* [Cute,Bwd] Simplify bwd_preprocessing kernel

* [Cute,Fwd,Sm90] Simplify by passing around functions

* [Cute,Fwd,Sm90] Simplify score mode by passing around partial fn

* [Cute] Optionally dump cubin and sass

* [Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n

* [Cute,Bwd,Sm90] Format file w ruff

* [Cute,Bwd,Sm90] Fix bwd dK & dV, more async

* [Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum

* [Cute,Bwd,Sm90] Use 1 barrier for loading both K & V

* [Cute,Bwd,Sm90] Don't clear dK & dV, use zero_init mma flag instead

* [Cute,Bwd,Sm90] Use TMA to store dK & dV

* [Cute,Bwd,Sm90] Load K together w Q & LSE in the first iteration

* [Cute,Sm90] Move gemm helper functions to hopper_helpers.py

* Swap masking to not use R2P

* Pre-indent to make commit diffs readable

* Adding varlen support + tests

* Remove self refs in softmax for loop (Dao-AILab#1924)

Co-authored-by: Tri Dao <tridao@users.noreply.github.com>

* [Cute,Bwd,Sm90] Make postprocessing kernel work

* [Cute] Run ruff format on bwd files

* [CI] Add pre-commit GH action

* [Cute,Bwd,Sm90] Try dO_stage=1, PdS_stage=1

* [Cute,Bwd,Sm90] Make causal work

* [Cute,Bwd,Sm90] Implement dQ_swapAB

* [Cute,Bwd,Sm90] Implement SdP_swapAB

* [AMD] Torch Compile Issues (Dao-AILab#1756)

* fix rounding and dropout metdata bug

* fix lse shape and bug in interface

* return softmax is true

* [Cute,Bwd,Sm90] Implement mma_dkv_is_rs

* [Cute,Bwd,Sm90] Use block size 80x128

* [CUTE] Enable Pack GQA for score mods (Dao-AILab#1937)

* Add precommit list and then uncomment in chunks (Dao-AILab#1941)

* create list to work through

* include ampere

* [ROCm] prepare CK sources for pytorch hipify v2 APIs (Dao-AILab#1944)

See pytorch/pytorch#151845.
pytorch has removed caffe2, but hipify still contained
work-arounds for caffe2 vs torch compatibility.
As a result of hipify v2 changes, some torch APIs are changing.

* [Cute] Add flake8 config file

* [Cute,Fwd,Sm90] Load Q & K using the same mbarrier

* [Cute,Bwd,Sm90] Use the same producer states if Q_stage == dO_stage

* [Cute,Bwd,Sm90] Split sdQaccum layout into 2 warp groups

* [Cute,Bwd,Sm90] Implement masking

* [Cute,Fwd,Sm100] Parse swizzle from pointer, don't need to pass in

* [Cute,Fwd,Sm100] Clean up

* [Cute,Fwd,Sm100] Clean up mask

* [Cute] Reformat blackwell_helpers.py, block_info.py

* [Cute] Format mma_sm100_desc.py, seqlen_info.py

* sm100 bwd add kernel and update postprocess mask and barriers (Dao-AILab#1945)

* [Cute,Bwd,Sm100] Format flash_bwd_sm100.py and flash_bwd_postprocess

* [Cute,Bwd,Sm100] Rename var {m,n}_block_size->tile_{m,n}

* [Cute,Bwd,Sm100] Clean up a bit

* add barrier module (Dao-AILab#1946)

* [Cute,Bwd,Sm100] Have a separate function to set up the mma

* [Cute,Bwd,Sm100] Load LSE with cpasync_bulk

* [Cute,Bwd,Sm100] Load dPsum with cpasync_bulk

* [Cute,Bwd,Sm100] Use copy_utils functions to load Q & dO

* [Cute,Bwd,Sm100] Load K & Q, V & dO in the first iteration

* [Cute,Bwd,Sm100] Simplify mma by using functools.partial

* [Cute,Bwd,Sm100] Don't need q_dk_consumer_state

* [Cute,Bwd,Sm100] Simplify dQacc_reduce, don't need mbarrier

* [Cute,Bwd,Sm100] Iterate from m_block_min -> m_block_max

* [Cute,Bwd,Sm100] Try direct atomicadd rmem -> gmem

* [Cute,Bwd,Sm100] Combine pipeline_dK and pipeline_dV into one

* [Cute,Bwd,Sm100] All compute warps wait for lse_barrier

* [Cute,Bwd,Sm100] sdQaccum doesn't need swizzle

* [Cute,Bwd,Sm100] Try gemm_ptx

* [Cute,Bwd,Sm100] Clean up compute fn

* [Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1

* [Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables

* [Cute,Bwd,Sm100] Hardcode dS_stage = 1

* [Cute,Bwd,Sm100] Add option for delay tma store

* Fix hopper cuda 13 build (Dao-AILab#1949)

* [CuteDSL] Fix hash function for cute.jit decorator (Dao-AILab#1953)

* Block Sparsity and Flex Attention mask mod support (Dao-AILab#1942)

* clean up and rebase for PR

* add mask mod tests

* add benchmarking files

* refactor for better style

* remove extraneous csrc

* type hint buffers

* refactor: order of non/overlap and modify blocksparse producer to agree with dense

* change variable name back to buffers

* remove unnecessary variable in first_half_block

* restore erroneous packgqa deletion

* add blocksparsity and mask_mod asserts to interface.py

* fix rebase issues

* Restore submodule and reset pointer to upstream/main

* rename cutlass.const_expr to const_expr

* support fully masked m blocks (i.e. skipped tiles)

* remove outdated commented code

* cutlass v4.3.0 (Dao-AILab#1952)

* [Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx

* [Cute,Bwd,Sm100] More cleanup

* [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (Dao-AILab#1961)

* clean up and rebase for PR

* add mask mod tests

* add benchmarking files

* refactor for better style

* remove extraneous csrc

* type hint buffers

* refactor: order of non/overlap and modify blocksparse producer to agree with dense

* change variable name back to buffers

* remove unnecessary variable in first_half_block

* restore erroneous packgqa deletion

* add blocksparsity and mask_mod asserts to interface.py

* fix rebase issues

* Restore submodule and reset pointer to upstream/main

* rename cutlass.const_expr to const_expr

* support fully masked m blocks (i.e. skipped tiles)

* remove outdated commented code

* rename buffers -> aux_tensors, fix score_mod test in sm90 fwd

* fix mask mod interface issues and tests

* remove newline at end of file

* format with ruff

* format mask & sm100 with ruff

* format more files with ruff

* format barrier.py with ruff

* Fix FA3 segfault with custom CUDA streams in ABI stable build (Dao-AILab#1957)

The ABI stable implementation incorrectly used getCurrentStream().id()
which returns a StreamId (int64_t) instead of the actual cudaStream_t
pointer. Casting an integer ID to a stream pointer caused segmentation
faults when using custom CUDA streams.

Fixed by using the proper AOTI C API function aoti_torch_get_current_cuda_stream()
which returns the actual CUDA stream pointer.

* [Cute,Fwd,Sm100] Fix interface w score mod to get it to run

* [Cute,Sm100] In gemm ptx, add to base smem_address instead

* [Cute,Bwd,Sm100] Make postprocessing work, add interface

* [Cute,Bwd,Sm100] Simplify layouts in compute_loop

* [Cute,Bwd,Sm100] Causal mask

* [Cute,Bwd,Sm100] Enable bwd tests

* [Cute,Bwd] Enable bwd benchmarks

* [Cute] Add store_shared_remote_fp32x4 util function

* [Cute,Bwd,Sm100] Tune registers

* [Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr

* [Cute,Bwd,Sm100] Reduce sync

* [Cute] Change utils.view_transpose back

* [Cute,Bwd,Sm100] Remove delay_tma_store option

* [Cute,Bwd,Sm100] Implement cluster

Co-authored-by: Ted Zadouri <tz6037@princeton.edu>

* [Cute] Copy benchmark util functions to cute directory

Easier to benchmark without having to install FA2

* [Cute,Bwd,Sm100] Use pipeline class for LSE and dPsum

* [Cute,Bwd,Sm100] Remove stage from sK, sV, tP, sdS

* [Cute,Bwd,Sm100] Fix wrong LSE and dPsum indexing in load

* [Cute] Blocks tweaks (Dao-AILab#1964)

* [Cute,Bwd,Sm100] Use TS MMA for dK

* [Cute,Blocksparse] Group block sparse input torch tensors

* [Cute,Bwd,Sm100] Separate mma_S and mma_dP

* [Cute,Bwd,Sm100] Try LPTBwdScheduler

* [Cute,Bwd,Sm100] Try separating warps loading Q and dO

* BlockSparse Tweaks (Dao-AILab#1970)

* Tweaks

* better errors

* Switch to new API

* [Cute] Fix main (Dao-AILab#1982)

* [Cute,Fwd,Sm100] Implement SplitKV (Dao-AILab#1940)

* Implement split KV

* Remove modal bench harness

* Fixes

* [Cute] Extract block-sparse utilities from SM80/90 (Dao-AILab#1984)

- Create block_sparse_utils.py with SM80/90 block-sparse logic
- Refactor flash_fwd.py to use extracted utilities
- Clean up whitespace in block_sparsity.py

This extracts the block-sparse consumer loop and related utilities
from flash_fwd.py into a reusable module for SM80/90 architectures.

* Enable python-3.10+ (Dao-AILab#1998)

* [Cute, Bwd, Sm100] Add GQA support (Dao-AILab#2004)

* add gqa for sm100 bwd

* remove mha guard for test

* change to cluster size 1

* [Cute,Fwd,Sm100] fix major regression with split kv (Dao-AILab#2006)

* [CuTe DSL] Block sparsity computation kernel (Dao-AILab#1983)

* begin block sparsity computation kernel

* block sparsity computation kernel and benchmark working

* loop range_constexpr

* add fast kernel

* merge fast and regular kernel

* use TensorSSA approach to mask mod

* update with OOB check

* tests and benchmarks for block sparsity working

* remove extraneous files

* Revert mask.py to previous state - removing unintended changes from block sparsity work

* remove flex attn test stub

* add sleeps to benchmark

* correct block sparsity benchmark to use torch.compile

* Restore missing mask definitions and fix benchmark window_size handling

* move benchmarks into new directory

* compute_block_sparsity docstring

* streamline compute block sparsity benchmark script

* [NVIDIA] bump github actions (Dao-AILab#1996)

* Update GitHub Actions to use checkout@v5 and setup-python@v6; enhance compute capability support

* revert changes

* revert

* Update publish.yml

* Update publish.yml

* Update publish.yml

* Update publish.yml

* cuda-toolkit@v0.2.29

* [Cute,Fwd,Sm100] Support paged attention (Dao-AILab#1999)

* modal bench and correctness

* implement for one thread per row

* coalesced(?) gmem loads

* use cp async

* use 64 threads to load

* fill in smem for V

* pass tests

* fixes

* removed extra files

* handle V loading for n_block < 0

* Add torch.compile support to flash attention 3

* Don't return mutated variables in mha_bwd

* Change fake_check flag to be opt-in; Remove build.sh and remove if-else around `torch.library.custom_op` usage

* Remove print statements and update exception message

* Fix flash_attn_backward_fake

* Add `safe_aot_autograd_check`

* Update namespace to flash_attn_3

* Add `flash_attn_forward.register_autograd`

* Fix bug in `flash_attn_backward_fake`

* Add support and tests for torch.export and aoti_compile_and_package

* format code

* update flash_api_stable.cpp

* Fix flash_api_stable.cpp build

* Only run schema_check if dtype is not float8_e4m3fn

* Correctly compute kBlockM for sm88/86/80

* Fix bug in boxed_mha_bwd

* don't run autograd_check when num_splits > 0

* [Cute] Add block-sparsity support to SM100 (Dao-AILab#1985)

- Implement block-sparse attention in flash_fwd_sm100.py
- Update interface.py to handle SM100 block size calculations
  (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows)
- Add mask_mod parameter support in mask.py for block-sparse masking
- Add SM100 test fixtures and tile size handling in test_mask_mod.py

This enables block-sparsity on SM 10.0 architecture, including
mask_mod support and proper block size accounting.

* [Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao-AILab#2014)

* use correction warps for epi when varlen (non tma O)

* properly enable fallback epilogue for varlen q

* fix rebase errors

* update tests

* Raise TypeError if out is specified when compiling _flash_attn_forward

* add fastdivmod for oob reads in mask_mods (Dao-AILab#2020)

* add fastdivmod for oob reads in mask_mods

* Updates for h100

* don't pass mask_fn to softmax_step generically (Dao-AILab#2026)

* swap order of decorators (Dao-AILab#2029)

* [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions (Dao-AILab#2033)

* enable deterministic mode for sm100 bwd and fix race conditions

* turn off lpt scheduler for causal

* use more regs for reduce when deterministic

* make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release

* use 100k iterations for default

* [NFC] Trivial fix to silence linter (Dao-AILab#1928)

Not much to see here, but this causes linter noise

* Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032)

* [Cute] Add authors

* [Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031)

* Bump pin (Dao-AILab#2025)

* Bump pin

* Swtich to new fastdivmod

* cleanup varlen on blackwell

* Allow for only cute install

* ruff all the smaller files (Dao-AILab#2040)

* [Flash] Fix head dim 64 bwd (Dao-AILab#2035)

* Add headdim64 tests (Dao-AILab#2041)

* [Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046)

* add local for sm100 bwd

* add deterministic

* update tests

* ruff files

* remove old code

* move comment

* override window_size = None for causal

* revert to fwd test defaults

* Add hash attr to shortcut expensive check (Dao-AILab#2048)

* [AMD ROCm] Update to latest composable_kernel to improve performance (Dao-AILab#2052)

* Update CK and c++ version

* update CK

* update ck

* Update comment to reflect qscale_type in fmha_fwd_traits

---------

Co-authored-by: Jeff Huang <chiachi.huang@amd.com>

* fixing cute bwd func def (Dao-AILab#2056)

* Fix use-after-free in FA3 deterministic mode. The pytorch caching allocator actually saves us here, but if you turn it off, then compute-sanitizer will detect this. (Dao-AILab#2063)

* [CUTE] Allow grads to be preallocated (Dao-AILab#2065)

* [Cute,Fwd] Extend score_mod to variable sequence length (Dao-AILab#2043)

* rebase to main

* varlen support for score mod

* interface change for varlen score mod

* implement varlen support for score mod

* varlen score mod working; updated tests

* modify varlen score mod to use fastdiv_mods updated per sequence

* updated test suite

* current working state of varlen score mod

* refactor varlen score mod tests

* fix to transpose

* refactor varlen score mod tests; fix bug; clean up varlen score mod application in kernel

* refactor test_score_mod.py to use external score mod definition file

* update flash_fwd.py for varlen score mod

* sm90 varlen score mod working; test revisions

* enable packgqa for varlen score mod; set up fastdiv_mod recomputation

* update flash_fwd_sm100.py for recomputing fastdiv_mods & format varlen score mod test

* Overwrite pack_gqa.py, tile_scheduler.py, and test_flash_attn.py with origin/main versions

* rebase to main

* fix test rebase artifacts

* fix floor_if_packed redundancy

* correct sm90 divmods mismatch

* revert test_flash_attn to main

* add varlen score mod benchmark script

* packgqa for varlen (independent of score mod)

* rm benchmark from PR

* move score mod arg wrapping to utils.py

* format with ruff

* major refactor: change score_mod signature to accept seqlen_info and update all tests accordingly

* reinstate varlen packgqa exclusion checks

* move fastdiv_mods recomputation out of apply_score_mod in prep for varlen mask_mod support

* remove duplicate fastdiv_mod recomputation

* [Fix] fastdiv_mods for paged attn and seqused_*

* clean up PR; fix paged_kv varlen for sm90

* update to varlen score mod test script (paged kv)

* remove premature seqlen arguments from sm90 apply_mask_mod

* [CUTE] Seeing if tvvm reduces cpu overhead (Dao-AILab#2042)

* [FIRST] Fix softcap scoremod kwargs typo. (Dao-AILab#2072)

* basics working (Dao-AILab#2070)

* Blocksparse impl (Dao-AILab#2085)

* Fix IMA in fwd on m boundary (Dao-AILab#2091)

* Fix IMA in fwd on m boundary

* Fix compeltely OOB loads

* Update to dsl 3.4.3 (Dao-AILab#2092)

* README for AMD ROCm (Dao-AILab#2068)

* readme update for rocm

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

* readme update for rocm

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

---------

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

* fix shuffle sync for pack gqa epilogue (Dao-AILab#2097)

* improve paged cpasync

* Enable Thor (Dao-AILab#2108)

* [Cute] Add quack as dependency

* [Cute,Fwd,Sm90] Change PipelineTMAAsync sublass to signal per warp

Previous we signal per warp group, but that makes the code more complicated
for a tiny bit of perf gain.

* Add pack-gqa support for blcoksparse impl w/ braodcasted H dim (Dao-AILab#2098)

* [Cute,Fwd] improved block sparsity (Dao-AILab#2100)

* improved block sparsity computation

* refactor blocksparsity computation for tvm-ffi

* refactor mask mod definitions and tests

* refactor of block sparsity and mask mod application; eventually allow varlen

* remove fastdivmods from compute block sparsity

* remove unnecessary imports

* revert to 1-phase block sparsity computation

* update bwd kernels to use new AttentionMaskCls api

* fix linter error

* [Cute] Fix minor lint issue in shuffle_sync

* Misc tests that should be xfailed for now (Dao-AILab#2127)

* Update cutlass to fix undefined symbol: cuDriverGetVersion. (Dao-AILab#2142)

* [Cute,Fwd,Sm100] Support `q_stage=1` for inference (Dao-AILab#1993)

* use q_stage=1 for split kv

* determine q_stage via seqlen_q for sm100

* repurpose softmax1 warps for cp.async load

* address comments

* [Cute] Fix two tests that were failing  (Dao-AILab#2149)

* [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py

The paged KV cache tests (test_score_mod_with_paged_kvcache and
test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY
to skip tests on SM90 since paged KV cache is only supported on SM100.
However, the variable was never defined, causing a NameError.

This adds the same definition used in test_mask_mod.py:
COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0]

* [Cute] Fix missing seqlen_info parameter in mask_mod call

The mask_mod call in apply_mask_sm100_transposed was missing the
seqlen_info parameter. All mask functions expect the signature:
(batch, head, m_idx, n_idx, seqlen_info, aux_tensors)

The other two mask_mod calls in the same file correctly pass all 6
arguments, but this one only passed 5, causing:
TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors'

This fixes test_mask_mod.py::test_mask_mod_ima_partial_block.

* cleanup

* [Cute, Bwd, Sm100] Add varlen for sm100 bwd (Dao-AILab#2150)

* varlen bwd with rounded padded offsets

* fix mha

* change offset mode to round down multiple

* enable varlen bwd tests

* enable deterministic mode

* fix deadlock and switch mha to no postprocess

* reenable tests

* fix lint error

* use head swizzle/spt for deterministic, update tests

* change padding offset based on arch

* rebase and update interface, tests

* add arch dispatch for padded offset q to postprocess

* address comments

* remove tile sizes from seqlen info class vars

* block-sparse backward SM90 (Dao-AILab#2136)

* score-mod backward SM90 (Dao-AILab#2137)

* [Cute] Clarify and fix subtle cachekey bug (Dao-AILab#2143)

* [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change (Dao-AILab#2146)

* [CUTE][SM90]Enable pack-gqa with broadcasted maskmods (Dao-AILab#2145)

* [CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158)

* [Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167)

* fix seqused in varlen bwd

* enable store zero for zero len seqused q

* [CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170)

* [Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171)

* add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing

* remove unnecessary reformatting

* reinstate changes

* [Cute][Flex] Remove no longer needed contig (Dao-AILab#2172)

* [Cute] update row_max before safe overwrite for online_softmax (Dao-AILab#2174)

* update row_max before safe overwrite

* move up row_max_prev

* [Cute][Flex] add back in contig (Dao-AILab#2177)

* [Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180)

* baseline local flops

* [Cute,Fwd,Sm100] distributed offset calculation for paged KV (Dao-AILab#2104)

* fully shard paged KV address calculation across threads

* use t0 indices for static bound checking

* increase tiled copy to full KV row

* shrink predicate tensor

* clarify paged KV divisibility constraints

* increase load register allocation

* Add R2P dual bound masking for local attention

Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.

* remove benchmark result, undo changes to benchmark

* Add R2P dual bound masking for local attention

Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.

* switch from xor to mask_right & ~ mask_left

* flip in_bound to out_bound

* remove zero logic for right_s and left_s

* remove 24 clamp

* doc

* lint

* added back clamp to avoid "OverflowError: Python int too large to convert to C long"

* add comment

* [Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189)

* [Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AILab#2194)

* fix

* same fix for bwd and SM80

* reduce chance of build oom (Dao-AILab#2079)

* [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases (Dao-AILab#2187)

* Remove hopper/flash_api_torch_lib.cpp from CMakeLists.txt

Upstream flash_api.cpp already has torch bindings, so this file is no longer needed.

* Fix compatibility between upstream flash_api.cpp and downstream flash.h

- Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API)
- Restore static_switch.h from downstream (has QV_SWITCH macro)

* Restore entire hopper/ folder from downstream

Using downstream's hopper code (with n_offset, CP, varlen combine) for full
compatibility. Upstream changes are kept in non-hopper files.

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: seungrok.jung <seungrok.jung@amd.com>
Co-authored-by: Tri Dao <tridpq@gmail.com>
Co-authored-by: Jean-Luc Duprat <jld@acm.org>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Chao Shi <stepinto@live.com>
Co-authored-by: jayhshah <jayhshah@gmail.com>
Co-authored-by: Jingze Shi <losercheems@gmail.com>
Co-authored-by: y-sq <58683402+y-sq@users.noreply.github.com>
Co-authored-by: Ravi Ghadia <40660742+ghadiaravi13@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Mingyang <mhao1999@outlook.com>
Co-authored-by: Reuben Stern <107093092+reubenconducts@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Co-authored-by: Rajesh Shashi Kumar <35628747+rajesh-s@users.noreply.github.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Henry Tsang <henrylhtsang@meta.com>
Co-authored-by: Ted Zadouri <tedzadouri@gmail.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: brandonsun <brandons@nvidia.com>
Co-authored-by: JackCharlesZhang <113156832+JackCharlesZhang@users.noreply.github.com>
Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>
Co-authored-by: imbr92 <40306754+imbr92@users.noreply.github.com>
Co-authored-by: Kevin Tong <kevin@augmentcode.com>
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
Co-authored-by: Michael Melesse <micmelesse@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Kevin Wang <kevmo314@gmail.com>
Co-authored-by: Ted Zadouri <tz6037@princeton.edu>
Co-authored-by: timmy-feng <70349932+timmy-feng@users.noreply.github.com>
Co-authored-by: Guilherme Leobas <guilhermeleobas@gmail.com>
Co-authored-by: Anakin(Yancheng) Zheng <103552181+anakinxc@users.noreply.github.com>
Co-authored-by: Markus Hoehnerbach <mhoehnerbach@meta.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: liangel-02 <liangel@meta.com>
Co-authored-by: skarupke <malteskarupke@fastmail.fm>
Co-authored-by: Leo Dong <leodong0315@gmail.com>
Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
Co-authored-by: Kareem <81531392+KareemMusleh@users.noreply.github.com>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent segfault with tvm-ffi [Cutedsl] Compile and cache not freeing all memory

2 participants