[fix] Fix barrier deadlock in fmha_v2 fp8+head_dim=256 transpose_v_tile by bobboli · Pull Request #2957 · flashinfer-ai/flashinfer

bobboli · 2026-04-02T09:34:09Z

Summary

This PR fixes the SM90 FP8 fmha_v2 deadlock on the warp-specialized QGMMA path and re-enables FP8-output prefill coverage.

Cleaned commit stack:

fix(fmha_v2): fix fp8 transpose barrier pipeline on SM90
fix(fmha_v2): fix fp8 persistent scheduler for ragged q-tiles
test(fmha_v2): enable fp8 output prefill coverage

Root Causes

There were three separate issues:

FP8 transpose / barrier / scratch-slot correctness bug in the DMA path
FP8 persistent-scheduler bug on mixed-length launches, caused by decoding work from a uniform num_tiles_per_head and skipping invalid tiles later
Separate H100 shared-memory budget issue for FP8-output head_dim=256

What Changed

Fixed the FP8 transpose / barrier pipeline on SM90
Kept FP8 on persistent scheduling, but switched the FP8 transpose path to exact dynamic tile decode from cu_q_seqlens
Reduced kv_tile_buffers to 1 for SM90 FP8-output head_dim > 128
Removed stale test skips that were masking the fixed behavior

Validation

Validated locally on H100 / SM90:

FP8 mixed-length PACKED_QKV repro: racecheck clean, synccheck clean
FP8 ragged PACKED_QKV repro: racecheck clean, synccheck clean
FP8 mixed-length CONTIGUOUS_Q_KV repro: racecheck clean, synccheck clean
FP16 control: racecheck clean
Overnight stress: 50000 rounds completed without hanging
Focused FP8-output causal matrix: 96 passed, 16 skipped (SEPARATE_Q_K_V unsupported cases only)

coderabbitai · 2026-04-02T09:34:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR refines FMHA v2 synchronization mechanisms by simplifying mutex coordination logic, updating barrier assembly mnemonics from legacy bar to barrier, adjusting DMA transposer unroll thresholds, and enabling test execution by removing a module-level skip marker.

Changes

Cohort / File(s)	Summary
DMA and Synchronization Refinements `csrc/fmha_v2/fmha/warpspec/dma.h`, `csrc/fmha_v2/fmha/warpspec/compute.h`	Simplified inter-thread synchronization: added named barrier wait in DMA transpose flow, unified mutex coordination into single `if constexpr (ENABLE_MUTEX)` block removing element-byte-specific branches, removed redundant inter-warpgroup barrier arrivals in softmax paths, and adjusted DMA unroll discriminator threshold from `>` to `>=` for `STEP_KV == 128` case.
Barrier Assembly Updates `csrc/fmha_v2/fmha/hopper/arrive_wait.h`	Updated PTX assembly mnemonics from legacy `bar.arrive`/`bar.sync` to modern `barrier.arrive`/`barrier.sync` with no logic changes.
Test Infrastructure `tests/attention/test_fmha_v2_prefill.py`	Removed unconditional module-level skip marker enabling test execution; per-test skip conditions for SM support and specific hang cases remain unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

tests: skip sliding window + fp8 to prevent hang in fmha_v2 unit tests #2781: Modifies the same test module's skip behavior to prevent hang scenarios, directly related to test execution control flow changes.

Suggested labels

run-ci

Suggested reviewers

sricketts
aleozlx
yongwww
yzh119
cyx-6
samuellees
saltyminty
bkryu

Poem

🐰 A bunny hops through synchronization's dance,
Barriers sync and mutexes prance,
Assembly updated with modern flair,
Tests now run without a care,
FMHA flows smoother everywhere! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing a barrier deadlock specific to fp8 with head_dim=256 in the transpose_v_tile function, which is the core issue addressed by the PR.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering root causes, changes, and validation results.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses a phase-flip race in the FMHA v2 DMA logic by adding a named_barrier_wait to prevent deadlocks. In the test suite, a general skip was removed, but a new skip was added for FP8 configurations with a head dimension of 256. Feedback indicates that this new skip contradicts the PR's objective of fixing the deadlock and should likely be removed if the fix is verified.

bobboli · 2026-04-06T16:56:12Z

/bot run

flashinfer-bot · 2026-04-06T16:57:51Z

GitLab MR !511 has been created, and the CI pipeline #47843767 is currently running. I'll report back once the pipeline job completes.

bobboli · 2026-04-07T04:13:09Z

/bot run

flashinfer-bot · 2026-04-07T04:14:08Z

GitLab MR !511 has been updated with latest changes, and the CI pipeline #47888948 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-04-07T16:18:53Z

[FAILED] Pipeline #47888948: 8/20 passed

bobboli · 2026-04-13T17:30:20Z

/bot run

flashinfer-bot · 2026-04-13T17:30:42Z

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48421201 is currently running. I'll report back once the pipeline job completes.

coderabbitai

🧹 Nitpick comments (1)

csrc/fmha_v2/fmha/warpspec/dma.h (1)
615-622: Consider turning reserve+sync into a single API.

This bug class came from forgetting the barrier after a reserve-like operation. load_q(), load_kv(), and now transpose_v_tile() all depend on the same “reserve, then named-barrier sync” invariant, so a small helper would make that ordering much harder to miss again.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fmha_v2/fmha/warpspec/dma.h` around lines 615 - 622, The code relies on
the repeated pattern "reserve then named_barrier_wait" (seen at
cbw_v.threadReserve() followed by named_barrier_wait(...)) which is easy to
forget in functions like load_q(), load_kv(), and transpose_v_tile(); introduce
a single helper API (e.g., threadReserveAndSync() or cbw_v.reserve_and_sync())
that calls threadReserve() and immediately performs
named_barrier_wait(SYNC_BARRIER, NUM_THREADS_IN_DMA_GROUP) internally, replace
the separate call sites (the current cbw_v.threadReserve();
named_barrier_wait(...) pair) with the new single call to ensure the
reserve+sync invariant cannot be missed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@csrc/fmha_v2/fmha/warpspec/dma.h`:
- Around line 615-622: The code relies on the repeated pattern "reserve then
named_barrier_wait" (seen at cbw_v.threadReserve() followed by
named_barrier_wait(...)) which is easy to forget in functions like load_q(),
load_kv(), and transpose_v_tile(); introduce a single helper API (e.g.,
threadReserveAndSync() or cbw_v.reserve_and_sync()) that calls threadReserve()
and immediately performs named_barrier_wait(SYNC_BARRIER,
NUM_THREADS_IN_DMA_GROUP) internally, replace the separate call sites (the
current cbw_v.threadReserve(); named_barrier_wait(...) pair) with the new single
call to ensure the reserve+sync invariant cannot be missed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f8b2b9e-6928-453d-a4a6-0713f5814892

📥 Commits

Reviewing files that changed from the base of the PR and between 7143fd466ba28ff7f7ea4679be6de7aedf84bb83 and 794b3f827c20c21c52f055e13323bf3416edee5a.

📒 Files selected for processing (3)

csrc/fmha_v2/fmha/hopper/arrive_wait.h
csrc/fmha_v2/fmha/warpspec/compute.h
csrc/fmha_v2/fmha/warpspec/dma.h

bobboli · 2026-04-14T14:42:46Z

/bot run

flashinfer-bot · 2026-04-14T14:43:28Z

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48504914 is currently running. I'll report back once the pipeline job completes.

bobboli · 2026-04-15T06:35:16Z

/bot run

flashinfer-bot · 2026-04-15T06:35:53Z

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48571675 is currently running. I'll report back once the pipeline job completes.

bobboli · 2026-04-15T08:14:07Z

/bot run

flashinfer-bot · 2026-04-15T08:15:58Z

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48577898 is currently running. I'll report back once the pipeline job completes.

bobboli · 2026-04-15T16:46:39Z

/bot run

flashinfer-bot · 2026-04-15T16:48:21Z

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48611098 is currently running. I'll report back once the pipeline job completes.

jimmyzho · 2026-04-17T18:33:13Z

/bot run

flashinfer-bot · 2026-04-17T18:34:13Z

GitLab MR !511 has been created, and the CI pipeline #48806541 is currently running. I'll report back once the pipeline job completes.

jimmyzho · 2026-04-21T21:18:09Z

+#pragma unroll 1
+      for (int batch_idx = 0; batch_idx < params.b; ++batch_idx) {
+        int const actual_q_seqlen =
+            params.cu_q_seqlens[batch_idx + 1] - params.cu_q_seqlens[batch_idx];


There will be B accesses for every DMA iteration, will this be a concern?

jimmyzho · 2026-04-23T01:09:18Z

+      // which is UB per PTX spec (bar.sync = barrier.sync.aligned requires all CTA threads
+      // to execute the same instruction) and was flagged by compute-sanitizer synccheck.
+      mutex.arrive();
+      mutex.wait();


Does this have any functional impact?

flashinfer-bot added the op: attention label Apr 2, 2026

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread tests/attention/test_fmha_v2_prefill.py Outdated

bobboli mentioned this pull request Apr 2, 2026

ci(tests): waive flaky/hardware-incompatible tests to cleanup CI #2922

Closed

bobboli marked this pull request as ready for review April 6, 2026 16:55

bobboli requested review from bkryu, nv-yunzheq, saltyminty and yzh119 as code owners April 6, 2026 16:55

bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from b76c689 to 4a775fb Compare April 6, 2026 16:55

bobboli requested review from aleozlx, cyx-6, jimmyzho, kahyunnam, samuellees, sricketts, yongwww and yyihuang as code owners April 7, 2026 02:39

bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch 3 times, most recently from 794b3f8 to 81a6653 Compare April 13, 2026 17:28

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from 09f361f to 42c9b4d Compare April 14, 2026 03:17

bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from 42c9b4d to 0c26b04 Compare April 14, 2026 03:28

bobboli requested a review from qsang-nv as a code owner April 14, 2026 14:42

jimmyzho mentioned this pull request Apr 14, 2026

FMHA v2 SM90 Readiness #3006

Open

5 tasks

bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from ede6e2e to 4b24359 Compare April 15, 2026 16:39

bobboli added 3 commits April 16, 2026 00:46

fix(fmha_v2): fix fp8 transpose barrier pipeline on SM90

0bb134e

fix(fmha_v2): fix fp8 persistent scheduler for ragged q-tiles

c252b01

test(fmha_v2): enable fp8 output prefill coverage

050cbad

bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from 4b24359 to 050cbad Compare April 15, 2026 16:46

jimmyzho reviewed Apr 23, 2026

View reviewed changes

jimmyzho mentioned this pull request May 9, 2026

[draft] copy of Fix barrier deadlock in fmha_v2 fp8+head_dim=256 transpose_v_tile #3276

Open

Conversation

bobboli commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Causes

What Changed

Validation

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bobboli commented Apr 6, 2026

Uh oh!

flashinfer-bot commented Apr 6, 2026

Uh oh!

bobboli commented Apr 7, 2026

Uh oh!

flashinfer-bot commented Apr 7, 2026

Uh oh!

flashinfer-bot commented Apr 7, 2026

Uh oh!

bobboli commented Apr 13, 2026

Uh oh!

flashinfer-bot commented Apr 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bobboli commented Apr 14, 2026

Uh oh!

flashinfer-bot commented Apr 14, 2026

Uh oh!

bobboli commented Apr 15, 2026

Uh oh!

flashinfer-bot commented Apr 15, 2026

Uh oh!

bobboli commented Apr 15, 2026

Uh oh!

flashinfer-bot commented Apr 15, 2026

Uh oh!

bobboli commented Apr 15, 2026

Uh oh!

flashinfer-bot commented Apr 15, 2026

Uh oh!

jimmyzho commented Apr 17, 2026

Uh oh!

flashinfer-bot commented Apr 17, 2026

Uh oh!

jimmyzho Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jimmyzho Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bobboli commented Apr 2, 2026 •

edited

Loading

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading