Skip to content

[fix] Fix barrier deadlock in fmha_v2 fp8+head_dim=256 transpose_v_tile#2957

Open
bobboli wants to merge 3 commits into
flashinfer-ai:mainfrom
bobboli:fix/fmha-v2-fp8-head256-barrier-deadlock
Open

[fix] Fix barrier deadlock in fmha_v2 fp8+head_dim=256 transpose_v_tile#2957
bobboli wants to merge 3 commits into
flashinfer-ai:mainfrom
bobboli:fix/fmha-v2-fp8-head256-barrier-deadlock

Conversation

@bobboli
Copy link
Copy Markdown
Contributor

@bobboli bobboli commented Apr 2, 2026

Summary

This PR fixes the SM90 FP8 fmha_v2 deadlock on the warp-specialized QGMMA path and re-enables FP8-output prefill coverage.

Cleaned commit stack:

  1. fix(fmha_v2): fix fp8 transpose barrier pipeline on SM90
  2. fix(fmha_v2): fix fp8 persistent scheduler for ragged q-tiles
  3. test(fmha_v2): enable fp8 output prefill coverage

Root Causes

There were three separate issues:

  • FP8 transpose / barrier / scratch-slot correctness bug in the DMA path
  • FP8 persistent-scheduler bug on mixed-length launches, caused by decoding work from a uniform num_tiles_per_head and skipping invalid tiles later
  • Separate H100 shared-memory budget issue for FP8-output head_dim=256

What Changed

  • Fixed the FP8 transpose / barrier pipeline on SM90
  • Kept FP8 on persistent scheduling, but switched the FP8 transpose path to exact dynamic tile decode from cu_q_seqlens
  • Reduced kv_tile_buffers to 1 for SM90 FP8-output head_dim > 128
  • Removed stale test skips that were masking the fixed behavior

Validation

Validated locally on H100 / SM90:

  • FP8 mixed-length PACKED_QKV repro: racecheck clean, synccheck clean
  • FP8 ragged PACKED_QKV repro: racecheck clean, synccheck clean
  • FP8 mixed-length CONTIGUOUS_Q_KV repro: racecheck clean, synccheck clean
  • FP16 control: racecheck clean
  • Overnight stress: 50000 rounds completed without hanging
  • Focused FP8-output causal matrix: 96 passed, 16 skipped (SEPARATE_Q_K_V unsupported cases only)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR refines FMHA v2 synchronization mechanisms by simplifying mutex coordination logic, updating barrier assembly mnemonics from legacy bar to barrier, adjusting DMA transposer unroll thresholds, and enabling test execution by removing a module-level skip marker.

Changes

Cohort / File(s) Summary
DMA and Synchronization Refinements
csrc/fmha_v2/fmha/warpspec/dma.h, csrc/fmha_v2/fmha/warpspec/compute.h
Simplified inter-thread synchronization: added named barrier wait in DMA transpose flow, unified mutex coordination into single if constexpr (ENABLE_MUTEX) block removing element-byte-specific branches, removed redundant inter-warpgroup barrier arrivals in softmax paths, and adjusted DMA unroll discriminator threshold from > to >= for STEP_KV == 128 case.
Barrier Assembly Updates
csrc/fmha_v2/fmha/hopper/arrive_wait.h
Updated PTX assembly mnemonics from legacy bar.arrive/bar.sync to modern barrier.arrive/barrier.sync with no logic changes.
Test Infrastructure
tests/attention/test_fmha_v2_prefill.py
Removed unconditional module-level skip marker enabling test execution; per-test skip conditions for SM support and specific hang cases remain unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • sricketts
  • aleozlx
  • yongwww
  • yzh119
  • cyx-6
  • samuellees
  • saltyminty
  • bkryu

Poem

🐰 A bunny hops through synchronization's dance,
Barriers sync and mutexes prance,
Assembly updated with modern flair,
Tests now run without a care,
FMHA flows smoother everywhere! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing a barrier deadlock specific to fp8 with head_dim=256 in the transpose_v_tile function, which is the core issue addressed by the PR.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering root causes, changes, and validation results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a phase-flip race in the FMHA v2 DMA logic by adding a named_barrier_wait to prevent deadlocks. In the test suite, a general skip was removed, but a new skip was added for FP8 configurations with a head dimension of 256. Feedback indicates that this new skip contradicts the PR's objective of fixing the deadlock and should likely be removed if the fix is verified.

Comment thread tests/attention/test_fmha_v2_prefill.py Outdated
@bobboli bobboli marked this pull request as ready for review April 6, 2026 16:55
@bobboli bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from b76c689 to 4a775fb Compare April 6, 2026 16:55
@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 6, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been created, and the CI pipeline #47843767 is currently running. I'll report back once the pipeline job completes.

@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 7, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been updated with latest changes, and the CI pipeline #47888948 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47888948: 8/20 passed

@bobboli bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch 3 times, most recently from 794b3f8 to 81a6653 Compare April 13, 2026 17:28
@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 13, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48421201 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
csrc/fmha_v2/fmha/warpspec/dma.h (1)

615-622: Consider turning reserve+sync into a single API.

This bug class came from forgetting the barrier after a reserve-like operation. load_q(), load_kv(), and now transpose_v_tile() all depend on the same “reserve, then named-barrier sync” invariant, so a small helper would make that ordering much harder to miss again.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fmha_v2/fmha/warpspec/dma.h` around lines 615 - 622, The code relies on
the repeated pattern "reserve then named_barrier_wait" (seen at
cbw_v.threadReserve() followed by named_barrier_wait(...)) which is easy to
forget in functions like load_q(), load_kv(), and transpose_v_tile(); introduce
a single helper API (e.g., threadReserveAndSync() or cbw_v.reserve_and_sync())
that calls threadReserve() and immediately performs
named_barrier_wait(SYNC_BARRIER, NUM_THREADS_IN_DMA_GROUP) internally, replace
the separate call sites (the current cbw_v.threadReserve();
named_barrier_wait(...) pair) with the new single call to ensure the
reserve+sync invariant cannot be missed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@csrc/fmha_v2/fmha/warpspec/dma.h`:
- Around line 615-622: The code relies on the repeated pattern "reserve then
named_barrier_wait" (seen at cbw_v.threadReserve() followed by
named_barrier_wait(...)) which is easy to forget in functions like load_q(),
load_kv(), and transpose_v_tile(); introduce a single helper API (e.g.,
threadReserveAndSync() or cbw_v.reserve_and_sync()) that calls threadReserve()
and immediately performs named_barrier_wait(SYNC_BARRIER,
NUM_THREADS_IN_DMA_GROUP) internally, replace the separate call sites (the
current cbw_v.threadReserve(); named_barrier_wait(...) pair) with the new single
call to ensure the reserve+sync invariant cannot be missed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f8b2b9e-6928-453d-a4a6-0713f5814892

📥 Commits

Reviewing files that changed from the base of the PR and between 7143fd466ba28ff7f7ea4679be6de7aedf84bb83 and 794b3f827c20c21c52f055e13323bf3416edee5a.

📒 Files selected for processing (3)
  • csrc/fmha_v2/fmha/hopper/arrive_wait.h
  • csrc/fmha_v2/fmha/warpspec/compute.h
  • csrc/fmha_v2/fmha/warpspec/dma.h

@bobboli bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from 09f361f to 42c9b4d Compare April 14, 2026 03:17
@bobboli bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from 42c9b4d to 0c26b04 Compare April 14, 2026 03:28
@bobboli bobboli requested a review from qsang-nv as a code owner April 14, 2026 14:42
@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 14, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48504914 is currently running. I'll report back once the pipeline job completes.

@jimmyzho jimmyzho mentioned this pull request Apr 14, 2026
5 tasks
@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 15, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48571675 is currently running. I'll report back once the pipeline job completes.

@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 15, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48577898 is currently running. I'll report back once the pipeline job completes.

@bobboli bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from ede6e2e to 4b24359 Compare April 15, 2026 16:39
@bobboli bobboli force-pushed the fix/fmha-v2-fp8-head256-barrier-deadlock branch from 4b24359 to 050cbad Compare April 15, 2026 16:46
@bobboli
Copy link
Copy Markdown
Contributor Author

bobboli commented Apr 15, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been updated with latest changes, and the CI pipeline #48611098 is currently running. I'll report back once the pipeline job completes.

@jimmyzho
Copy link
Copy Markdown
Contributor

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !511 has been created, and the CI pipeline #48806541 is currently running. I'll report back once the pipeline job completes.

#pragma unroll 1
for (int batch_idx = 0; batch_idx < params.b; ++batch_idx) {
int const actual_q_seqlen =
params.cu_q_seqlens[batch_idx + 1] - params.cu_q_seqlens[batch_idx];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be B accesses for every DMA iteration, will this be a concern?

// which is UB per PTX spec (bar.sync = barrier.sync.aligned requires all CTA threads
// to execute the same instruction) and was flagged by compute-sanitizer synccheck.
mutex.arrive();
mutex.wait();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any functional impact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants