[fix] Fix barrier deadlock in fmha_v2 fp8+head_dim=256 transpose_v_tile#2957
[fix] Fix barrier deadlock in fmha_v2 fp8+head_dim=256 transpose_v_tile#2957bobboli wants to merge 3 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR refines FMHA v2 synchronization mechanisms by simplifying mutex coordination logic, updating barrier assembly mnemonics from legacy Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses a phase-flip race in the FMHA v2 DMA logic by adding a named_barrier_wait to prevent deadlocks. In the test suite, a general skip was removed, but a new skip was added for FP8 configurations with a head dimension of 256. Feedback indicates that this new skip contradicts the PR's objective of fixing the deadlock and should likely be removed if the fix is verified.
b76c689 to
4a775fb
Compare
|
/bot run |
|
/bot run |
|
[FAILED] Pipeline #47888948: 8/20 passed |
794b3f8 to
81a6653
Compare
|
/bot run |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
csrc/fmha_v2/fmha/warpspec/dma.h (1)
615-622: Consider turning reserve+sync into a single API.This bug class came from forgetting the barrier after a reserve-like operation.
load_q(),load_kv(), and nowtranspose_v_tile()all depend on the same “reserve, then named-barrier sync” invariant, so a small helper would make that ordering much harder to miss again.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fmha_v2/fmha/warpspec/dma.h` around lines 615 - 622, The code relies on the repeated pattern "reserve then named_barrier_wait" (seen at cbw_v.threadReserve() followed by named_barrier_wait(...)) which is easy to forget in functions like load_q(), load_kv(), and transpose_v_tile(); introduce a single helper API (e.g., threadReserveAndSync() or cbw_v.reserve_and_sync()) that calls threadReserve() and immediately performs named_barrier_wait(SYNC_BARRIER, NUM_THREADS_IN_DMA_GROUP) internally, replace the separate call sites (the current cbw_v.threadReserve(); named_barrier_wait(...) pair) with the new single call to ensure the reserve+sync invariant cannot be missed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@csrc/fmha_v2/fmha/warpspec/dma.h`:
- Around line 615-622: The code relies on the repeated pattern "reserve then
named_barrier_wait" (seen at cbw_v.threadReserve() followed by
named_barrier_wait(...)) which is easy to forget in functions like load_q(),
load_kv(), and transpose_v_tile(); introduce a single helper API (e.g.,
threadReserveAndSync() or cbw_v.reserve_and_sync()) that calls threadReserve()
and immediately performs named_barrier_wait(SYNC_BARRIER,
NUM_THREADS_IN_DMA_GROUP) internally, replace the separate call sites (the
current cbw_v.threadReserve(); named_barrier_wait(...) pair) with the new single
call to ensure the reserve+sync invariant cannot be missed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2f8b2b9e-6928-453d-a4a6-0713f5814892
📥 Commits
Reviewing files that changed from the base of the PR and between 7143fd466ba28ff7f7ea4679be6de7aedf84bb83 and 794b3f827c20c21c52f055e13323bf3416edee5a.
📒 Files selected for processing (3)
csrc/fmha_v2/fmha/hopper/arrive_wait.hcsrc/fmha_v2/fmha/warpspec/compute.hcsrc/fmha_v2/fmha/warpspec/dma.h
09f361f to
42c9b4d
Compare
42c9b4d to
0c26b04
Compare
|
/bot run |
|
/bot run |
|
/bot run |
ede6e2e to
4b24359
Compare
4b24359 to
050cbad
Compare
|
/bot run |
|
/bot run |
| #pragma unroll 1 | ||
| for (int batch_idx = 0; batch_idx < params.b; ++batch_idx) { | ||
| int const actual_q_seqlen = | ||
| params.cu_q_seqlens[batch_idx + 1] - params.cu_q_seqlens[batch_idx]; |
There was a problem hiding this comment.
There will be B accesses for every DMA iteration, will this be a concern?
| // which is UB per PTX spec (bar.sync = barrier.sync.aligned requires all CTA threads | ||
| // to execute the same instruction) and was flagged by compute-sanitizer synccheck. | ||
| mutex.arrive(); | ||
| mutex.wait(); |
There was a problem hiding this comment.
Does this have any functional impact?
Summary
This PR fixes the SM90 FP8
fmha_v2deadlock on the warp-specialized QGMMA path and re-enables FP8-output prefill coverage.Cleaned commit stack:
fix(fmha_v2): fix fp8 transpose barrier pipeline on SM90fix(fmha_v2): fix fp8 persistent scheduler for ragged q-tilestest(fmha_v2): enable fp8 output prefill coverageRoot Causes
There were three separate issues:
num_tiles_per_headand skipping invalid tiles laterhead_dim=256What Changed
cu_q_seqlenskv_tile_buffersto1for SM90 FP8-outputhead_dim > 128Validation
Validated locally on H100 / SM90:
PACKED_QKVrepro:racecheckclean,synccheckcleanPACKED_QKVrepro:racecheckclean,synccheckcleanCONTIGUOUS_Q_KVrepro:racecheckclean,synccheckcleanracecheckclean50000rounds completed without hanging96 passed, 16 skipped(SEPARATE_Q_K_Vunsupported cases only)