feat: add SM120 fmha_v2 kernels to AOT pip wheel builds by blake-snc · Pull Request #2885 · flashinfer-ai/flashinfer

blake-snc · 2026-03-24T21:55:58Z

Summary

gen_trtllm_fmha_v2_sm120_module() exists in jit/attention/modules.py and the JIT runtime path (generate_kernels.py) already dispatches to it correctly. However, aot.py's gen_all_modules() — which drives the pip wheel AOT build — was missing it from the has_sm120 or has_sm121 section.

This means SM120/SM121 devices using a pip wheel would never get the fmha_v2 SM120 kernels compiled into the wheel, and would have to fall back to slower paths.

Fix: Add gen_trtllm_fmha_v2_sm120_module() to the has_sm120 or has_sm121 block in aot.py, alongside the other SM120 modules (fused MOE, GEMM, FP4 quantization).

No behavior change for JIT users; only affects AOT pip wheel builds.

Addresses the AOT gap noted in #2555.

Contributed by Second Nature Computing (https://joinsecondnature.com)

Summary by CodeRabbit

Chores
- Expanded optimized inference module support for SM120 and SM121 GPUs to include attention kernels in addition to existing fused MoE and GEMM optimizations.
- Increased runtime coverage and readiness for attention-heavy workloads on those architectures, improving performance consistency for models using attention.

coderabbitai · 2026-03-24T21:56:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1e3df6a-92cd-43b3-b5f9-5ab3a6d22224

📥 Commits

Reviewing files that changed from the base of the PR and between b3d8f49 and 15f100e.

📒 Files selected for processing (1)

flashinfer/aot.py

🚧 Files skipped from review as they are similar to previous changes (1)

flashinfer/aot.py

📝 Walkthrough

Walkthrough

This change imports gen_trtllm_fmha_v2_sm120_module and appends its JIT spec in gen_all_modules() when has_sm120 or has_sm121 is true, extending SM12x shared-kernel coverage to include attention alongside existing fused MoE/GEMM modules.

Changes

FMHA_V2 SM120 Module Integration

Layer / File(s)	Summary
Import: attention module generator `flashinfer/aot.py`	Added `gen_trtllm_fmha_v2_sm120_module` import used by the AOT JIT-spec generator.
gen_all_modules: include SM120 attention JIT spec `flashinfer/aot.py`	When `has_sm120 or has_sm121` is true, appended `gen_trtllm_fmha_v2_sm120_module()` to the generated JIT spec list and updated the inline comment to include attention with fused MoE/GEMM.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

SM120 attention kernels exist but are blocked by wiring issues (fmha_v2, backend selector, MLA) #2555: Wires SM120 FMHA_V2 attention into the JIT module list, addressing the missing SM120 attention kernel wiring.

Possibly related PRs

flashinfer-ai/flashinfer#2654: Modified gen_all_modules() to append SM12x module generators (fused MoE/GEMM); closely related at the code-level change to SM12x module assembly.
flashinfer-ai/flashinfer#2560: Related FMHA/attention and SM12x handling changes affecting module generation.
flashinfer-ai/flashinfer#3089: Related updates to trtllm FMHA cubins/headers and kernel metadata that complement this generator wiring.

Suggested labels

run-ci, op: moe

Suggested reviewers

yzh119
jimmyzho
cyx-6
bkryu
nv-yunzheq

Poem

🐇 I hop through imports, nibble a line,
I tuck attention kernels into the vine,
SM12x hums now, modules align,
A tiny change — the forest sings, divine. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description provides a detailed summary explaining what was missing, why it matters, and what the fix does. However, it does not follow the repository's description template structure with the required sections like 'Description', 'Related Issues', 'Pre-commit Checks', and 'Tests'.	Reorganize the description to match the template structure: add a 'Description' section with the summary, include a 'Related Issues' section (already mentions `#2555`), and confirm completion of checklist items for pre-commit hooks and tests.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding SM120 fmha_v2 kernels to AOT pip wheel builds, which is the primary purpose of this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-24T21:56:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the AOT compilation process for FlashInfer's pip wheel builds by integrating the fmha_v2 kernels specifically designed for SM120/SM121 GPU architectures. This ensures that users deploying with pre-compiled wheels on compatible hardware will benefit from optimized performance, addressing a previously identified gap where these kernels were not included, leading to suboptimal execution paths. The change is isolated to AOT builds and does not impact JIT compilation workflows.

Highlights

AOT Compilation for SM120/SM121: Added the fmha_v2 kernels for SM120/SM121 architectures to the Ahead-Of-Time (AOT) pip wheel build process.
Performance Improvement: Resolved an issue where SM120/SM121 devices using pip wheels would fall back to slower paths due to missing compiled fmha_v2 kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates a new TensorRT-LLM Flash Attention v2 module for SM120 architectures into the AOT compilation process. A review comment suggests updating an existing code comment to accurately reflect the inclusion of attention kernels alongside fused MOE and GEMM, improving clarity and maintainability.

gemini-code-assist · 2026-03-24T21:58:01Z

@@ -527,6 +528,7 @@ def gen_all_modules(
            jit_specs.append(gen_cutlass_fused_moe_sm120_module())
            jit_specs.append(gen_gemm_sm120_module())
            jit_specs.append(gen_gemm_sm120_module_cutlass_fp4())
+            jit_specs.append(gen_trtllm_fmha_v2_sm120_module())


With the addition of this fmha_v2 module, the comment on lines 525-527 is now slightly outdated as it only mentions 'fused MOE and GEMM'. For better maintainability, please consider updating it to include attention kernels for clarity.

For example:

- # SM120 and SM121 share the same CUTLASS kernels for fused MOE and GEMM. + # SM120 and SM121 share the same kernels for fused MOE, GEMM, and attention.

coderabbitai

🧹 Nitpick comments (1)

flashinfer/aot.py (1)

524-531: Consider decoupling FMHA v2 from the add_moe gate.

At Line 531, this is an attention kernel but it is only emitted when add_moe is True. For custom AOT configs (--add-moe false), that can unexpectedly drop FMHA v2.

♻️ Suggested placement change

@@
-    if add_moe:
+    if has_sm120 or has_sm121:
+        jit_specs.append(gen_trtllm_fmha_v2_sm120_module())
+
+    if add_moe:
@@
         if has_sm120 or has_sm121:
             # SM120 and SM121 share the same CUTLASS kernels for fused MOE and GEMM.
             # The SM120 module generators use supported_major_versions=[12] which
             # compiles for all SM12x targets.
             jit_specs.append(gen_cutlass_fused_moe_sm120_module())
             jit_specs.append(gen_gemm_sm120_module())
             jit_specs.append(gen_gemm_sm120_module_cutlass_fp4())
-            jit_specs.append(gen_trtllm_fmha_v2_sm120_module())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/aot.py` around lines 524 - 531, The FMHA v2 module
gen_trtllm_fmha_v2_sm120_module() is currently gated by the add_moe flag (inside
the has_sm120/has_sm121 block) which causes FMHA v2 to be omitted when --add-moe
false; update the logic so that gen_trtllm_fmha_v2_sm120_module() is appended to
jit_specs independently of add_moe (i.e., move or duplicate the call out of the
add_moe-specific branch in the SM120/SM121 handling code), or replace the
add_moe check with a dedicated attention-kernel condition so FMHA v2 is always
emitted for SM12x targets regardless of the MOE flag.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/aot.py`:
- Around line 524-531: The FMHA v2 module gen_trtllm_fmha_v2_sm120_module() is
currently gated by the add_moe flag (inside the has_sm120/has_sm121 block) which
causes FMHA v2 to be omitted when --add-moe false; update the logic so that
gen_trtllm_fmha_v2_sm120_module() is appended to jit_specs independently of
add_moe (i.e., move or duplicate the call out of the add_moe-specific branch in
the SM120/SM121 handling code), or replace the add_moe check with a dedicated
attention-kernel condition so FMHA v2 is always emitted for SM12x targets
regardless of the MOE flag.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06c41597-0a60-4df3-961f-79d2f7163cd4

📥 Commits

Reviewing files that changed from the base of the PR and between 6d34eba and b4ae6490ee29cb3056f8b1103d097b82724867a7.

📒 Files selected for processing (1)

flashinfer/aot.py

`gen_trtllm_fmha_v2_sm120_module()` was already callable via JIT (generate_kernels.py dispatches to it at runtime), but was never registered in gen_all_modules() in aot.py. SM120/SM121 devices getting flashinfer from a pip wheel would skip the fmha_v2 SM120 kernels entirely during the AOT build step, falling back to slower paths or missing support. Add it to the `has_sm120 or has_sm121` section alongside the other SM120 modules (fused MOE, GEMM, FP4 quantization). Contributed by Second Nature Computing (https://joinsecondnature.com) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

blake-snc requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners March 24, 2026 21:55

gemini-code-assist Bot reviewed Mar 24, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

blake-snc force-pushed the feat/enable-sm120-default branch from 372f460 to 0285211 Compare April 7, 2026 20:53

blake-snc mentioned this pull request Apr 8, 2026

SM120 attention kernels exist but are blocked by wiring issues (fmha_v2, backend selector, MLA) #2555

Open

blake-snc force-pushed the feat/enable-sm120-default branch from 0285211 to b3d8f49 Compare April 16, 2026 18:04

saltyminty approved these changes May 11, 2026

View reviewed changes

blake-snc and others added 2 commits May 11, 2026 10:44

nit: update SM120 comment to include attention kernels

15f100e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

saltyminty force-pushed the feat/enable-sm120-default branch from b3d8f49 to 15f100e Compare May 11, 2026 17:45

saltyminty added op: attention run-ci labels May 11, 2026

jimmyzho merged commit 4dba29f into flashinfer-ai:main May 11, 2026
54 of 56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SM120 fmha_v2 kernels to AOT pip wheel builds#2885

feat: add SM120 fmha_v2 kernels to AOT pip wheel builds#2885
jimmyzho merged 2 commits into
flashinfer-ai:mainfrom
blake-snc:feat/enable-sm120-default

blake-snc commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blake-snc commented Mar 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blake-snc commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading