unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict by nihui · Pull Request #6524 · Tencent/ncnn

nihui · 2026-01-24T15:26:14Z

zimage-ncnn-vulkan 1024x1024

end2end	7900xtx (no cm)
baseline	1m42.569s
+gemm 4x4/vec	1m34.738s
+sdpa 4x4/vec	1m31.767s
+gemm/sdpa smpad	1m20.524s

…flict

tencent-adm · 2026-01-24T15:26:35Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-01-24T15:28:52Z

Codecov Report

❌ Patch coverage is 24.57627% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.04%. Comparing base (0deee81) to head (c20bff1).

Files with missing lines	Patch %	Lines
src/layer/vulkan/gemm_vulkan.cpp	15.78%	80 Missing ⚠️
src/gpu.cpp	50.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6524      +/-   ##
==========================================
- Coverage   93.08%   93.04%   -0.04%     
==========================================
  Files         809      809              
  Lines      256380   256310      -70     
==========================================
- Hits       238651   238493     -158     
- Misses      17729    17817      +88

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR optimizes the GEMM and SDPA Vulkan compute shaders by implementing 4x4 unrolling instead of 2x2, vectorizing data loading using vec4 operations, and adding padding to shared memory arrays to avoid bank conflicts. According to the performance results, these optimizations provide a significant 21% speedup (from 1m42s to 1m20s) for end-to-end processing on 1024x1024 images.

Changes:

Refactored GEMM and SDPA shaders from 2x2 to 4x4 unrolling with vectorized loads/stores
Added PAD=1 to shared memory arrays to avoid bank conflicts
Updated dispatcher calculations to use (N+3)/4 and (M+3)/4 instead of (N+1)/2 and (M+1)/2
Added new conversion macros (afp2lfp and afp2lfpvec4) for bidirectional conversions between arithmetic and local memory precision types

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/layer/vulkan/shader/sdpa_cross.comp	Refactored from 2x2 to 4x4 unrolling with vec4 buffer types, added shared memory padding, and updated all load/store operations
src/layer/vulkan/shader/gemm.comp	Refactored from 2x2 to 4x4 unrolling with vec4 buffer types, added shared memory padding, optimized alpha/beta multiplications, and updated all load/store operations
src/layer/vulkan/sdpa_vulkan.cpp	Updated dispatcher dimensions to divide by 4 instead of 2 to match shader changes
src/layer/vulkan/gemm_vulkan.cpp	Updated dispatcher dimensions to divide by 4 instead of 2 to match shader changes
src/gpu.cpp	Added conversion macros afp2lfp and afp2lfpvec4 for all precision modes, but contains critical typos

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gpu.cpp

src/layer/vulkan/shader/gemm.comp

src/gpu.cpp

nihui · 2026-01-26T03:34:50Z

…lkan-gemm-2

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T12:19:45Z

src/layer/vulkan/shader/sdpa_cross.comp

+        {
+            const uvec4 gy4 = gy * 4 + uvec4(0, 1, 2, 3);
+            const uvec4 ai4 = gz * psc(A_cstep) + gy4 * psc(K) + k;
+
+            const uvec4 ai4d4 = ai4 / 4;
+            const uvec4 ai4m4 = ai4 % 4;
+
+            a.r = buffer_ld4(A_blob_data, ai4d4.r)[ai4m4.r];
+            a.g = buffer_ld4(A_blob_data, ai4d4.g)[ai4m4.g];
+            a.b = buffer_ld4(A_blob_data, ai4d4.b)[ai4m4.b];
+            a.a = buffer_ld4(A_blob_data, ai4d4.a)[ai4m4.a];
+        }

-        afp k0;
-        afp k1;
+        afpvec4 b;
        if (transB == 0)
        {
-            const int bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) + k * psc(N) + gx;
-            k0 = buffer_ld1(B_blob_data, bi);
-            k1 = buffer_ld1(B_blob_data, bi + 1);
+            if (psc(N) % 4 == 0)
+            {
+                const uint bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) / 4 + k * (psc(N) / 4) + gx;
+                b = buffer_ld4(B_blob_data, bi);
+            }
+            else
+            {
+                const uvec4 bi4 = (gz / psc(num_heads_per_group)) * psc(B_cstep) + k * psc(N) + gx * 4 + uvec4(0, 1, 2, 3);
+
+                const uvec4 bi4d4 = bi4 / 4;
+                const uvec4 bi4m4 = bi4 % 4;
+
+                b.r = buffer_ld4(B_blob_data, bi4d4.r)[bi4m4.r];
+                b.g = buffer_ld4(B_blob_data, bi4d4.g)[bi4m4.g];
+                b.b = buffer_ld4(B_blob_data, bi4d4.b)[bi4m4.b];
+                b.a = buffer_ld4(B_blob_data, bi4d4.a)[bi4m4.a];
+            }
        }
        else
        {
-            const int bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) + gx * psc(K) + k;
-            k0 = buffer_ld1(B_blob_data, bi);
-            k1 = buffer_ld1(B_blob_data, bi + psc(K));
+            const uvec4 gx4 = gx * 4 + uvec4(0, 1, 2, 3);
+            const uvec4 bi4 = (gz / psc(num_heads_per_group)) * psc(B_cstep) + gx4 * psc(K) + k;
+
+            const uvec4 bi4d4 = bi4 / 4;
+            const uvec4 bi4m4 = bi4 % 4;
+
+            b.r = buffer_ld4(B_blob_data, bi4d4.r)[bi4m4.r];
+            b.g = buffer_ld4(B_blob_data, bi4d4.g)[bi4m4.g];
+            b.b = buffer_ld4(B_blob_data, bi4d4.b)[bi4m4.b];
+            b.a = buffer_ld4(B_blob_data, bi4d4.a)[bi4m4.a];
        }


The loads from A_blob_data and B_blob_data use gy4 and gx * 4 + uvec4(0,1,2,3) without per-lane bounds checks, so on the tail tiles where psc(M) or psc(N) are not multiples of 4, some lanes will index rows/columns beyond the actual matrix dimensions. Because buffer_ld4 is a thin wrapper over buf[i] and the buffers are sized only for psc(M) * psc(K) / psc(K) * psc(N), this produces out-of-bounds storage-buffer reads that are then incorporated into the attention scores, potentially exposing or corrupting adjacent GPU memory. This pattern appears both in the main loop and the remainder loop; consider adding per-lane gy4/gx4 bounds checks (or adjusting the dispatch) before calling buffer_ld4/buffer_sm4 so no lane reads past the allocated ranges.

nihui added 2 commits January 24, 2026 22:14

unroll 4x4 for vulkan gemm, vectorize loading a and b

b6a54cb

unroll 4x4 for vulkan sdpa, vectorize loading a and b, avoid bank con…

1c0ae53

…flict

github-actions bot added core vulkan labels Jan 24, 2026

nihui requested a review from Copilot January 24, 2026 15:33

Copilot started reviewing on behalf of nihui January 24, 2026 15:33 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

src/gpu.cpp Outdated Show resolved Hide resolved

src/gpu.cpp Outdated Show resolved Hide resolved

src/layer/vulkan/shader/gemm.comp Outdated Show resolved Hide resolved

src/gpu.cpp Outdated Show resolved Hide resolved

f

a90cd19

Update glsl-extension.md

670d121

github-actions bot added the doc label Jan 26, 2026

nihui and others added 4 commits January 26, 2026 11:45

Update glsl-extension.zh.md

d6e353a

gemm vulkan subgroup optimization

9635791

Merge branch 'opt-vulkan-gemm-2' of github.com:nihui/ncnn into opt-vu…

94dd1d8

…lkan-gemm-2

apply code-format changes

a7a9645

nihui closed this Jan 27, 2026

nihui reopened this Jan 27, 2026

nihui requested a review from Copilot January 27, 2026 11:58

Copilot started reviewing on behalf of nihui January 27, 2026 11:58 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

enable it

c20bff1

nihui merged commit 526e5e6 into Tencent:master Jan 28, 2026
111 of 115 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict#6524

unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict#6524
nihui merged 9 commits intoTencent:masterfrom
nihui:opt-vulkan-gemm-2

nihui commented Jan 24, 2026 •

edited

Loading

Uh oh!

tencent-adm commented Jan 24, 2026

Uh oh!

codecov-commenter commented Jan 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nihui commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nihui commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Jan 24, 2026

Uh oh!

codecov-commenter commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nihui commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nihui commented Jan 24, 2026 •

edited

Loading

codecov-commenter commented Jan 24, 2026 •

edited

Loading

nihui commented Jan 26, 2026 •

edited

Loading