Skip to content

unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict#6524

Merged
nihui merged 9 commits intoTencent:masterfrom
nihui:opt-vulkan-gemm-2
Jan 28, 2026
Merged

unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict#6524
nihui merged 9 commits intoTencent:masterfrom
nihui:opt-vulkan-gemm-2

Conversation

@nihui
Copy link
Member

@nihui nihui commented Jan 24, 2026

zimage-ncnn-vulkan 1024x1024

end2end 7900xtx (no cm)
baseline 1m42.569s
+gemm 4x4/vec 1m34.738s
+sdpa 4x4/vec 1m31.767s
+gemm/sdpa smpad 1m20.524s

@tencent-adm
Copy link
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2026

Codecov Report

❌ Patch coverage is 24.57627% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.04%. Comparing base (0deee81) to head (c20bff1).

Files with missing lines Patch % Lines
src/layer/vulkan/gemm_vulkan.cpp 15.78% 80 Missing ⚠️
src/gpu.cpp 50.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6524      +/-   ##
==========================================
- Coverage   93.08%   93.04%   -0.04%     
==========================================
  Files         809      809              
  Lines      256380   256310      -70     
==========================================
- Hits       238651   238493     -158     
- Misses      17729    17817      +88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the GEMM and SDPA Vulkan compute shaders by implementing 4x4 unrolling instead of 2x2, vectorizing data loading using vec4 operations, and adding padding to shared memory arrays to avoid bank conflicts. According to the performance results, these optimizations provide a significant 21% speedup (from 1m42s to 1m20s) for end-to-end processing on 1024x1024 images.

Changes:

  • Refactored GEMM and SDPA shaders from 2x2 to 4x4 unrolling with vectorized loads/stores
  • Added PAD=1 to shared memory arrays to avoid bank conflicts
  • Updated dispatcher calculations to use (N+3)/4 and (M+3)/4 instead of (N+1)/2 and (M+1)/2
  • Added new conversion macros (afp2lfp and afp2lfpvec4) for bidirectional conversions between arithmetic and local memory precision types

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/layer/vulkan/shader/sdpa_cross.comp Refactored from 2x2 to 4x4 unrolling with vec4 buffer types, added shared memory padding, and updated all load/store operations
src/layer/vulkan/shader/gemm.comp Refactored from 2x2 to 4x4 unrolling with vec4 buffer types, added shared memory padding, optimized alpha/beta multiplications, and updated all load/store operations
src/layer/vulkan/sdpa_vulkan.cpp Updated dispatcher dimensions to divide by 4 instead of 2 to match shader changes
src/layer/vulkan/gemm_vulkan.cpp Updated dispatcher dimensions to divide by 4 instead of 2 to match shader changes
src/gpu.cpp Added conversion macros afp2lfp and afp2lfpvec4 for all precision modes, but contains critical typos

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nihui
Copy link
Member Author

nihui commented Jan 26, 2026

f7006a4f42c9a51a6bc5b44b0db3c91f 362d4334a56c81d425897435c22ba7f2 0bcc0fb9d436d751213f97ee74f1a1ea 48ef2548edbf305b6730f0718419b7e7

@github-actions github-actions bot added the doc label Jan 26, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +229 to 275
{
const uvec4 gy4 = gy * 4 + uvec4(0, 1, 2, 3);
const uvec4 ai4 = gz * psc(A_cstep) + gy4 * psc(K) + k;

const uvec4 ai4d4 = ai4 / 4;
const uvec4 ai4m4 = ai4 % 4;

a.r = buffer_ld4(A_blob_data, ai4d4.r)[ai4m4.r];
a.g = buffer_ld4(A_blob_data, ai4d4.g)[ai4m4.g];
a.b = buffer_ld4(A_blob_data, ai4d4.b)[ai4m4.b];
a.a = buffer_ld4(A_blob_data, ai4d4.a)[ai4m4.a];
}

afp k0;
afp k1;
afpvec4 b;
if (transB == 0)
{
const int bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) + k * psc(N) + gx;
k0 = buffer_ld1(B_blob_data, bi);
k1 = buffer_ld1(B_blob_data, bi + 1);
if (psc(N) % 4 == 0)
{
const uint bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) / 4 + k * (psc(N) / 4) + gx;
b = buffer_ld4(B_blob_data, bi);
}
else
{
const uvec4 bi4 = (gz / psc(num_heads_per_group)) * psc(B_cstep) + k * psc(N) + gx * 4 + uvec4(0, 1, 2, 3);

const uvec4 bi4d4 = bi4 / 4;
const uvec4 bi4m4 = bi4 % 4;

b.r = buffer_ld4(B_blob_data, bi4d4.r)[bi4m4.r];
b.g = buffer_ld4(B_blob_data, bi4d4.g)[bi4m4.g];
b.b = buffer_ld4(B_blob_data, bi4d4.b)[bi4m4.b];
b.a = buffer_ld4(B_blob_data, bi4d4.a)[bi4m4.a];
}
}
else
{
const int bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) + gx * psc(K) + k;
k0 = buffer_ld1(B_blob_data, bi);
k1 = buffer_ld1(B_blob_data, bi + psc(K));
const uvec4 gx4 = gx * 4 + uvec4(0, 1, 2, 3);
const uvec4 bi4 = (gz / psc(num_heads_per_group)) * psc(B_cstep) + gx4 * psc(K) + k;

const uvec4 bi4d4 = bi4 / 4;
const uvec4 bi4m4 = bi4 % 4;

b.r = buffer_ld4(B_blob_data, bi4d4.r)[bi4m4.r];
b.g = buffer_ld4(B_blob_data, bi4d4.g)[bi4m4.g];
b.b = buffer_ld4(B_blob_data, bi4d4.b)[bi4m4.b];
b.a = buffer_ld4(B_blob_data, bi4d4.a)[bi4m4.a];
}
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loads from A_blob_data and B_blob_data use gy4 and gx * 4 + uvec4(0,1,2,3) without per-lane bounds checks, so on the tail tiles where psc(M) or psc(N) are not multiples of 4, some lanes will index rows/columns beyond the actual matrix dimensions. Because buffer_ld4 is a thin wrapper over buf[i] and the buffers are sized only for psc(M) * psc(K) / psc(K) * psc(N), this produces out-of-bounds storage-buffer reads that are then incorporated into the attention scores, potentially exposing or corrupting adjacent GPU memory. This pattern appears both in the main loop and the remainder loop; consider adding per-lane gy4/gx4 bounds checks (or adjusting the dispatch) before calling buffer_ld4/buffer_sm4 so no lane reads past the allocated ranges.

Copilot uses AI. Check for mistakes.
@nihui nihui merged commit 526e5e6 into Tencent:master Jan 28, 2026
111 of 115 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants