unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict#6524
unroll 4x4 for gemm and sdpa vulkan, vectorize a and b loading, avoid bank conflict#6524nihui merged 9 commits intoTencent:masterfrom
Conversation
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6524 +/- ##
==========================================
- Coverage 93.08% 93.04% -0.04%
==========================================
Files 809 809
Lines 256380 256310 -70
==========================================
- Hits 238651 238493 -158
- Misses 17729 17817 +88 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes the GEMM and SDPA Vulkan compute shaders by implementing 4x4 unrolling instead of 2x2, vectorizing data loading using vec4 operations, and adding padding to shared memory arrays to avoid bank conflicts. According to the performance results, these optimizations provide a significant 21% speedup (from 1m42s to 1m20s) for end-to-end processing on 1024x1024 images.
Changes:
- Refactored GEMM and SDPA shaders from 2x2 to 4x4 unrolling with vectorized loads/stores
- Added PAD=1 to shared memory arrays to avoid bank conflicts
- Updated dispatcher calculations to use (N+3)/4 and (M+3)/4 instead of (N+1)/2 and (M+1)/2
- Added new conversion macros (afp2lfp and afp2lfpvec4) for bidirectional conversions between arithmetic and local memory precision types
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/layer/vulkan/shader/sdpa_cross.comp | Refactored from 2x2 to 4x4 unrolling with vec4 buffer types, added shared memory padding, and updated all load/store operations |
| src/layer/vulkan/shader/gemm.comp | Refactored from 2x2 to 4x4 unrolling with vec4 buffer types, added shared memory padding, optimized alpha/beta multiplications, and updated all load/store operations |
| src/layer/vulkan/sdpa_vulkan.cpp | Updated dispatcher dimensions to divide by 4 instead of 2 to match shader changes |
| src/layer/vulkan/gemm_vulkan.cpp | Updated dispatcher dimensions to divide by 4 instead of 2 to match shader changes |
| src/gpu.cpp | Added conversion macros afp2lfp and afp2lfpvec4 for all precision modes, but contains critical typos |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| const uvec4 gy4 = gy * 4 + uvec4(0, 1, 2, 3); | ||
| const uvec4 ai4 = gz * psc(A_cstep) + gy4 * psc(K) + k; | ||
|
|
||
| const uvec4 ai4d4 = ai4 / 4; | ||
| const uvec4 ai4m4 = ai4 % 4; | ||
|
|
||
| a.r = buffer_ld4(A_blob_data, ai4d4.r)[ai4m4.r]; | ||
| a.g = buffer_ld4(A_blob_data, ai4d4.g)[ai4m4.g]; | ||
| a.b = buffer_ld4(A_blob_data, ai4d4.b)[ai4m4.b]; | ||
| a.a = buffer_ld4(A_blob_data, ai4d4.a)[ai4m4.a]; | ||
| } | ||
|
|
||
| afp k0; | ||
| afp k1; | ||
| afpvec4 b; | ||
| if (transB == 0) | ||
| { | ||
| const int bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) + k * psc(N) + gx; | ||
| k0 = buffer_ld1(B_blob_data, bi); | ||
| k1 = buffer_ld1(B_blob_data, bi + 1); | ||
| if (psc(N) % 4 == 0) | ||
| { | ||
| const uint bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) / 4 + k * (psc(N) / 4) + gx; | ||
| b = buffer_ld4(B_blob_data, bi); | ||
| } | ||
| else | ||
| { | ||
| const uvec4 bi4 = (gz / psc(num_heads_per_group)) * psc(B_cstep) + k * psc(N) + gx * 4 + uvec4(0, 1, 2, 3); | ||
|
|
||
| const uvec4 bi4d4 = bi4 / 4; | ||
| const uvec4 bi4m4 = bi4 % 4; | ||
|
|
||
| b.r = buffer_ld4(B_blob_data, bi4d4.r)[bi4m4.r]; | ||
| b.g = buffer_ld4(B_blob_data, bi4d4.g)[bi4m4.g]; | ||
| b.b = buffer_ld4(B_blob_data, bi4d4.b)[bi4m4.b]; | ||
| b.a = buffer_ld4(B_blob_data, bi4d4.a)[bi4m4.a]; | ||
| } | ||
| } | ||
| else | ||
| { | ||
| const int bi = (gz / psc(num_heads_per_group)) * psc(B_cstep) + gx * psc(K) + k; | ||
| k0 = buffer_ld1(B_blob_data, bi); | ||
| k1 = buffer_ld1(B_blob_data, bi + psc(K)); | ||
| const uvec4 gx4 = gx * 4 + uvec4(0, 1, 2, 3); | ||
| const uvec4 bi4 = (gz / psc(num_heads_per_group)) * psc(B_cstep) + gx4 * psc(K) + k; | ||
|
|
||
| const uvec4 bi4d4 = bi4 / 4; | ||
| const uvec4 bi4m4 = bi4 % 4; | ||
|
|
||
| b.r = buffer_ld4(B_blob_data, bi4d4.r)[bi4m4.r]; | ||
| b.g = buffer_ld4(B_blob_data, bi4d4.g)[bi4m4.g]; | ||
| b.b = buffer_ld4(B_blob_data, bi4d4.b)[bi4m4.b]; | ||
| b.a = buffer_ld4(B_blob_data, bi4d4.a)[bi4m4.a]; | ||
| } |
There was a problem hiding this comment.
The loads from A_blob_data and B_blob_data use gy4 and gx * 4 + uvec4(0,1,2,3) without per-lane bounds checks, so on the tail tiles where psc(M) or psc(N) are not multiples of 4, some lanes will index rows/columns beyond the actual matrix dimensions. Because buffer_ld4 is a thin wrapper over buf[i] and the buffers are sized only for psc(M) * psc(K) / psc(K) * psc(N), this produces out-of-bounds storage-buffer reads that are then incorporated into the attention scores, potentially exposing or corrupting adjacent GPU memory. This pattern appears both in the main loop and the remainder loop; consider adding per-lane gy4/gx4 bounds checks (or adjusting the dispatch) before calling buffer_ld4/buffer_sm4 so no lane reads past the allocated ranges.




zimage-ncnn-vulkan 1024x1024