Skip to content

Commit dc1d2ad

Browse files
authored
vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA
1 parent 7c28a74 commit dc1d2ad

File tree

4 files changed

+646
-94
lines changed

4 files changed

+646
-94
lines changed

.github/workflows/build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ jobs:
307307
run: |
308308
cd build
309309
# This is using llvmpipe and runs slower than other backends
310-
ctest -L main --verbose --timeout 2700
310+
ctest -L main --verbose --timeout 3600
311311
312312
ubuntu-22-cmake-hip:
313313
runs-on: ubuntu-22.04

0 commit comments

Comments
 (0)