Skip to content

Commit 808f087

Browse files
ikawrakowIwan Kawrakow
andauthored
AVX2 Flash Attention (ggml-org#48)
* First version of AVX2 Flash attention I simply took the Zen4 implementation and converted platform specific stuff to methods of a struct providing data loading/storing, conversions, multiply, add, etc. Most likely not optimal as the Zen4 strategy has been designed based on having 32 512-bit registers, so basically we can have 4X more data stored in vector registers compared to AVX2 with 16 x 256-bit. It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b. * Fix Zenn4 parts broken via the AVX2 change * Try smaller q_step - no improvement * Fix ARM_NEON I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__ --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
1 parent 49cbbc9 commit 808f087

1 file changed

Lines changed: 165 additions & 111 deletions

File tree

0 commit comments

Comments
 (0)