Commit 808f087

and

Iwan Kawrakow

authored

AVX2 Flash Attention (ggml-org#48)

* First version of AVX2 Flash attention I simply took the Zen4 implementation and converted platform specific stuff to methods of a struct providing data loading/storing, conversions, multiply, add, etc. Most likely not optimal as the Zen4 strategy has been designed based on having 32 512-bit registers, so basically we can have 4X more data stored in vector registers compared to AVX2 with 16 x 256-bit. It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b. * Fix Zenn4 parts broken via the AVX2 change * Try smaller q_step - no improvement * Fix ARM_NEON I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__ --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

1 parent 49cbbc9 commit 808f087Copy full SHA for 808f087

1 file changed

ggml/src/iqk
- iqk_mul_mat.cpp

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 808f087

File tree

0 commit comments