Flash Attention 2 support possible? #2257
Dampfinchen
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
And with llama2 having a native 4k context this comes at the perfect time! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
So Flash Attention 2 has just been released. https://github.com/Dao-AILab/flash-attention
Apparently, this can also be used to speed up inference and significantly decrease memory consumption for context.
Less memory usage for ctx could definately be useful for llama.cpp as ctx does need a significant amount of memory, regardless if you are using partial or full CUDA GPU offloading.
Right now it just supports recent RTX GPUs, but support for Turing (RTX 2000) is coming soon.
I'm interested to hear your opinions about this. Seems like good stuff.
Beta Was this translation helpful? Give feedback.
All reactions