Flash Attention 2 support possible? #2257

Dampfinchen · 2023-07-18T08:54:58Z

Dampfinchen
Jul 18, 2023

Hello!

So Flash Attention 2 has just been released. https://github.com/Dao-AILab/flash-attention

Apparently, this can also be used to speed up inference and significantly decrease memory consumption for context.

Less memory usage for ctx could definately be useful for llama.cpp as ctx does need a significant amount of memory, regardless if you are using partial or full CUDA GPU offloading.

Right now it just supports recent RTX GPUs, but support for Turing (RTX 2000) is coming soon.

I'm interested to hear your opinions about this. Seems like good stuff.

drax-xard · 2023-07-18T19:42:38Z

drax-xard
Jul 18, 2023

And with llama2 having a native 4k context this comes at the perfect time!

2 replies

imrohankataria Apr 4, 2024

Is this still in progress?

ggerganov Apr 4, 2024
Maintainer

There is a WIP branch in #5021 - should already be able to run tests with it on most models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention 2 support possible? #2257

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Flash Attention 2 support possible? #2257

Dampfinchen Jul 18, 2023

Replies: 1 comment · 2 replies

drax-xard Jul 18, 2023

imrohankataria Apr 4, 2024

ggerganov Apr 4, 2024 Maintainer

Dampfinchen
Jul 18, 2023

Replies: 1 comment 2 replies

drax-xard
Jul 18, 2023

ggerganov Apr 4, 2024
Maintainer