-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add Flash Decoding #1151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1! It looks like this might be included in FlashAttention v2.2. It's not clear from the blog whether any inference code needs to be changed to see the benefits of this. |
+1 |
It's not clear if it is superior to paged attention: all the tests I saw were vs native Transformers which we know is not optimised. |
@OlivierDehaene PA announce to PagedAttention V2 implements a similar idea to boost the performance when the batch size or the number of attention heads per GPU is small. |
#1183 instead. |
Feature request
See https://pytorch.org/blog/flash-decoding/#:~:text=Flash%2DDecoding%20works%20in%203,exp%20of%20the%20attention%20values.
Motivation
Flash decoding further improves attention mechanism compared to FlashAttention V2 on long context
Your contribution
None
The text was updated successfully, but these errors were encountered: