Paged attention #425

liangan1 · 2024-06-24T02:28:25Z

Related RFC

Co-authored-by: Jiong Gong <[email protected]>

pytorch-bot · 2024-06-24T02:28:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/425

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

liangan1 · 2024-06-24T02:30:51Z

@jgong5

jgong5 · 2024-07-03T07:17:13Z

torchao/kv_cache.py

+HANDLED_FUNCTIONS = {}
+
+
+class PagedTensor(object):


Shall we make it a tensor subclass and inherit torch.Tensor here?

jgong5 · 2024-07-03T07:19:40Z

torchao/kv_cache.py

+        self,
+        cache: torch.Tensor, #The cache tensor from the PagedAttentionCache object, which is shared accross iterations.
+        block_tables: torch.Tensor,#The block tables for each sequence in the batch which is used to mapping logical block to physical blocks.
+        context_lens: torch.Tensor,#The context lens for each sequence in the batch.


context length is a concept of text generation, seems not generic to describe a tensor?

We have used size which represent the real cache tensor size(bs, num_key_value_heads, seq_lens, head_dim) like the dynamic cache to replace context.

jgong5 · 2024-07-03T07:22:02Z

torchao/kv_cache.py

+        """Returns the maximum sequence length of the cached states. PagedAttentionCache does not have a maximum length."""
+        RuntimeError("PagedAttentionCache does not have a maximum sequence length.")
+
+    def update(


This function will be called inside the model forward. With python implementation with conditionals and loops here, would it work with torch.compile?

Good question. We need more effort to integrate this PR to the huggingface to validate the end2end functionality for the torch.compile. I suggest we can review this PR in paralel. I will refine it if need.

jcaip · 2025-03-19T18:48:15Z

cc @liangan1 @HDCharles what's the status of this PR - do we need additional work to land?

HDCharles

kv_cache.py should probably be in prototypes folder or experimental, at the very least in kernel or something rather than the top level. which is reserved and kept clean.

a further question is what is goal of this PR as far as usage, it looks like it only enables this for CPU but 90% of our techniques are cuda based, is cuda support intended to be a next step? I think its fine to add a cpu only technique but just want clarity as far as the plan and who we expect to use it.

in other comments you mention wanting to integrate huggingface for e2e tests but you can do that directly in torchao, see torchao/_models/llama where there are a bunch of techniques being tested and benchmarked there. Generally the first step towards getting someone to use a technique like this would be a benchmark demo of how it helps, without that, its a large burden on the user to figure out what its for.

finally there's no information as far as how this is supposed to be used. Just code enabling a feature. Something like an e2e demo in the llama benchmark would make that easier for someone to understand how to use but there should probably also be a .md file explaining what this does what is the intended use case and the basic api a user is expected to apply. The RFC link is useful in the PR but not to a random user stumbling across this kernel who most likely isn't going to check the PR notes.

if you want to move this into an experimental/prototype folder i think that's the minimum needed to land this, though it would be good to understand the path this is expected to take towards usage since i think this feature is interesting but i don't know who is going to use it given almost everything in the repo is cuda related.

liangan1 and others added 19 commits May 19, 2024 23:48

Add pagedattention kernel for CPU

de3ac6b

Add ut

2b843e6

Add UT

63a826a

Refine code

ab17859

Add PagedAttention KV Cache manager

784f503

Enable flash decodeing for paged attention.

57c4faa

Update kv cache manager

d744dff

clang-format and recover test/kernel/test_fused_kernels.py

3ee352f

Merge branch 'main' into liangan1/paged_attention

101c4db

Update test_paged_attention.py

302dd70

Update test_paged_attention.py

52d5924

Update test/kernel/test_paged_attention.py

0511027

Co-authored-by: Jiong Gong <[email protected]>

Refine code

51ad5cb

Update according to the review suggestions.

b48b5ff

Update test_ops.py

fcabfce

Update test_paged_attention.py

11d4f47

Remove redundant test

ad7caaa

update

1daba57

update format

93c5a7d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2024

Merge branch 'main' into liangan1/paged_attention

8f75d4b

msaroufim requested review from andrewor14, jcaip and drisspg June 24, 2024 15:19

jgong5 reviewed Jul 3, 2024

View reviewed changes

msaroufim mentioned this pull request Jul 17, 2024

Paged Low Bit Optimizers #519

Open

liangan1 added 2 commits September 4, 2024 01:49

Enable subclassing for paged attention design

2e4c048

Merge branch 'main' into liangan1/paged_attention

5133692

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

create dir on download (pytorch#425)

d6da1b4

HDCharles requested changes Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paged attention #425

Paged attention #425

liangan1 commented Jun 24, 2024 •

edited

Loading

pytorch-bot bot commented Jun 24, 2024

liangan1 commented Jun 24, 2024

jgong5 Jul 3, 2024

liangan1 Sep 4, 2024

jgong5 Jul 3, 2024

liangan1 Sep 4, 2024

jgong5 Jul 3, 2024

liangan1 Sep 4, 2024

jcaip commented Mar 19, 2025

HDCharles left a comment •

edited

Loading

		HANDLED_FUNCTIONS = {}


		class PagedTensor(object):

Paged attention #425

Are you sure you want to change the base?

Paged attention #425

Conversation

liangan1 commented Jun 24, 2024 • edited Loading

pytorch-bot bot commented Jun 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/425

liangan1 commented Jun 24, 2024

jgong5 Jul 3, 2024

Choose a reason for hiding this comment

liangan1 Sep 4, 2024

Choose a reason for hiding this comment

jgong5 Jul 3, 2024

Choose a reason for hiding this comment

liangan1 Sep 4, 2024

Choose a reason for hiding this comment

jgong5 Jul 3, 2024

Choose a reason for hiding this comment

liangan1 Sep 4, 2024

Choose a reason for hiding this comment

jcaip commented Mar 19, 2025

HDCharles left a comment • edited Loading

Choose a reason for hiding this comment

liangan1 commented Jun 24, 2024 •

edited

Loading

HDCharles left a comment •

edited

Loading