-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync #38460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
d9f57cb
bb8c84c
48e1981
43371f7
cfd1990
f4e6fd3
01cf9a8
ce30ed0
0733407
6926c12
1ac17a2
f238c88
d42ace5
24f9716
47b1d6c
b161b1a
2712093
602cdc1
a78f1e2
5c15e76
10714d8
d88add5
735aefa
895816b
99bba24
7415a98
9347eca
f695c1e
6e90ee6
7d25c74
c27f38f
15ab2fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -680,6 +680,12 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) { | |||||
| " int block_size_in_bytes, Tensor block_mapping) -> ()"); | ||||||
| cache_ops.impl("swap_blocks", torch::kCUDA, &swap_blocks); | ||||||
|
|
||||||
| // Batch swap: submit all block copies in a single driver call. | ||||||
| cache_ops.def( | ||||||
| "swap_blocks_batch(Tensor src_ptrs, Tensor dst_ptrs," | ||||||
| " Tensor sizes) -> ()"); | ||||||
| cache_ops.impl("swap_blocks_batch", torch::kCPU, &swap_blocks_batch); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The input tensors (src_ptrs, dst_ptrs, sizes) are CPU tensors — they're numpy arrays of raw pointers/sizes converted via torch.from_numpy(). PyTorch dispatches based on the input tensor device, so kCPU is correct here. The existing swap_blocks uses kCUDA because its inputs are the actual GPU KV cache tensors. Registering with kCUDA would actually break dispatch since no input tensor lives on GPU. |
||||||
|
|
||||||
| // Reshape the key and value tensors and cache them. | ||||||
| cache_ops.def( | ||||||
| "reshape_and_cache(Tensor key, Tensor value," | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: Curious have you tried
CU_MEMCPY_SRC_ACCESS_ORDER_ANY(https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g6f1ff58e3065df3eb4b573dba77ad31f)? I found it gives me better CPU->GPU bandwidth on Grace Blackwell nodes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also applied this parameter to GPU srcs.
According to the documentation this means access to srcs can be out of stream, so potentially not waiting for the compute (default) stream to complete?
@Etelis Anyhow for CPU->GPU this seems safe. Let's test it towards a follow up.
Thanks @ivanium for this suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it won't wait for previous ops in the stream. Since we typically call this API in a separate copy stream, I guess we cannot rely on this AccessOrder param anyway. If we want to stay safe as a general purpose API, maybe we can expose a configurable param to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test it as a followup