Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
4f51705
Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps
EtelisIBM Apr 7, 2026
87e5072
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 20, 2026
2306690
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 20, 2026
156d180
swap_blocks_batch: make srcAccessOrder a per-call parameter
EtelisIBM Apr 20, 2026
7d5883f
swap_blocks_batch: address review comments
EtelisIBM Apr 23, 2026
d34e022
Merge remote-tracking branch 'upstream/main' into try-memcpy-access-o…
EtelisIBM Apr 23, 2026
a46f2e7
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 23, 2026
7c209d4
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 23, 2026
51b05d9
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 23, 2026
a23dea7
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 23, 2026
543af59
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 25, 2026
1acf1a0
Merge main into try-memcpy-access-order-any to retrigger CI
EtelisIBM Apr 25, 2026
12d59d8
Merge main into try-memcpy-access-order-any to retrigger CI
EtelisIBM Apr 25, 2026
bbaea08
Merge main into try-memcpy-access-order-any to retrigger CI
EtelisIBM Apr 26, 2026
8e4d45d
Retrigger CI (flaky distributed jobs)
EtelisIBM Apr 26, 2026
1243a88
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 26, 2026
0c968ea
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 27, 2026
68aa841
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 27, 2026
82cc29c
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 27, 2026
9b8221f
Merge branch 'main' into try-memcpy-access-order-any
orozery Apr 27, 2026
3289183
Merge branch 'main' into try-memcpy-access-order-any
orozery Apr 28, 2026
5d91dbc
Merge branch 'main' into try-memcpy-access-order-any
orozery Apr 28, 2026
533ec29
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 28, 2026
78c23e5
Merge remote-tracking branch 'upstream/main' into try-memcpy-access-o…
EtelisIBM Apr 28, 2026
1e7c07e
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 28, 2026
5255b65
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-main
EtelisIBM Apr 28, 2026
aae1340
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-main-r3
EtelisIBM Apr 28, 2026
5a9a397
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r6
EtelisIBM Apr 28, 2026
98f0f2c
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r11
EtelisIBM Apr 28, 2026
b40f274
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r13
EtelisIBM Apr 28, 2026
340be60
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r15
EtelisIBM Apr 29, 2026
23075d0
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r17
EtelisIBM Apr 29, 2026
f8f9d87
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r19
EtelisIBM Apr 29, 2026
ecc6e93
Merge remote-tracking branch 'upstream/main' into pr-39306-merge-r21
EtelisIBM Apr 29, 2026
a4022a3
Merge branch 'main' into try-memcpy-access-order-any
orozery Apr 29, 2026
d9a4249
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 29, 2026
0ea0a97
Merge remote-tracking branch 'upstream/main' into try-memcpy-access-o…
EtelisIBM Apr 29, 2026
d8e87be
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 29, 2026
ab9be2e
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 29, 2026
c7d1e17
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 29, 2026
23f970e
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 29, 2026
08035a6
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 29, 2026
eafd48f
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 29, 2026
4e59324
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
e19d7c9
Merge remote-tracking branch 'upstream/main' into try-memcpy-access-o…
EtelisIBM Apr 30, 2026
ae84b73
Merge branch 'main' into try-memcpy-access-order-any
Etelis Apr 30, 2026
173a6d0
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
f31219c
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
a82bb5d
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
38825af
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
92ebf4d
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
2975168
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM Apr 30, 2026
909291b
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
2caed15
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
cc65b53
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
06881d0
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
2a6d081
Merge branch 'main' into try-memcpy-access-order-any
Etelis May 1, 2026
394578c
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
813f0bf
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
fdaefe8
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
efd42bf
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
940801e
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
6336b8e
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
1046e5a
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
3a3b344
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
293964e
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
eff804c
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
819d1a5
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
20c4ea6
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
bf35538
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
1ba1667
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
4f97664
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 1, 2026
7649aca
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
ee0c260
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
7b079c1
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
576374e
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
c8fa5b0
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
36af9ee
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
5e9239b
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 2, 2026
521d5d7
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
d12dcbe
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
99ebd51
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
a87c9b1
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
1b27b7a
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
c86ea93
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
5c22ce1
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 3, 2026
17f8343
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 4, 2026
a3689a4
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 4, 2026
b3766e7
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 4, 2026
3888561
Merge branch 'main' into try-memcpy-access-order-any
EtelisIBM May 4, 2026
4451912
Merge branch 'main' into try-memcpy-access-order-any
orozery May 4, 2026
011c298
Merge branch 'main' into try-memcpy-access-order-any
Etelis May 4, 2026
d6ff483
Merge branch 'main' into try-memcpy-access-order-any
orozery May 5, 2026
4084b0b
Merge branch 'main' into try-memcpy-access-order-any
Etelis May 5, 2026
b6040cf
Merge branch 'main' into try-memcpy-access-order-any
orozery May 5, 2026
719fbd3
Merge branch 'main' into try-memcpy-access-order-any
orozery May 5, 2026
587bea5
Merge branch 'main' into try-memcpy-access-order-any
Etelis May 5, 2026
1400acf
Merge branch 'main' into try-memcpy-access-order-any
Etelis May 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion csrc/cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ void swap_blocks(torch::Tensor& src, torch::Tensor& dst,

void swap_blocks_batch(const torch::Tensor& src_ptrs,
const torch::Tensor& dst_ptrs,
const torch::Tensor& sizes);
const torch::Tensor& sizes, bool src_access_order_any);

void reshape_and_cache(torch::Tensor& key, torch::Tensor& value,
torch::Tensor& key_cache, torch::Tensor& value_cache,
Expand Down
11 changes: 9 additions & 2 deletions csrc/cache_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ void swap_blocks(torch::Tensor& src, torch::Tensor& dst,

void swap_blocks_batch(const torch::Tensor& src_ptrs,
const torch::Tensor& dst_ptrs,
const torch::Tensor& sizes) {
const torch::Tensor& sizes, bool src_access_order_any) {
TORCH_CHECK(src_ptrs.device().is_cpu(), "src_ptrs must be on CPU");
TORCH_CHECK(dst_ptrs.device().is_cpu(), "dst_ptrs must be on CPU");
TORCH_CHECK(sizes.device().is_cpu(), "sizes must be on CPU");
Expand Down Expand Up @@ -124,7 +124,14 @@ void swap_blocks_batch(const torch::Tensor& src_ptrs,

if (batch_fn != nullptr) {
CUmemcpyAttributes attr = {};
attr.srcAccessOrder = CU_MEMCPY_SRC_ACCESS_ORDER_STREAM;
// ANY lets the DMA engine prefetch source bytes out of stream order,
// which is only safe when no GPU stream is concurrently writing the
// source (e.g. CPU->GPU reads from host-owned pinned memory). For
// GPU->CPU we must keep STREAM so source reads are gated by the
// transfer stream's wait_stream(compute) / wait_event barriers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest is specific to the offloading connector implementation.

Suggested change
// source (e.g. CPU->GPU reads from host-owned pinned memory). For
// GPU->CPU we must keep STREAM so source reads are gated by the
// transfer stream's wait_stream(compute) / wait_event barriers.
// source.

attr.srcAccessOrder = src_access_order_any
? CU_MEMCPY_SRC_ACCESS_ORDER_ANY
: CU_MEMCPY_SRC_ACCESS_ORDER_STREAM;
size_t attrs_idx = 0;
size_t fail_idx = 0;
CUresult result = batch_fn(reinterpret_cast<CUdeviceptr*>(dst_data),
Expand Down
3 changes: 2 additions & 1 deletion csrc/torch_bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,8 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) {
// Batch swap: submit all block copies in a single driver call.
cache_ops.def(
"swap_blocks_batch(Tensor src_ptrs, Tensor dst_ptrs,"
" Tensor sizes) -> ()");
" Tensor sizes,"
" bool src_access_order_any=False) -> ()");
cache_ops.impl("swap_blocks_batch", torch::kCPU, &swap_blocks_batch);

// Reshape the key and value tensors and cache them.
Expand Down
12 changes: 11 additions & 1 deletion vllm/_custom_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -2782,6 +2782,7 @@ def swap_blocks_batch(
src_ptrs: torch.Tensor,
dst_ptrs: torch.Tensor,
sizes: torch.Tensor,
src_access_order_any: bool = False,
) -> None:
"""
Batch version of swap_blocks: submit all copies in a single driver call.
Expand All @@ -2790,8 +2791,17 @@ def swap_blocks_batch(
of sizes[i] bytes. All three tensors must be int64 CPU tensors.
On CUDA 12.8+ this uses cuMemcpyBatchAsync for minimal submission
overhead; on older CUDA it falls back to a loop of cudaMemcpyAsync.

src_access_order_any: if True, pass CU_MEMCPY_SRC_ACCESS_ORDER_ANY to
cuMemcpyBatchAsync, letting the DMA engine prefetch source bytes
out of stream order. Only safe when no GPU stream is concurrently
writing to the source (e.g. CPU->GPU, where the source is host
pinned memory). Defaults to False (STREAM ordering), which is
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the e.g. part is specific to the offloading connector implementation. Let's remove it.

always safe.
"""
torch.ops._C_cache_ops.swap_blocks_batch(src_ptrs, dst_ptrs, sizes)
torch.ops._C_cache_ops.swap_blocks_batch(
src_ptrs, dst_ptrs, sizes, src_access_order_any
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_access_order_any is a big confusing.
Maybe rename it to is_src_access_order_any?

)


def convert_fp8(
Expand Down
14 changes: 13 additions & 1 deletion vllm/v1/kv_offload/worker/cpu_gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,10 +261,22 @@ def transfer_async(self, job_id: int, transfer_spec: TransferSpec) -> bool:
last_event = last_transfer.end_event
# assure job will start only after the previous one completes
stream.wait_event(last_event)
# CPU->GPU reads from host pinned memory, which is never written
# by a concurrent GPU stream, so CU_MEMCPY_SRC_ACCESS_ORDER_ANY is
# safe and lets the driver pipeline source reads. GPU->CPU reads
# from the live GPU KV cache, which the compute stream keeps
# writing; we must keep STREAM ordering so source reads are gated
# by the transfer stream's wait_stream(compute) barrier.
src_access_order_any = not self.gpu_to_cpu
with torch.cuda.stream(stream):
start_event.record(stream)
if total > 0:
ops.swap_blocks_batch(batch_src, batch_dst, batch_sizes)
ops.swap_blocks_batch(
batch_src,
batch_dst,
batch_sizes,
src_access_order_any=src_access_order_any,
)
end_event.record(stream)

self._transfer_events[job_id] = end_event
Expand Down
Loading