Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps by Etelis · Pull Request #39306 · vllm-project/vllm

Etelis · 2026-04-08T14:21:21Z

Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY instead of CU_MEMCPY_SRC_ACCESS_ORDER_STREAM in the cuMemcpyBatchAsync call used for batched KV cache swap copies.

This relaxes the source access ordering constraint, allowing the CUDA driver to pipeline reads more aggressively. The safety of this change relies on the fact that source data is always fully written before the batch copy begins — the offloading handler synchronizes via stream events (stream.wait_event(last_event)) before issuing the copy.

Motivated by this code review comment from @ivanium on PR #38460, who observed improved CPU->GPU bandwidth on Grace Blackwell nodes with ACCESS_ORDER_ANY.

gemini-code-assist

Code Review

This pull request updates the CUDA memory copy attributes in csrc/cache_kernels.cu by changing the source access order from CU_MEMCPY_SRC_ACCESS_ORDER_STREAM to CU_MEMCPY_SRC_ACCESS_ORDER_ANY. I have no further feedback to provide.

ivanium · 2026-04-08T18:06:22Z

Thanks for the work. Can you add some benchmark results?

Etelis · 2026-04-09T10:48:39Z

Thanks for the work. Can you add some benchmark results?

NVIDIA GH200 480GB (Grace Hopper, NVLink-C2C, sm_90)
Driver: 580.105.08, CUDA Toolkit 12.8

Built vllm from source twice — once with STREAM, once with ANY in csrc/cache_kernels.cu
Called vllm._custom_ops.swap_blocks_batch() directly with pinned host memory <-> device memory

CUDA events (GPU-side), 200 iterations, 20 warmup, non-default stream
Varying block counts (64-4096) and sizes (4KB-256KB), both CPU->GPU and GPU->CPU

GPU Copy Time (CUDA Events)

Dir	Blocks	BlkSize	STREAM med ms	ANY med ms	Delta
c->g	64	4KB	0.0238	0.0229	-3.8%
g->c	64	4KB	0.0229	0.0217	-5.2%
c->g	256	4KB	0.0729	0.0713	-2.2%
g->c	256	4KB	0.0684	0.0665	-2.8%
c->g	1024	4KB	0.2760	0.2689	-2.6%
g->c	1024	4KB	0.2569	0.2502	-2.6%
c->g	4096	4KB	1.0701	1.0503	-1.9%
g->c	4096	4KB	0.9783	0.9557	-2.3%
c->g	64	32KB	0.0256	0.0246	-3.9%
g->c	64	32KB	0.0237	0.0224	-5.5%
c->g	256	32KB	0.0760	0.0736	-3.2%
g->c	256	32KB	0.0706	0.0683	-3.3%
c->g	32	256KB	0.0292	0.0293	+0.3%
g->c	32	256KB	0.0310	0.0314	+1.3%
c->g	64	128KB	0.0294	0.0294	0.0%
g->c	64	128KB	0.0313	0.0313	0.0%

Host Submission Time (no sync)

Dir	Blocks	BlkSize	STREAM med us	ANY med us	Delta
c->g	1024	4KB	270.2	263.1	-2.6%
g->c	1024	4KB	251.3	244.8	-2.6%
c->g	4096	4KB	1061.5	1043.5	-1.7%
g->c	4096	4KB	971.9	948.6	-2.4%

Didn't get a Grace Blackwell, The closest I can put my hands on is a Grace Hopper.
@ivanium you might be able to test it yourself as well..

Etelis · 2026-04-09T10:50:13Z

cc @orozery , @mgoin

orozery · 2026-04-09T11:17:19Z

@Etelis I think we want to make this a parameter.
For GPU->CPU I believe we still want stream order.

Etelis · 2026-04-09T11:27:06Z

@Etelis I think we want to make this a parameter. For GPU->CPU I believe we still want stream order.

The only place we're using swap_blocks_batch today is in the cpu_gpu.py where we already have stream.wait_event(last_event) in place (for the GPU->CPU)

So that would be as an extra caution measurement?

orozery · 2026-04-09T12:12:17Z

The only place we're using swap_blocks_batch today is in the cpu_gpu.py where we already have stream.wait_event(last_event) in place (for the GPU->CPU)

So that would be as an extra caution measurement?

I'm not sure.
Can you then give me an example where ORDER_ANY behaves differently than ORDER_STREAM?

Etelis · 2026-04-09T12:56:30Z

The only place we're using swap_blocks_batch today is in the cpu_gpu.py where we already have stream.wait_event(last_event) in place (for the GPU->CPU)
So that would be as an extra caution measurement?

I'm not sure. Can you then give me an example where ORDER_ANY behaves differently than ORDER_STREAM?

From CUDA docs:

If the source access order is set to CU_MEMCPY_SRC_ACCESS_ORDER_STREAM, then the source will be accessed in stream order. ... If the source access order is set to CU_MEMCPY_SRC_ACCESS_ORDER_ANY then it indicates that access to the source pointer can be out of stream order and the accesses can happen even after the API call returns.

So if I understand correctly, since we're creating a new stream for the offloading operations, CU_MEMCPY_SRC_ACCESS_ORDER_STREAM will wait on any operation on the copy stream prior to the DMA batch before it starts reading source memory.

CU_MEMCPY_SRC_ACCESS_ORDER_ANY skips that — it lets the DMA engine start reading immediately without checking for prior work on the stream. But since we already do stream.wait_stream(torch.cuda.current_stream()) on the compute stream before GPU->CPU copies, there shouldn't be any case where we're still writing to those GPU blocks. The copy stream itself never writes to the source buffers — it only runs DMA copies and sync barriers.

So in this case, CU_MEMCPY_SRC_ACCESS_ORDER_STREAM is an extra caution step.
Am I missing something?

orozery · 2026-04-09T13:11:54Z

What is it reads at call time, and writes at "stream time"?
I think this is the difference between STREAM and ANY.
Both guarantee the write will happen in stream order, but the read operation can be out-of-stream if using ANY.

Etelis · 2026-04-09T17:06:01Z

What is it reads at call time, and writes at "stream time"? I think this is the difference between STREAM and ANY. Both guarantee the write will happen in stream order, but the read operation can be out-of-stream if using ANY.

I see what you mean,
I'm speculating here, but I think that overhead is where the 2-5% improvement comes from on small blocks.
The driver likely takes a lighter codepath when it knows it doesn't need to enforce ordering at all.

mergify · 2026-04-11T23:27:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Etelis.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Relax source access ordering in cuMemcpyBatchAsync from STREAM to ANY. The source data is always fully written before copies start (ensured by stream event synchronization), so strict stream ordering is redundant. ANY gives the driver freedom to pipeline reads more aggressively, which may improve CPU<->GPU bandwidth on NVLink-C2C interconnects (Grace Hopper/Blackwell). Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis · 2026-04-20T14:36:01Z

@orozery you were right, thanks for the pushback.

The subtlety is that CU_MEMCPY_SRC_ACCESS_ORDER_ANY only relaxes the source-read ordering — the destination write is still stream-ordered. So our stream.wait_stream(compute) and stream.wait_event(last_event) gate when the copy's write commits, but under ANY the DMA engine is free to prefetch the source bytes before those barriers fire. For GPU→CPU that means the DMA can start reading the KV cache before compute has finished writing it.

CPU→GPU doesn't have this problem.

GPU->CPU source is the live GPU KV cache, which the compute stream keeps writing; ANY would let the DMA prefetch source bytes before wait_stream(compute) fires. Keep STREAM there; only the CPU->GPU handler (host pinned source) opts into ANY. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-04-20T16:18:18Z

+    // source (e.g. CPU->GPU reads from host-owned pinned memory). For
+    // GPU->CPU we must keep STREAM so source reads are gated by the
+    // transfer stream's wait_stream(compute) / wait_event barriers.


The rest is specific to the offloading connector implementation.

Suggested change

// source (e.g. CPU->GPU reads from host-owned pinned memory). For

// GPU->CPU we must keep STREAM so source reads are gated by the

// transfer stream's wait_stream(compute) / wait_event barriers.

// source.

orozery · 2026-04-20T16:19:36Z

+        writing to the source (e.g. CPU->GPU, where the source is host
+        pinned memory). Defaults to False (STREAM ordering), which is


the e.g. part is specific to the offloading connector implementation. Let's remove it.

orozery · 2026-04-20T16:21:06Z

    """
-    torch.ops._C_cache_ops.swap_blocks_batch(src_ptrs, dst_ptrs, sizes)
+    torch.ops._C_cache_ops.swap_blocks_batch(
+        src_ptrs, dst_ptrs, sizes, src_access_order_any


src_access_order_any is a big confusing.
Maybe rename it to is_src_access_order_any?

Rename src_access_order_any -> is_src_access_order_any and keep the op-level comment/docstring generic; the offloader-specific rationale stays at the cpu_gpu.py call site where it belongs. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mergify · 2026-04-23T05:56:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Etelis.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Etelis · 2026-04-23T06:17:22Z

@orozery Addressed all three — op-level comment/docstring are now generic, is_src_access_order_any across all layers. Offloader-specific rationale stays at the cpu_gpu.py call site where it belongs."

…rder-any # Conflicts: # vllm/v1/kv_offload/worker/cpu_gpu.py Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery

Thanks @Etelis !

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mergify · 2026-05-04T05:38:14Z

Hi @Etelis, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

mergify Bot added the needs-rebase label Apr 11, 2026

Etelis force-pushed the try-memcpy-access-order-any branch from 1df9e61 to 4f51705 Compare April 12, 2026 16:48

mergify Bot removed the needs-rebase label Apr 12, 2026

Etelis added 2 commits April 20, 2026 17:16

Merge branch 'main' into try-memcpy-access-order-any

87e5072

Merge branch 'main' into try-memcpy-access-order-any

2306690

Etelis requested review from ApostaC and orozery as code owners April 20, 2026 15:33

mergify Bot added the v1 label Apr 20, 2026

orozery reviewed Apr 20, 2026

View reviewed changes

swap_blocks_batch: address review comments

7d5883f

Rename src_access_order_any -> is_src_access_order_any and keep the op-level comment/docstring generic; the offloader-specific rationale stays at the cpu_gpu.py call site where it belongs. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mergify Bot added the needs-rebase label Apr 23, 2026

Merge remote-tracking branch 'upstream/main' into try-memcpy-access-o…

d34e022

…rder-any # Conflicts: # vllm/v1/kv_offload/worker/cpu_gpu.py Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mergify Bot removed the needs-rebase label Apr 23, 2026

orozery approved these changes Apr 23, 2026

View reviewed changes

EtelisIBM added 21 commits May 1, 2026 23:06

Merge branch 'main' into try-memcpy-access-order-any

bf35538

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

1ba1667

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

4f97664

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

7649aca

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

ee0c260

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

7b079c1

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

576374e

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

c8fa5b0

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

36af9ee

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

5e9239b

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

521d5d7

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

d12dcbe

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

99ebd51

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

a87c9b1

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

1b27b7a

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

c86ea93

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

5c22ce1

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

17f8343

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

a3689a4

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

b3766e7

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into try-memcpy-access-order-any

3888561

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery and others added 8 commits May 4, 2026 13:45

Merge branch 'main' into try-memcpy-access-order-any

4451912

Merge branch 'main' into try-memcpy-access-order-any

011c298

Merge branch 'main' into try-memcpy-access-order-any

d6ff483

Merge branch 'main' into try-memcpy-access-order-any

4084b0b

Merge branch 'main' into try-memcpy-access-order-any

b6040cf

Merge branch 'main' into try-memcpy-access-order-any

719fbd3

Merge branch 'main' into try-memcpy-access-order-any

587bea5

Merge branch 'main' into try-memcpy-access-order-any

1400acf

		writing to the source (e.g. CPU->GPU, where the source is host
		pinned memory). Defaults to False (STREAM ordering), which is

Uh oh!

Conversation

Etelis commented Apr 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ivanium commented Apr 8, 2026

Uh oh!

Etelis commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Copy Time (CUDA Events)

Host Submission Time (no sync)

Uh oh!

Etelis commented Apr 9, 2026

Uh oh!

orozery commented Apr 9, 2026

Uh oh!

Etelis commented Apr 9, 2026

Uh oh!

orozery commented Apr 9, 2026

Uh oh!

Etelis commented Apr 9, 2026

Uh oh!

orozery commented Apr 9, 2026

Uh oh!

Etelis commented Apr 9, 2026

Uh oh!

mergify Bot commented Apr 11, 2026

Uh oh!

Etelis commented Apr 20, 2026

Uh oh!

orozery Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 23, 2026

Uh oh!

Etelis commented Apr 23, 2026

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Etelis commented Apr 9, 2026 •

edited

Loading