Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps#39306
Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps#39306Etelis wants to merge 98 commits intovllm-project:mainfrom
Conversation
|
Thanks for the work. Can you add some benchmark results? |
NVIDIA GH200 480GB (Grace Hopper, NVLink-C2C, sm_90) Built vllm from source twice — once with CUDA events (GPU-side), 200 iterations, 20 warmup, non-default stream GPU Copy Time (CUDA Events)
Host Submission Time (no sync)
Didn't get a Grace Blackwell, The closest I can put my hands on is a Grace Hopper. |
|
@Etelis I think we want to make this a parameter. |
The only place we're using So that would be as an extra caution measurement? |
I'm not sure. |
From CUDA docs:
So if I understand correctly, since we're creating a new stream for the offloading operations,
So in this case, |
|
What is it reads at call time, and writes at "stream time"? |
I see what you mean, |
|
This pull request has merge conflicts that must be resolved before it can be |
Relax source access ordering in cuMemcpyBatchAsync from STREAM to ANY. The source data is always fully written before copies start (ensured by stream event synchronization), so strict stream ordering is redundant. ANY gives the driver freedom to pipeline reads more aggressively, which may improve CPU<->GPU bandwidth on NVLink-C2C interconnects (Grace Hopper/Blackwell). Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
1df9e61 to
4f51705
Compare
|
@orozery you were right, thanks for the pushback. The subtlety is that CPU→GPU doesn't have this problem. |
GPU->CPU source is the live GPU KV cache, which the compute stream keeps writing; ANY would let the DMA prefetch source bytes before wait_stream(compute) fires. Keep STREAM there; only the CPU->GPU handler (host pinned source) opts into ANY. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
| // source (e.g. CPU->GPU reads from host-owned pinned memory). For | ||
| // GPU->CPU we must keep STREAM so source reads are gated by the | ||
| // transfer stream's wait_stream(compute) / wait_event barriers. |
There was a problem hiding this comment.
The rest is specific to the offloading connector implementation.
| // source (e.g. CPU->GPU reads from host-owned pinned memory). For | |
| // GPU->CPU we must keep STREAM so source reads are gated by the | |
| // transfer stream's wait_stream(compute) / wait_event barriers. | |
| // source. |
| writing to the source (e.g. CPU->GPU, where the source is host | ||
| pinned memory). Defaults to False (STREAM ordering), which is |
There was a problem hiding this comment.
the e.g. part is specific to the offloading connector implementation. Let's remove it.
| """ | ||
| torch.ops._C_cache_ops.swap_blocks_batch(src_ptrs, dst_ptrs, sizes) | ||
| torch.ops._C_cache_ops.swap_blocks_batch( | ||
| src_ptrs, dst_ptrs, sizes, src_access_order_any |
There was a problem hiding this comment.
src_access_order_any is a big confusing.
Maybe rename it to is_src_access_order_any?
Rename src_access_order_any -> is_src_access_order_any and keep the op-level comment/docstring generic; the offloader-specific rationale stays at the cpu_gpu.py call site where it belongs. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
@orozery Addressed all three — op-level comment/docstring are now generic, is_src_access_order_any across all layers. Offloader-specific rationale stays at the cpu_gpu.py call site where it belongs." |
…rder-any # Conflicts: # vllm/v1/kv_offload/worker/cpu_gpu.py Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Hi @Etelis, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Use
CU_MEMCPY_SRC_ACCESS_ORDER_ANYinstead ofCU_MEMCPY_SRC_ACCESS_ORDER_STREAMin thecuMemcpyBatchAsynccall used for batched KV cache swap copies.This relaxes the source access ordering constraint, allowing the CUDA driver to pipeline reads more aggressively. The safety of this change relies on the fact that source data is always fully written before the batch copy begins — the offloading handler synchronizes via stream events (
stream.wait_event(last_event)) before issuing the copy.Motivated by this code review comment from @ivanium on PR #38460, who observed improved CPU->GPU bandwidth on Grace Blackwell nodes with
ACCESS_ORDER_ANY.