[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch by Etelis · Pull Request #41737 · vllm-project/vllm

Etelis · 2026-05-05T13:52:39Z

Summary

#38460 introduced swap_blocks_batch using CUDA 12.8's cuMemcpyBatchAsync. ROCm was out of scope and falls back to a per-element cudaMemcpyAsync loop. ROCm 7.1 added hipMemcpyBatchAsync — same shape as the CUDA function — so this PR routes ROCm 7.1+ builds through it. ROCm <7.1 and the CUDA path are unchanged.

Test plan

Compiles on ROCm 7.1+ and ROCm <7.1 (fallback path).
E2E OffloadingConnector on MI300/MI325 — KV-transfer bandwidth vs main.
Decoded-token correctness vs the per-element fallback under sustained load.

Test results

Notes

ROCm 7.1's hipMemcpyBatchAsync is HIP_PARTIALLY_SUPPORTED per AMD's HIPIFY tables and silently ignores hipMemcpyAttributes (per AMD's own functional test). We pass nullptr rather than srcAccessOrder=Stream. The offloading flow already ensures source coherence via stream events before invoking swap_blocks_batch, so the missing access-order hint is safe in practice.

ROCm 7.1+ exposes the analog of cuMemcpyBatchAsync; route AMD builds through it for the same single-driver-call fast path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

gemini-code-assist

Code Review

This pull request introduces support for hipMemcpyBatchAsync on ROCm 7.1+ within the swap_blocks_batch function, providing an optimized batch copy path similar to the existing CUDA 12.8+ implementation. The review feedback suggests adding a static_assert to verify pointer size parity during casting, which would enhance the safety and portability of the new ROCm code path.

The ROCm branch reinterpret_casts int64_t* to void**; make the size assumption explicit alongside the existing CUdeviceptr / size_t asserts. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mawong-amd · 2026-05-06T17:55:28Z

FYI, #40549 also implements hipMemcpyBatchAsync

Etelis · 2026-05-06T22:00:10Z

Already handled unfortunately

[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch

384fac5

ROCm 7.1+ exposes the analog of cuMemcpyBatchAsync; route AMD builds through it for the same single-driver-call fast path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mergify Bot added the rocm Related to AMD ROCm label May 5, 2026

Merge branch 'main' into rocm-swap-blocks-batch

fc92b7d

github-project-automation Bot added this to AMD May 5, 2026

github-project-automation Bot moved this to Todo in AMD May 5, 2026

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread csrc/cache_kernels.cu

Assert sizeof(void*) parity in swap_blocks_batch

fea5520

The ROCm branch reinterpret_casts int64_t* to void**; make the size assumption explicit alongside the existing CUdeviceptr / size_t asserts. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis mentioned this pull request May 6, 2026

[ROCm] Enable SimpleCPUOffloadConnector on ROCm #40549

Merged

4 tasks

Etelis closed this May 6, 2026

github-project-automation Bot moved this from Todo to Done in AMD May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch#41737

[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch#41737
Etelis wants to merge 3 commits intovllm-project:mainfrom
Etelis:rocm-swap-blocks-batch

Etelis commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mawong-amd commented May 6, 2026

Uh oh!

Etelis commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Etelis commented May 5, 2026

Summary

Test plan

Test results

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mawong-amd commented May 6, 2026

Uh oh!

Etelis commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants