Skip to content

[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch#41737

Closed
Etelis wants to merge 3 commits intovllm-project:mainfrom
Etelis:rocm-swap-blocks-batch
Closed

[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch#41737
Etelis wants to merge 3 commits intovllm-project:mainfrom
Etelis:rocm-swap-blocks-batch

Conversation

@Etelis
Copy link
Copy Markdown
Contributor

@Etelis Etelis commented May 5, 2026

Summary

#38460 introduced swap_blocks_batch using CUDA 12.8's cuMemcpyBatchAsync. ROCm was out of scope and falls back to a per-element cudaMemcpyAsync loop. ROCm 7.1 added hipMemcpyBatchAsync — same shape as the CUDA function — so this PR routes ROCm 7.1+ builds through it. ROCm <7.1 and the CUDA path are unchanged.

Test plan

  • Compiles on ROCm 7.1+ and ROCm <7.1 (fallback path).
  • E2E OffloadingConnector on MI300/MI325 — KV-transfer bandwidth vs main.
  • Decoded-token correctness vs the per-element fallback under sustained load.

Test results

Notes

ROCm 7.1's hipMemcpyBatchAsync is HIP_PARTIALLY_SUPPORTED per AMD's HIPIFY tables and silently ignores hipMemcpyAttributes (per AMD's own functional test). We pass nullptr rather than srcAccessOrder=Stream. The offloading flow already ensures source coherence via stream events before invoking swap_blocks_batch, so the missing access-order hint is safe in practice.

ROCm 7.1+ exposes the analog of cuMemcpyBatchAsync; route AMD builds
through it for the same single-driver-call fast path.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@mergify mergify Bot added the rocm Related to AMD ROCm label May 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for hipMemcpyBatchAsync on ROCm 7.1+ within the swap_blocks_batch function, providing an optimized batch copy path similar to the existing CUDA 12.8+ implementation. The review feedback suggests adding a static_assert to verify pointer size parity during casting, which would enhance the safety and portability of the new ROCm code path.

Comment thread csrc/cache_kernels.cu
The ROCm branch reinterpret_casts int64_t* to void**; make the size
assumption explicit alongside the existing CUdeviceptr / size_t asserts.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@mawong-amd
Copy link
Copy Markdown
Contributor

FYI, #40549 also implements hipMemcpyBatchAsync

@Etelis
Copy link
Copy Markdown
Contributor Author

Etelis commented May 6, 2026

Already handled unfortunately

@Etelis Etelis closed this May 6, 2026
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants