[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch#41737
Closed
Etelis wants to merge 3 commits intovllm-project:mainfrom
Closed
[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch#41737Etelis wants to merge 3 commits intovllm-project:mainfrom
Etelis wants to merge 3 commits intovllm-project:mainfrom
Conversation
ROCm 7.1+ exposes the analog of cuMemcpyBatchAsync; route AMD builds through it for the same single-driver-call fast path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for hipMemcpyBatchAsync on ROCm 7.1+ within the swap_blocks_batch function, providing an optimized batch copy path similar to the existing CUDA 12.8+ implementation. The review feedback suggests adding a static_assert to verify pointer size parity during casting, which would enhance the safety and portability of the new ROCm code path.
The ROCm branch reinterpret_casts int64_t* to void**; make the size assumption explicit alongside the existing CUdeviceptr / size_t asserts. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Contributor
|
FYI, #40549 also implements |
4 tasks
Contributor
Author
|
Already handled unfortunately |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#38460 introduced
swap_blocks_batchusing CUDA 12.8'scuMemcpyBatchAsync. ROCm was out of scope and falls back to a per-elementcudaMemcpyAsyncloop. ROCm 7.1 addedhipMemcpyBatchAsync— same shape as the CUDA function — so this PR routes ROCm 7.1+ builds through it. ROCm <7.1 and the CUDA path are unchanged.Test plan
OffloadingConnectoron MI300/MI325 — KV-transfer bandwidth vsmain.Test results
Notes
ROCm 7.1's
hipMemcpyBatchAsyncisHIP_PARTIALLY_SUPPORTEDper AMD's HIPIFY tables and silently ignoreshipMemcpyAttributes(per AMD's own functional test). We passnullptrrather thansrcAccessOrder=Stream. The offloading flow already ensures source coherence via stream events before invokingswap_blocks_batch, so the missing access-order hint is safe in practice.