[Bugfix] Runtime driver check for cuMemcpyBatchAsync in swap_blocks_batch#38919
Conversation
…ks_batch Replace the compile-time-only #ifdef guard for cuMemcpyBatchAsync with a runtime resolution via cuGetProcAddress. Pre-built wheels compiled with CUDA 12.8+ would fail with "undefined symbol: cuMemcpyBatchAsync" on systems with older CUDA drivers (e.g. driver 12.1). The function pointer is now resolved lazily and cached, falling back to individual cudaMemcpyAsync calls when the driver lacks support. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
There was a problem hiding this comment.
Code Review
This pull request updates the swap_blocks_batch function in csrc/cache_kernels.cu to resolve cuMemcpyBatchAsync at runtime using cuGetProcAddress. This change ensures that binaries compiled with CUDA 12.8+ remain compatible with older drivers by falling back to individual async copies if the batch function is unavailable. I have no feedback to provide as the implementation correctly handles the dynamic loading and fallback logic.
|
cc @mgoin |
|
Thanks, building with this PR now |
|
After applying this patch on top of latest main I was able to build vLLM from source again with CUDA 13 on my DGX Spark. So, I'm that hopeful eugr will report success as well. |
|
The rebuild has been successful, the regression test pipeline is half way through now, so far so good |
|
I will also rerun it. |
|
All checks passed, everything is good! Thanks for a quick turnaround! |
|
Resolve the conflicts |
|
@orozery cc |
…ch-runtime-driver-check # Conflicts: # csrc/cache_kernels.cu Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
mgoin
left a comment
There was a problem hiding this comment.
Seems reasonable to have the fallback, and thanks folks for confirming the fix
|
@Etelis How does this work for CUDA 13 if it expects 8 arguments instead of 9? |
Tested with both. |
|
@mgoin Can we merge this? |
|
Needed this on RHEL 9 with nvidia-driver-550.163.01 and CUDA 13, seems to work fine |
|
Hi, any update? @mgoin |
|
Thanks for the ping! |
…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Fixes two issues introduced by
swap_blocks_batch(#38460):undefined symbol: cuMemcpyBatchAsyncon CUDA drivers < 12.8 (@JaheimLee) — pre-built wheels hard-link the symbol, crashing atimport vllm._Ctime on older drivers.#define cuMemcpyBatchAsync cuMemcpyBatchAsync_v2(8 params), breaking the original 9-param call.#38915 fixed problem 2 with compile-time
#ifdefbranching but left problem 1 open. This PR supersedes that approach by resolving the function at runtime viacuGetProcAddress("cuMemcpyBatchAsync", ..., 12080):#defineremapping