[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync#7819
[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync#7819wangxiyuan merged 26 commits intovllm-project:mainfrom
Conversation
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes KV cache offloading performance by introducing batched memory copy operations. By utilizing Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces a batched block swapping mechanism for Ascend NPU to optimize KV cache offloading by utilizing aclrtMemcpyBatchAsync. Review feedback identified critical issues including a C++ function signature mismatch with operator registration, the use of undefined attributes in the Python __init__ method, and several logic errors in transfer_async such as the use of undefined variables. Suggested PR Title:
[Ops][Feature] Implement batched block swapping for Ascend NPUSuggested PR Summary:
### What this PR does / why we need it?
This PR implements `swap_blocks_batch` for Ascend NPU to optimize KV cache offloading. It introduces a C++ implementation for batched memory copies and updates the Python offloading handler to aggregate transfer pointers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No testing details provided; implementation contains blocking bugs.| void swap_blocks_batch(const torch::Tensor& src_ptrs, | ||
| const torch::Tensor& dst_ptrs, | ||
| const torch::Tensor& sizes) { |
There was a problem hiding this comment.
The function signature for swap_blocks_batch uses const torch::Tensor& for src_ptrs and dst_ptrs. However, the operator registration on line 881 (ops.def("swap_blocks_batch(Tensor! x, Tensor! y, Tensor z) -> ()");) marks these tensors as mutable (!). This mismatch forces the use of const_cast on lines 167-172, which is unsafe and breaks the const contract.
To fix this, the function signature should be updated to match the registration. This will also allow removing the const_casts. Additionally, the const_cast for size_data is unnecessary as the aclrtMemcpyBatchAsync API expects const size_t* for size-related arguments.
void swap_blocks_batch(torch::Tensor& src_ptrs,
torch::Tensor& dst_ptrs,
const torch::Tensor& sizes) {| # Pre-compute base pointers and block sizes for batch copies. | ||
| self._src_base_ptrs = np.array( | ||
| [t.data_ptr() for t in self.src_tensors], dtype=np.int64 | ||
| ) | ||
| self._dst_base_ptrs = np.array( | ||
| [t.data_ptr() for t in self.dst_tensors], dtype=np.int64 | ||
| ) | ||
| self._block_size_in_bytes_arr = np.array( | ||
| self.tensor_block_size_in_bytes, dtype=np.int64 | ||
| ) |
There was a problem hiding this comment.
The attributes self.src_tensors, self.dst_tensors, and self.tensor_block_size_in_bytes are used here in __init__, but they are not defined on the class instance. The transfer direction, and therefore which tensors are source or destination, is only determined within the transfer_async method. This code will raise an AttributeError when CpuNpuOffloadingHandler is initialized.
This logic for preparing base pointers and sizes should be moved into transfer_async where src_tensors and dst_tensors are defined.
| torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor) | ||
| torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor) |
There was a problem hiding this comment.
The calls to torch.ops._C_ascend.swap_blocks_batch are using undefined variables from the old, removed code path (src_key_cache, dst_key_cache, etc.).
The new batched implementation prepares batch_src, batch_dst, and batch_sizes to handle all copies in one go. There should be a single call to swap_blocks_batch with these tensors.
Additionally, there are other errors in this block:
self.dst_block_size_factoron line 142 should be the local variabledst_block_size_factor.self.src_tensorson line 146 should be the local variablesrc_tensors.- The logic for preparing
all_src,all_dst,all_sizesdepends on attributes (_src_base_ptrs, etc.) that are not correctly initialized. This logic needs to be self-contained withintransfer_asyncand correctly handle the structure ofsrc_tensors(a list of tensor tuples).
| torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor) | |
| torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor) | |
| torch.ops._C_ascend.swap_blocks_batch(batch_src, batch_dst, batch_sizes) |
Signed-off-by: HF-001 <1670186653@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: kx <1670186653@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: kx <1670186653@qq.com>
|
@wangxiyuan hi,this pr is ready, Would it be convenient for you to take a look? the ci error is fixed by this pr: #8181 |
|
@wangxiyuan hi, this pr is ready and passed all ci tests,Would it be convenient for you to take a look? |
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: PiratePai <416932041@qq.com>
What this PR does / why we need it?
refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading.
It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function.
manual parameter :
How was this patch tested?
test results:
main : TTFT 307 ms TPOT 49.96ms
this pr : TTFT 272.82ms TPOT 41.04ms
model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code
--block-size 128
--kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}'
test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json
--num-clients 8 --max-active-conversations 24