Skip to content

[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync#7819

Merged
wangxiyuan merged 26 commits intovllm-project:mainfrom
HF-001:batch_kv_offload
Apr 21, 2026
Merged

[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync#7819
wangxiyuan merged 26 commits intovllm-project:mainfrom
HF-001:batch_kv_offload

Conversation

@HF-001
Copy link
Copy Markdown
Contributor

@HF-001 HF-001 commented Mar 30, 2026

What this PR does / why we need it?

refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading.

It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function.
manual parameter :

  1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
  2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e .

How was this patch tested?

test results:
main : TTFT 307 ms TPOT 49.96ms
this pr : TTFT 272.82ms TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code
--block-size 128
--kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json
--num-clients 8 --max-active-conversations 24

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes KV cache offloading performance by introducing batched memory copy operations. By utilizing aclrtMemcpyBatchAsync on supported CANN versions, the system can process multiple block transfers in a single asynchronous call, significantly reducing the overhead compared to sequential individual transfers. The changes include both the necessary C++ bindings and the corresponding Python logic to prepare and execute these batched operations.

Highlights

  • Batch KV Cache Copying: Introduced swap_blocks_batch to leverage aclrtMemcpyBatchAsync for more efficient KV cache offloading on CANN 8.5+.
  • Performance Optimization: Replaced individual aclrtMemcpyAsync calls with a batched approach to reduce overhead during block transfers.
  • Compatibility: Implemented a fallback mechanism to ensure compatibility with older CANN versions and all memory copy directions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@HF-001 HF-001 changed the title [wip][feat]Batch kvcache swap copies via aclrtMemcpyBatchAsync [wip][perf]Batch kvcache swap copies via aclrtMemcpyBatchAsync Mar 30, 2026
@HF-001 HF-001 changed the title [wip][perf]Batch kvcache swap copies via aclrtMemcpyBatchAsync [wip][perf]Batch kvcache offloading via aclrtMemcpyBatchAsync Mar 30, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a batched block swapping mechanism for Ascend NPU to optimize KV cache offloading by utilizing aclrtMemcpyBatchAsync. Review feedback identified critical issues including a C++ function signature mismatch with operator registration, the use of undefined attributes in the Python __init__ method, and several logic errors in transfer_async such as the use of undefined variables. Suggested PR Title:

[Ops][Feature] Implement batched block swapping for Ascend NPU

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements `swap_blocks_batch` for Ascend NPU to optimize KV cache offloading. It introduces a C++ implementation for batched memory copies and updates the Python offloading handler to aggregate transfer pointers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No testing details provided; implementation contains blocking bugs.

Comment thread csrc/torch_binding.cpp Outdated
Comment on lines +134 to +136
void swap_blocks_batch(const torch::Tensor& src_ptrs,
const torch::Tensor& dst_ptrs,
const torch::Tensor& sizes) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function signature for swap_blocks_batch uses const torch::Tensor& for src_ptrs and dst_ptrs. However, the operator registration on line 881 (ops.def("swap_blocks_batch(Tensor! x, Tensor! y, Tensor z) -> ()");) marks these tensors as mutable (!). This mismatch forces the use of const_cast on lines 167-172, which is unsafe and breaks the const contract.

To fix this, the function signature should be updated to match the registration. This will also allow removing the const_casts. Additionally, the const_cast for size_data is unnecessary as the aclrtMemcpyBatchAsync API expects const size_t* for size-related arguments.

void swap_blocks_batch(torch::Tensor& src_ptrs,
                       torch::Tensor& dst_ptrs,
                       const torch::Tensor& sizes) {

Comment thread vllm_ascend/kv_offload/cpu_npu.py Outdated
Comment on lines +97 to +106
# Pre-compute base pointers and block sizes for batch copies.
self._src_base_ptrs = np.array(
[t.data_ptr() for t in self.src_tensors], dtype=np.int64
)
self._dst_base_ptrs = np.array(
[t.data_ptr() for t in self.dst_tensors], dtype=np.int64
)
self._block_size_in_bytes_arr = np.array(
self.tensor_block_size_in_bytes, dtype=np.int64
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The attributes self.src_tensors, self.dst_tensors, and self.tensor_block_size_in_bytes are used here in __init__, but they are not defined on the class instance. The transfer direction, and therefore which tensors are source or destination, is only determined within the transfer_async method. This code will raise an AttributeError when CpuNpuOffloadingHandler is initialized.

This logic for preparing base pointers and sizes should be moved into transfer_async where src_tensors and dst_tensors are defined.

Comment thread vllm_ascend/kv_offload/cpu_npu.py Outdated
Comment on lines +167 to +168
torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor)
torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calls to torch.ops._C_ascend.swap_blocks_batch are using undefined variables from the old, removed code path (src_key_cache, dst_key_cache, etc.).

The new batched implementation prepares batch_src, batch_dst, and batch_sizes to handle all copies in one go. There should be a single call to swap_blocks_batch with these tensors.

Additionally, there are other errors in this block:

  • self.dst_block_size_factor on line 142 should be the local variable dst_block_size_factor.
  • self.src_tensors on line 146 should be the local variable src_tensors.
  • The logic for preparing all_src, all_dst, all_sizes depends on attributes (_src_base_ptrs, etc.) that are not correctly initialized. This logic needs to be self-contained within transfer_async and correctly handle the structure of src_tensors (a list of tensor tuples).
Suggested change
torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor)
torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor)
torch.ops._C_ascend.swap_blocks_batch(batch_src, batch_dst, batch_sizes)

01267596 and others added 2 commits April 1, 2026 06:26
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
@HF-001 HF-001 changed the title [wip][perf]Batch kvcache offloading via aclrtMemcpyBatchAsync [Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync Apr 9, 2026
Signed-off-by: HF-001 <1670186653@qq.com>
@HF-001 HF-001 force-pushed the batch_kv_offload branch from 7ef1c77 to 5c17e15 Compare April 9, 2026 03:06
HF-001 added 2 commits April 9, 2026 11:15
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>
HF-001 added 4 commits April 10, 2026 08:54
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

HF-001 added 3 commits April 10, 2026 17:31
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
@HF-001
Copy link
Copy Markdown
Contributor Author

HF-001 commented Apr 13, 2026

@wangxiyuan hi,this pr is ready, Would it be convenient for you to take a look? the ci error is fixed by this pr: #8181

@HF-001
Copy link
Copy Markdown
Contributor Author

HF-001 commented Apr 14, 2026

@wangxiyuan hi, this pr is ready and passed all ci tests,Would it be convenient for you to take a look?

Signed-off-by: HF-001 <1670186653@qq.com>
@HF-001 HF-001 requested a review from wangxiyuan as a code owner April 17, 2026 14:09
@wangxiyuan wangxiyuan merged commit aa8ac74 into vllm-project:main Apr 21, 2026
49 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Apr 21, 2026
…-project#7819)

### What this PR does / why we need it?
refer to vllm-project/vllm#38460 and
vllm-project/vllm#38915 , cann 8.5.0+ use
aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do
kvcache offloading.

It can automatically compile and select the appropriate transmission
function based on the CANN environment, and also supports manual
parameter transmission to choose the suitable transmission function.
manual parameter :
1. batch memcpy(need CANN ≥ 8.5): export
VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install
-e .

### How was this patch tested?

test results:
main :    TTFT 307 ms         TPOT 49.96ms
this pr :  TTFT 272.82ms    TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python
/model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py
--url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name
Qwen3-14B --seed 1234 --input-file
/model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \
--num-clients 8 --max-active-conversations 24



- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@35141a7

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
…-project#7819)

### What this PR does / why we need it?
refer to vllm-project/vllm#38460 and
vllm-project/vllm#38915 , cann 8.5.0+ use
aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do
kvcache offloading.

It can automatically compile and select the appropriate transmission
function based on the CANN environment, and also supports manual
parameter transmission to choose the suitable transmission function.
manual parameter :
1. batch memcpy(need CANN ≥ 8.5): export
VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install
-e .

### How was this patch tested?

test results:
main :    TTFT 307 ms         TPOT 49.96ms
this pr :  TTFT 272.82ms    TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python
/model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py
--url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name
Qwen3-14B --seed 1234 --input-file
/model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \
--num-clients 8 --max-active-conversations 24



- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@35141a7

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
…-project#7819)

### What this PR does / why we need it?
refer to vllm-project/vllm#38460 and
vllm-project/vllm#38915 , cann 8.5.0+ use
aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do
kvcache offloading.

It can automatically compile and select the appropriate transmission
function based on the CANN environment, and also supports manual
parameter transmission to choose the suitable transmission function.
manual parameter :
1. batch memcpy(need CANN ≥ 8.5): export
VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install
-e .

### How was this patch tested?

test results:
main :    TTFT 307 ms         TPOT 49.96ms
this pr :  TTFT 272.82ms    TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python
/model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py
--url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name
Qwen3-14B --seed 1234 --input-file
/model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \
--num-clients 8 --max-active-conversations 24

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@35141a7

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
…-project#7819)

### What this PR does / why we need it?
refer to vllm-project/vllm#38460 and
vllm-project/vllm#38915 , cann 8.5.0+ use
aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do
kvcache offloading.

It can automatically compile and select the appropriate transmission
function based on the CANN environment, and also supports manual
parameter transmission to choose the suitable transmission function.
manual parameter :
1. batch memcpy(need CANN ≥ 8.5): export
VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install
-e .

### How was this patch tested?

test results:
main :    TTFT 307 ms         TPOT 49.96ms
this pr :  TTFT 272.82ms    TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python
/model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py
--url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name
Qwen3-14B --seed 1234 --input-file
/model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \
--num-clients 8 --max-active-conversations 24

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@35141a7

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…-project#7819)

### What this PR does / why we need it?
refer to vllm-project/vllm#38460 and
vllm-project/vllm#38915 , cann 8.5.0+ use
aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do
kvcache offloading.

It can automatically compile and select the appropriate transmission
function based on the CANN environment, and also supports manual
parameter transmission to choose the suitable transmission function.
manual parameter :
1. batch memcpy(need CANN ≥ 8.5): export
VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install
-e .

### How was this patch tested?

test results:
main :    TTFT 307 ms         TPOT 49.96ms
this pr :  TTFT 272.82ms    TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python
/model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py
--url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name
Qwen3-14B --seed 1234 --input-file
/model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \
--num-clients 8 --max-active-conversations 24



- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@35141a7

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
PiratePai pushed a commit to PiratePai/vllm-ascend that referenced this pull request May 7, 2026
…-project#7819)

### What this PR does / why we need it?
refer to vllm-project/vllm#38460 and
vllm-project/vllm#38915 , cann 8.5.0+ use
aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do
kvcache offloading.

It can automatically compile and select the appropriate transmission
function based on the CANN environment, and also supports manual
parameter transmission to choose the suitable transmission function.
manual parameter :
1. batch memcpy(need CANN ≥ 8.5): export
VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install
-e .

### How was this patch tested?

test results:
main :    TTFT 307 ms         TPOT 49.96ms
this pr :  TTFT 272.82ms    TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python
/model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py
--url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name
Qwen3-14B --seed 1234 --input-file
/model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \
--num-clients 8 --max-active-conversations 24

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@35141a7

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: kx <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: PiratePai <416932041@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants