[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync by HF-001 · Pull Request #7819 · vllm-project/vllm-ascend

HF-001 · 2026-03-30T06:06:32Z

What this PR does / why we need it?

refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading.

It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function.
manual parameter :

batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e .
normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e .

How was this patch tested?

test results:
main : TTFT 307 ms TPOT 49.96ms
this pr : TTFT 272.82ms TPOT 41.04ms

model script:
export TP=1
export MODEL_PATH=/nas/disk1/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
export CUDA_VISIBLE_DEVICES=3
export ASCEND_RT_VISIBLE_DEVICES=3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code
--block-size 128
--kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}'

test script:
export MODEL_NAME=/nas/disk1/Qwen3-14B
python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json
--num-clients 8 --max-active-conversations 24

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@35141a7

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist · 2026-03-30T06:06:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes KV cache offloading performance by introducing batched memory copy operations. By utilizing aclrtMemcpyBatchAsync on supported CANN versions, the system can process multiple block transfers in a single asynchronous call, significantly reducing the overhead compared to sequential individual transfers. The changes include both the necessary C++ bindings and the corresponding Python logic to prepare and execute these batched operations.

Highlights

Batch KV Cache Copying: Introduced swap_blocks_batch to leverage aclrtMemcpyBatchAsync for more efficient KV cache offloading on CANN 8.5+.
Performance Optimization: Replaced individual aclrtMemcpyAsync calls with a batched approach to reduce overhead during block transfers.
Compatibility: Implemented a fallback mechanism to ensure compatibility with older CANN versions and all memory copy directions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-03-30T06:06:49Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a batched block swapping mechanism for Ascend NPU to optimize KV cache offloading by utilizing aclrtMemcpyBatchAsync. Review feedback identified critical issues including a C++ function signature mismatch with operator registration, the use of undefined attributes in the Python __init__ method, and several logic errors in transfer_async such as the use of undefined variables. Suggested PR Title:

[Ops][Feature] Implement batched block swapping for Ascend NPU

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements `swap_blocks_batch` for Ascend NPU to optimize KV cache offloading. It introduces a C++ implementation for batched memory copies and updates the Python offloading handler to aggregate transfer pointers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No testing details provided; implementation contains blocking bugs.

gemini-code-assist · 2026-03-30T06:10:23Z

+void swap_blocks_batch(const torch::Tensor& src_ptrs,
+                       const torch::Tensor& dst_ptrs,
+                       const torch::Tensor& sizes) {


The function signature for swap_blocks_batch uses const torch::Tensor& for src_ptrs and dst_ptrs. However, the operator registration on line 881 (ops.def("swap_blocks_batch(Tensor! x, Tensor! y, Tensor z) -> ()");) marks these tensors as mutable (!). This mismatch forces the use of const_cast on lines 167-172, which is unsafe and breaks the const contract.

To fix this, the function signature should be updated to match the registration. This will also allow removing the const_casts. Additionally, the const_cast for size_data is unnecessary as the aclrtMemcpyBatchAsync API expects const size_t* for size-related arguments.

void swap_blocks_batch(torch::Tensor& src_ptrs, torch::Tensor& dst_ptrs, const torch::Tensor& sizes) {

gemini-code-assist · 2026-03-30T06:10:24Z

+        # Pre-compute base pointers and block sizes for batch copies.
+        self._src_base_ptrs = np.array(
+            [t.data_ptr() for t in self.src_tensors], dtype=np.int64
+        )
+        self._dst_base_ptrs = np.array(
+            [t.data_ptr() for t in self.dst_tensors], dtype=np.int64
+        )
+        self._block_size_in_bytes_arr = np.array(
+            self.tensor_block_size_in_bytes, dtype=np.int64
+        )


The attributes self.src_tensors, self.dst_tensors, and self.tensor_block_size_in_bytes are used here in __init__, but they are not defined on the class instance. The transfer direction, and therefore which tensors are source or destination, is only determined within the transfer_async method. This code will raise an AttributeError when CpuNpuOffloadingHandler is initialized.

This logic for preparing base pointers and sizes should be moved into transfer_async where src_tensors and dst_tensors are defined.

gemini-code-assist · 2026-03-30T06:10:24Z

+            torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor)
+            torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor)


The calls to torch.ops._C_ascend.swap_blocks_batch are using undefined variables from the old, removed code path (src_key_cache, dst_key_cache, etc.).

The new batched implementation prepares batch_src, batch_dst, and batch_sizes to handle all copies in one go. There should be a single call to swap_blocks_batch with these tensors.

Additionally, there are other errors in this block:

self.dst_block_size_factor on line 142 should be the local variable dst_block_size_factor.

self.src_tensors on line 146 should be the local variable src_tensors.

The logic for preparing all_src, all_dst, all_sizes depends on attributes (_src_base_ptrs, etc.) that are not correctly initialized. This logic needs to be self-contained within transfer_async and correctly handle the structure of src_tensors (a list of tensor tuples).

Suggested change

torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor)

torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor)

torch.ops._C_ascend.swap_blocks_batch(batch_src, batch_dst, batch_sizes)

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Signed-off-by: HF-001 <1670186653@qq.com>

github-actions · 2026-04-09T07:12:54Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>

Signed-off-by: HF-001 <1670186653@qq.com>

github-actions · 2026-04-10T08:38:04Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: HF-001 <1670186653@qq.com>

Signed-off-by: kx <1670186653@qq.com>

HF-001 · 2026-04-13T06:10:04Z

@wangxiyuan hi，this pr is ready， Would it be convenient for you to take a look? the ci error is fixed by this pr: #8181

HF-001 · 2026-04-14T00:39:36Z

@wangxiyuan hi, this pr is ready and passed all ci tests，Would it be convenient for you to take a look?

Signed-off-by: HF-001 <1670186653@qq.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: PiratePai <416932041@qq.com>

[feat]Batch kvcache swap copies via aclrtMemcpyBatchAsync

3db2a33

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

HF-001 requested review from nalinaly and zzzzwwjj as code owners March 30, 2026 06:06

HF-001 changed the title ~~[wip][feat]Batch kvcache swap copies via aclrtMemcpyBatchAsync~~ [wip][perf]Batch kvcache swap copies via aclrtMemcpyBatchAsync Mar 30, 2026

HF-001 changed the title ~~[wip][perf]Batch kvcache swap copies via aclrtMemcpyBatchAsync~~ [wip][perf]Batch kvcache offloading via aclrtMemcpyBatchAsync Mar 30, 2026

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

01267596 and others added 2 commits April 1, 2026 06:26

optimize

fc2e5eb

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

batch kvcache offloading

ebf582d

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 changed the title ~~[wip][perf]Batch kvcache offloading via aclrtMemcpyBatchAsync~~ [Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync Apr 9, 2026

fix

5c17e15

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 force-pushed the batch_kv_offload branch from 7ef1c77 to 5c17e15 Compare April 9, 2026 03:06

HF-001 added 2 commits April 9, 2026 11:15

fix

f664dc9

Signed-off-by: HF-001 <1670186653@qq.com>

fix

1dc4c4f

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 force-pushed the batch_kv_offload branch from 66911c8 to 1dc4c4f Compare April 9, 2026 07:11

github-actions Bot added the merge-conflicts label Apr 9, 2026

Merge branch 'main' into batch_kv_offload

86a785d

Signed-off-by: kx <1670186653@qq.com>

github-actions Bot removed the merge-conflicts label Apr 9, 2026

HF-001 added 4 commits April 10, 2026 08:54

fix

350834f

Signed-off-by: HF-001 <1670186653@qq.com>

fix

836064c

Signed-off-by: HF-001 <1670186653@qq.com>

fix

6eed37b

Signed-off-by: HF-001 <1670186653@qq.com>

Merge branch 'main' into batch_kv_offload

35f43dd

github-actions Bot added the merge-conflicts label Apr 10, 2026

HF-001 added 3 commits April 10, 2026 17:31

fix

da2586c

Signed-off-by: HF-001 <1670186653@qq.com>

Merge branch 'main' into batch_kv_offload

3838709

Signed-off-by: kx <1670186653@qq.com>

Merge branch 'main' into batch_kv_offload

9c21b19

github-actions Bot removed the merge-conflicts label Apr 10, 2026

HF-001 added 5 commits April 10, 2026 17:42

Merge branch 'main' into batch_kv_offload

cdbdd52

Merge branch 'main' into batch_kv_offload

36e7020

Merge branch 'main' into batch_kv_offload

6ef2a5d

Merge branch 'main' into batch_kv_offload

ecd50c4

Merge branch 'main' into batch_kv_offload

4520d63

Merge branch 'main' into batch_kv_offload

4c6ce06

fix

33bd8ab

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 requested a review from wangxiyuan as a code owner April 17, 2026 14:09

HF-001 added 5 commits April 17, 2026 22:09

Merge branch 'main' into batch_kv_offload

a903f24

fix

a09758a

Signed-off-by: HF-001 <1670186653@qq.com>

Merge branch 'main' into batch_kv_offload

9b31f4e

Merge branch 'main' into batch_kv_offload

8f01044

Merge branch 'main' into batch_kv_offload

fa05b23

wangxiyuan merged commit aa8ac74 into vllm-project:main Apr 21, 2026
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync#7819

[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync#7819
wangxiyuan merged 26 commits intovllm-project:mainfrom
HF-001:batch_kv_offload

HF-001 commented Mar 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

HF-001 commented Apr 13, 2026

Uh oh!

HF-001 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor)
		torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor)

	torch.ops._C_ascend.swap_blocks_batch(src_key_cache, dst_key_cache, src_to_dst_tensor)
	torch.ops._C_ascend.swap_blocks_batch(src_value_cache, dst_value_cache, src_to_dst_tensor)
	torch.ops._C_ascend.swap_blocks_batch(batch_src, batch_dst, batch_sizes)

Conversation

HF-001 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Mar 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

HF-001 commented Apr 13, 2026

Uh oh!

HF-001 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HF-001 commented Mar 30, 2026 •

edited

Loading