[ROCm] Enable SimpleCPUOffloadConnector on ROCm by hongxiayang · Pull Request #40549 · vllm-project/vllm

hongxiayang · 2026-04-21T20:00:46Z

Purpose

Enable SimpleCPUOffloadConnector on ROCm backend.
Also enabled the related tests on ROCm.

Test Plan

Using the upstream nightly docker as the base (where rocm is v7.2.1) and build vllm from source.
Tested on MI350

(1) unit test:

vllm/tests/v1/simple_kv_offload# pytest-

(2) integration test:

vllm/tests/v1/simple_kv_offload# pytest test_integration.py

(3) model serve and lm-eval

Test Result

(1) unit test: pass

root@xxx:/app/vllm/tests/v1/simple_kv_offload# pytest test_scheduler.py                                                    
================================================================================================================== test session starts ===================================================================================================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /app/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 12 items                                                                                                                                                                                                                                       

test_scheduler.py ............                                                                                                                                                                                                                     [100%]

==================================================================================================================== warnings summary ====================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

tests/v1/simple_kv_offload/test_scheduler.py: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================ 12 passed, 16 warnings in 12.12s ============================================================================================================

(2) integration test: by default, skipped.

pytest test_integration.py                                                               
======================================================================================================================== test session starts ========================================================================================================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /app/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 8 items                                                                                                                                                                                                                                                   

test_integration.py ssssssss                                                                                                                                                                                                                                  [100%]

========================================================================================================================= warnings summary ==========================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================================================== 8 skipped, 2 warnings in 0.04s ===================================================================================================================

(3) model test using the command from the associated issue

export VLLM_USE_SIMPLE_KV_OFFLOAD=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=0

vllm serve amd/DeepSeek-R1-0528-MXFP4-Preview --host 0.0.0.0 --port 8889  --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --kv_offloading_backend native --kv_offloading_size 600 --no-disable-hybrid-kv-cache-manager

server started without exception:

...
(APIServer pid=28377) INFO:     Started server process [28377]
(APIServer pid=28377) INFO:     Waiting for application startup.
(APIServer pid=28377) INFO:     Application startup complete.

(4) lm-eval on the above model config:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.955|±  |0.0147|
|     |       |strict-match    |     5|exact_match|↑  |0.940|±  |0.0168|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

gemini-code-assist

Code Review

This pull request adds ROCm support to the simple KV offload system by implementing hipMemcpyBatchAsync and a fallback to per-op hipMemcpyAsync. It also includes logic to handle ROCm 7.2 stubs by clearing sticky errors and disabling the batch API if it returns NotSupported. Review feedback identifies potential crashes during the lazy resolution of memory functions if libraries or symbols are missing, and suggests adding null checks before calling the resolved batch memcpy function.

…ipErrorNotSupported Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-04-21T22:48:10Z

Hi @hongxiayang, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

hongxiayang · 2026-04-22T15:30:04Z

lm-eval result using cpu-offload:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.955|±  |0.0147|
|     |       |strict-match    |     5|exact_match|↑  |0.940|±  |0.0168|

baseline without using cpu-offload:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.945|±  |0.0162|
|     |       |strict-match    |     5|exact_match|↑  |0.940|±  |0.0168|

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

hongxiayang · 2026-04-23T12:43:19Z

Code Review

This pull request adds ROCm support to the simple KV offload system by implementing hipMemcpyBatchAsync and a fallback to per-op hipMemcpyAsync. It also includes logic to handle ROCm 7.2 stubs by clearing sticky errors and disabling the batch API if it returns NotSupported. Review feedback identifies potential crashes during the lazy resolution of memory functions if libraries or symbols are missing, and suggests adding null checks before calling the resolved batch memcpy function.

this review comment is obsolete.

kenroche · 2026-04-24T17:58:15Z

hi Hongxia - taking a look, ty.

mawong-amd

Worth noting that #41386 bumps the ROCm base image to 7.2.2, so after it is merged we may be able to use a unified num_attrs > 0 invocation. This would be relevant for say #39306.

Also worth noting that #41474 has increased relevance on ROCm after this PR is merged.

hongxiayang · 2026-05-04T13:56:20Z

@tjtanaa @DarkLight1337 : Can you help on merging this PR? thanks! please let me know if you need anything else.

hongxiayang · 2026-05-04T13:57:11Z

Worth noting that #41386 bumps the ROCm base image to 7.2.2, so after it is merged we may be able to use a unified num_attrs > 0 invocation. This would be relevant for say #39306.

Also worth noting that #41474 has increased relevance on ROCm after this PR is merged.

thanks for the comment and verification. We will watch and follow up with those updates

functionstackx · 2026-05-05T01:19:07Z

hi @hongxiayang can u help get this merged? this is blocker to agentx that @andyluo7 is working with us on

tjtanaa

LGTM.

tjtanaa · 2026-04-23T13:26:28Z

        raise RuntimeError(f"cudaHostRegister failed: {err}")


+# NOTE: ``CUmemcpyAttributes`` and ``hipMemcpyAttributes`` share the same


We don't need to specify this as it is known that CUmemcpyAttributes and hipMemcpyAttributes are compatible if we don't specify custom code path. We don't need to be this verbose.

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

[ROCm] Enable SimpleCPUOffloadConnector on ROCm

3413e73

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

mergify Bot added nvidia rocm Related to AMD ROCm v1 labels Apr 21, 2026

github-project-automation Bot added this to NVIDIA and AMD Apr 21, 2026

github-project-automation Bot moved this to Todo in AMD Apr 21, 2026

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated

hongxiayang added 3 commits April 21, 2026 20:56

set numAttrs == 0 to let hipMemcpyBatchAsync work without returning h…

d484b33

…ipErrorNotSupported Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

further cleanup

4f47b8e

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

gemini feedback

1753878

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

hongxiayang marked this pull request as ready for review April 21, 2026 22:48

claude Bot reviewed Apr 21, 2026

View reviewed changes

hongxiayang requested review from tjtanaa April 21, 2026 22:53

hongxiayang mentioned this pull request Apr 21, 2026

[Feature]: Add ROCm support for simple offload connector #40397

Open

1 task

pre-commit

f63d614

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

hongxiayang moved this from Todo to In Progress in AMD Apr 22, 2026

hongxiayang requested a review from DarkLight1337 April 23, 2026 12:36

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 1, 2026

mawong-amd approved these changes May 1, 2026

View reviewed changes

tjtanaa approved these changes May 5, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 5, 2026

Merge branch 'main' into offload-cpu

21f875b

tjtanaa enabled auto-merge (squash) May 5, 2026 11:41

tjtanaa disabled auto-merge May 5, 2026 11:41

updated some comments also added 7.2.2 after verification

b5497fd

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

		raise RuntimeError(f"cudaHostRegister failed: {err}")


		# NOTE: ``CUmemcpyAttributes`` and ``hipMemcpyAttributes`` share the same

Uh oh!

Conversation

hongxiayang commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Apr 21, 2026

Uh oh!

hongxiayang commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongxiayang commented Apr 23, 2026

Code Review

Uh oh!

kenroche commented Apr 24, 2026

Uh oh!

mawong-amd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongxiayang commented May 4, 2026

Uh oh!

hongxiayang commented May 4, 2026

Uh oh!

functionstackx commented May 5, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

hongxiayang May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hongxiayang commented Apr 21, 2026 •

edited

Loading

hongxiayang commented Apr 22, 2026 •

edited

Loading

mawong-amd left a comment •

edited

Loading