Skip to content

[ROCm] Enable SimpleCPUOffloadConnector on ROCm#40549

Open
hongxiayang wants to merge 7 commits intovllm-project:mainfrom
hongxiayang:offload-cpu
Open

[ROCm] Enable SimpleCPUOffloadConnector on ROCm#40549
hongxiayang wants to merge 7 commits intovllm-project:mainfrom
hongxiayang:offload-cpu

Conversation

@hongxiayang
Copy link
Copy Markdown
Collaborator

@hongxiayang hongxiayang commented Apr 21, 2026

Purpose

Fix #40397

Enable SimpleCPUOffloadConnector on ROCm backend.
Also enabled the related tests on ROCm.

Test Plan

Using the upstream nightly docker as the base (where rocm is v7.2.1) and build vllm from source.
Tested on MI350

(1) unit test:

vllm/tests/v1/simple_kv_offload# pytest-

(2) integration test:

vllm/tests/v1/simple_kv_offload# pytest test_integration.py

(3) model serve and lm-eval

Test Result

(1) unit test: pass

root@xxx:/app/vllm/tests/v1/simple_kv_offload# pytest test_scheduler.py                                                    
================================================================================================================== test session starts ===================================================================================================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /app/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 12 items                                                                                                                                                                                                                                       

test_scheduler.py ............                                                                                                                                                                                                                     [100%]

==================================================================================================================== warnings summary ====================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

tests/v1/simple_kv_offload/test_scheduler.py: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================ 12 passed, 16 warnings in 12.12s ============================================================================================================

(2) integration test: by default, skipped.

pytest test_integration.py                                                               
======================================================================================================================== test session starts ========================================================================================================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /app/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 8 items                                                                                                                                                                                                                                                   

test_integration.py ssssssss                                                                                                                                                                                                                                  [100%]

========================================================================================================================= warnings summary ==========================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================================================== 8 skipped, 2 warnings in 0.04s ===================================================================================================================

(3) model test using the command from the associated issue

export VLLM_USE_SIMPLE_KV_OFFLOAD=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=0

vllm serve amd/DeepSeek-R1-0528-MXFP4-Preview --host 0.0.0.0 --port 8889  --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --kv_offloading_backend native --kv_offloading_size 600 --no-disable-hybrid-kv-cache-manager

server started without exception:

...
(APIServer pid=28377) INFO:     Started server process [28377]
(APIServer pid=28377) INFO:     Waiting for application startup.
(APIServer pid=28377) INFO:     Application startup complete.

(4) lm-eval on the above model config:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.955|±  |0.0147|
|     |       |strict-match    |     5|exact_match|↑  |0.940|±  |0.0168|


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
@mergify mergify Bot added nvidia rocm Related to AMD ROCm v1 labels Apr 21, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds ROCm support to the simple KV offload system by implementing hipMemcpyBatchAsync and a fallback to per-op hipMemcpyAsync. It also includes logic to handle ROCm 7.2 stubs by clearing sticky errors and disabling the batch API if it returns NotSupported. Review feedback identifies potential crashes during the lazy resolution of memory functions if libraries or symbols are missing, and suggests adding null checks before calling the resolved batch memcpy function.

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py
Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated
…ipErrorNotSupported

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
@hongxiayang hongxiayang marked this pull request as ready for review April 21, 2026 22:48
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 21, 2026

Hi @hongxiayang, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@hongxiayang
Copy link
Copy Markdown
Collaborator Author

hongxiayang commented Apr 22, 2026

lm-eval result using cpu-offload:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.955|±  |0.0147|
|     |       |strict-match    |     5|exact_match|↑  |0.940|±  |0.0168|

baseline without using cpu-offload:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.945|±  |0.0162|
|     |       |strict-match    |     5|exact_match|↑  |0.940|±  |0.0168|

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
@hongxiayang hongxiayang moved this from Todo to In Progress in AMD Apr 22, 2026
@hongxiayang
Copy link
Copy Markdown
Collaborator Author

Code Review

This pull request adds ROCm support to the simple KV offload system by implementing hipMemcpyBatchAsync and a fallback to per-op hipMemcpyAsync. It also includes logic to handle ROCm 7.2 stubs by clearing sticky errors and disabling the batch API if it returns NotSupported. Review feedback identifies potential crashes during the lazy resolution of memory functions if libraries or symbols are missing, and suggests adding null checks before calling the resolved batch memcpy function.

this review comment is obsolete.

@kenroche
Copy link
Copy Markdown
Collaborator

hi Hongxia - taking a look, ty.

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 1, 2026
Copy link
Copy Markdown
Contributor

@mawong-amd mawong-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that #41386 bumps the ROCm base image to 7.2.2, so after it is merged we may be able to use a unified num_attrs > 0 invocation. This would be relevant for say #39306.

Also worth noting that #41474 has increased relevance on ROCm after this PR is merged.

@hongxiayang
Copy link
Copy Markdown
Collaborator Author

@tjtanaa @DarkLight1337 : Can you help on merging this PR? thanks! please let me know if you need anything else.

@hongxiayang
Copy link
Copy Markdown
Collaborator Author

Worth noting that #41386 bumps the ROCm base image to 7.2.2, so after it is merged we may be able to use a unified num_attrs > 0 invocation. This would be relevant for say #39306.

Also worth noting that #41474 has increased relevance on ROCm after this PR is merged.

thanks for the comment and verification. We will watch and follow up with those updates

@functionstackx
Copy link
Copy Markdown

hi @hongxiayang can u help get this merged? this is blocker to agentx that @andyluo7 is working with us on

Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

raise RuntimeError(f"cudaHostRegister failed: {err}")


# NOTE: ``CUmemcpyAttributes`` and ``hipMemcpyAttributes`` share the same
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to specify this as it is known that CUmemcpyAttributes and hipMemcpyAttributes are compatible if we don't specify custom code path. We don't need to be this verbose.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 5, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) May 5, 2026 11:41
@tjtanaa tjtanaa disabled auto-merge May 5, 2026 11:41
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: In Progress
Status: Ready

Development

Successfully merging this pull request may close these issues.

[Feature]: Add ROCm support for simple offload connector

5 participants