[ROCm] Enable SimpleCPUOffloadConnector on ROCm#40549
[ROCm] Enable SimpleCPUOffloadConnector on ROCm#40549hongxiayang wants to merge 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
There was a problem hiding this comment.
Code Review
This pull request adds ROCm support to the simple KV offload system by implementing hipMemcpyBatchAsync and a fallback to per-op hipMemcpyAsync. It also includes logic to handle ROCm 7.2 stubs by clearing sticky errors and disabling the batch API if it returns NotSupported. Review feedback identifies potential crashes during the lazy resolution of memory functions if libraries or symbols are missing, and suggests adding null checks before calling the resolved batch memcpy function.
…ipErrorNotSupported Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
|
Hi @hongxiayang, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
lm-eval result using cpu-offload: baseline without using cpu-offload: |
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
this review comment is obsolete. |
|
hi Hongxia - taking a look, ty. |
|
@tjtanaa @DarkLight1337 : Can you help on merging this PR? thanks! please let me know if you need anything else. |
thanks for the comment and verification. We will watch and follow up with those updates |
|
hi @hongxiayang can u help get this merged? this is blocker to agentx that @andyluo7 is working with us on |
| raise RuntimeError(f"cudaHostRegister failed: {err}") | ||
|
|
||
|
|
||
| # NOTE: ``CUmemcpyAttributes`` and ``hipMemcpyAttributes`` share the same |
There was a problem hiding this comment.
We don't need to specify this as it is known that CUmemcpyAttributes and hipMemcpyAttributes are compatible if we don't specify custom code path. We don't need to be this verbose.
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Purpose
Fix #40397
Enable
SimpleCPUOffloadConnectoron ROCm backend.Also enabled the related tests on ROCm.
Test Plan
Using the upstream nightly docker as the base (where rocm is v7.2.1) and build vllm from source.
Tested on MI350
(1) unit test:
(2) integration test:
(3) model serve and lm-eval
Test Result
(1) unit test: pass
(2) integration test: by default, skipped.
(3) model test using the command from the associated issue
server started without exception:
(4) lm-eval on the above model config:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.