[offloader] v2: Hide weight onloading latency via prefetching#29941
[offloader] v2: Hide weight onloading latency via prefetching#29941vllm-bot merged 45 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Ming Yang <minos.future@gmail.com>
There was a problem hiding this comment.
IIUC this is mostly a RL feature, correct? Maybe @youkaichao can take a look?
|
Why is TPOT and ITL 0 in the benchmark result? EDIT: Ah I see you probably set num output tokens to 1. |
mostly for fitting model onto less GPUs at the moment. But could be useful for weight updating in RL. |
Signed-off-by: Michael Goin <mgoin64@gmail.com>
mgoin
left a comment
There was a problem hiding this comment.
LGTM, thanks for the changes. I think it is obvious the get_offloader().sync_prev_onload() and get_offloader().join_after_forward() insertions in the model runner are fragile and prone to failure in the future, but I'm not sure how to better structure this. Maybe @benchislett or @LucasWilkinson have strong opinions against this, but we do need this performant feature to land sooner or later
Set _instance default to NoopOffloader() so get_offloader() always returns a valid instance. Log the offloader type in set_offloader() for visibility into which backend is active. Signed-off-by: Ming Yang <minos.future@gmail.com>
…roject#29941) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…roject#29941) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…roject#29941) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Purpose
This PR adds CPU weight offloader that hides weight onloading latency by prefetching weights. This saves the performance cost of zero-copy UVA access. This technique was first developed in SGLang for GB200: https://lmsys.org/blog/2025-09-25-gb200-part-2/, and now adapted to support torch.compile and CUDA graph within vLLM in this PR.
Also refactors the offloading to be extensible.
Demonstrated in the trace:
Test Plan
Example serving recipe for GB200:
Test Result
QPS per GPU 18.04 (2 GPUs)
this PR
--offload-group-size 2 --offload-num-in-group 1 --offload-prefetch-step 1QPS per GPU 16.9 (4 GPUs)
without
--offload-group-size 2 --offload-num-in-group 1 --offload-prefetch-step 1Accuracy:
local-completions (model=nvidia/DeepSeek-R1-0528-FP4-v2,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=32), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.