[Perf] feat: (OffloaderV2, >5x thrpt↑) Support CUDA Graph and all weights offloading by xiaobao520123 · Pull Request #24531 · sgl-project/sglang

xiaobao520123 · 2026-05-06T12:49:54Z

Motivation

Current implementation of OffloaderV2 has following issues and limitations:

DeepSeekV2 and MoE weights only. This is due to models must explicitly specify parameter names to whitelist_param_names_creator to select which weights to offload.
CUDA Graph and torch.compile are NOT supported or NOT correctly supported.
(OOM!) Because current offloader requires to disable CUDA Graph (and piecewise CUDA Graph), torch profiling accumulates thousands of CUDA kernels and PyTorch ops and eventually process often gets killed during benchmarks.

To resolve these issues, I added a help function which can find all names of parameters to offload. Enhanced naming function allows more types of models to offload, such as Dense models. Moreover, inspired by [offloader] v2: Hide weight onloading latency via prefetching #29941, I implemented a N buffer pool plus a cross-event synchronization mechanism to support CUDA Graph and torch.compile.

Briefly speaking, this PR contributes:

>5x token throughput increase compared with the baseline.
Support offloading more types of models (Dense, MoE, ...) and support all-weight offloading.
- After this PR, users can enable all-weight offloading by simply settingoffloader_kwargs=ALL_MODEL_PARAMS when loading models.

Modifications

Allow dots . in parameter names to support models different from MoE.
Add ALL_MODEL_PARAMS to support full model weights offloading.
Add a unified offload memory pool to support CUDA Graph and allow graphs to be replayed. It also saves memory by allocating buffers that are just enough for prefetching.
Add cross-event synchronization to support CUDA Graph capturing and replay. Support torch.compile for OffloaderV2.
Updated docs.

Accuracy Tests

Serve command:

Baseline:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 1 \
    --enable-metrics

With offloading (25%):

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 1 \
    --enable-metrics \
    --offload-group-size 4 \
    --offload-num-in-group 1 \
    --offload-prefetch-step 1

Test command:

python -m sglang.test.run_eval --eval-name gsm8k --num-examples 200 --port 8000

Result:

Configuration	Score	Latency (s)	Throughput (tok/s)
Baseline	0.930	55.71	2707.19
With offloading	0.935	358.26	422.47

Speed Tests and Profiling

Test environment

Hardware

GPU: 1× NVIDIA A100-PCIE-40GB (PCIe Gen4 ×16, ~25 GB/s effective)
CPU: 2× Intel Xeon Gold 5318Y @ 2.10GHz (24c/48t each, 96 logical CPUs total)
RAM: 188 GiB (NUMA node0 = 64 GB, node1 = 128 GB)
NUMA topology: 2 nodes (cross-node penalty: 20 vs local 10)

Software

OS: Ubuntu 22.04.5 LTS, kernel 6.12.0
CUDA toolkit: 13.0 (driver 595.58.03)
Python: 3.11.15
PyTorch: 2.11.0+cu130
sglang: 0.5.10.post2.dev974+ge5df44d5f

Offload Configurations

Model: deepseek-ai/DeepSeek-V2-Lite (MoE, 27 layers, 1 Dense, 26 MoE)
Offload Mode: cpu
--offload-group-size 4 --offload-num-in-group 1 --offload-prefetch-step 2

Baseline

(NOTICE): CUDA Graph, as well as piecewise CUDA Graph, is disabled because OffloaderV2 doesn't support it.

python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite \
    --host 0.0.0.0 --port 8000 \
    --tp 1 \
    --context-length 4096 \
    --chunked-prefill-size 2048 \
    --piecewise-cuda-graph-max-tokens 2048 \
    --enable-metrics \
    --offload-group-size 4 \
    --offload-num-in-group 1 \
    --offload-prefetch-step 2 \
    --offload-mode cpu \
    --disable-cuda-graph \
    --disable-piecewise-cuda-graph

This PR

python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 1 \
    --context-length 4096 \
    --chunked-prefill-size 2048 \
    --piecewise-cuda-graph-max-tokens 2048 \
    --enable-metrics \
    --offload-group-size 4 \
    --offload-num-in-group 1 \
    --offload-prefetch-step 2 \
    --offload-mode cpu

Benchmark

python -m sglang.bench_serving \
    --backend sglang \
    --port 8000 \
    --num-prompts 16 \
    --max-concurrency {1, 2, 4, 8, 16} \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 128 \
    --random-range-ratio 1.0 \
    --profile \
    --profile-output-dir [PLACEHOLDER] \
    --output-file [PLACEHOLDER]

Evaluation

Token throughput increases >5x.
Negligible impact on TTFT and TPOT.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.

Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci

After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-05-06T12:49:58Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

xiaobao520123 · 2026-05-06T13:01:41Z

@fzyzcjy, @Hide-on-bushsh, @ping1jing2 Please take a look, thanks!

xiaobao520123 requested review from Fridge003, Oasis-Git, Ying1123, hebiao064, hnyls2002, ispobock, merrymercy and wisclmy0611 as code owners May 6, 2026 12:49

github-actions Bot added documentation Improvements or additions to documentation npu piecewise-cuda-graph labels May 6, 2026

xiaobao520123 force-pushed the feature/offloader_v2_full branch 3 times, most recently from 1ddd97b to 5989c69 Compare May 7, 2026 06:33

feat: (OffloaderV2) support all weights offload and CUDA Graph

14d74b5

xiaobao520123 force-pushed the feature/offloader_v2_full branch from 5989c69 to 14d74b5 Compare May 7, 2026 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] feat: (OffloaderV2, >5x thrpt↑) Support CUDA Graph and all weights offloading#24531

[Perf] feat: (OffloaderV2, >5x thrpt↑) Support CUDA Graph and all weights offloading#24531
xiaobao520123 wants to merge 1 commit intosgl-project:mainfrom
xiaobao520123:feature/offloader_v2_full

xiaobao520123 commented May 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Uh oh!

xiaobao520123 commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaobao520123 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Test environment

Offload Configurations

Baseline

This PR

Benchmark

Evaluation

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Uh oh!

xiaobao520123 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaobao520123 commented May 6, 2026 •

edited

Loading

xiaobao520123 commented May 6, 2026 •

edited

Loading