Skip to content

[Perf] feat: (OffloaderV2, >5x thrpt↑) Support CUDA Graph and all weights offloading#24531

Open
xiaobao520123 wants to merge 1 commit intosgl-project:mainfrom
xiaobao520123:feature/offloader_v2_full
Open

[Perf] feat: (OffloaderV2, >5x thrpt↑) Support CUDA Graph and all weights offloading#24531
xiaobao520123 wants to merge 1 commit intosgl-project:mainfrom
xiaobao520123:feature/offloader_v2_full

Conversation

@xiaobao520123
Copy link
Copy Markdown

@xiaobao520123 xiaobao520123 commented May 6, 2026

Motivation

Current implementation of OffloaderV2 has following issues and limitations:

  1. DeepSeekV2 and MoE weights only. This is due to models must explicitly specify parameter names to whitelist_param_names_creator to select which weights to offload.
  2. CUDA Graph and torch.compile are NOT supported or NOT correctly supported.
  3. (OOM!) Because current offloader requires to disable CUDA Graph (and piecewise CUDA Graph), torch profiling accumulates thousands of CUDA kernels and PyTorch ops and eventually process often gets killed during benchmarks.

To resolve these issues, I added a help function which can find all names of parameters to offload. Enhanced naming function allows more types of models to offload, such as Dense models. Moreover, inspired by [offloader] v2: Hide weight onloading latency via prefetching #29941, I implemented a N buffer pool plus a cross-event synchronization mechanism to support CUDA Graph and torch.compile.

Briefly speaking, this PR contributes:

  1. >5x token throughput increase compared with the baseline.
  2. Support offloading more types of models (Dense, MoE, ...) and support all-weight offloading.
    • After this PR, users can enable all-weight offloading by simply settingoffloader_kwargs=ALL_MODEL_PARAMS when loading models.

Modifications

  1. Allow dots . in parameter names to support models different from MoE.
  2. Add ALL_MODEL_PARAMS to support full model weights offloading.
  3. Add a unified offload memory pool to support CUDA Graph and allow graphs to be replayed. It also saves memory by allocating buffers that are just enough for prefetching.
  4. Add cross-event synchronization to support CUDA Graph capturing and replay. Support torch.compile for OffloaderV2.
  5. Updated docs.

Accuracy Tests

Serve command:

  • Baseline:
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 1 \
    --enable-metrics
  • With offloading (25%):
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 1 \
    --enable-metrics \
    --offload-group-size 4 \
    --offload-num-in-group 1 \
    --offload-prefetch-step 1

Test command:

python -m sglang.test.run_eval --eval-name gsm8k --num-examples 200 --port 8000

Result:

Configuration Score Latency (s) Throughput (tok/s)
Baseline 0.930 55.71 2707.19
With offloading 0.935 358.26 422.47

Speed Tests and Profiling

Test environment

Hardware

  • GPU: 1× NVIDIA A100-PCIE-40GB (PCIe Gen4 ×16, ~25 GB/s effective)
  • CPU: 2× Intel Xeon Gold 5318Y @ 2.10GHz (24c/48t each, 96 logical CPUs total)
  • RAM: 188 GiB (NUMA node0 = 64 GB, node1 = 128 GB)
  • NUMA topology: 2 nodes (cross-node penalty: 20 vs local 10)

Software

  • OS: Ubuntu 22.04.5 LTS, kernel 6.12.0
  • CUDA toolkit: 13.0 (driver 595.58.03)
  • Python: 3.11.15
  • PyTorch: 2.11.0+cu130
  • sglang: 0.5.10.post2.dev974+ge5df44d5f

Offload Configurations

  • Model: deepseek-ai/DeepSeek-V2-Lite (MoE, 27 layers, 1 Dense, 26 MoE)
  • Offload Mode: cpu
  • --offload-group-size 4 --offload-num-in-group 1 --offload-prefetch-step 2

Baseline

(NOTICE): CUDA Graph, as well as piecewise CUDA Graph, is disabled because OffloaderV2 doesn't support it.

python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite \
    --host 0.0.0.0 --port 8000 \
    --tp 1 \
    --context-length 4096 \
    --chunked-prefill-size 2048 \
    --piecewise-cuda-graph-max-tokens 2048 \
    --enable-metrics \
    --offload-group-size 4 \
    --offload-num-in-group 1 \
    --offload-prefetch-step 2 \
    --offload-mode cpu \
    --disable-cuda-graph \
    --disable-piecewise-cuda-graph

This PR

python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 1 \
    --context-length 4096 \
    --chunked-prefill-size 2048 \
    --piecewise-cuda-graph-max-tokens 2048 \
    --enable-metrics \
    --offload-group-size 4 \
    --offload-num-in-group 1 \
    --offload-prefetch-step 2 \
    --offload-mode cpu

Benchmark

python -m sglang.bench_serving \
    --backend sglang \
    --port 8000 \
    --num-prompts 16 \
    --max-concurrency {1, 2, 4, 8, 16} \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 128 \
    --random-range-ratio 1.0 \
    --profile \
    --profile-output-dir [PLACEHOLDER] \
    --output-file [PLACEHOLDER]

Evaluation

image
  • Token throughput increases >5x.
  • Negligible impact on TTFT and TPOT.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
  • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  1. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation npu piecewise-cuda-graph labels May 6, 2026
@xiaobao520123
Copy link
Copy Markdown
Author

xiaobao520123 commented May 6, 2026

@fzyzcjy, @Hide-on-bushsh, @ping1jing2 Please take a look, thanks!

@xiaobao520123 xiaobao520123 force-pushed the feature/offloader_v2_full branch 3 times, most recently from 1ddd97b to 5989c69 Compare May 7, 2026 06:33
@xiaobao520123 xiaobao520123 force-pushed the feature/offloader_v2_full branch from 5989c69 to 14d74b5 Compare May 7, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation npu piecewise-cuda-graph

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant