Skip to content

add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324

Open
zovonoir wants to merge 1 commit intomainfrom
qwen3.5_gemm_a8w8_tuned
Open

add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324
zovonoir wants to merge 1 commit intomainfrom
qwen3.5_gemm_a8w8_tuned

Conversation

@zovonoir
Copy link

Add tuned config for Qwen3.5-397B-A17B-FP8 CK blockscale GEMM

Motivation

According to the data on https://inferencex.semianalysis.com/, the throughput performance of Qwen3.5 FP8 is significantly worse than BF16. Here are our benchmark results (docker image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260316):

DTYPE CONC IN OUT PROMPTS Input Tput(tok/s) Output Tput(tok/s) Total Tput(tok/s)
bf16 32 1024 1024 128 1244.21 1259.29 2503.49
fp8 32 1024 1024 128 466.32 471.97 938.28
bf16 32 8192 1024 128 8280.12 1025.66 9305.77
fp8 32 8192 1024 128 3515.85 435.51 3951.35
bf16 32 1024 1024 32 1330.38 1298.37 2628.75
fp8 32 1024 1024 32 162.51 158.60 321.11
bf16 32 8192 1024 32 8299.27 1009.76 9309.04
fp8 32 8192 1024 32 1249.79 152.06 1401.85
bf16 32 1024 1024 64 1269.08 1261.13 2530.21
fp8 32 1024 1024 64 287.22 285.43 572.65
bf16 32 8192 1024 64 8352.56 1022.88 9375.43
fp8 32 8192 1024 64 2198.10 269.19 2467.29
bf16 64 1024 1024 128 1967.29 1991.13 3958.43
fp8 64 1024 1024 128 533.33 539.80 1073.13
bf16 64 8192 1024 128 11605.85 1437.61 13043.47
fp8 64 8192 1024 128 3870.51 479.44 4349.95
bf16 64 1024 1024 32 1328.27 1296.31 2624.57
fp8 64 1024 1024 32 161.48 157.59 319.07
bf16 64 8192 1024 32 8259.75 1004.95 9264.70
fp8 64 8192 1024 32 1264.76 153.88 1418.65
bf16 64 1024 1024 64 2124.91 2111.61 4236.52
fp8 64 1024 1024 64 311.12 309.17 620.28
bf16 64 8192 1024 64 11684.25 1430.88 13115.13
fp8 64 8192 1024 64 2355.39 288.45 2643.83

Root Cause

The root cause is that the FP8 model uses CK blockscale GEMM, which queries tuning configurations at runtime. However, most GEMM shapes do not have matching tuned configs, causing the query overhead to be wasted and a large amount of warning messages to be printed. All this time is counted by sglang, resulting in the FP8 throughput in sglang's performance report being far behind BF16 — but this does not reflect the actual compute performance. Below are the runtime logs from identical configurations (concurrency=32, prompts=32, input=1024, output=1024), comparing BF16 and FP8:

  1. bf16
[2026-03-17 08:16:26 TP0] Decode batch, #running-req: 32, #full token: 52307, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1502.25, #queue-req: 0
[2026-03-17 08:16:27 TP0] Decode batch, #running-req: 32, #full token: 53587, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1507.70, #queue-req: 0
[2026-03-17 08:16:28 TP0] Decode batch, #running-req: 32, #full token: 54867, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1503.64, #queue-req: 0
[2026-03-17 08:16:28 TP0] Decode batch, #running-req: 30, #full token: 52574, full token usage: 0.01, mamba num: 60, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1493.49, #queue-req: 0
[2026-03-17 08:16:29 TP0] Decode batch, #running-req: 22, #full token: 39301, full token usage: 0.01, mamba num: 44, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 1184.83, #queue-req: 0
[2026-03-17 08:16:30 TP0] Decode batch, #running-req: 13, #full token: 23915, full token usage: 0.00, mamba num: 26, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 893.73, #queue-req: 0
[2026-03-17 08:16:31 TP0] Decode batch, #running-req: 9, #full token: 16936, full token usage: 0.00, mamba num: 18, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 675.51, #queue-req: 0
[2026-03-17 08:16:31 TP0] Decode batch, #running-req: 5, #full token: 9565, full token usage: 0.00, mamba num: 10, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 359.44, #queue-req: 0
#Input tokens: 29678
#Output tokens: 28964
#Input tokens: 4096
#Output tokens: 256

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     32        
Benchmark duration (s):                  22.31     
Total input tokens:                      29678     
Total generated tokens:                  28964     
Last generation throughput (tok/s):      359.44    
Request throughput (req/s):              1.43      
Input token throughput (tok/s):          1330.38   
Output token throughput (tok/s):         1298.37   
Total token throughput (tok/s):          2628.75   
==================================================
  1. fp8
[2026-03-17 07:53:39 TP0] Decode batch, #running-req: 32, #full token: 51027, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1634.73, #queue-req: 0
[2026-03-17 07:53:40 TP0] Decode batch, #running-req: 32, #full token: 52307, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1640.24, #queue-req: 0
[2026-03-17 07:53:41 TP0] Decode batch, #running-req: 32, #full token: 53587, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1640.73, #queue-req: 0
[2026-03-17 07:53:41 TP0] Decode batch, #running-req: 32, #full token: 54867, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1638.80, #queue-req: 0
[2026-03-17 07:53:42 TP0] Decode batch, #running-req: 30, #full token: 52574, full token usage: 0.01, mamba num: 60, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1618.10, #queue-req: 0
[2026-03-17 07:53:43 TP0] Decode batch, #running-req: 22, #full token: 39301, full token usage: 0.01, mamba num: 44, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 1251.72, #queue-req: 0
[2026-03-17 07:53:44 TP0] Decode batch, #running-req: 13, #full token: 23915, full token usage: 0.00, mamba num: 26, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 902.71, #queue-req: 0
[2026-03-17 07:53:45 TP0] Decode batch, #running-req: 9, #full token: 16936, full token usage: 0.00, mamba num: 18, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 665.28, #queue-req: 0
[2026-03-17 07:53:45 TP0] Decode batch, #running-req: 5, #full token: 9565, full token usage: 0.00, mamba num: 10, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 346.18, #queue-req: 0
#Input tokens: 29678
#Output tokens: 28964
#Input tokens: 4096
#Output tokens: 256

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     32        
Benchmark duration (s):                  182.62    
Total input tokens:                      29678     
Total generated tokens:                  28964     
Last generation throughput (tok/s):      346.18    
Request throughput (req/s):              0.18      
Input token throughput (tok/s):          162.51    
Output token throughput (tok/s):         158.60    
Total token throughput (tok/s):          321.11    
==================================================

As shown above, while maintaining 32 concurrent requests, FP8 actually outperforms BF16 in generation throughput (1640 vs. 1503 tok/s). However, sglang's final performance report shows FP8 far behind BF16 (158 vs. 1298 tok/s) due to the overhead from missing tuned GEMM configs.

Solution

This PR provides CK blockscale GEMM tuning configurations for the following common benchmark scenarios:

CONCURRENT PROMPTS INPUT OUTPUT
32 32 1024 1024
32 64 1024 1024
32 128 1024 1024
64 32 1024 1024
64 64 1024 1024
64 128 1024 1024
32 32 8192 1024
32 64 8192 1024
32 128 8192 1024
64 32 8192 1024
64 64 8192 1024
64 128 8192 1024
32 32 1024 8192
32 64 1024 8192

Performance

Below are the FP8 benchmark results after applying the tuning configurations:

CONC IN OUT PROMPTS Input Tput Not Tune(tok/s) Input Tput Tuned(tok/s) Output Tput Not Tune(tok/s) Output Tput Tuned(tok/s) Total Tput Not Tune(tok/s) Total Tput Tuned(tok/s) Total Tput Diff(%)
32 1024 1024 32 162.51 1459.79 158.60 1424.67 321.11 2884.46 +798.3%
32 1024 1024 64 287.22 1391.59 285.43 1382.88 572.65 2774.46 +384.5%
32 1024 1024 128 466.32 1371.08 471.97 1387.70 938.28 2758.78 +194.0%
32 1024 8192 32 90.96 187.06 726.90 1494.93 817.86 1681.99 +105.7%
32 1024 8192 64 120.63 185.36 985.52 1514.33 1106.15 1699.68 +53.7%
32 8192 1024 32 1249.79 8728.92 152.06 1062.04 1401.85 9790.96 +598.4%
32 8192 1024 64 2198.10 9027.97 269.19 1105.59 2467.29 10133.55 +310.7%
32 8192 1024 128 3515.85 8848.01 435.51 1096.00 3951.35 9944.01 +151.7%
64 1024 1024 32 161.48 1460.76 157.59 1425.62 319.07 2886.39 +804.6%
64 1024 1024 64 311.12 2219.75 309.17 2205.86 620.28 4425.62 +613.5%
64 1024 1024 128 533.33 2022.49 539.80 2047.00 1073.13 4069.50 +279.2%
64 8192 1024 32 1264.76 8882.65 153.88 1080.74 1418.65 9963.40 +602.3%
64 8192 1024 64 2355.39 12042.77 288.45 1474.79 2643.83 13517.56 +411.3%
64 8192 1024 128 3870.51 11891.75 479.44 1473.03 4349.95 13364.78 +207.2%

After applying the tuning config, sglang now reports correct performance numbers for FP8. The runtime logs also show that FP8 achieves slightly higher generation throughput than BF16. Using the same configuration (concurrency=32, prompts=32, input=1024, output=1024) as an example, here are the logs after tuning:

cuda graph: True, gen throughput (token/s): 1687.39, #queue-req: 0
[2026-03-17 13:35:54 TP0] Decode batch, #running-req: 32, #full token: 53587, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1686.10, #queue-req: 0
[2026-03-17 13:35:55 TP0] Decode batch, #running-req: 32, #full token: 54867, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1689.84, #queue-req: 0
[2026-03-17 13:35:56 TP0] Decode batch, #running-req: 30, #full token: 52574, full token usage: 0.01, mamba num: 60, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1669.94, #queue-req: 0
[2026-03-17 13:35:57 TP0] Decode batch, #running-req: 22, #full token: 39301, full token usage: 0.01, mamba num: 44, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 1279.35, #queue-req: 0
[2026-03-17 13:35:57 TP0] Decode batch, #running-req: 13, #full token: 23915, full token usage: 0.00, mamba num: 26, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 931.22, #queue-req: 0
[2026-03-17 13:35:58 TP0] Decode batch, #running-req: 9, #full token: 16936, full token usage: 0.00, mamba num: 18, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 691.06, #queue-req: 0
[2026-03-17 13:35:59 TP0] Decode batch, #running-req: 5, #full token: 9565, full token usage: 0.00, mamba num: 10, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 357.70, #queue-req: 0
#Input tokens: 29678
#Output tokens: 28964
#Input tokens: 4096
#Output tokens: 256

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     32        
Benchmark duration (s):                  20.33     
Total input tokens:                      29678     
Total generated tokens:                  28964     
Last generation throughput (tok/s):      357.70    
Request throughput (req/s):              1.57      
Input token throughput (tok/s):          1459.79   
Output token throughput (tok/s):         1424.67   
Total token throughput (tok/s):          2884.46   
==================================================

@zovonoir zovonoir requested review from a team and Copilot March 18, 2026 06:53
@github-actions
Copy link
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2324 --add-label <label>

@zovonoir
Copy link
Author

gh pr edit 2324 --add-label ci:sglang

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a much larger set of tuned CK blockscale GEMM configurations intended to eliminate “missing tuned config” lookup/logging overhead for common Qwen3.5 FP8 benchmark shapes, so sglang’s reported throughput better reflects actual compute performance.

Changes:

  • Expands the a8w8_blockscale_tuned_gemm.csv tuned-shape database from a handful of entries to a comprehensive set covering many small/medium M values and common N/K combinations.
  • Adds both ck and cktile libtype entries for various shapes to improve hit rate in the runtime config lookup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

256,128,4096,1280,ck,7,0,7.4194,a8w8_blockscale_1x128x128_256x16x128x256_16x16_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8_1x2_intrawave_v1,180.9,870.06,0.0
256,1,256,4096,ck,8,0,11.6484,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.18,90.41,0.0
256,2,256,4096,ck,8,0,11.5873,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.36,91.29,0.0
256,3,256,4096,ck,8,0,11.6965,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.54,90.83,0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have padded M to match in configs so you can just tune shapes base as these Image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Let me check the shapes and delete some redundant config item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants