add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324
Open
add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324
Conversation
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Author
|
gh pr edit 2324 --add-label ci:sglang |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a much larger set of tuned CK blockscale GEMM configurations intended to eliminate “missing tuned config” lookup/logging overhead for common Qwen3.5 FP8 benchmark shapes, so sglang’s reported throughput better reflects actual compute performance.
Changes:
- Expands the
a8w8_blockscale_tuned_gemm.csvtuned-shape database from a handful of entries to a comprehensive set covering many small/mediumMvalues and commonN/Kcombinations. - Adds both
ckandcktilelibtype entries for various shapes to improve hit rate in the runtime config lookup.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
yzhou103
reviewed
Mar 18, 2026
| 256,128,4096,1280,ck,7,0,7.4194,a8w8_blockscale_1x128x128_256x16x128x256_16x16_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8_1x2_intrawave_v1,180.9,870.06,0.0 | ||
| 256,1,256,4096,ck,8,0,11.6484,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.18,90.41,0.0 | ||
| 256,2,256,4096,ck,8,0,11.5873,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.36,91.29,0.0 | ||
| 256,3,256,4096,ck,8,0,11.6965,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.54,90.83,0.0 |
Contributor
Author
There was a problem hiding this comment.
OK, Let me check the shapes and delete some redundant config item.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Add tuned config for Qwen3.5-397B-A17B-FP8 CK blockscale GEMM
Motivation
According to the data on
https://inferencex.semianalysis.com/, the throughput performance of Qwen3.5 FP8 is significantly worse than BF16. Here are our benchmark results (docker image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260316):Root Cause
The root cause is that the FP8 model uses CK blockscale GEMM, which queries tuning configurations at runtime. However, most GEMM shapes do not have matching tuned configs, causing the query overhead to be wasted and a large amount of warning messages to be printed. All this time is counted by sglang, resulting in the FP8 throughput in sglang's performance report being far behind BF16 — but this does not reflect the actual compute performance. Below are the runtime logs from identical configurations (concurrency=32, prompts=32, input=1024, output=1024), comparing BF16 and FP8:
As shown above, while maintaining 32 concurrent requests, FP8 actually outperforms BF16 in generation throughput (1640 vs. 1503 tok/s). However, sglang's final performance report shows FP8 far behind BF16 (158 vs. 1298 tok/s) due to the overhead from missing tuned GEMM configs.
Solution
This PR provides CK blockscale GEMM tuning configurations for the following common benchmark scenarios:
Performance
Below are the FP8 benchmark results after applying the tuning configurations:
After applying the tuning config, sglang now reports correct performance numbers for FP8. The runtime logs also show that FP8 achieves slightly higher generation throughput than BF16. Using the same configuration (concurrency=32, prompts=32, input=1024, output=1024) as an example, here are the logs after tuning: