Support GPT-OSS-BF16#4240
Merged
Jiang-Jia-Jun merged 25 commits intoPaddlePaddle:developfrom Oct 20, 2025
Merged
Conversation
|
Thanks for your contribution! |
yuanlehome
reviewed
Oct 15, 2025
Comment on lines
+222
to
+228
| if ( | ||
| hasattr(self.fd_config.model_config, "layer_types") | ||
| and self.fd_config.model_config.layer_types[layer.layer_id] == "sliding_attention" | ||
| ): | ||
| sliding_window = self.fd_config.model_config.sliding_window | ||
| else: | ||
| sliding_window = 0 |
Collaborator
There was a problem hiding this comment.
目前只有append_attention支持SWA,可以后面再提个PR放到attention.py里面,其他后端就要弹NotImplementedError了
qingqing01
previously approved these changes
Oct 15, 2025
DDDivano
previously approved these changes
Oct 16, 2025
ec49653
DDDivano
approved these changes
Oct 20, 2025
qingqing01
approved these changes
Oct 20, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for the GPT OSS bf16 model. Compared to the vLLM, this PR implements Wint8 quantization and achieves a 15% leads in metrics such as QPS, TPS, and TTFT. This PR also introduces several new features to enhance model flexibility and performance, such as sinks in append attention, sliding window attention, bias support for MoE layers, and the swigluoai activation function.
New Features
Feature 1: Support Sinks in Append Attention
This feature introduces sinks in append attention, allowing certain tokens to remain visible across all decoding and enhances the control and stability of attention mechanisms, especially in long-context or multi-turn scenarios.
Feature 2: Support Sliding Window Attention (SWA)
This feature implements Sliding Window Attention, an efficient mechanism for handling long sequences by limiting the attention scope for each token.Sliding window constrains the visible key-value pairs during decoding, improving memory and efficiency in long-sequence inference.
Feature 3: Implement "swigluoai" activation function
This added support for SwigluOAI activation, a variant of SwiGLU with optimized scaling which provides configurable scaling factors (1.702, 7.0) and supports interleaved mode.
Feature 4: Add Bias support for MoE layers
This extended MoE feed-forward to correctly apply expert-specific bias during down projection. And ensures each token routes to the correct expert with its associated bias term.
Usage Example
Start online service
python -m fastdeploy.entrypoints.openai.api_server \ --model /path/to/gpt-oss-20b-bf16 \ --port 8188 \ --engine-worker-queue-port 51001\ --cache-queue-port 51002 \ --host 0.0.0.0 \ --max-model-len 32768 \ --max-num-seqs 256 \ --quantization wint8 \Send a request