Skip to content

Overlap two kernels in DeepSeek with communication #6711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 28, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/sglang/srt/operations_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,9 @@ def _compute_moe_deepseek_blog_decode(layer):
layer.mlp.op_combine_a,
operations.YieldOperation(),
layer.mlp.op_combine_b,
operations.YieldOperation(),
layer.mlp.op_output,
layer.op_comm_postprocess_layer,
Comment on lines 129 to 132
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change refines the operation staging for the DeepSeek decode path:

  • A YieldOperation is introduced after layer.mlp.op_combine_b.
  • The final YieldOperation after layer.op_comm_postprocess_layer is removed.

This effectively isolates op_combine_b into its own stage and groups op_output with op_comm_postprocess_layer in the new final stage. The total number of stages for decode operations increases from 5 to 6.

Could you provide more details on the specific kernels being overlapped and the expected performance benefits from this new staging? For instance, is op_combine_b (which involves deepep_dispatcher.combine_b) a communication-heavy step where yielding immediately after offers significant overlap opportunities with other batch processing?

Understanding the rationale will help in assessing the impact, especially since tbo_delta_stages remains 2.

operations.YieldOperation(),
],
)

Expand Down
Loading