feat: add thinking_budget (version 2) #6208

thyecust · 2025-05-12T02:13:45Z

Motivation

See #6089

Modifications

Removed incorrect diff in protocol.py. See commit version 2.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

thyecust · 2025-05-12T02:16:17Z

@zhyncs May you provide more information about #6181 (comment) ? I cannot reproduce that.

thyecust · 2025-05-12T02:22:58Z

12 results - 4 files

sglang • python/sglang/srt/model_executor/model_runner.py:
   ...

sglang • python/sglang/srt/openai_api/adapter.py:
   557                  "min_new_tokens": request.min_tokens,
   558:                 "thinking_budget": request.thinking_budget,
   559                  "stop": request.stop,

  1130              "min_new_tokens": request.min_tokens,
  1131:             "thinking_budget": request.thinking_budget,
  1132              "stop": stop,

sglang • python/sglang/srt/sampling/sampling_batch_info.py:
   ...

sglang • python/sglang/srt/sampling/sampling_params.py:
   ...

There are only two references of request.thinking_budget, one for ChatCompletionRequest and one for CompletionRequest. The definitions is in protocol.py. So I think they are safe.

thyecust · 2025-05-12T02:26:00Z

@minleminzui Could you help review this? thx

minleminzui · 2025-05-12T02:34:24Z

12 results - 4 files

sglang • python/sglang/srt/model_executor/model_runner.py:
   ...

sglang • python/sglang/srt/openai_api/adapter.py:
   557                  "min_new_tokens": request.min_tokens,
   558:                 "thinking_budget": request.thinking_budget,
   559                  "stop": request.stop,

  1130              "min_new_tokens": request.min_tokens,
  1131:             "thinking_budget": request.thinking_budget,
  1132              "stop": stop,

sglang • python/sglang/srt/sampling/sampling_batch_info.py:
   ...

sglang • python/sglang/srt/sampling/sampling_params.py:
   ...

There are only two references of request.thinking_budget, one for ChatCompletionRequest and one for CompletionRequest. The definitions is in protocol.py. So I think they are safe.

@CatherineSue, @ispobock, @sleepcoo please help to review this pr, please ensure it doesn't affect your internal services.

CatherineSue

It seems we allow user to pass this parameter with a non-thinking model? I think it is better to isolate this from non-thinking model to prevent potential effects in the future

thyecust · 2025-05-12T20:26:07Z

It seems we allow user to pass this parameter with a non-thinking model? I think it is better to isolate this from non-thinking model to prevent potential effects in the future

in sampling_batch_info.py

        if any(hasattr(r.tokenizer, "think_end_id") for r in reqs): 
             think_end_ids = torch.tensor( 
                 [getattr(r.tokenizer, "think_end_id", -1) for r in reqs], 
                 dtype=torch.int64, 
             ).to(device, non_blocking=True) 
             num_thinking_tokens = torch.tensor([0 for _ in reqs], dtype=torch.int64).to( 
                 device, non_blocking=True

For a non-thinking model, r.tokenizer has no think_end_id attribute, so nothing will happen.

Better isolation introduces more code. Is this silent ignorance acceptable? @CatherineSue

python/sglang/srt/sampling/sampling_batch_info.py

ispobock · 2025-05-13T06:42:35Z

python/sglang/srt/sampling/sampling_batch_info.py

@@ -76,7 +81,22 @@ def from_schedule_batch(cls, batch: ScheduleBatch, vocab_size: int):
        min_ps = torch.tensor(
            [r.sampling_params.min_p for r in reqs], dtype=torch.float
        ).to(device, non_blocking=True)
-
+        if any(hasattr(r.tokenizer, "think_end_id") for r in reqs):


For non-reasoning model, we can skip this check? Do we need to identify if the model is a reasoning model from the architect?

I am thinking about it. Now I don't know how to decide whether a model is reasoning from a request.

As far as I can think of, there seems to be no better way. The information about can be obtained here is very limited. It is just doing sampling and does not know the specific model architecture.

ispobock · 2025-05-13T06:53:08Z

python/sglang/srt/openai_api/adapter.py

@@ -1127,6 +1128,7 @@ def v1_chat_generate_request(
            "temperature": request.temperature,
            "max_new_tokens": request.max_tokens or request.max_completion_tokens,
            "min_new_tokens": request.min_tokens,
+            "thinking_budget": request.thinking_budget,


It seems only valid when enable_thinking = True. Need to do the validation.

done in 1c741a6

ispobock · 2025-05-13T06:55:46Z

python/sglang/srt/openai_api/protocol.py

@@ -172,6 +172,7 @@ class CompletionRequest(BaseModel):
    top_k: int = -1
    min_p: float = 0.0
    min_tokens: int = 0
+    thinking_budget: Optional[int] = None


Will it be used in non-chat scenario? Is there any usage example/reference for that?

I am not familiar with CompletionRequest. Removed thinking_budget in it 186b1aa

lambert0312 · 2025-05-14T03:13:06Z

python/sglang/srt/sampling/sampling_batch_info.py

+    def apply_thinking_budgets(self, next_token_logits: torch.Tensor):
+        if self.thinking_budgets is None:
+            return
+        has_budget = self.thinking_budgets > 0


Doesn't support =0 scenario?

Hi, now we support thinking_budget=0 in 30aa15f

lambert0312 · 2025-05-14T03:13:20Z

python/sglang/srt/sampling/sampling_batch_info.py

+            next_token_logits[batch_indices, end_token_indices] = 0.0
+
+    def update_thinking_budgets(self, next_token_ids: torch.Tensor):
+        if self.thinking_budgets is None or not torch.any(self.thinking_budgets > 0):


Doesn't support =0 scenario?

Hi, now we support thinking_budget=0 in 30aa15f

lambert0312 · 2025-05-15T07:08:45Z

python/sglang/srt/sampling/sampling_batch_info.py

+                device, non_blocking=True
+            )
+            thinking_budgets = torch.tensor(
+                [r.sampling_params.thinking_budget or -1 for r in reqs],


When thinking_budget=0, this will be assigned a value of -1. It is recommended to modify this, otherwise it will not be fully supported.

updated 200c598
thanks for pointing out

thyecust added 2 commits May 12, 2025 02:10

version 1

a16ac99

version 2

6c2ecdb

thyecust requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu, CatherineSue and zhaochenyang20 as code owners May 12, 2025 02:13

thyecust changed the title ~~Thinking budget2~~ feat: add thinking_budget (version 2) May 12, 2025

minleminzui self-assigned this May 12, 2025

zhyncs assigned CatherineSue, ispobock and sleepcoo May 12, 2025

Merge branch 'main' into thinking_budget2

14fe85c

CatherineSue reviewed May 12, 2025

View reviewed changes

ispobock reviewed May 13, 2025

View reviewed changes

thyecust added 3 commits May 13, 2025 12:22

check none

6aca64c

remove thinking_budget in CompletionRequest

186b1aa

add enable_thinking check

1c741a6

lambert0312 reviewed May 14, 2025

View reviewed changes

support thinking_budgets=0

30aa15f

lambert0312 reviewed May 15, 2025

View reviewed changes

= and others added 8 commits May 24, 2025 23:44

fix

200c598

Merge branch 'main' into thinking_budget2

1c99b26

Merge branch 'main' into thinking_budget2

8830b06

Fix typo

6a56bfe

Merge branch 'main' into thinking_budget2

40defe9

Merge branch 'main' into thinking_budget2

ebb710b

Merge branch 'main' into thinking_budget2

3819825

Merge branch 'main' into thinking_budget2

7a56516

feat: add thinking_budget (version 2) #6208

Are you sure you want to change the base?

feat: add thinking_budget (version 2) #6208

Uh oh!

Conversation

thyecust commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

thyecust commented May 12, 2025

Uh oh!

thyecust commented May 12, 2025

Uh oh!

thyecust commented May 12, 2025

Uh oh!

minleminzui commented May 12, 2025

Uh oh!

CatherineSue left a comment

Choose a reason for hiding this comment

Uh oh!

thyecust commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thyecust commented May 12, 2025 •

edited

Loading

thyecust commented May 12, 2025 •

edited

Loading