-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: add thinking_budget (version 2) #6208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@zhyncs May you provide more information about #6181 (comment) ? I cannot reproduce that. |
There are only two references of |
@minleminzui Could you help review this? thx |
@CatherineSue, @ispobock, @sleepcoo please help to review this pr, please ensure it doesn't affect your internal services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we allow user to pass this parameter with a non-thinking model? I think it is better to isolate this from non-thinking model to prevent potential effects in the future
in if any(hasattr(r.tokenizer, "think_end_id") for r in reqs):
think_end_ids = torch.tensor(
[getattr(r.tokenizer, "think_end_id", -1) for r in reqs],
dtype=torch.int64,
).to(device, non_blocking=True)
num_thinking_tokens = torch.tensor([0 for _ in reqs], dtype=torch.int64).to(
device, non_blocking=True For a non-thinking model, r.tokenizer has no think_end_id attribute, so nothing will happen. Better isolation introduces more code. Is this silent ignorance acceptable? @CatherineSue |
@@ -76,7 +81,22 @@ def from_schedule_batch(cls, batch: ScheduleBatch, vocab_size: int): | |||
min_ps = torch.tensor( | |||
[r.sampling_params.min_p for r in reqs], dtype=torch.float | |||
).to(device, non_blocking=True) | |||
|
|||
if any(hasattr(r.tokenizer, "think_end_id") for r in reqs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-reasoning model, we can skip this check? Do we need to identify if the model is a reasoning model from the architect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about it. Now I don't know how to decide whether a model is reasoning from a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can think of, there seems to be no better way. The information about can be obtained here is very limited. It is just doing sampling and does not know the specific model architecture.
@@ -1127,6 +1128,7 @@ def v1_chat_generate_request( | |||
"temperature": request.temperature, | |||
"max_new_tokens": request.max_tokens or request.max_completion_tokens, | |||
"min_new_tokens": request.min_tokens, | |||
"thinking_budget": request.thinking_budget, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems only valid when enable_thinking = True
. Need to do the validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 1c741a6
@@ -172,6 +172,7 @@ class CompletionRequest(BaseModel): | |||
top_k: int = -1 | |||
min_p: float = 0.0 | |||
min_tokens: int = 0 | |||
thinking_budget: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be used in non-chat scenario? Is there any usage example/reference for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with CompletionRequest. Removed thinking_budget
in it 186b1aa
def apply_thinking_budgets(self, next_token_logits: torch.Tensor): | ||
if self.thinking_budgets is None: | ||
return | ||
has_budget = self.thinking_budgets > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't support =0
scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, now we support thinking_budget=0
in 30aa15f
next_token_logits[batch_indices, end_token_indices] = 0.0 | ||
|
||
def update_thinking_budgets(self, next_token_ids: torch.Tensor): | ||
if self.thinking_budgets is None or not torch.any(self.thinking_budgets > 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't support =0
scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, now we support thinking_budget=0
in 30aa15f
device, non_blocking=True | ||
) | ||
thinking_budgets = torch.tensor( | ||
[r.sampling_params.thinking_budget or -1 for r in reqs], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When thinking_budget=0
, this will be assigned a value of -1. It is recommended to modify this, otherwise it will not be fully supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated 200c598
thanks for pointing out
Motivation
See #6089
Modifications
Removed incorrect diff in protocol.py. See commit
version 2
.Checklist