-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Frontend] Support chat_template_kwargs
in LLM.chat
#17356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Support chat_template_kwargs
in LLM.chat
#17356
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@fyabc would be great if you could get your team to add this to the Qwen docs! |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Can you support chat_template_kwargs also in LLM.generate? |
|
|
|
…#17356) Signed-off-by: DarkLight1337 <[email protected]>
在官方的示例上简单修改了一下,大概就是下面这样 from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
if __name__ == '__main__':
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=32768)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
messages_list=[]
for prompt in prompts:
messages = [
{"role": "user", "content": prompt}
]
messages_list.append(messages)
et = False
text = tokenizer.apply_chat_template(
messages_list,
tokenize=False,
add_generation_prompt=True,
enable_thinking= et# Switches between thinking and non-thinking modes. Default is True.
)
llm = LLM(model = ,tensor_parallel_size=4,enable_prefix_caching=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60) |
Hi @DarkLight1337, thank you for your contribution! |
Sounds good. @simon-mo are we planning on another release soon once the bugfixes for Qwen3 are in? |
Yes v0.9.0 release milestone is still open. We just merged the torch upgrade. I'm looking at next Monday as a good checkpoint to release maybe? |
@DarkLight1337 so there's no straightforward approach then to disable thinking via the chat completions endpoint? Thanks (I'm also confused why --enable-thinking needs to be passed for thinking, if omitting it doesn't disable thinking. Unless I've misunderstood. I did test without the flag and still saw thinking.) |
You can already disable it when calling the endpoint, even without this PR. See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes |
…#17356) Signed-off-by: DarkLight1337 <[email protected]>
…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>
@DarkLight1337 Qwen doc already updated the latest |
…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>
This PR enables disabling thinking mode per request in offline inference by adding
chat_template_kwargs
argument.Note: The keyword argument for disabling thinking mode depends on the chat template. For example, Granite 3.2 uses
thinking
while Qwen 3 usesenable_thinking
.Example using Qwen 3:
Resolve this #17327 (comment)