[Frontend] Support `chat_template_kwargs` in `LLM.chat` #17356

DarkLight1337 · 2025-04-29T07:01:58Z

This PR enables disabling thinking mode per request in offline inference by adding chat_template_kwargs argument.

Note: The keyword argument for disabling thinking mode depends on the chat template. For example, Granite 3.2 uses thinking while Qwen 3 uses enable_thinking.

Example using Qwen 3:

from vllm import LLM, SamplingParams

llm = LLM("Qwen/Qwen3-8B")
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    max_tokens=8192,
    presence_penalty=1.5,
)

outputs = llm.chat(
    [{"role": "user", "content": "Give me a short introduction to large language models."}],
    sampling_params=sampling_params,
    chat_template_kwargs={"enable_thinking": False},  # or True
)

print("-" * 50)
for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)
    print("-" * 50)

Resolve this #17327 (comment)

Signed-off-by: DarkLight1337 <[email protected]>

github-actions · 2025-04-29T07:02:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-04-29T07:10:19Z

@fyabc would be great if you could get your team to add this to the Qwen docs!

Signed-off-by: DarkLight1337 <[email protected]>

ad1192214879 · 2025-04-29T14:16:52Z

Can you support chat_template_kwargs also in LLM.generate?

DarkLight1337 · 2025-04-29T14:17:41Z

Can you support chat_template_kwargs also in LLM.generate?

LLM.generate doesn't use chat template so it doesn't make sense to support it there

ad1192214879 · 2025-04-29T14:29:42Z

Can you support chat_template_kwargs also in LLM.generate?

LLM.generate doesn't use chat template so it doesn't make sense to support it there
Because we need to use LLM.generate ，we can't use no-thinking mode when using offline batched inference？

DarkLight1337 · 2025-04-29T15:14:10Z

LLM.chat can also be used in offline inference, so I'm not really sure why you couldn't just use that.

…#17356) Signed-off-by: DarkLight1337 <[email protected]>

rangehow · 2025-04-30T09:51:20Z

Can you support chat_template_kwargs also in LLM.generate?

LLM.generate doesn't use chat template so it doesn't make sense to support it there
Because we need to use LLM.generate ，we can't use no-thinking mode when using offline batched inference？

在官方的示例上简单修改了一下，大概就是下面这样

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

if __name__ == '__main__':
# Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=32768)
    

    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    messages_list=[]
    for prompt in prompts:
        messages = [
            {"role": "user", "content": prompt}
        ]
        messages_list.append(messages)
    et = False
    text = tokenizer.apply_chat_template(
        messages_list,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking= et# Switches between thinking and non-thinking modes. Default is True.
    )
    
    llm = LLM(model = ,tensor_parallel_size=4,enable_prefix_caching=True)
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)

fyabc · 2025-04-30T16:19:34Z

@fyabc would be great if you could get your team to add this to the Qwen docs!

Hi @DarkLight1337, thank you for your contribution!
Since vllm-0.8.5 does not include this PR, I think we can update doc when the new version is released?

DarkLight1337 · 2025-04-30T16:22:16Z

@fyabc would be great if you could get your team to add this to the Qwen docs!

Hi @DarkLight1337, thank you for your contribution! Since vllm-0.8.5 does not include this PR, I think we can update doc when the new version is released?

Sounds good. @simon-mo are we planning on another release soon once the bugfixes for Qwen3 are in?

simon-mo · 2025-04-30T22:00:40Z

Yes v0.9.0 release milestone is still open. We just merged the torch upgrade. I'm looking at next Monday as a good checkpoint to release maybe?

RonanKMcGovern · 2025-05-01T08:53:35Z

@DarkLight1337 so there's no straightforward approach then to disable thinking via the chat completions endpoint? Thanks

(I'm also confused why --enable-thinking needs to be passed for thinking, if omitting it doesn't disable thinking. Unless I've misunderstood. I did test without the flag and still saw thinking.)

DarkLight1337 · 2025-05-01T08:58:12Z

You can already disable it when calling the endpoint, even without this PR. See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes

…#17356) Signed-off-by: DarkLight1337 <[email protected]>

…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>

fyabc · 2025-05-20T09:44:27Z

@fyabc would be great if you could get your team to add this to the Qwen docs!

Hi @DarkLight1337, thank you for your contribution! Since vllm-0.8.5 does not include this PR, I think we can update doc when the new version is released?

@DarkLight1337 Qwen doc already updated the latest llm.chat usage.
https://qwen.readthedocs.io/en/latest/deployment/vllm.html#python-library

…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>

[Frontend] Support chat_template_kwargs in LLM.chat

49c1b75

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2025

DarkLight1337 requested review from njhill and Isotr0py April 29, 2025 07:01

DarkLight1337 requested review from robertgshaw2-redhat and simon-mo as code owners April 29, 2025 07:01

mergify bot added the frontend label Apr 29, 2025

DarkLight1337 mentioned this pull request Apr 29, 2025

[Usage] Qwen3 Usage Guide #17327

Open

DarkLight1337 requested a review from mgoin April 29, 2025 07:31

DarkLight1337 added 2 commits April 29, 2025 07:44

Fix

1c88d69

Signed-off-by: DarkLight1337 <[email protected]>

Avoid OOM

212977e

Signed-off-by: DarkLight1337 <[email protected]>

sins921 approved these changes Apr 29, 2025

View reviewed changes

Isotr0py approved these changes Apr 29, 2025

View reviewed changes

DarkLight1337 merged commit 88ad9ec into vllm-project:main Apr 29, 2025
43 checks passed

DarkLight1337 deleted the offline-chat-kwargs branch April 29, 2025 14:03

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Frontend] Support chat_template_kwargs in LLM.chat (vllm-project…

6f9fccd

…#17356) Signed-off-by: DarkLight1337 <[email protected]>

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025

[Frontend] Support chat_template_kwargs in LLM.chat (vllm-project…

3bd589f

…#17356) Signed-off-by: DarkLight1337 <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Frontend] Support chat_template_kwargs in LLM.chat (vllm-project…

a0bd1ec

…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>

chaunceyjiang mentioned this pull request May 13, 2025

[Usage]: Disable Qwen-3 Thinking in LLM.chat #18066

Closed

1 task

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Frontend] Support chat_template_kwargs in LLM.chat (vllm-project…

d77b971

…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Frontend] Support chat_template_kwargs in LLM.chat (vllm-project…

65db6de

…#17356) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Support `chat_template_kwargs` in `LLM.chat` #17356

[Frontend] Support `chat_template_kwargs` in `LLM.chat` #17356

Uh oh!

DarkLight1337 commented Apr 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

ad1192214879 commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

ad1192214879 commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025 •

edited

Loading

Uh oh!

rangehow commented Apr 30, 2025 •

edited

Loading

Uh oh!

fyabc commented Apr 30, 2025

Uh oh!

DarkLight1337 commented Apr 30, 2025

Uh oh!

simon-mo commented Apr 30, 2025

Uh oh!

RonanKMcGovern commented May 1, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented May 1, 2025 •

edited

Loading

Uh oh!

fyabc commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

[Frontend] Support chat_template_kwargs in LLM.chat #17356

[Frontend] Support chat_template_kwargs in LLM.chat #17356

Uh oh!

Conversation

DarkLight1337 commented Apr 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ad1192214879 commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

ad1192214879 commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rangehow commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fyabc commented Apr 30, 2025

Uh oh!

DarkLight1337 commented Apr 30, 2025

Uh oh!

simon-mo commented Apr 30, 2025

Uh oh!

RonanKMcGovern commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fyabc commented May 20, 2025

Uh oh!

Uh oh!

[Frontend] Support `chat_template_kwargs` in `LLM.chat` #17356

[Frontend] Support `chat_template_kwargs` in `LLM.chat` #17356

DarkLight1337 commented Apr 29, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Apr 29, 2025 •

edited

Loading

DarkLight1337 commented Apr 29, 2025 •

edited

Loading

rangehow commented Apr 30, 2025 •

edited

Loading

RonanKMcGovern commented May 1, 2025 •

edited

Loading

DarkLight1337 commented May 1, 2025 •

edited

Loading