Skip to content

Commit 67bade5

Browse files
authored
Support chat template and echo for chat API (vllm-project#1756)
1 parent aa7bc61 commit 67bade5

File tree

7 files changed

+439
-180
lines changed

7 files changed

+439
-180
lines changed

docs/source/getting_started/quickstart.rst

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ OpenAI-Compatible Server
107107
------------------------
108108

109109
vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
110+
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the above command) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints.
110111

111112
Start the server:
112113

@@ -122,14 +123,23 @@ Use model from www.modelscope.cn
122123
$ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \
123124
$ --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code
124125
125-
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the above command) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_ and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints.
126+
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:
127+
128+
.. code-block:: console
129+
130+
$ python -m vllm.entrypoints.openai.api_server \
131+
$ --model facebook/opt-125m \
132+
$ --chat-template ./examples/template_chatml.json
126133
127134
This server can be queried in the same format as OpenAI API. For example, list the models:
128135

129136
.. code-block:: console
130137
131138
$ curl http://localhost:8000/v1/models
132139
140+
Using OpenAI Completions API with vLLM
141+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
142+
133143
Query the model with input prompts:
134144

135145
.. code-block:: console
@@ -156,3 +166,45 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
156166
print("Completion result:", completion)
157167
158168
For a more detailed client example, refer to `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
169+
170+
Using OpenAI Chat API with vLLM
171+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
172+
173+
The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
174+
175+
Querying the model using OpenAI Chat API:
176+
177+
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to communicate with the model in a chat-like interface:
178+
179+
.. code-block:: console
180+
181+
$ curl http://localhost:8000/v1/chat/completions \
182+
$ -H "Content-Type: application/json" \
183+
$ -d '{
184+
$ "model": "facebook/opt-125m",
185+
$ "messages": [
186+
$ {"role": "system", "content": "You are a helpful assistant."},
187+
$ {"role": "user", "content": "Who won the world series in 2020?"}
188+
$ ]
189+
$ }'
190+
191+
Python Client Example:
192+
193+
Using the `openai` python package, you can also communicate with the model in a chat-like manner:
194+
195+
.. code-block:: python
196+
197+
import openai
198+
# Set OpenAI's API key and API base to use vLLM's API server.
199+
openai.api_key = "EMPTY"
200+
openai.api_base = "http://localhost:8000/v1"
201+
chat_response = openai.ChatCompletion.create(
202+
model="facebook/opt-125m",
203+
messages=[
204+
{"role": "system", "content": "You are a helpful assistant."},
205+
{"role": "user", "content": "Tell me a joke."},
206+
]
207+
)
208+
print("Chat response:", chat_response)
209+
210+
For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.

examples/template_alpaca.jinja

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}
2+
3+
{% for message in messages %}
4+
{% if message['role'] == 'user' %}
5+
### Instruction:
6+
{{ message['content']|trim -}}
7+
{% if not loop.last %}
8+
9+
10+
{% endif %}
11+
{% elif message['role'] == 'assistant' %}
12+
### Response:
13+
{{ message['content']|trim -}}
14+
{% if not loop.last %}
15+
16+
17+
{% endif %}
18+
{% elif message['role'] == 'user_context' %}
19+
### Input:
20+
{{ message['content']|trim -}}
21+
{% if not loop.last %}
22+
23+
24+
{% endif %}
25+
{% endif %}
26+
{% endfor %}
27+
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}
28+
### Response:
29+
{% endif %}

examples/template_chatml.jinja

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}{% endfor %}
2+
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}

examples/template_inkbot.jinja

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<#meta#>
2+
- Date: {{ (messages|selectattr('role', 'equalto', 'meta-current_date')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'meta-current_date')|list) else '' }}
3+
- Task: {{ (messages|selectattr('role', 'equalto', 'meta-task_name')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'meta-task_name')|list) else '' }}
4+
<#system#>
5+
{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }}
6+
<#chat#>
7+
{% for message in messages %}
8+
{% if message['role'] == 'user' %}
9+
<#user#>
10+
{{ message['content']|trim -}}
11+
{% if not loop.last %}
12+
13+
{% endif %}
14+
{% elif message['role'] == 'assistant' %}
15+
<#bot#>
16+
{{ message['content']|trim -}}
17+
{% if not loop.last %}
18+
19+
{% endif %}
20+
{% elif message['role'] == 'user_context' %}
21+
<#user_context#>
22+
{{ message['content']|trim -}}
23+
{% if not loop.last %}
24+
25+
{% endif %}
26+
{% endif %}
27+
{% endfor %}
28+
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}
29+
<#bot#>
30+
{% endif %}
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
from argparse import Namespace
2+
from dataclasses import dataclass
3+
4+
import pytest
5+
from fastapi.testclient import TestClient
6+
7+
from vllm.entrypoints.openai.api_server import *
8+
9+
# Define models, templates, and their corresponding expected outputs
10+
MODEL_TEMPLATE_GENERATON_OUTPUT = [
11+
("facebook/opt-125m", None, True,
12+
"Hello</s>Hi there!</s>What is the capital of</s>"),
13+
("facebook/opt-125m", None, False,
14+
"Hello</s>Hi there!</s>What is the capital of</s>"),
15+
("facebook/opt-125m", "../../examples/template_chatml.jinja", True,
16+
"""<|im_start|>user
17+
Hello<|im_end|>
18+
<|im_start|>assistant
19+
Hi there!<|im_end|>
20+
<|im_start|>user
21+
What is the capital of<|im_end|>
22+
<|im_start|>assistant
23+
"""),
24+
("facebook/opt-125m", "../../examples/template_chatml.jinja", False,
25+
"""<|im_start|>user
26+
Hello<|im_end|>
27+
<|im_start|>assistant
28+
Hi there!<|im_end|>
29+
<|im_start|>user
30+
What is the capital of""")
31+
]
32+
33+
TEST_MESSAGES = [
34+
{
35+
'role': 'user',
36+
'content': 'Hello'
37+
},
38+
{
39+
'role': 'assistant',
40+
'content': 'Hi there!'
41+
},
42+
{
43+
'role': 'user',
44+
'content': 'What is the capital of'
45+
},
46+
]
47+
client = TestClient(app)
48+
49+
50+
@dataclass
51+
class MockTokenizer:
52+
chat_template = None
53+
54+
55+
def test_load_chat_template():
56+
# Testing chatml template
57+
template = "../../examples/template_chatml.jinja"
58+
mock_args = Namespace(chat_template=template)
59+
tokenizer = MockTokenizer()
60+
61+
# Call the function with the mocked args
62+
load_chat_template(mock_args, tokenizer)
63+
64+
template_content = tokenizer.chat_template
65+
66+
# Test assertions
67+
assert template_content is not None
68+
# Hard coded value for template_chatml.jinja
69+
assert template_content == """{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\\n'}}{% endif %}{% endfor %}
70+
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\\n' }}{% endif %}"""
71+
72+
73+
def test_no_load_chat_template():
74+
# Testing chatml template
75+
template = "../../examples/does_not_exist"
76+
mock_args = Namespace(chat_template=template)
77+
tokenizer = MockTokenizer()
78+
79+
# Call the function with the mocked args
80+
load_chat_template(mock_args, tokenizer=tokenizer)
81+
template_content = tokenizer.chat_template
82+
83+
# Test assertions
84+
assert template_content is not None
85+
# Hard coded value for template_chatml.jinja
86+
assert template_content == """../../examples/does_not_exist"""
87+
88+
89+
@pytest.mark.asyncio
90+
@pytest.mark.parametrize(
91+
"model,template,add_generation_prompt,expected_output",
92+
MODEL_TEMPLATE_GENERATON_OUTPUT)
93+
async def test_get_gen_prompt(model, template, add_generation_prompt,
94+
expected_output):
95+
# Initialize the tokenizer
96+
tokenizer = get_tokenizer(tokenizer_name=model)
97+
98+
mock_args = Namespace(chat_template=template)
99+
load_chat_template(mock_args, tokenizer)
100+
101+
# Create a mock request object using keyword arguments
102+
mock_request = ChatCompletionRequest(
103+
model=model,
104+
messages=TEST_MESSAGES,
105+
add_generation_prompt=add_generation_prompt)
106+
107+
# Call the function and get the result
108+
result = tokenizer.apply_chat_template(
109+
conversation=mock_request.messages,
110+
tokenize=False,
111+
add_generation_prompt=mock_request.add_generation_prompt)
112+
113+
# Test assertion
114+
assert result == expected_output, f"The generated prompt does not match the expected output for model {model} and template {template}"
115+
116+
117+
def test_health_endpoint():
118+
response = client.get("/health")
119+
assert response.status_code == 200

0 commit comments

Comments
 (0)