-
-
Notifications
You must be signed in to change notification settings - Fork 12.6k
[Frontend] Chat-based Embeddings API #9759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 41 commits
1b91750
61e0fcf
c62be47
cc999b1
9ed87c1
efa7c6f
ab9297e
5a4f271
4a969b4
279b9ce
7de803f
c1ef363
a10fa85
89e0710
81b94de
a79d3b2
8b950dd
2c91855
f5e72ff
9cd1ac3
d775150
f2b5846
4a25806
bbcfc6a
b6820b7
fed887a
1774b27
c94aa93
e2ecbcd
fbbd8b1
50ad3aa
9c1df21
8049030
a387845
d80ec7e
ea5fd96
b05ede6
dba9806
557c9ef
8c8ee96
c3ba030
46f316f
1179f66
eb4b235
bf46a16
7f188f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| Pooling Parameters | ||
| ================== | ||
|
|
||
| .. autoclass:: vllm.PoolingParams | ||
| :members: |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -185,7 +185,7 @@ Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruc | |
| --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2 | ||
|
|
||
| .. important:: | ||
| Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API, | ||
| Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_, | ||
| a chat template is **required** to launch the API server. | ||
|
|
||
| Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it. | ||
|
|
@@ -243,6 +243,9 @@ To consume the server, you can use the OpenAI client like in the example below: | |
|
|
||
| A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_. | ||
|
|
||
| .. tip:: | ||
| There is no need to format the prompt in the API request since it will be handled by the server. | ||
|
|
||
| .. note:: | ||
|
|
||
| By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable: | ||
|
|
@@ -251,5 +254,50 @@ A full code example can be found in `examples/openai_api_client_for_multimodal.p | |
|
|
||
| $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> | ||
|
|
||
| .. note:: | ||
| There is no need to format the prompt in the API request since it will be handled by the server. | ||
| Chat Embeddings API | ||
| ^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_, | ||
| where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models. | ||
|
|
||
| .. tip:: | ||
| The schema of ``messages`` is exactly the same as in Chat Completions API. | ||
|
|
||
| In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model. | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \ | ||
| --trust-remote-code --max-model-len 4096 | ||
|
|
||
| .. important:: | ||
|
|
||
| Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding`` | ||
| to run this model in embedding mode instead of text generation mode. | ||
|
|
||
| Since this schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just leaving this as a thought here: should we perhaps have a fork of the openai client that support our extensions explicitly?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds good, but not sure whether we have bandwidth to maintain it 😅
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest opening an issue for this. |
||
|
|
||
| .. code-block:: python | ||
|
|
||
| import requests | ||
|
|
||
| image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" | ||
|
|
||
| response = requests.post( | ||
| "http://localhost:8000/v1/embeddings", | ||
| json={ | ||
| "model": "TIGER-Lab/VLM2Vec-Full", | ||
| "messages": [{ | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image_url", "image_url": {"url": image_url}}, | ||
| {"type": "text", "text": "Represent the given image."}, | ||
| ], | ||
| }], | ||
| "encoding_format": "float", | ||
| }, | ||
| ) | ||
| response.raise_for_status() | ||
|
|
||
| embedding_json = response.json() | ||
| print("Embedding output:", embedding_json["data"][0]["embedding"]) | ||
Uh oh!
There was an error while loading. Please reload this page.