Skip to content

Commit 7dbad07

Browse files
authored
openvino support in vllm (#65)
Signed-off-by: Zahidul Haque <[email protected]>
1 parent 3d134d2 commit 7dbad07

File tree

3 files changed

+128
-0
lines changed

3 files changed

+128
-0
lines changed
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Use vLLM with OpenVINO
2+
3+
## Build Docker Image
4+
5+
To build the docker image, run the command
6+
7+
```bash
8+
bash ./build_vllm_openvino.sh
9+
```
10+
11+
Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.
12+
13+
## Use vLLM serving with OpenAI API
14+
15+
### Start The Server:
16+
17+
For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
18+
19+
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get an access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
20+
21+
```bash
22+
export HUGGINGFACEHUB_API_TOKEN=<token>
23+
```
24+
25+
To start the model server:
26+
27+
```bash
28+
bash launch_model_server.sh
29+
```
30+
31+
### Request Completion With Curl:
32+
33+
```bash
34+
curl http://localhost:8000/v1/completions \
35+
-H "Content-Type: application/json" \
36+
-d '{
37+
"model": "meta-llama/Llama-2-7b-hf",
38+
"prompt": "What is the key advantage of Openvino framework?",
39+
"max_tokens": 300,
40+
"temperature": 0.7
41+
}'
42+
```
43+
44+
#### Customize vLLM-OpenVINO Service
45+
46+
The `launch_model_server.sh` script accepts two parameters:
47+
48+
- port: The port number assigned to the vLLM CPU endpoint, with the default being 8000.
49+
- model: The model name utilized for LLM, with the default set to "meta-llama/Llama-2-7b-hf".
50+
51+
You have the flexibility to customize the two parameters according to your specific needs. Below is a sample reference, if you wish to specify a different model and port number
52+
53+
` bash launch_model_server.sh -m meta-llama/Llama-2-7b-chat-hf -p 8123`
54+
55+
Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
56+
57+
```bash
58+
export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8000"
59+
export LLM_MODEL=<model_name> # example: export LLM_MODEL="meta-llama/Llama-2-7b-hf"
60+
```
61+
62+
## Use Int-8 Weights Compression
63+
64+
Weights int-8 compression is disabled by default. For better performance and lower memory consumption, the weights compression can be enabled by setting the environment variable `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1`.
65+
To pass the variable in docker, use `-e VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1` as an additional argument to `docker run` command in the examples above.
66+
67+
The variable enables weights compression logic described in [optimum-intel 8-bit weights quantization](https://huggingface.co/docs/optimum/intel/optimization_ov#8-bit).
68+
Hence, even if the variable is enabled, the compression is applied only for models starting with a certain size and avoids compression of too small models due to a significant accuracy drop.
69+
70+
## Use UInt-8 KV cache Compression
71+
72+
KV cache uint-8 compression is disabled by default. For better performance and lower memory consumption, the KV cache compression can be enabled by setting the environment variable `VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`.
73+
To pass the variable in docker, use `-e VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8` as an additional argument to `docker run` command in the examples above.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#!/bin/bash
2+
3+
# Copyright (C) 2024 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
7+
git clone --branch openvino-model-executor https://github.com/ilya-lavrenov/vllm.git
8+
cd ./vllm/
9+
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/bin/bash
2+
3+
# Copyright (C) 2024 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
7+
# Set default values
8+
9+
10+
default_port=8000
11+
default_model="meta-llama/Llama-2-7b-hf"
12+
swap_space=50
13+
14+
while getopts ":hm:p:" opt; do
15+
case $opt in
16+
h)
17+
echo "Usage: $0 [-h] [-m model] [-p port]"
18+
echo "Options:"
19+
echo " -h Display this help message"
20+
echo " -m model Model (default: meta-llama/Llama-2-7b-hf)"
21+
echo " -p port Port (default: 8000)"
22+
exit 0
23+
;;
24+
m)
25+
model=$OPTARG
26+
;;
27+
p)
28+
port=$OPTARG
29+
;;
30+
\?)
31+
echo "Invalid option: -$OPTARG" >&2
32+
exit 1
33+
;;
34+
esac
35+
done
36+
37+
# Assign arguments to variables
38+
model_name=${model:-$default_model}
39+
port_number=${port:-$default_port}
40+
41+
42+
# Set the Huggingface cache directory variable
43+
HF_CACHE_DIR=$HOME/.cache/huggingface
44+
45+
# Start the model server using Openvino as the backend inference engine. Provide the container name that is unique and meaningful, typically one that includes the model name.
46+
docker run --rm --name="vllm-openvino-server" -p $port_number:$port_number -v $HF_CACHE_DIR:/root/.cache/huggingface vllm:openvino --model $model_name --port $port_number --disable-log-requests --swap-space $swap_space

0 commit comments

Comments
 (0)