Skip to content

Commit 8e3f553

Browse files
Support Llama index for vLLM (#665)
Signed-off-by: Xinyao Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 267fb02 commit 8e3f553

18 files changed

+720
-1
lines changed

.github/workflows/docker/compose/llms-compose-cd.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,11 @@ services:
1111
context: vllm-openvino
1212
dockerfile: Dockerfile.openvino
1313
image: ${REGISTRY:-opea}/vllm-openvino:${TAG:-latest}
14+
llm-vllm-llamaindex:
15+
build:
16+
dockerfile: comps/llms/text-generation/vllm/llama_index/Dockerfile
17+
image: ${REGISTRY:-opea}/llm-vllm-llamaindex:${TAG:-latest}
18+
llm-vllm-llamaindex-hpu:
19+
build:
20+
dockerfile: comps/llms/text-generation/vllm/llama_index/dependency/Dockerfile.intel_hpu
21+
image: ${REGISTRY:-opea}/llm-vllm-llamaindex-hpu:${TAG:-latest}

comps/llms/text-generation/vllm/langchain/build_docker_microservice.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ docker build \
66
-t opea/llm-vllm:latest \
77
--build-arg https_proxy=$https_proxy \
88
--build-arg http_proxy=$http_proxy \
9-
-f comps/llms/text-generation/vllm/docker/Dockerfile .
9+
-f comps/llms/text-generation/vllm/langchain/Dockerfile .
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM ubuntu:22.04
5+
6+
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
7+
libgl1-mesa-glx \
8+
libjemalloc-dev \
9+
python3 \
10+
python3-pip
11+
12+
RUN useradd -m -s /bin/bash user && \
13+
mkdir -p /home/user && \
14+
chown -R user /home/user/
15+
16+
USER user
17+
18+
COPY comps /home/user/comps
19+
20+
RUN pip install --no-cache-dir --upgrade pip && \
21+
pip install --no-cache-dir -r /home/user/comps/llms/text-generation/vllm/llama_index/requirements.txt
22+
23+
24+
ENV PYTHONPATH=$PYTHONPATH:/home/user
25+
26+
WORKDIR /home/user/comps/llms/text-generation/vllm/llama_index
27+
28+
ENTRYPOINT ["bash", "entrypoint.sh"]
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# vLLM Endpoint Service
2+
3+
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
4+
5+
## 🚀1. Set up Environment Variables
6+
7+
```bash
8+
export HUGGINGFACEHUB_API_TOKEN=<token>
9+
export vLLM_ENDPOINT="http://${your_ip}:8008"
10+
export LLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
11+
```
12+
13+
For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
14+
15+
## 🚀2. Set up vLLM Service
16+
17+
First of all, go to the server folder for vllm.
18+
19+
```bash
20+
cd dependency
21+
```
22+
23+
### 2.1 vLLM on CPU
24+
25+
First let's enable VLLM on CPU.
26+
27+
#### Build docker
28+
29+
```bash
30+
bash ./build_docker_vllm.sh
31+
```
32+
33+
The `build_docker_vllm` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
34+
35+
#### Launch vLLM service
36+
37+
```bash
38+
bash ./launch_vllm_service.sh
39+
```
40+
41+
If you want to customize the port or model_name, can run:
42+
43+
```bash
44+
bash ./launch_vllm_service.sh ${port_number} ${model_name}
45+
```
46+
47+
### 2.2 vLLM on Gaudi
48+
49+
Then we show how to enable VLLM on Gaudi.
50+
51+
#### Build docker
52+
53+
```bash
54+
bash ./build_docker_vllm.sh hpu
55+
```
56+
57+
Set `hw_mode` to `hpu`.
58+
59+
Note: If you want to enable tensor parallel, please set `setuptools==69.5.1` in Dockerfile.hpu before build docker with following command.
60+
61+
```
62+
sed -i "s/RUN pip install setuptools/RUN pip install setuptools==69.5.1/g" docker/Dockerfile.hpu
63+
```
64+
65+
#### Launch vLLM service on single node
66+
67+
For small model, we can just use single node.
68+
69+
```bash
70+
bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu 1
71+
```
72+
73+
Set `hw_mode` to `hpu` and `parallel_number` to 1.
74+
75+
The `launch_vllm_service.sh` script accepts 7 parameters:
76+
77+
- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
78+
- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
79+
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
80+
- parallel_number: parallel nodes number for 'hpu' mode
81+
- block_size: default set to 128 for better performance on HPU
82+
- max_num_seqs: default set to 256 for better performance on HPU
83+
- max_seq_len_to_capture: default set to 2048 for better performance on HPU
84+
85+
If you want to get more performance tuning tips, can refer to [Performance tuning](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#performance-tips).
86+
87+
#### Launch vLLM service on multiple nodes
88+
89+
For large model such as `meta-llama/Meta-Llama-3-70b`, we need to launch on multiple nodes.
90+
91+
```bash
92+
bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu ${parallel_number}
93+
```
94+
95+
For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use following command.
96+
97+
```bash
98+
bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
99+
```
100+
101+
### 2.3 vLLM with OpenVINO
102+
103+
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:
104+
105+
- Prefix caching (`--enable-prefix-caching`)
106+
- Chunked prefill (`--enable-chunked-prefill`)
107+
108+
#### Build Docker Image
109+
110+
To build the docker image, run the command
111+
112+
```bash
113+
bash ./build_docker_vllm_openvino.sh
114+
```
115+
116+
Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.
117+
118+
#### Launch vLLM service
119+
120+
For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
121+
122+
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get an access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
123+
124+
```bash
125+
export HUGGINGFACEHUB_API_TOKEN=<token>
126+
```
127+
128+
To start the model server:
129+
130+
```bash
131+
bash launch_vllm_service_openvino.sh
132+
```
133+
134+
#### Performance tips
135+
136+
vLLM OpenVINO backend uses the following environment variables to control behavior:
137+
138+
- `VLLM_OPENVINO_KVCACHE_SPACE` to specify the KV Cache size (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
139+
140+
- `VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
141+
142+
- `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON` to enable U8 weights compression during model loading stage. By default, compression is turned off.
143+
144+
To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (`--enable-chunked-prefill`). Based on the experiments, the recommended batch size is `256` (`--max-num-batched-tokens`)
145+
146+
OpenVINO best known configuration is:
147+
148+
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
149+
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256
150+
151+
### 2.4 Query the service
152+
153+
And then you can make requests like below to check the service status:
154+
155+
```bash
156+
curl http://${your_ip}:8008/v1/completions \
157+
-H "Content-Type: application/json" \
158+
-d '{
159+
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
160+
"prompt": "What is Deep Learning?",
161+
"max_tokens": 32,
162+
"temperature": 0
163+
}'
164+
```
165+
166+
## 🚀3. Set up LLM microservice
167+
168+
Then we warp the VLLM service into LLM microcervice.
169+
170+
### Build docker
171+
172+
```bash
173+
bash build_docker_microservice.sh
174+
```
175+
176+
### Launch the microservice
177+
178+
```bash
179+
bash launch_microservice.sh
180+
```
181+
182+
### Query the microservice
183+
184+
```bash
185+
curl http://${your_ip}:9000/v1/chat/completions \
186+
-X POST \
187+
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_p":0.95,"temperature":0.01,"streaming":false}' \
188+
-H 'Content-Type: application/json'
189+
```
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
cd ../../../../
5+
docker build \
6+
-t opea/llm-vllm-llamaindex:latest \
7+
--build-arg https_proxy=$https_proxy \
8+
--build-arg http_proxy=$http_proxy \
9+
-f comps/llms/text-generation/vllm/llama_index/Dockerfile .
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# FROM vault.habana.ai/gaudi-docker/1.16.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest as hpu
5+
FROM opea/habanalabs:1.16.1-pytorch-installer-2.2.2 as hpu
6+
7+
RUN useradd -m -s /bin/bash user && \
8+
mkdir -p /home/user && \
9+
chown -R user /home/user/
10+
ENV LANG=en_US.UTF-8
11+
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
12+
service ssh restart
13+
USER user
14+
WORKDIR /root
15+
16+
RUN pip install --no-cache-dir --upgrade-strategy eager optimum[habana]
17+
18+
RUN pip install --no-cache-dir -v git+https://github.com/HabanaAI/vllm-fork.git@cf6952d
19+
20+
RUN pip install --no-cache-dir setuptools
21+
22+
ENV no_proxy=localhost,127.0.0.1
23+
24+
ENV PT_HPU_LAZY_ACC_PAR_MODE=0
25+
26+
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
27+
28+
CMD ["/bin/bash"]
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/bin/bash
2+
3+
# Copyright (c) 2024 Intel Corporation
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
# Set default values
18+
default_hw_mode="cpu"
19+
20+
# Assign arguments to variable
21+
hw_mode=${1:-$default_hw_mode}
22+
23+
# Check if all required arguments are provided
24+
if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
25+
echo "Usage: $0 [hw_mode]"
26+
echo "Please customize the arguments you want to use.
27+
- hw_mode: The hardware mode for the Ray Gaudi endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'hpu'."
28+
exit 1
29+
fi
30+
31+
# Build the docker image for vLLM based on the hardware mode
32+
if [ "$hw_mode" = "hpu" ]; then
33+
docker build -f docker/Dockerfile.intel_hpu -t opea/vllm:hpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
34+
else
35+
git clone https://github.com/vllm-project/vllm.git
36+
cd ./vllm/
37+
docker build -f Dockerfile.cpu -t opea/vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
38+
fi
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
3+
# Copyright (C) 2024 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
7+
git clone https://github.com/vllm-project/vllm.git vllm
8+
cd ./vllm/
9+
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
10+
cd $BASEDIR && rm -rf vllm
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#!/bin/bash
2+
# Copyright (C) 2024 Intel Corporation
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# Set default values
6+
default_port=8008
7+
default_model=$LLM_MODEL
8+
default_hw_mode="cpu"
9+
default_parallel_number=1
10+
default_block_size=128
11+
default_max_num_seqs=256
12+
default_max_seq_len_to_capture=2048
13+
14+
# Assign arguments to variables
15+
port_number=${1:-$default_port}
16+
model_name=${2:-$default_model}
17+
hw_mode=${3:-$default_hw_mode}
18+
parallel_number=${4:-$default_parallel_number}
19+
block_size=${5:-$default_block_size}
20+
max_num_seqs=${6:-$default_max_num_seqs}
21+
max_seq_len_to_capture=${7:-$default_max_seq_len_to_capture}
22+
23+
# Check if all required arguments are provided
24+
if [ "$#" -lt 0 ] || [ "$#" -gt 4 ]; then
25+
echo "Usage: $0 [port_number] [model_name] [hw_mode] [parallel_number]"
26+
echo "port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080."
27+
echo "model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'."
28+
echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
29+
echo "parallel_number: parallel nodes number for 'hpu' mode"
30+
echo "block_size: default set to 128 for better performance on HPU"
31+
echo "max_num_seqs: default set to 256 for better performance on HPU"
32+
echo "max_seq_len_to_capture: default set to 2048 for better performance on HPU"
33+
exit 1
34+
fi
35+
36+
# Set the volume variable
37+
volume=$PWD/data
38+
39+
# Build the Docker run command based on hardware mode
40+
if [ "$hw_mode" = "hpu" ]; then
41+
docker run -d --rm --runtime=habana --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} opea/vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80 --block-size $block_size --max-num-seqs $max_num_seqs --max-seq_len-to-capture $max_seq_len_to_capture "
42+
else
43+
docker run -d --rm --name="vllm-service" -p $port_number:80 --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e VLLM_CPU_KVCACHE_SPACE=40 opea/vllm:cpu --model $model_name --host 0.0.0.0 --port 80
44+
fi

0 commit comments

Comments
 (0)