Skip to content

Commit 6e2c28b

Browse files
refine vllm instruction (#272)
* refine vllm instruction --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 9637553 commit 6e2c28b

File tree

2 files changed

+67
-27
lines changed

2 files changed

+67
-27
lines changed

comps/llms/text-generation/vllm/README.md

Lines changed: 63 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,81 @@
22

33
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
44

5-
## Getting Started
5+
## vLLM on CPU
66

7-
### Launch vLLM Service
7+
First let's enable VLLM on CPU.
88

9-
#### Launch a local server instance:
9+
### Build docker
1010

1111
```bash
12-
bash ./serving/vllm/launch_vllm_service.sh
12+
bash ./build_docker_vllm.sh
1313
```
1414

15-
The `./serving/vllm/launch_vllm_service.sh` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
15+
The `build_docker_vllm` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
1616

17-
For gated models such as `LLAMA-2`, you will have to pass -e HF_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
17+
### Launch vLLM service
1818

19-
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
19+
```bash
20+
bash ./launch_vllm_service.sh
21+
```
22+
23+
The `launch_vllm_service.sh` script accepts four parameters:
24+
25+
- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
26+
- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
27+
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
28+
- parallel_number: parallel nodes number for 'hpu' mode
29+
30+
If you want to customize the port or model_name, can run:
31+
32+
```bash
33+
bash ./launch_vllm_service.sh ${port_number} ${model_name}
34+
```
35+
36+
For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
37+
38+
```bash
39+
export HUGGINGFACEHUB_API_TOKEN=<token>
40+
```
41+
42+
## vLLM on Gaudi
43+
44+
Then we show how to enable VLLM on Gaudi.
45+
46+
### Build docker
2047

2148
```bash
22-
export HF_TOKEN=<token>
49+
bash ./build_docker_vllm.sh hpu
2350
```
2451

52+
Set `hw_mode` to `hpu`.
53+
54+
### Launch vLLM service on single node
55+
56+
For small model, we can just use single node.
57+
58+
```bash
59+
bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu 1
60+
```
61+
62+
Set `hw_mode` to `hpu` and `parallel_number` to 1.
63+
64+
### Launch vLLM service on multiple nodes
65+
66+
For large model such as `meta-llama/Meta-Llama-3-70b`, we need to launch on multiple nodes.
67+
68+
```bash
69+
bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu ${parallel_number}
70+
```
71+
72+
For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use following command.
73+
74+
```bash
75+
bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
76+
```
77+
78+
## Query the service
79+
2580
And then you can make requests like below to check the service status:
2681

2782
```bash
@@ -34,18 +89,3 @@ curl http://127.0.0.1:8008/v1/completions \
3489
"temperature": 0
3590
}'
3691
```
37-
38-
#### Customize vLLM Service
39-
40-
The `./serving/vllm/launch_vllm_service.sh` script accepts three parameters:
41-
42-
- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080.
43-
- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
44-
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu"
45-
46-
You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
47-
48-
```bash
49-
export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8008"
50-
export LLM_MODEL=<model_name> # example: export LLM_MODEL="Intel/neural-chat-7b-v3-3"
51-
```

comps/llms/text-generation/vllm/launch_vllm_service.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44

55
# Set default values
66
default_port=8008
7+
default_model="meta-llama/Meta-Llama-3-8B-Instruct"
78
default_hw_mode="cpu"
8-
default_model=${LLM_MODEL_ID}
99
default_parallel_number=1
1010

1111
# Assign arguments to variables
@@ -18,7 +18,7 @@ parallel_number=${4:-$default_parallel_number}
1818
if [ "$#" -lt 0 ] || [ "$#" -gt 4 ]; then
1919
echo "Usage: $0 [port_number] [model_name] [hw_mode] [parallel_number]"
2020
echo "port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080."
21-
echo "model_name: The model name utilized for LLM, with the default set to 'Intel/neural-chat-7b-v3-3'."
21+
echo "model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'."
2222
echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
2323
echo "parallel_number: parallel nodes number for 'hpu' mode"
2424
exit 1
@@ -29,7 +29,7 @@ volume=$PWD/data
2929

3030
# Build the Docker run command based on hardware mode
3131
if [ "$hw_mode" = "hpu" ]; then
32-
docker run -it --runtime=habana --rm --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80"
32+
docker run -d --rm--runtime=habana --rm --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80"
3333
else
34-
docker run -it --rm --name="vllm-service" -p $port_number:80 --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port 80"
34+
docker run -d --rm --name="vllm-service" -p $port_number:80 --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port 80"
3535
fi

0 commit comments

Comments
 (0)