You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: comps/llms/text-generation/vllm/README.md
+63-23Lines changed: 63 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -2,26 +2,81 @@
2
2
3
3
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
4
4
5
-
## Getting Started
5
+
## vLLM on CPU
6
6
7
-
### Launch vLLM Service
7
+
First let's enable VLLM on CPU.
8
8
9
-
#### Launch a local server instance:
9
+
###Build docker
10
10
11
11
```bash
12
-
bash ./serving/vllm/launch_vllm_service.sh
12
+
bash ./build_docker_vllm.sh
13
13
```
14
14
15
-
The `./serving/vllm/launch_vllm_service.sh` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
15
+
The `build_docker_vllm` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
16
16
17
-
For gated models such as `LLAMA-2`, you will have to pass -e HF_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
17
+
### Launch vLLM service
18
18
19
-
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
19
+
```bash
20
+
bash ./launch_vllm_service.sh
21
+
```
22
+
23
+
The `launch_vllm_service.sh` script accepts four parameters:
24
+
25
+
- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
26
+
- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
27
+
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
28
+
- parallel_number: parallel nodes number for 'hpu' mode
29
+
30
+
If you want to customize the port or model_name, can run:
For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
The `./serving/vllm/launch_vllm_service.sh` script accepts three parameters:
41
-
42
-
- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080.
43
-
- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
44
-
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu"
45
-
46
-
You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
0 commit comments