|
| 1 | +# Use vLLM with OpenVINO |
| 2 | + |
| 3 | +## Build Docker Image |
| 4 | + |
| 5 | +To build the docker image, run the command |
| 6 | + |
| 7 | +```bash |
| 8 | +bash ./build_vllm_openvino.sh |
| 9 | +``` |
| 10 | + |
| 11 | +Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell. |
| 12 | + |
| 13 | +## Use vLLM serving with OpenAI API |
| 14 | + |
| 15 | +### Start The Server: |
| 16 | + |
| 17 | +For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token. |
| 18 | + |
| 19 | +Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get an access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token. |
| 20 | + |
| 21 | +```bash |
| 22 | +export HUGGINGFACEHUB_API_TOKEN=<token> |
| 23 | +``` |
| 24 | + |
| 25 | +To start the model server: |
| 26 | + |
| 27 | +```bash |
| 28 | +bash launch_model_server.sh |
| 29 | +``` |
| 30 | + |
| 31 | +### Request Completion With Curl: |
| 32 | + |
| 33 | +```bash |
| 34 | +curl http://localhost:8000/v1/completions \ |
| 35 | + -H "Content-Type: application/json" \ |
| 36 | + -d '{ |
| 37 | + "model": "meta-llama/Llama-2-7b-hf", |
| 38 | + "prompt": "What is the key advantage of Openvino framework?", |
| 39 | + "max_tokens": 300, |
| 40 | + "temperature": 0.7 |
| 41 | + }' |
| 42 | +``` |
| 43 | + |
| 44 | +#### Customize vLLM-OpenVINO Service |
| 45 | + |
| 46 | +The `launch_model_server.sh` script accepts two parameters: |
| 47 | + |
| 48 | +- port: The port number assigned to the vLLM CPU endpoint, with the default being 8000. |
| 49 | +- model: The model name utilized for LLM, with the default set to "meta-llama/Llama-2-7b-hf". |
| 50 | + |
| 51 | +You have the flexibility to customize the two parameters according to your specific needs. Below is a sample reference, if you wish to specify a different model and port number |
| 52 | + |
| 53 | +` bash launch_model_server.sh -m meta-llama/Llama-2-7b-chat-hf -p 8123` |
| 54 | + |
| 55 | +Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`: |
| 56 | + |
| 57 | +```bash |
| 58 | +export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8000" |
| 59 | +export LLM_MODEL=<model_name> # example: export LLM_MODEL="meta-llama/Llama-2-7b-hf" |
| 60 | +``` |
| 61 | + |
| 62 | +## Use Int-8 Weights Compression |
| 63 | + |
| 64 | +Weights int-8 compression is disabled by default. For better performance and lower memory consumption, the weights compression can be enabled by setting the environment variable `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1`. |
| 65 | +To pass the variable in docker, use `-e VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1` as an additional argument to `docker run` command in the examples above. |
| 66 | + |
| 67 | +The variable enables weights compression logic described in [optimum-intel 8-bit weights quantization](https://huggingface.co/docs/optimum/intel/optimization_ov#8-bit). |
| 68 | +Hence, even if the variable is enabled, the compression is applied only for models starting with a certain size and avoids compression of too small models due to a significant accuracy drop. |
| 69 | + |
| 70 | +## Use UInt-8 KV cache Compression |
| 71 | + |
| 72 | +KV cache uint-8 compression is disabled by default. For better performance and lower memory consumption, the KV cache compression can be enabled by setting the environment variable `VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`. |
| 73 | +To pass the variable in docker, use `-e VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8` as an additional argument to `docker run` command in the examples above. |
0 commit comments