Skip to content

Commit 78b94fc

Browse files
authored
text generation, embedding and reranking with ovms (opea-project#1318)
Signed-off-by: Dariusz Trawinski <[email protected]>
1 parent d5cd3e4 commit 78b94fc

File tree

23 files changed

+1228
-16
lines changed

23 files changed

+1228
-16
lines changed

comps/embeddings/deployment/docker_compose/compose.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
include:
55
- ../../../third_parties/tei/deployment/docker_compose/compose.yaml
6+
- ../../../third_parties/ovms/deployment/docker_compose/compose.yaml
67
- ../../../third_parties/bridgetower/deployment/docker_compose/compose.yaml
78
- ../../../third_parties/clip/deployment/docker_compose/compose_intel_cpu.yaml
89

@@ -40,6 +41,21 @@ services:
4041
condition: service_healthy
4142
restart: unless-stopped
4243

44+
ovms-embedding-server:
45+
image: ${REGISTRY:-opea}/embedding:${TAG:-latest}
46+
container_name: ovms-embedding-server
47+
ports:
48+
- "${EMBEDDER_PORT:-10201}:6000"
49+
ipc: host
50+
environment:
51+
no_proxy: ${no_proxy}
52+
http_proxy: ${http_proxy}
53+
https_proxy: ${https_proxy}
54+
OVMS_EMBEDDING_ENDPOINT: ${OVMS_EMBEDDING_ENDPOINT}
55+
EMBEDDING_COMPONENT_NAME: "OPEA_OVMS_EMBEDDING"
56+
MODEL_ID: ${MODEL_ID}
57+
restart: unless-stopped
58+
4359
pg-embedding-server:
4460
image: ${REGISTRY:-opea}/embedding:${TAG:-latest}
4561
container_name: pg-embedding-server

comps/embeddings/src/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,11 @@ Key Features:
1212

1313
**Customizable**: Supports configuration and customization to meet specific use case requirements, including different embedding models and preprocessing techniques.
1414

15-
Users are albe to configure and build embedding-related services according to their actual needs.
15+
Users are able to configure and build embedding-related services according to their actual needs.
16+
17+
## Embeddings Microservice with OVMS
18+
19+
For details, please refer to [readme](./README_ovms.md).
1620

1721
## Embeddings Microservice with TEI
1822

comps/embeddings/src/README_ovms.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# 🌟 Embedding Microservice with OpenVINO Model Server
2+
3+
This guide walks you through starting, deploying, and consuming the **OVMS Embeddings Microservice**. 🚀
4+
It is Intel highly optimized serving solution which employs OpenVINO Runtime for super fast inference on CPU.
5+
6+
---
7+
8+
## 📦 1. Start Microservice with `docker run`
9+
10+
### 🔹 1.1 Start Embedding Service with OVMS
11+
12+
1. Prepare the model in the model repository
13+
This step will export the model from HuggingFace Hub to the local models repository. At the some time model will be converted to IR format and optionally quantized.
14+
It speedup starting the service and avoids copying the model from Internet each time the container starts.
15+
16+
```
17+
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/requirements.txt
18+
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/export_model.py -o export_model.py
19+
mkdir models
20+
python export_model.py embeddings --source_model BAAI/bge-large-en-v1.5 --weight-format int8 --config_file_path models/config_embeddings.json --model_repository_path models --target_device CPU
21+
```
22+
23+
2. **Test the OVMS service**:
24+
Run the following command to check if the service is up and running.
25+
26+
```bash
27+
your_port=8090
28+
docker run -p $your_port:8000 -v ./models:/models --name ovms-embedding-serving \
29+
openvino/model_server:2025.0 --port 8000 --config_path /models/config_embeddings.json
30+
```
31+
32+
3. **Test the OVMS service**:
33+
Run the following command to check if the service is up and running.
34+
35+
```bash
36+
curl http://localhost:$your_port/v3/embeddings \
37+
-X POST \
38+
-H 'Content-Type: application/json'
39+
-d '{
40+
"model": "BAAI/bge-large-en-v1.5",
41+
"input":"What is Deep Learning?"
42+
}'
43+
```
44+
45+
### 🔹 1.2 Build Docker Image and Run Docker with CLI
46+
47+
1. Build the Docker image for the embedding microservice:
48+
49+
```bash
50+
cd ../../../
51+
docker build -t opea/embedding:latest \
52+
--build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy \
53+
-f comps/embeddings/src/Dockerfile .
54+
```
55+
56+
2. Run the embedding microservice and connect it to the OVMS service:
57+
58+
```bash
59+
docker run -d --name="embedding-ovms-server" \
60+
-p 6000:6000 \
61+
--ipc=host \
62+
-e OVMS_EMBEDDING_ENDPOINT=$OVMS_EMBEDDING_ENDPOINT \
63+
-e MODEL_ID=$MODEL_ID \
64+
-e EMBEDDING_COMPONENT_NAME="OPEA_OVMS_EMBEDDING" \
65+
opea/embedding:latest
66+
```
67+
68+
## 📦 2. Start Microservice with docker compose
69+
70+
Deploy both the OVMS Embedding Service and the Embedding Microservice using Docker Compose.
71+
72+
🔹 Steps:
73+
74+
1. Set environment variables:
75+
76+
```bash
77+
export host_ip=${your_ip_address}
78+
export MODEL_ID="BAAI/bge-large-en-v1.5"
79+
export OVMS_EMBEDDER_PORT=8090
80+
export EMBEDDER_PORT=6000
81+
export OVMS_EMBEDDING_ENDPOINT="http://${host_ip}:${OVMS_EMBEDDER_PORT}"
82+
```
83+
84+
2. Navigate to the Docker Compose directory:
85+
86+
```bash
87+
cd comps/embeddings/deployment/docker_compose/
88+
```
89+
90+
3. Start the services:
91+
92+
```bash
93+
docker compose up ovms-embedding-server -d
94+
```
95+
96+
## 📦 3. Consume Embedding Service
97+
98+
### 🔹 3.1 Check Service Status
99+
100+
Verify the embedding service is running:
101+
102+
```bash
103+
curl http://localhost:6000/v1/health_check \
104+
-X GET \
105+
-H 'Content-Type: application/json'
106+
```
107+
108+
### 🔹 3.2 Use the Embedding Service API
109+
110+
The API is compatible with the [OpenAI API](https://platform.openai.com/docs/api-reference/embeddings).
111+
112+
1. Single Text Input
113+
114+
```bash
115+
curl http://localhost:6000/v1/embeddings \
116+
-X POST \
117+
-d '{"input":"Hello, world!"}' \
118+
-H 'Content-Type: application/json'
119+
```
120+
121+
2. Multiple Text Inputs with Parameters
122+
123+
```bash
124+
curl http://localhost:6000/v1/embeddings \
125+
-X POST \
126+
-d '{"input":["Hello, world!","How are you?"], "dimensions":100}' \
127+
-H 'Content-Type: application/json'
128+
```
129+
130+
## ✨ Tips for Better Understanding:
131+
132+
1. Port Mapping:
133+
Ensure the ports are correctly mapped to avoid conflicts with other services.
134+
135+
2. Model Selection:
136+
Choose a model appropriate for your use case, like "BAAI/bge-large-en-v1.5" or "BAAI/bge-base-en-v1.5".
137+
It should be exported to the models repository and set in 'MODEL_ID' env in the deployment of the OPEA API wrapper.
138+
139+
3. Models repository Volume:
140+
The `-v ./models:/models` flag ensures the models directory is correctly mounted.
141+
142+
4. Select correct configuration JSON file
143+
Models repository can host multiple models. Choose the models to be served by selecting the right configuration file.
144+
In the example above `config_embeddings.json`
145+
146+
5. Upload the models to persistent volume claim in Kubernetes
147+
Models repository with configuration JSON file will be mounted in the OVMS containers when deployed via [helm chart](../../third_parties/ovms/deployment/kubernetes/README.md).
148+
149+
6. Learn more about [OVMS embeddings API](https://docs.openvino.ai/2025/model-server/ovms_docs_rest_api_embeddings.html)
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import json
5+
import os
6+
from typing import List, Union
7+
8+
import requests
9+
from huggingface_hub import AsyncInferenceClient
10+
11+
from comps import CustomLogger, OpeaComponent, OpeaComponentRegistry, ServiceType
12+
from comps.cores.mega.utils import get_access_token
13+
from comps.cores.proto.api_protocol import EmbeddingRequest, EmbeddingResponse
14+
15+
logger = CustomLogger("opea_ovms_embedding")
16+
logflag = os.getenv("LOGFLAG", False)
17+
TOKEN_URL = os.getenv("TOKEN_URL")
18+
CLIENTID = os.getenv("CLIENTID")
19+
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
20+
MODEL_ID = os.getenv("MODEL_ID")
21+
22+
23+
@OpeaComponentRegistry.register("OPEA_OVMS_EMBEDDING")
24+
class OpeaOVMSEmbedding(OpeaComponent):
25+
"""A specialized embedding component derived from OpeaComponent for OVMS embedding services.
26+
27+
Attributes:
28+
client (AsyncInferenceClient): An instance of the async client for embedding generation.
29+
model_name (str): The name of the embedding model used.
30+
"""
31+
32+
def __init__(self, name: str, description: str, config: dict = None):
33+
super().__init__(name, ServiceType.EMBEDDING.name.lower(), description, config)
34+
self.base_url = os.getenv("OVMS_EMBEDDING_ENDPOINT", "http://localhost:8080")
35+
self.client = self._initialize_client()
36+
37+
health_status = self.check_health()
38+
if not health_status:
39+
logger.error("OpeaOVMSEmbedding health check failed.")
40+
41+
def _initialize_client(self) -> AsyncInferenceClient:
42+
"""Initializes the AsyncInferenceClient."""
43+
access_token = (
44+
get_access_token(TOKEN_URL, CLIENTID, CLIENT_SECRET) if TOKEN_URL and CLIENTID and CLIENT_SECRET else None
45+
)
46+
headers = {"Authorization": f"Bearer {access_token}"} if access_token else {}
47+
return AsyncInferenceClient(
48+
model=MODEL_ID,
49+
token=os.getenv("HUGGINGFACEHUB_API_TOKEN"),
50+
headers=headers,
51+
)
52+
53+
async def invoke(self, input: EmbeddingRequest) -> EmbeddingResponse:
54+
"""Invokes the embedding service to generate embeddings for the provided input.
55+
56+
Args:
57+
input (EmbeddingRequest): The input in OpenAI embedding format, including text(s) and optional parameters like model.
58+
59+
Returns:
60+
EmbeddingResponse: The response in OpenAI embedding format, including embeddings, model, and usage information.
61+
"""
62+
# Parse input according to the EmbeddingRequest format
63+
if isinstance(input.input, str):
64+
texts = [input.input.replace("\n", " ")]
65+
elif isinstance(input.input, list):
66+
if all(isinstance(item, str) for item in input.input):
67+
texts = [text.replace("\n", " ") for text in input.input]
68+
else:
69+
raise ValueError("Invalid input format: Only string or list of strings are supported.")
70+
else:
71+
raise TypeError("Unsupported input type: input must be a string or list of strings.")
72+
response = await self.client.post(
73+
json={
74+
"input": texts,
75+
"encoding_format": input.encoding_format,
76+
"model": self.client.model,
77+
"user": input.user,
78+
},
79+
model=f"{self.base_url}/v3/embeddings",
80+
task="text-embedding",
81+
)
82+
embeddings = json.loads(response.decode())
83+
return EmbeddingResponse(**embeddings)
84+
85+
def check_health(self) -> bool:
86+
"""Checks the health of the embedding service.
87+
88+
Returns:
89+
bool: True if the service is reachable and healthy, False otherwise.
90+
"""
91+
try:
92+
response = requests.get(f"{self.base_url}/v2/health/ready")
93+
if response.status_code == 200:
94+
return True
95+
else:
96+
return False
97+
except Exception as e:
98+
# Handle connection errors, timeouts, etc.
99+
logger.error(f"Health check failed: {e}")
100+
return False

comps/embeddings/src/opea_embedding_microservice.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import time
66

77
from integrations.clip import OpeaClipEmbedding
8+
from integrations.ovms import OpeaOVMSEmbedding
89
from integrations.predictionguard import PredictionguardEmbedding
910
from integrations.tei import OpeaTEIEmbedding
1011

comps/llms/deployment/docker_compose/compose_text-generation.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ include:
55
- ../../../third_parties/tgi/deployment/docker_compose/compose.yaml
66
- ../../../third_parties/vllm/deployment/docker_compose/compose.yaml
77
- ../../../third_parties/ollama/deployment/docker_compose/compose.yaml
8+
- ../../../third_parties/ovms/deployment/docker_compose/compose.yaml
89

910
services:
1011
textgen:
@@ -100,6 +101,17 @@ services:
100101
environment:
101102
LLM_COMPONENT_NAME: ${LLM_COMPONENT_NAME:-OpeaTextGenNative}
102103

104+
textgen-service-ovms:
105+
extends: textgen
106+
container_name: textgen-service-ovms
107+
environment:
108+
LLM_COMPONENT_NAME: ${LLM_COMPONENT_NAME:-OpeaTextGenService}
109+
OVMS_LLM_ENDPOINT: ${OVMS_LLM_ENDPOINT}
110+
MODEL_ID: ${MODEL_ID}
111+
depends_on:
112+
ovms-llm-serving:
113+
condition: service_healthy
114+
103115
networks:
104116
default:
105117
driver: bridge

comps/llms/src/text-generation/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,12 @@ RUN useradd -m -s /bin/bash user && \
1111
mkdir -p /home/user && \
1212
chown -R user /home/user/
1313

14-
COPY comps /home/user/comps
14+
COPY comps/llms/src/text-generation/requirements.txt /home/user/comps/llms/src/text-generation/requirements.txt
1515

1616
RUN pip install --no-cache-dir --upgrade pip setuptools && \
1717
pip install --no-cache-dir -r /home/user/comps/llms/src/text-generation/requirements.txt
1818

19+
COPY comps /home/user/comps
1920
ENV PYTHONPATH=$PYTHONPATH:/home/user
2021

2122
USER user

comps/llms/src/text-generation/README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,18 @@ Overall, this microservice offers a streamlined way to integrate large language
88

99
## Validated LLM Models
1010

11-
| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi |
12-
| ------------------------------------------- | --------- | -------- | ---------- |
13-
| [Intel/neural-chat-7b-v3-3] ||||
14-
| [meta-llama/Llama-2-7b-chat-hf] ||||
15-
| [meta-llama/Llama-2-70b-chat-hf] || - ||
16-
| [meta-llama/Meta-Llama-3-8B-Instruct] ||||
17-
| [meta-llama/Meta-Llama-3-70B-Instruct] || - ||
18-
| [Phi-3] | x | Limit 4K | Limit 4K |
19-
| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B] || - ||
20-
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B] || - ||
21-
| [mistralai/Mistral-Small-24B-Instruct-2501] || - ||
22-
| [mistralai/Mistral-Large-Instruct-2411] | x | - ||
11+
| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | OVMS |
12+
| ------------------------------------------- | --------- | -------- | ---------- | -------- |
13+
| [Intel/neural-chat-7b-v3-3] |||||
14+
| [meta-llama/Llama-2-7b-chat-hf] |||||
15+
| [meta-llama/Llama-2-70b-chat-hf] || - || - |
16+
| [meta-llama/Meta-Llama-3-8B-Instruct] |||||
17+
| [meta-llama/Meta-Llama-3-70B-Instruct] || - || - |
18+
| [Phi-3] | x | Limit 4K | Limit 4K | Limit 4K |
19+
| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B] || - || - |
20+
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B] || - || - |
21+
| [mistralai/Mistral-Small-24B-Instruct-2501] || - || - |
22+
| [mistralai/Mistral-Large-Instruct-2411] | x | - || - |
2323

2424
### System Requirements for LLM Models
2525

0 commit comments

Comments
 (0)