Skip to content

Commit 484b69a

Browse files
[ChatQnA] Support the replica tuning for ChatQnA (#116)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent cf8bd83 commit 484b69a

22 files changed

+3117
-0
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ repos:
1010
files: (.*\.(py|md|rst|yaml|yml|json|ts|js|html|svelte|sh))$
1111
- id: check-json
1212
- id: check-yaml
13+
args: [--allow-multiple-documents]
1314
- id: debug-statements
1415
- id: requirements-txt-fixer
1516
- id: trailing-whitespace

evals/auto_tuning/README.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Auto-Tuning for ChatQnA: Optimizing Resource Allocation in Kubernetes
2+
3+
This document describes the Auto-Tuning framework, a tool designed to streamline deployment strategies for resource-intensive services, particularly in ChatQnA environments. It leverages Kubernetes for container orchestration and integrates experimental data with out prior knowledge to fine-tune deployments for optimal performance.
4+
5+
## Key Features
6+
* Hardware Efficiency: Focuses on adjusting replica counts and maximizing the utilization of CPU and HPU (Habana Processing Unit) resources.
7+
8+
* Theoretical and Experimental Optimization: Integrates theoretical best practices with our prior knowledge to ensure optimal resource allocation for services.
9+
10+
# Usage
11+
12+
To generate the strategy.json configuration file for deployment, use the following command:
13+
14+
15+
```bash
16+
# Kubernetes Deployment
17+
python3 tuning.py --tuning_config replica_tuning_config.json --hardware_info hardware_info_gaudi.json --service_info chatqna_neuralchat_rerank_latest.yaml
18+
19+
# Note: Add --config_only to output deployment configs only.
20+
```
21+
22+
## Configuration Files
23+
1. hardware_info_gaudi.json: Specifies the hardware details (CPU, HPU, etc.).
24+
25+
2. chatqna_neuralchat_rerank_latest.yaml: Contains service deployment information.
26+
27+
3. tuning_config.json: Customizes tuning parameters for replica counts and granularity.
28+
29+
### Hardrware_info.json
30+
This file lists only the hardware devices to be used in deployment.
31+
32+
```json
33+
{
34+
"device_0": {
35+
"ip": ["10.239.1.5", "10.239.10.6"],
36+
"type": "hpu",
37+
"sockets": 2,
38+
"cores_per_socket": 64,
39+
"num_cards": 8
40+
}
41+
}
42+
```
43+
Please refer to `hardware_info_gaudi.json` for more details.
44+
45+
### chatqna_neuralchat_rerank_latest.yaml
46+
This file includes all services that will be deployed.
47+
```yaml
48+
opea_micro_services:
49+
data_prep:
50+
... ...
51+
embedding:
52+
... ...
53+
54+
reranking:
55+
... ...
56+
57+
llm:
58+
opea/llm-tgi:
59+
tag: latest
60+
type: cpu
61+
dependency:
62+
ghcr.io/huggingface/tgi-gaudi:
63+
tag: 2.0.4
64+
type: hpu
65+
requirements:
66+
model_id: "Intel/neural-chat-7b-v3-3"
67+
68+
opea_mega_service:
69+
opea/chatqna:
70+
tag: latest
71+
type: cpu
72+
```
73+
Please refer to `chatqna_neuralchat_rerank_latest.yaml` for more details.
74+
75+
### Tuning Config Parameters
76+
77+
`embedding_replicas_granularity = 1`: This defines the step size for scaling the number of replicas for the embedding server.
78+
* Value (1): Each scaling operation increases or decreases the number of replicas by 1 at a time.
79+
80+
`embedding_replicas_min = 1`: This sets the minimum number of replicas allowed for the embedding server.
81+
* Value (1): The service will always have at least 1 replica running, ensuring that it is available for deployment.
82+
83+
`embedding_replicas_max = 4`: This defines the maximum number of replicas allowed for the embedding server.
84+
* Value (4): The service can be scaled up to a maximum of 4 replicas, limiting resource consumption and avoiding over-provisioning.
85+
86+
`microservice_replicas_granularity = 1`: This specifies the scaling step size for other microservices (such as retrieval, dataprep, etc.).
87+
* Value (1): Similar to the embedding_replicas_granularity, the number of replicas for these microservices will scale by 1 replica at a time.
88+
89+
`microservice_replicas_min = 1`: This parameter sets the minimum number of replicas for these microservices.
90+
* Value (1): Ensures that each microservice always has at least 1 replica running.
91+
92+
`microservice_replicas_max = 4`: This defines the upper limit for scaling replicas for these microservices.
93+
* Value (4): The maximum number of replicas allowed for the microservices is 4.
94+
95+
96+
If you want to adjust the default tuning parameters, just create a replica_tuning_config.json file. For example:
97+
98+
```json
99+
{
100+
"embedding_replicas_granularity": 1,
101+
"embedding_replicas_min": 1,
102+
"embedding_replicas_max": 4,
103+
104+
"microservice_replicas_granularity": 1,
105+
"microservice_replicas_min": 1,
106+
"microservice_replicas_max": 4
107+
}
108+
```
109+
Please refer to `replica_tuning_config.json` for more details.
110+
111+
## Output
112+
113+
The output of the auto-tuning process includes two key components:
114+
1. strategy_files: Contains optimized configurations for deploying services, such as replica counts and hardware resource allocations.
115+
116+
2. K8S manifests: Provides the Kubernetes deployment specifications, including pod definitions and resource limits, ready for deployment.
117+
118+
Example of a strategy file:
119+
```json
120+
{
121+
"embedding-dependency": {
122+
"type": "cpu",
123+
"image": "ghcr.io/huggingface/text-embeddings-inference:cpu-1.5",
124+
"model_id": "BAAI/bge-base-en-v1.5",
125+
"replica": 1
126+
},
127+
"llm-microservice": {
128+
"type": "cpu",
129+
"image": "opea/llm-tgi:latest",
130+
"replica": 4
131+
},
132+
133+
... ...
134+
"reranking-dependency": {
135+
"type": "hpu",
136+
"image": "opea/tei-gaudi:latest",
137+
"model_id": "BAAI/bge-reranker-base",
138+
"replica": 1,
139+
"cards": 1
140+
},
141+
"chatqna_mega_service": {
142+
"image": "opea/chatqna:latest",
143+
"type": "cpu",
144+
"replica": 4
145+
}
146+
}
147+
```
148+
149+
Both the K8S manifests and strategy files are generated in the current directory, providing everything needed for deployment.
150+
151+
Deployment methods: simply run `kubectl apply -f` on the newly generated *_run.yaml files and the chatqna_config_map.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: v1
5+
kind: ConfigMap
6+
metadata:
7+
name: qna-config
8+
namespace: default
9+
data:
10+
EMBEDDING_MODEL_ID: BAAI/bge-base-en-v1.5
11+
RERANK_MODEL_ID: BAAI/bge-reranker-base
12+
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
13+
TEI_EMBEDDING_ENDPOINT: http://embedding-dependency-svc.default.svc.cluster.local:6006
14+
TEI_RERANKING_ENDPOINT: http://reranking-dependency-svc.default.svc.cluster.local:8808
15+
TGI_LLM_ENDPOINT: http://llm-dependency-svc.default.svc.cluster.local:9009
16+
REDIS_URL: redis://vector-db.default.svc.cluster.local:6379
17+
INDEX_NAME: rag-redis
18+
HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
19+
EMBEDDING_SERVICE_HOST_IP: embedding-svc
20+
RETRIEVER_SERVICE_HOST_IP: retriever-svc
21+
RERANK_SERVICE_HOST_IP: reranking-svc
22+
NODE_SELECTOR: chatqna-opea
23+
LLM_SERVICE_HOST_IP: llm-svc
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: apps/v1
5+
kind: Deployment
6+
metadata:
7+
name: chatqna-backend-server-deploy
8+
namespace: default
9+
spec:
10+
replicas: 1
11+
selector:
12+
matchLabels:
13+
app: chatqna-backend-server-deploy
14+
template:
15+
metadata:
16+
annotations:
17+
sidecar.istio.io/rewriteAppHTTPProbers: 'true'
18+
labels:
19+
app: chatqna-backend-server-deploy
20+
spec:
21+
nodeSelector:
22+
node-type: chatqna-opea
23+
topologySpreadConstraints:
24+
- maxSkew: 1
25+
topologyKey: kubernetes.io/hostname
26+
whenUnsatisfiable: ScheduleAnyway
27+
labelSelector:
28+
matchLabels:
29+
app: chatqna-backend-server-deploy
30+
hostIPC: true
31+
containers:
32+
- envFrom:
33+
- configMapRef:
34+
name: qna-config
35+
image: opea/chatqna:latest
36+
imagePullPolicy: IfNotPresent
37+
name: chatqna-backend-server-deploy
38+
args: null
39+
ports:
40+
- containerPort: 8888
41+
serviceAccountName: default
42+
---
43+
kind: Service
44+
apiVersion: v1
45+
metadata:
46+
name: chatqna-backend-server-svc
47+
spec:
48+
type: NodePort
49+
selector:
50+
app: chatqna-backend-server-deploy
51+
ports:
52+
- name: service
53+
port: 8888
54+
targetPort: 8888
55+
nodePort: 30888
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
---
5+
apiVersion: apps/v1
6+
kind: Deployment
7+
metadata:
8+
name: dataprep-deploy
9+
namespace: default
10+
spec:
11+
replicas: 1
12+
selector:
13+
matchLabels:
14+
app: dataprep-deploy
15+
template:
16+
metadata:
17+
annotations:
18+
sidecar.istio.io/rewriteAppHTTPProbers: 'true'
19+
labels:
20+
app: dataprep-deploy
21+
spec:
22+
nodeSelector:
23+
node-type: chatqna-opea
24+
topologySpreadConstraints:
25+
- maxSkew: 1
26+
topologyKey: kubernetes.io/hostname
27+
whenUnsatisfiable: ScheduleAnyway
28+
labelSelector:
29+
matchLabels:
30+
app: dataprep-deploy
31+
hostIPC: true
32+
containers:
33+
- env:
34+
- name: REDIS_URL
35+
valueFrom:
36+
configMapKeyRef:
37+
name: qna-config
38+
key: REDIS_URL
39+
- name: TEI_ENDPOINT
40+
valueFrom:
41+
configMapKeyRef:
42+
name: qna-config
43+
key: TEI_EMBEDDING_ENDPOINT
44+
- name: INDEX_NAME
45+
valueFrom:
46+
configMapKeyRef:
47+
name: qna-config
48+
key: INDEX_NAME
49+
image: opea/dataprep-redis:latest
50+
imagePullPolicy: IfNotPresent
51+
name: dataprep-deploy
52+
args: null
53+
ports:
54+
- containerPort: 6007
55+
- containerPort: 6008
56+
- containerPort: 6009
57+
serviceAccountName: default
58+
---
59+
kind: Service
60+
apiVersion: v1
61+
metadata:
62+
name: dataprep-svc
63+
spec:
64+
type: ClusterIP
65+
selector:
66+
app: dataprep-deploy
67+
ports:
68+
- name: port1
69+
port: 6007
70+
targetPort: 6007
71+
- name: port2
72+
port: 6008
73+
targetPort: 6008
74+
- name: port3
75+
port: 6009
76+
targetPort: 6009
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
---
5+
apiVersion: apps/v1
6+
kind: Deployment
7+
metadata:
8+
name: embedding-dependency-deploy
9+
namespace: default
10+
spec:
11+
replicas: 1
12+
selector:
13+
matchLabels:
14+
app: embedding-dependency-deploy
15+
template:
16+
metadata:
17+
annotations:
18+
sidecar.istio.io/rewriteAppHTTPProbers: 'true'
19+
labels:
20+
app: embedding-dependency-deploy
21+
spec:
22+
nodeSelector:
23+
node-type: chatqna-opea
24+
containers:
25+
- envFrom:
26+
- configMapRef:
27+
name: qna-config
28+
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
29+
name: embedding-dependency-deploy
30+
args:
31+
- --model-id
32+
- $(EMBEDDING_MODEL_ID)
33+
- --auto-truncate
34+
volumeMounts:
35+
- mountPath: /data
36+
name: model-volume
37+
- mountPath: /dev/shm
38+
name: shm
39+
ports:
40+
- containerPort: 80
41+
serviceAccountName: default
42+
volumes:
43+
- name: model-volume
44+
hostPath:
45+
path: /mnt/models
46+
type: Directory
47+
- name: shm
48+
emptyDir:
49+
medium: Memory
50+
sizeLimit: 1Gi
51+
---
52+
kind: Service
53+
apiVersion: v1
54+
metadata:
55+
name: embedding-dependency-svc
56+
spec:
57+
type: ClusterIP
58+
selector:
59+
app: embedding-dependency-deploy
60+
ports:
61+
- name: service
62+
port: 6006
63+
targetPort: 80

0 commit comments

Comments
 (0)