Skip to content

Commit 8d304ac

Browse files
add Observability for OPEA (#393)
* add Observability for OPEA Signed-off-by: leslieluyu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 43adcc6 commit 8d304ac

23 files changed

+19424
-0
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# How-To Setup Observability for OPEA Workload in Kubernetes
2+
3+
This guide provides a step-by-step approach to setting up observability for the OPEA workload in a Kubernetes environment. We will cover the setup of Prometheus and Grafana, as well as the collection of metrics for Gaudi hardware, OPEA/chatqna including TGI,TEI-Embedding,TEI-Reranking and other microservies, and PCM.
4+
5+
#### Prepare
6+
7+
```
8+
git clone https://github.com/opea-project/GenAIInfra.git
9+
cd kubernetes-addons/Observability
10+
```
11+
12+
## 1. Setup Prometheus & Grafana
13+
14+
Setting up Prometheus and Grafana is essential for monitoring and visualizing your workloads. Follow these steps to get started:
15+
16+
### Step 1: Install Prometheus&Grafana
17+
18+
```
19+
kubectl create ns monitoring
20+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
21+
helm repo update
22+
helm install prometheus-stack prometheus-community/kube-prometheus-stack --version 55.5.1 -n monitoring
23+
```
24+
25+
### Step 2: Verify the installation:
26+
27+
```
28+
kubectl get pods -n monitoring
29+
```
30+
31+
### Step 3: Port-forward to access Grafana:
32+
33+
```
34+
kubectl port-forward service/grafana 3000:80
35+
```
36+
37+
### Step 4: Access Grafana:
38+
39+
Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login.
40+
41+
## 2. Metric for Gaudi Hardware(v1.16.2)
42+
43+
To monitor Gaudi hardware metrics, you can use the following steps:
44+
45+
### Step 1: Install daemonset
46+
47+
```
48+
kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-daemonset.yaml
49+
```
50+
51+
### Step 2: Install metric-exporter
52+
53+
```
54+
kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-service.yaml
55+
```
56+
57+
### Step 3: Install service-monitor
58+
59+
```
60+
kubectl apply -f ./habana/metric-exporter-serviceMonitor.yaml
61+
```
62+
63+
### Step 4: Verify the metrics
64+
65+
The metric endpoints for habana will be a headless service, so we need to get endpoint to verify
66+
67+
```
68+
# To get the metric endpoints, e.g. to get first endpoint to test
69+
habana_metric_url=`kubectl -n monitoring get ep metric-exporter -o jsonpath="{.subsets[].addresses[0].ip}:{..subsets[].ports[0].port}"`
70+
# Fetch the metrics
71+
curl ${habana_metric_url}/metrics
72+
73+
# you will see the habana metric data like this:
74+
process_resident_memory_bytes 2.9216768e+07
75+
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
76+
# TYPE process_start_time_seconds gauge
77+
process_start_time_seconds 1.71394960963e+09
78+
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
79+
# TYPE process_virtual_memory_bytes gauge
80+
process_virtual_memory_bytes 2.862641152e+09
81+
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
82+
# TYPE process_virtual_memory_max_bytes gauge
83+
process_virtual_memory_max_bytes 1.8446744073709552e+19
84+
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
85+
# TYPE promhttp_metric_handler_requests_in_flight gauge
86+
promhttp_metric_handler_requests_in_flight 1
87+
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
88+
# TYPE promhttp_metric_handler_requests_total counter
89+
promhttp_metric_handler_requests_total{code="200"} 125
90+
promhttp_metric_handler_requests_total{code="500"} 0
91+
promhttp_metric_handler_requests_total{code="503"} 0
92+
```
93+
94+
### Step 5: Import the dashboard into Grafana
95+
96+
Manually import ./habana/Dashboard-Gaudi-HW.json into Grafana
97+
![alt text](image-1.png)
98+
99+
## 3. Metric for OPEA/chatqna
100+
101+
To monitor ChatQnA metrics including TGI-gaudi,TEI,TEI-Reranking and other micro services, you can use the following steps:
102+
103+
### Step 1: Install ChatQnA by Helm
104+
105+
Install Helm (version >= 3.15) first. Refer to the [Helm Installation Guide](https://helm.sh/docs/intro/install/) for more information.
106+
107+
Refer to the [ChatQnA helm chart](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna) for instructions on deploying ChatQnA into Kubernetes on Xeon & Gaudi.
108+
109+
### Step 2: Install all the serviceMonitor
110+
111+
###### NOTE:
112+
113+
> If the chatQnA installed into another instance instead of chatqna(Default instance name),you should modify the
114+
> matchLabels app.kubernetes.io/instance:${instanceName} with proper instanceName
115+
116+
```
117+
kubectl apply -f chatqna/
118+
```
119+
120+
### Step 3: Install the dashboard
121+
122+
- manually import tgi_grafana.json into the Grafana to monitor the tgi-gaudi utilization
123+
- manually import queue_size_embedding_rerank_tgi.json into the Grafana to monitor the queue size of TGI-gaudi,TEI-Embedding,TEI-reranking
124+
- OR you could create dashboard to monitor all the services in ChatQnA by yourself
125+
126+
![alt text](image-2.png)
127+
128+
## 4. Metric for PCM(Intel® Performance Counter Monitor)
129+
130+
### Step 1: Install PCM
131+
132+
Please refer this repo to install [Intel® PCM](https://github.com/intel/pcm)
133+
134+
### Step 2: Modify & Install pcm-service
135+
136+
modify the pcm/pcm-service.yaml to set the addresses
137+
138+
```
139+
kubectl apply -f pcm/pcm-service.yaml
140+
```
141+
142+
### Step 3: Install pcm serviceMonitor
143+
144+
```
145+
kubectl apply -f pcm/pcm-serviceMonitor.yaml
146+
```
147+
148+
### Step 4: Install the pcm dashboard
149+
150+
manually import the pcm/pcm-dashboard.json into the Grafana
151+
![alt text](image.png)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: monitoring.coreos.com/v1
5+
kind: ServiceMonitor
6+
metadata:
7+
labels:
8+
app.kubernetes.io/name: chatqna-backend-svc-exporter
9+
app.kubernetes.io/version: v0.0.1
10+
release: prometheus-stack
11+
name: chatqna-backend-svc-exporter
12+
namespace: monitoring
13+
spec:
14+
namespaceSelector:
15+
any: true
16+
selector:
17+
matchLabels:
18+
app.kubernetes.io/instance: chatqna
19+
app.kubernetes.io/name: chatqna
20+
endpoints:
21+
- port: chatqna
22+
interval: 5s

0 commit comments

Comments
 (0)