|
| 1 | +# How-To Setup Observability for OPEA Workload in Kubernetes |
| 2 | + |
| 3 | +This guide provides a step-by-step approach to setting up observability for the OPEA workload in a Kubernetes environment. We will cover the setup of Prometheus and Grafana, as well as the collection of metrics for Gaudi hardware, OPEA/chatqna including TGI,TEI-Embedding,TEI-Reranking and other microservies, and PCM. |
| 4 | + |
| 5 | +#### Prepare |
| 6 | + |
| 7 | +``` |
| 8 | +git clone https://github.com/opea-project/GenAIInfra.git |
| 9 | +cd kubernetes-addons/Observability |
| 10 | +``` |
| 11 | + |
| 12 | +## 1. Setup Prometheus & Grafana |
| 13 | + |
| 14 | +Setting up Prometheus and Grafana is essential for monitoring and visualizing your workloads. Follow these steps to get started: |
| 15 | + |
| 16 | +### Step 1: Install Prometheus&Grafana |
| 17 | + |
| 18 | +``` |
| 19 | +kubectl create ns monitoring |
| 20 | +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts |
| 21 | +helm repo update |
| 22 | +helm install prometheus-stack prometheus-community/kube-prometheus-stack --version 55.5.1 -n monitoring |
| 23 | +``` |
| 24 | + |
| 25 | +### Step 2: Verify the installation: |
| 26 | + |
| 27 | +``` |
| 28 | +kubectl get pods -n monitoring |
| 29 | +``` |
| 30 | + |
| 31 | +### Step 3: Port-forward to access Grafana: |
| 32 | + |
| 33 | +``` |
| 34 | +kubectl port-forward service/grafana 3000:80 |
| 35 | +``` |
| 36 | + |
| 37 | +### Step 4: Access Grafana: |
| 38 | + |
| 39 | +Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login. |
| 40 | + |
| 41 | +## 2. Metric for Gaudi Hardware(v1.16.2) |
| 42 | + |
| 43 | +To monitor Gaudi hardware metrics, you can use the following steps: |
| 44 | + |
| 45 | +### Step 1: Install daemonset |
| 46 | + |
| 47 | +``` |
| 48 | +kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-daemonset.yaml |
| 49 | +``` |
| 50 | + |
| 51 | +### Step 2: Install metric-exporter |
| 52 | + |
| 53 | +``` |
| 54 | +kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-service.yaml |
| 55 | +``` |
| 56 | + |
| 57 | +### Step 3: Install service-monitor |
| 58 | + |
| 59 | +``` |
| 60 | +kubectl apply -f ./habana/metric-exporter-serviceMonitor.yaml |
| 61 | +``` |
| 62 | + |
| 63 | +### Step 4: Verify the metrics |
| 64 | + |
| 65 | +The metric endpoints for habana will be a headless service, so we need to get endpoint to verify |
| 66 | + |
| 67 | +``` |
| 68 | +# To get the metric endpoints, e.g. to get first endpoint to test |
| 69 | +habana_metric_url=`kubectl -n monitoring get ep metric-exporter -o jsonpath="{.subsets[].addresses[0].ip}:{..subsets[].ports[0].port}"` |
| 70 | +# Fetch the metrics |
| 71 | +curl ${habana_metric_url}/metrics |
| 72 | +
|
| 73 | +# you will see the habana metric data like this: |
| 74 | +process_resident_memory_bytes 2.9216768e+07 |
| 75 | +# HELP process_start_time_seconds Start time of the process since unix epoch in seconds. |
| 76 | +# TYPE process_start_time_seconds gauge |
| 77 | +process_start_time_seconds 1.71394960963e+09 |
| 78 | +# HELP process_virtual_memory_bytes Virtual memory size in bytes. |
| 79 | +# TYPE process_virtual_memory_bytes gauge |
| 80 | +process_virtual_memory_bytes 2.862641152e+09 |
| 81 | +# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes. |
| 82 | +# TYPE process_virtual_memory_max_bytes gauge |
| 83 | +process_virtual_memory_max_bytes 1.8446744073709552e+19 |
| 84 | +# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. |
| 85 | +# TYPE promhttp_metric_handler_requests_in_flight gauge |
| 86 | +promhttp_metric_handler_requests_in_flight 1 |
| 87 | +# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. |
| 88 | +# TYPE promhttp_metric_handler_requests_total counter |
| 89 | +promhttp_metric_handler_requests_total{code="200"} 125 |
| 90 | +promhttp_metric_handler_requests_total{code="500"} 0 |
| 91 | +promhttp_metric_handler_requests_total{code="503"} 0 |
| 92 | +``` |
| 93 | + |
| 94 | +### Step 5: Import the dashboard into Grafana |
| 95 | + |
| 96 | +Manually import ./habana/Dashboard-Gaudi-HW.json into Grafana |
| 97 | + |
| 98 | + |
| 99 | +## 3. Metric for OPEA/chatqna |
| 100 | + |
| 101 | +To monitor ChatQnA metrics including TGI-gaudi,TEI,TEI-Reranking and other micro services, you can use the following steps: |
| 102 | + |
| 103 | +### Step 1: Install ChatQnA by Helm |
| 104 | + |
| 105 | +Install Helm (version >= 3.15) first. Refer to the [Helm Installation Guide](https://helm.sh/docs/intro/install/) for more information. |
| 106 | + |
| 107 | +Refer to the [ChatQnA helm chart](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna) for instructions on deploying ChatQnA into Kubernetes on Xeon & Gaudi. |
| 108 | + |
| 109 | +### Step 2: Install all the serviceMonitor |
| 110 | + |
| 111 | +###### NOTE: |
| 112 | + |
| 113 | +> If the chatQnA installed into another instance instead of chatqna(Default instance name),you should modify the |
| 114 | +> matchLabels app.kubernetes.io/instance:${instanceName} with proper instanceName |
| 115 | +
|
| 116 | +``` |
| 117 | +kubectl apply -f chatqna/ |
| 118 | +``` |
| 119 | + |
| 120 | +### Step 3: Install the dashboard |
| 121 | + |
| 122 | +- manually import tgi_grafana.json into the Grafana to monitor the tgi-gaudi utilization |
| 123 | +- manually import queue_size_embedding_rerank_tgi.json into the Grafana to monitor the queue size of TGI-gaudi,TEI-Embedding,TEI-reranking |
| 124 | +- OR you could create dashboard to monitor all the services in ChatQnA by yourself |
| 125 | + |
| 126 | + |
| 127 | + |
| 128 | +## 4. Metric for PCM(Intel® Performance Counter Monitor) |
| 129 | + |
| 130 | +### Step 1: Install PCM |
| 131 | + |
| 132 | +Please refer this repo to install [Intel® PCM](https://github.com/intel/pcm) |
| 133 | + |
| 134 | +### Step 2: Modify & Install pcm-service |
| 135 | + |
| 136 | +modify the pcm/pcm-service.yaml to set the addresses |
| 137 | + |
| 138 | +``` |
| 139 | +kubectl apply -f pcm/pcm-service.yaml |
| 140 | +``` |
| 141 | + |
| 142 | +### Step 3: Install pcm serviceMonitor |
| 143 | + |
| 144 | +``` |
| 145 | +kubectl apply -f pcm/pcm-serviceMonitor.yaml |
| 146 | +``` |
| 147 | + |
| 148 | +### Step 4: Install the pcm dashboard |
| 149 | + |
| 150 | +manually import the pcm/pcm-dashboard.json into the Grafana |
| 151 | + |
0 commit comments