@@ -5,12 +5,79 @@ description: Shows you how to set up to monitor your database for potential issu
5
5
6
6
## Basic health check
7
7
8
+ QuestDB comes with an out-of-the-box health check HTTP endpoint:
9
+
10
+ ``` shell title="GET health status of local instance"
11
+ curl -v http://127.0.0.1:9003
12
+ ```
13
+
14
+ Getting an OK response means the QuestDB process is up and running. This method
15
+ provides no further information.
16
+
17
+ If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not
18
+ be able to get enough CPU time to respodn in a timely manner. Your load balancer
19
+ may flag the instance as dead. In such a case, create an isolated thread pool
20
+ just for the health check service (the ` min ` HTTP server), by setting this
21
+ configuration option:
22
+
23
+ ``` text
24
+ http.min.worker.count=1
25
+ ```
26
+
8
27
## Alert on critical errors
9
28
29
+ QuestDB includes a log writer that sends any message logged at critical level to
30
+ Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it
31
+ to the ` writers ` config alongside other log writers. This is the basic setup:
32
+
33
+ ``` ini title="log.conf"
34
+ writers =stdout,alert
35
+ w.alert.class =io.questdb.log.LogAlertSocketWriter
36
+ w.alert.level =CRITICAL
37
+ ```
38
+
39
+ For more details, see the
40
+ [ Logging and metrics page] ( /docs/operations/logging-metrics/#prometheus-alertmanager ) .
41
+
10
42
## Detect suspended tables
11
43
44
+ QuestDB exposes a Prometheus gauge called ` questdb_suspended_tables ` . You can set up
45
+ to alert whenever this gauge shows an above-zero value.
46
+
12
47
## Detect slow ingestion
13
48
49
+ QuestDB ingests data in two stages: first it records everything to the
50
+ Write-Ahead Log. This step is optimized for throughput and usually isn't the
51
+ bottleneck. The next step is inserting the data to the table, and this can
52
+ take longer if the data is out of order, or touches different time partitions.
53
+ You can monitor the overall performance of this process of applying the WAL
54
+ data to tables. QuestDB exposes two Prometheus counters for this:
55
+
56
+ 1 . ` questdb_wal_apply_seq_txn_total ` : sum of all committed transaction sequence numbers
57
+ 2 . ` questdb_wal_apply_writer_txn_total ` : sum of all transaction sequence numbers applied to tables
58
+
59
+ Both of these numbers are continuously growing as the data is ingested. When
60
+ they are equal, all WAL data has been applied to the tables. While data is being
61
+ actively ingested, the second counter will lag behind the first one. A steady
62
+ difference between them is a sign of healthy rate of WAL application, the
63
+ database keeping up with the demand. However, if the difference continously
64
+ rises, this indicates that either a table has become suspended and WAL can't be
65
+ applied to it, or QuestDB is not able to keep up with the ingestion rate. All of
66
+ the data is still safely stored, but a growing portion of it is not yet visible
67
+ to queries.
68
+
69
+ You can create an alert that detects a steadily increasing difference between
70
+ these two numbers. It won't tell you which table is experiencing issues, but it
71
+ is a low-impact way to detect there's a problem which needs further diagnosing.
72
+
14
73
## Detect slow queries
15
74
75
+ QuestDB maintains a table called ` _query_trace ` , which records each executed
76
+ query and the time it took. You can query this table to find slow queries.
77
+
78
+ Read more on query tracing on the
79
+ [ Concepts page] ( /docs/concept/query-tracing/ ) .
80
+
16
81
## Detect potential causes of performance issues
82
+
83
+ ... mention interesting Prometheus metrics ...
0 commit comments