Skip to content

Commit 0644a51

Browse files
committed
Add Monitoring and Alerting content
1 parent 4a30618 commit 0644a51

File tree

2 files changed

+68
-0
lines changed

2 files changed

+68
-0
lines changed

documentation/operations/monitoring-alerting.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,79 @@ description: Shows you how to set up to monitor your database for potential issu
55

66
## Basic health check
77

8+
QuestDB comes with an out-of-the-box health check HTTP endpoint:
9+
10+
```shell title="GET health status of local instance"
11+
curl -v http://127.0.0.1:9003
12+
```
13+
14+
Getting an OK response means the QuestDB process is up and running. This method
15+
provides no further information.
16+
17+
If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not
18+
be able to get enough CPU time to respodn in a timely manner. Your load balancer
19+
may flag the instance as dead. In such a case, create an isolated thread pool
20+
just for the health check service (the `min` HTTP server), by setting this
21+
configuration option:
22+
23+
```text
24+
http.min.worker.count=1
25+
```
26+
827
## Alert on critical errors
928

29+
QuestDB includes a log writer that sends any message logged at critical level to
30+
Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it
31+
to the `writers` config alongside other log writers. This is the basic setup:
32+
33+
```ini title="log.conf"
34+
writers=stdout,alert
35+
w.alert.class=io.questdb.log.LogAlertSocketWriter
36+
w.alert.level=CRITICAL
37+
```
38+
39+
For more details, see the
40+
[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager).
41+
1042
## Detect suspended tables
1143

44+
QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up
45+
to alert whenever this gauge shows an above-zero value.
46+
1247
## Detect slow ingestion
1348

49+
QuestDB ingests data in two stages: first it records everything to the
50+
Write-Ahead Log. This step is optimized for throughput and usually isn't the
51+
bottleneck. The next step is inserting the data to the table, and this can
52+
take longer if the data is out of order, or touches different time partitions.
53+
You can monitor the overall performance of this process of applying the WAL
54+
data to tables. QuestDB exposes two Prometheus counters for this:
55+
56+
1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers
57+
2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables
58+
59+
Both of these numbers are continuously growing as the data is ingested. When
60+
they are equal, all WAL data has been applied to the tables. While data is being
61+
actively ingested, the second counter will lag behind the first one. A steady
62+
difference between them is a sign of healthy rate of WAL application, the
63+
database keeping up with the demand. However, if the difference continously
64+
rises, this indicates that either a table has become suspended and WAL can't be
65+
applied to it, or QuestDB is not able to keep up with the ingestion rate. All of
66+
the data is still safely stored, but a growing portion of it is not yet visible
67+
to queries.
68+
69+
You can create an alert that detects a steadily increasing difference between
70+
these two numbers. It won't tell you which table is experiencing issues, but it
71+
is a low-impact way to detect there's a problem which needs further diagnosing.
72+
1473
## Detect slow queries
1574

75+
QuestDB maintains a table called `_query_trace`, which records each executed
76+
query and the time it took. You can query this table to find slow queries.
77+
78+
Read more on query tracing on the
79+
[Concepts page](/docs/concept/query-tracing/).
80+
1681
## Detect potential causes of performance issues
82+
83+
... mention interesting Prometheus metrics ...

documentation/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -468,6 +468,7 @@ module.exports = {
468468
]
469469
},
470470
"operations/logging-metrics",
471+
"operations/monitoring-alerting",
471472
"operations/data-retention",
472473
"operations/design-for-performance",
473474
"operations/updating-data",

0 commit comments

Comments
 (0)