Skip to content

DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jun 26, 2025
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
0173eb8
initial draft.
florence-crl May 1, 2025
6dbaa47
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 1, 2025
5a2c18f
first revision
florence-crl May 2, 2025
c31804e
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 2, 2025
9bd27e8
fixed link
florence-crl May 3, 2025
db76289
fixed summary
florence-crl May 3, 2025
b12f81c
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 20, 2025
4b1cf7a
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 20, 2025
e78be2d
draft 2
florence-crl May 21, 2025
74965b4
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 21, 2025
57fa244
draft 3
florence-crl May 22, 2025
8fc6e2c
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 22, 2025
80e592f
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 23, 2025
0aab4d9
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl May 29, 2025
e8411a4
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 2, 2025
30fd1c0
full draft
florence-crl Jun 3, 2025
6a5609b
fixed link
florence-crl Jun 3, 2025
c7c0a9e
fix file names
florence-crl Jun 3, 2025
4c3b13f
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 3, 2025
3744bc2
restart deploy-preview
florence-crl Jun 3, 2025
98122e5
incorporated Brian’s feedback
florence-crl Jun 3, 2025
f724d59
Incorporated Brian’s feedback 2. Deleted unused images.
florence-crl Jun 4, 2025
afbd9ff
Added detect-hotspots-workflow.svg.
florence-crl Jun 4, 2025
1886437
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 4, 2025
8fd4609
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 6, 2025
7e7f289
Incorporated Brian’s feedback 3.
florence-crl Jun 6, 2025
7d32151
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 17, 2025
502cc35
Incorporated Kevin’s feedback.
florence-crl Jun 17, 2025
f0370b3
Copied files to v25.3.
florence-crl Jun 17, 2025
96f0169
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 20, 2025
d48a24a
Incorporated Brian’s feedback.
florence-crl Jun 20, 2025
9fd2a63
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 26, 2025
db82336
Incorporated Rich’s feedback.
florence-crl Jun 26, 2025
90b09c4
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 26, 2025
42673a1
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl Jun 26, 2025
efad758
Fix links.
florence-crl Jun 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/current/_includes/v25.2/sidebar-data/troubleshooting.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,12 @@
"/${VERSION}/understand-hotspots.html"
]
},
{
"title": "Detect Hotspots",
"urls": [
"/${VERSION}/detect-hotspots.html"
]
},
{
"title": "Replication Reports",
"urls": [
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
174 changes: 174 additions & 0 deletions src/current/v25.2/detect-hotspots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: Detect Hotspots
summary: Learn how to detect hotspots using real-time monitoring and historical logs in CockroachDB.
toc: true
---

This page provides practical guidance for identifying common [hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}) in CockroachDB clusters using real-time monitoring and historical logs.

## Before you begin

- Review the [Understand hotspots page]({% link {{ page.version.version }}/understand-hotspots.md %}) for definitions and concepts.
- Ensure you have access to the DB Console Metrics and the relevant logs.
- Confirm that you have the necessary permissions to modify the application or schema.

## Troubleshooting overview

Identify potential hotspots and optimize query and schema performance. The following sections provide details for each step.

```
[Start]
|
[Is there a node outlier in metrics?]
|
|-- Yes --> [Is the outlier in the latch conflict wait durations metric?]
| |
| |-- Yes --> [Does a popular key detected log exist?]
| |
| |-- Yes (write hotspot) --> [Find hot ranges log, find table and index] ------|
| | |
| |-- No --> [Some other reason for latch conflict] |
| |
|-- Yes --> [Is the outlier in the CPU percent or the Runnable Goroutines per CPU metric?] |
| | |
| |-- Yes --> [Does a popular key detected log exist?] |
| | v
| |-- Yes (read hotspot) --> [Find hot ranges log, find table and index] --> [Mitigate hot key (find queries and refactor app)]
| |
| |-- No --> [Does a clear direction detected log exist?]
| |
| |-- Yes (index hotspot) --> [Find hot ranges log, find table and index] --> [Mitigate hot index (change schema)]
| |
| |-- No --> [Some other reason for CPU skew]
|
|-- No --> [Some other reason for metrics outlier]
```

## Step 1. Check for a node outlier in metrics

To identify a [hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}), monitor the following metrics on the [DB Console **Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics) and the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). A node with a maximum value that is a clear outlier in the cluster may indicate a potential hotspot.

### A. Latch conflict wait durations

- On the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}), if a virtual cluster dropdown is present in the upper right corner, select `system`.
- Create a custom chart to monitor the `kv.concurrency.latch_conflict_wait_durations-avg` metric, which tracks time spent on [latch acquisition]({% link {{ page.version.version }}/architecture/transaction-layer.md %}#latch-manager) waiting for conflicts with other latches. For example, a [sequence]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-sequence) that writes to the same row must wait to acquire the latch.
- To display the metric per node, select the `PER NODE/STORE` checkbox.

For example:

<img src="{{ 'images/v25.2/detect-hotspots-latch-conflict-wait-durations.png' | relative_url }}" alt="kv.concurrency.latch_conflict_wait_durations-avg" style="border:1px solid #eee;max-width:100%" />

- Is there a node with a maximum value that is a clear outlier in the cluster for the latch conflict wait durations metric?

- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time range when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected).
- If **No**, check for a node outlier in [CPU percent](#b-cpu-percent) metric.

### B. CPU percent

- On the DB Console **Metrics** page **Hardware** dashboard, monitor the [**CPU Percent** graph]({% link {{ page.version.version }}/ui-hardware-dashboard.md %}#cpu-percent).
- CPU usage typically increases with traffic volume.
- Check if the CPU usage of the hottest node is 20% or more above the cluster average. For example, node `n5`, represented by the green line in the following **CPU Percent** graph, hovers at around 87% at time 17:35 compared to other nodes that hover around 20% to 25%.

<img src="{{ 'images/v25.2/detect-hotspots-cpu-percent.png' | relative_url }}" alt="graph of CPU Percent utilization per node showing hot key" style="border:1px solid #eee;max-width:100%" />

- Is there a node with a maximum value that is a clear outlier in the cluster for the CPU percent metric?

- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time range when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected).
- If **No**, check for a node outlier in [Runnable Goroutines per CPU](#c-runnable-goroutines-per-cpu) metric.

### C. Runnable Goroutines per CPU

- On the DB Console **Metrics** page **Runtime** dashboard, monitor the [**Runnable Goroutines Per CPU** graph]({% link {{ page.version.version }}/ui-runtime-dashboard.md %}#runnable-goroutines-per-cpu).
- Check if there is a significant difference between the average and maximum values of the nodes. Nodes typically hover near `0.0`, unless a node is at or near its system-configured limit of 32.
- The **Runnable Goroutines per CPU** graph rises more sharply than the [**CPU Percent** graph](#b-cpu-percent). The goroutines graph increases gradually until a node approaches its limit, after which it rises sharply. The following image shows the general shapes of the two graphs.

<img src="{{ 'images/v25.2/detect-hotspots-cpu-goroutine-graphs.png' | relative_url }}" alt="comparison of CPU percent and Runnable Goroutines per CPU graphs" style="border:1px solid #eee;max-width:100%" />

- For example, node `n5`, represented by the green line in the following **Runnable Goroutine per CPU** graph, hovers above 3 at 17:35, compared to other nodes hovering around 0.0.

<img src="{{ 'images/v25.2/detect-hotspots-goroutines.png' | relative_url }}" alt="graph of Runnable Goroutines per CPU per node showing node overload" style="border:1px solid #eee;max-width:100%" />

{{site.data.alerts.callout_success}}
Compare the **Runnable Goroutines per CPU** graph and the **CPU Percent** graph at the same timestamp to spot sharp increases.
{{site.data.alerts.end}}

- Is there a node with a maximum value that is a clear outlier in the cluster for Runnable Goroutines per CPU metric?

- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time range when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected).
- If **No**, investigate other reasons for the metrics outlier.

## Step 2. Check for existence of `no split key found` log

The [`no split key found` log]({% link {{ page.version.version }}/load-based-splitting.md %}#monitor-load-based-splitting) is emitted in the [`KV_DISTRIBUTION` log channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). This log is not associated with a specific event type, but includes an unstructured message such as:

<a id="no-split-key-found-log-example"></a>

```
I250523 21:59:25.755283 31560 13@kv/kvserver/split/decider.go:298 ⋮ [T1,Vsystem,n5,s5,r1115/3:‹/Table/106/1/{113338-899841…}›] 2979 no split key found: insufficient counters = 0, imbalance = 20, most popular key occurs in 36% of samples, access balance right-biased 98%, popular key detected, clear direction detected
```

The unstructured message ends in either of these string combinations:

1. `popular key detected, clear direction detected`
1. `popular key detected, no clear direction`
1. `no popular key, clear direction detected`
1. `no popular key, no clear direction`

### A. `popular key detected`

- To check whether a `popular key detected` log exists, search for `popular key detected` in the `KV_DISTRIBUTION` logs on the hot node you noted in Step 1 in the time range that you noted. In the [preceding log example](#no-split-key-found-log-example), the log is on node 5, `n5` in the tag section in square brackets, and at timestamp `250523 21:59:25.755283`.

- Once you identify a relevant log, note the range ID in the tag section. In the [preceding log example](#no-split-key-found-log-example), the range is 1115 (`r1115`), as shown in the tag section in square brackets.

{{site.data.alerts.callout_info}}
There may be false positives of the `popular key detected` log.
{{site.data.alerts.end}}

- The outlier was in the latch conflict wait durations metric. Does a `popular key detected` log exist?

- If **Yes**, it is a [write hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#write-hotspot). Note the range ID of `popular key detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log).
- If **No**, investigate other reasons for the latch conflict wait durations metric outlier.

- The outlier was CPU percent or the Runnable Goroutines per CPU metric. Does a `popular key detected` log exist?

- If **Yes**, it is a [read hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#read-hotspot). Note the range ID of `popular key detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log).
- If **No**, note the range ID of `popular key detected` log and proceed to check whether the log is also a [`clear direction detected` log](#b-clear-direction-detected).

### B. `clear direction detected`

- To determine whether a `clear direction detected` log exists, check the unstructured message of the `popular key detected` log. Does it end with `clear direction detected`?

- The outlier was CPU percent or the Runnable Goroutines per CPU metric. A `popular key detected` log exists. Does a `clear direction detected` log exist?

- If **Yes**, it is an [index hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#index-hotspot). Proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log).
- If **No**, investigate other possible causes for CPU skew.

## Step 3. Find hot ranges log

A hot ranges log is a log of an event of type `hot_ranges_stats` emitted to the [`HEALTH` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). Because this log corresponds to an event type, it includes a structured message such as:

```
I250602 04:46:54.752464 2023 2@util/log/event_log.go:39 ⋮ [T1,Vsystem,n5] 31977 ={"Timestamp":1748839613749807000,"EventType":"hot_ranges_stats","RangeID":1115,"Qps":0,"LeaseholderNodeID":5,"WritesPerSecond":0.0012048123820978134,"CPUTimePerSecond":251.30338109510822,"Databases":["kv"],"Tables":["kv"],"Indexes":["kv_pkey"]}
```

- To find the relevant hot ranges log, search for `"EventType":"hot_ranges_stats"` and `"RangeID":{range ID from popular key detected log}` and `"LeaseholderNodeID":{node ID from metric outlier}` in the noted time range of the metric outlier.
- Once you find the relevant hot ranges log, note the values for `Databases`, `Tables`, and `Indexes`.
- For a write hotspot or read hotspot, proceed to [Mitigation for hot key](#mitigation-1-hot-key).
- For an index hotspot, proceed to [Mitigation for hot index](#mitigation-2-hot-index).

## Mitigation 1 - hot key

To mitigate a [hot key]({% link {{ page.version.version }}/understand-hotspots.md %}#row-hotspot) (whether a write hotspot or read hotspot), identify the problematic queries, and then refactor your application accordingly. Use the [SQL Activity Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}) in the DB Console to help identify the corresponding statements by the values noted for `Databases`, `Tables`, and `Indexes` in the hot ranges log.

## Mitigation 2 - hot index

To mitigate a hot index, update the index schema using the values noted for `Databases`, `Tables`, and `Indexes` in the hot ranges log. Refer to [Resolving index hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}#resolving-index-hotspots).

## See also

- [Understand Hotspots]({% link {{ page.version.version }}/understand-hotspots.md %})
- [**Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics)
- [**Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %})
- [Logging channels]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels)
- [Load-based splitting]({% link {{ page.version.version }}/load-based-splitting.md %})
- [**SQL Activity Statements** page]({% link {{ page.version.version }}/ui-statements-page.md %})
3 changes: 3 additions & 0 deletions src/current/v25.2/understand-hotspots.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ In distributed SQL, hotspots refer to bottlenecks that limit a cluster's ability

The page also offers best practices for [reducing hotspots](#reduce-hotspots), including a [video demo](#video-demo).

To troubleshoot common hotspots, refer to the [Detect Hotspots page]({% link {{ page.version.version }}/detect-hotspots.md %}).

## Terminology

### Hotspot
Expand Down Expand Up @@ -335,5 +337,6 @@ For a demo on hotspot reduction, watch the following video:

## See also

- [Detect Hotspots]({% link {{ page.version.version }}/detect-hotspots.md %})
- [Performance Tuning Recipes: Hotspots]({% link {{ page.version.version }}/performance-recipes.md %}#hotspots)
- [Single hot node]({% link {{ page.version.version }}/query-behavior-troubleshooting.md %}#single-hot-node)
Loading