-
Notifications
You must be signed in to change notification settings - Fork 472
DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577
Changes from 20 commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
0173eb8
initial draft.
florence-crl 6dbaa47
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 5a2c18f
first revision
florence-crl c31804e
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 9bd27e8
fixed link
florence-crl db76289
fixed summary
florence-crl b12f81c
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 4b1cf7a
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl e78be2d
draft 2
florence-crl 74965b4
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 57fa244
draft 3
florence-crl 8fc6e2c
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 80e592f
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 0aab4d9
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl e8411a4
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 30fd1c0
full draft
florence-crl 6a5609b
fixed link
florence-crl c7c0a9e
fix file names
florence-crl 4c3b13f
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 3744bc2
restart deploy-preview
florence-crl 98122e5
incorporated Brian’s feedback
florence-crl f724d59
Incorporated Brian’s feedback 2. Deleted unused images.
florence-crl afbd9ff
Added detect-hotspots-workflow.svg.
florence-crl 1886437
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 8fd4609
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 7e7f289
Incorporated Brian’s feedback 3.
florence-crl 7d32151
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 502cc35
Incorporated Kevin’s feedback.
florence-crl f0370b3
Copied files to v25.3.
florence-crl 96f0169
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl d48a24a
Incorporated Brian’s feedback.
florence-crl 9fd2a63
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl db82336
Incorporated Rich’s feedback.
florence-crl 90b09c4
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl 42673a1
Merge remote-tracking branch 'origin/main' into DOC-11497
florence-crl efad758
Fix links.
florence-crl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+101 KB
src/current/images/v25.2/detect-hotspots-latch-conflict-wait-durations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
--- | ||
title: Detect Hotspots | ||
summary: Learn how to detect hotspots using real-time monitoring and historical logs in CockroachDB. | ||
toc: true | ||
--- | ||
|
||
This page provides practical guidance for identifying common [hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}) in CockroachDB clusters using real-time monitoring and historical logs. | ||
|
||
## Before you begin | ||
|
||
- Review the [Understand hotspots page]({% link {{ page.version.version }}/understand-hotspots.md %}) for definitions and concepts. | ||
- Ensure you have access to the DB Console Metrics and the relevant logs. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Confirm that you have the necessary permissions to modify the application or schema. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Troubleshooting overview | ||
|
||
Identify potential hotspots and optimize query and schema performance. The following sections provide details for each step. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``` | ||
[Start] | ||
| | ||
[Is there a node outlier in metrics?] | ||
| | ||
|-- Yes --> [Is the outlier in the latch conflict wait durations metric?] | ||
| | | ||
| |-- Yes --> [Does a popular key detected log exist?] | ||
| | | ||
| |-- Yes (write hotspot) --> [Find hot ranges log, find table and index] ------| | ||
| | | | ||
| |-- No --> [Some other reason for latch conflict] | | ||
| | | ||
|-- Yes --> [Is the outlier in the CPU percent or the Runnable Goroutines per CPU metric?] | | ||
| | | | ||
| |-- Yes --> [Does a popular key detected log exist?] | | ||
| | v | ||
| |-- Yes (read hotspot) --> [Find hot ranges log, find table and index] --> [Mitigate hot key (find queries and refactor app)] | ||
| | | ||
| |-- No --> [Does a clear direction detected log exist?] | ||
| | | ||
| |-- Yes (index hotspot) --> [Find hot ranges log, find table and index] --> [Mitigate hot index (change schema)] | ||
| | | ||
| |-- No --> [Some other reason for CPU skew] | ||
| | ||
|-- No --> [Some other reason for metrics outlier] | ||
``` | ||
|
||
## Step 1. Check for a node outlier in metrics | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To identify a [hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}), monitor the following metrics on the [DB Console **Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics) and the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). A node with a maximum value that is a clear outlier in the cluster may indicate a potential hotspot. | ||
|
||
### A. Latch conflict wait durations | ||
|
||
- On the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}), if a virtual cluster dropdown is present in the upper right corner, select `system`. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Create a custom chart to monitor the `kv.concurrency.latch_conflict_wait_durations-avg` metric, which tracks time spent on [latch acquisition]({% link {{ page.version.version }}/architecture/transaction-layer.md %}#latch-manager) waiting for conflicts with other latches. For example, a [sequence]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-sequence) that writes to the same row must wait to acquire the latch. | ||
- To display the metric per node, select the `PER NODE/STORE` checkbox. | ||
|
||
For example: | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-latch-conflict-wait-durations.png' | relative_url }}" alt="kv.concurrency.latch_conflict_wait_durations-avg" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
- Is there a node with a maximum value that is a clear outlier in the cluster for the latch conflict wait durations metric? | ||
|
||
- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time range when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected). | ||
- If **No**, check for a node outlier in [CPU percent](#b-cpu-percent) metric. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### B. CPU percent | ||
|
||
- On the DB Console **Metrics** page **Hardware** dashboard, monitor the [**CPU Percent** graph]({% link {{ page.version.version }}/ui-hardware-dashboard.md %}#cpu-percent). | ||
- CPU usage typically increases with traffic volume. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Check if the CPU usage of the hottest node is 20% or more above the cluster average. For example, node `n5`, represented by the green line in the following **CPU Percent** graph, hovers at around 87% at time 17:35 compared to other nodes that hover around 20% to 25%. | ||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-cpu-percent.png' | relative_url }}" alt="graph of CPU Percent utilization per node showing hot key" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
- Is there a node with a maximum value that is a clear outlier in the cluster for the CPU percent metric? | ||
|
||
- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time range when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected). | ||
- If **No**, check for a node outlier in [Runnable Goroutines per CPU](#c-runnable-goroutines-per-cpu) metric. | ||
|
||
### C. Runnable Goroutines per CPU | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- On the DB Console **Metrics** page **Runtime** dashboard, monitor the [**Runnable Goroutines Per CPU** graph]({% link {{ page.version.version }}/ui-runtime-dashboard.md %}#runnable-goroutines-per-cpu). | ||
- Check if there is a significant difference between the average and maximum values of the nodes. Nodes typically hover near `0.0`, unless a node is at or near its system-configured limit of 32. | ||
- The **Runnable Goroutines per CPU** graph rises more sharply than the [**CPU Percent** graph](#b-cpu-percent). The goroutines graph increases gradually until a node approaches its limit, after which it rises sharply. The following image shows the general shapes of the two graphs. | ||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-cpu-goroutine-graphs.png' | relative_url }}" alt="comparison of CPU percent and Runnable Goroutines per CPU graphs" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
- For example, node `n5`, represented by the green line in the following **Runnable Goroutine per CPU** graph, hovers above 3 at 17:35, compared to other nodes hovering around 0.0. | ||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-goroutines.png' | relative_url }}" alt="graph of Runnable Goroutines per CPU per node showing node overload" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
{{site.data.alerts.callout_success}} | ||
Compare the **Runnable Goroutines per CPU** graph and the **CPU Percent** graph at the same timestamp to spot sharp increases. | ||
{{site.data.alerts.end}} | ||
|
||
- Is there a node with a maximum value that is a clear outlier in the cluster for Runnable Goroutines per CPU metric? | ||
|
||
- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time range when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected). | ||
- If **No**, investigate other reasons for the metrics outlier. | ||
|
||
## Step 2. Check for existence of `no split key found` log | ||
|
||
The [`no split key found` log]({% link {{ page.version.version }}/load-based-splitting.md %}#monitor-load-based-splitting) is emitted in the [`KV_DISTRIBUTION` log channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). This log is not associated with a specific event type, but includes an unstructured message such as: | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<a id="no-split-key-found-log-example"></a> | ||
|
||
``` | ||
I250523 21:59:25.755283 31560 13@kv/kvserver/split/decider.go:298 ⋮ [T1,Vsystem,n5,s5,r1115/3:‹/Table/106/1/{113338-899841…}›] 2979 no split key found: insufficient counters = 0, imbalance = 20, most popular key occurs in 36% of samples, access balance right-biased 98%, popular key detected, clear direction detected | ||
``` | ||
|
||
The unstructured message ends in either of these string combinations: | ||
|
||
1. `popular key detected, clear direction detected` | ||
1. `popular key detected, no clear direction` | ||
1. `no popular key, clear direction detected` | ||
1. `no popular key, no clear direction` | ||
|
||
### A. `popular key detected` | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- To check whether a `popular key detected` log exists, search for `popular key detected` in the `KV_DISTRIBUTION` logs on the hot node you noted in Step 1 in the time range that you noted. In the [preceding log example](#no-split-key-found-log-example), the log is on node 5, `n5` in the tag section in square brackets, and at timestamp `250523 21:59:25.755283`. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- Once you identify a relevant log, note the range ID in the tag section. In the [preceding log example](#no-split-key-found-log-example), the range is 1115 (`r1115`), as shown in the tag section in square brackets. | ||
|
||
{{site.data.alerts.callout_info}} | ||
There may be false positives of the `popular key detected` log. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{{site.data.alerts.end}} | ||
|
||
- The outlier was in the latch conflict wait durations metric. Does a `popular key detected` log exist? | ||
|
||
- If **Yes**, it is a [write hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#write-hotspot). Note the range ID of `popular key detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log). | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- If **No**, investigate other reasons for the latch conflict wait durations metric outlier. | ||
|
||
- The outlier was CPU percent or the Runnable Goroutines per CPU metric. Does a `popular key detected` log exist? | ||
|
||
- If **Yes**, it is a [read hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#read-hotspot). Note the range ID of `popular key detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log). | ||
- If **No**, note the range ID of `popular key detected` log and proceed to check whether the log is also a [`clear direction detected` log](#b-clear-direction-detected). | ||
|
||
### B. `clear direction detected` | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- To determine whether a `clear direction detected` log exists, check the unstructured message of the `popular key detected` log. Does it end with `clear direction detected`? | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- The outlier was CPU percent or the Runnable Goroutines per CPU metric. A `popular key detected` log exists. Does a `clear direction detected` log exist? | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- If **Yes**, it is an [index hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#index-hotspot). Proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log). | ||
- If **No**, investigate other possible causes for CPU skew. | ||
|
||
## Step 3. Find hot ranges log | ||
|
||
A hot ranges log is a log of an event of type `hot_ranges_stats` emitted to the [`HEALTH` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). Because this log corresponds to an event type, it includes a structured message such as: | ||
|
||
``` | ||
I250602 04:46:54.752464 2023 2@util/log/event_log.go:39 ⋮ [T1,Vsystem,n5] 31977 ={"Timestamp":1748839613749807000,"EventType":"hot_ranges_stats","RangeID":1115,"Qps":0,"LeaseholderNodeID":5,"WritesPerSecond":0.0012048123820978134,"CPUTimePerSecond":251.30338109510822,"Databases":["kv"],"Tables":["kv"],"Indexes":["kv_pkey"]} | ||
``` | ||
|
||
- To find the relevant hot ranges log, search for `"EventType":"hot_ranges_stats"` and `"RangeID":{range ID from popular key detected log}` and `"LeaseholderNodeID":{node ID from metric outlier}` in the noted time range of the metric outlier. | ||
- Once you find the relevant hot ranges log, note the values for `Databases`, `Tables`, and `Indexes`. | ||
- For a write hotspot or read hotspot, proceed to [Mitigation for hot key](#mitigation-1-hot-key). | ||
- For an index hotspot, proceed to [Mitigation for hot index](#mitigation-2-hot-index). | ||
|
||
## Mitigation 1 - hot key | ||
|
||
To mitigate a [hot key]({% link {{ page.version.version }}/understand-hotspots.md %}#row-hotspot) (whether a write hotspot or read hotspot), identify the problematic queries, and then refactor your application accordingly. Use the [SQL Activity Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}) in the DB Console to help identify the corresponding statements by the values noted for `Databases`, `Tables`, and `Indexes` in the hot ranges log. | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Mitigation 2 - hot index | ||
|
||
To mitigate a hot index, update the index schema using the values noted for `Databases`, `Tables`, and `Indexes` in the hot ranges log. Refer to [Resolving index hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}#resolving-index-hotspots). | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## See also | ||
|
||
- [Understand Hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}) | ||
- [**Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics) | ||
- [**Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) | ||
- [Logging channels]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels) | ||
- [Load-based splitting]({% link {{ page.version.version }}/load-based-splitting.md %}) | ||
- [**SQL Activity Statements** page]({% link {{ page.version.version }}/ui-statements-page.md %}) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.