Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -231,137 +231,137 @@ prometheusRules:
In some cases for components like vmagent or vminsert the alert might trigger if there are too many clients making write attempts.
If vmagent's or vminsert's CPU usage and network saturation are at normal level, then it might be worth adjusting `-maxConcurrentInserts` cmd-line flag."

vmCluster:
- alert: DiskRunsOutOfSpaceIn3Days
expr: |
vm_free_disk_space_bytes / ignoring(path)
(
rate(vm_rows_added_to_storage_total[1d])
* scalar(
sum(vm_data_size_bytes{type!~"indexdb.*"}) /
sum(vm_rows{type!~"indexdb.*"})
)
) < 3 * 24 * 3600 > 0
for: 30m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} will run out of disk space in 3 days"
description: "Taking into account current ingestion rate, free disk space will be enough only for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.
Consider to limit the ingestion rate, decrease retention or scale the disk space up if possible."
- alert: DiskRunsOutOfSpace
expr: |
sum(vm_data_size_bytes) by(job, instance) /
(
sum(vm_free_disk_space_bytes) by(job, instance) +
sum(vm_data_size_bytes) by(job, instance)
) > 0.8
for: 30m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} (job={{ $labels.job }}) will run out of disk space soon"
description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.
Having less than 20% of free disk space could cripple merges processes and overall performance.
Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."
- alert: RequestErrorsToAPI
expr: increase(vm_http_request_errors_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors. Please verify if clients are sending correct requests."
- alert: RPCErrors
expr: |
(
sum(increase(vm_rpc_connection_errors_total[5m])) by(job, instance)
+
sum(increase(vm_rpc_dial_errors_total[5m])) by(job, instance)
+
sum(increase(vm_rpc_handshake_errors_total[5m])) by(job, instance)
) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"
description: "RPC errors are interconnection errors between cluster components.
Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."
- alert: RowsRejectedOnIngestion
expr: rate(vm_rows_ignored_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the following reason: \"{{ $labels.reason }}\""
- alert: TooHighChurnRate
expr: |
(
sum(rate(vm_new_timeseries_created_total[5m]))
/
sum(rate(vm_rows_inserted_total[5m]))
) > 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Churn rate is more than 10% for the last 15m"
description: "VM constantly creates new time series. This effect is known as Churn Rate.
High Churn Rate tightly connected with database performance and may result in unexpected OOM's or slow queries."
- alert: TooHighChurnRate24h
expr: |
sum(increase(vm_new_timeseries_created_total[24h]))
>
(sum(vm_cache_entries{type="storage/hour_metric_ids"}) * 3)
for: 15m
labels:
severity: warning
annotations:
summary: "Too high number of new series created over last 24h"
description: "The number of created new time series over last 24h is 3x times higher than current number of active series.
This effect is known as Churn Rate. High Churn Rate tightly connected with database performance and may result in unexpected OOM's or slow queries."
- alert: TooHighSlowInsertsRate
expr: |
(
sum(rate(vm_slow_row_inserts_total[5m]))
/
sum(rate(vm_rows_inserted_total[5m]))
) > 0.05
for: 15m
labels:
severity: warning
annotations:
summary: "Percentage of slow inserts is more than 5% for the last 15m"
description: "High rate of slow inserts may be a sign of resource exhaustion for the current load.
It is likely more RAM is needed for optimal handling of the current number of active time series.
See also https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183"
- alert: ProcessNearFDLimits
expr: (process_max_fds - process_open_fds) < 100
for: 5m
labels:
severity: critical
annotations:
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
Consider to increase the limit as fast as possible."
- alert: LabelsLimitExceededOnIngestion
expr: increase(vm_metrics_with_dropped_labels_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Metrics ingested to vminsert on {{ $labels.instance }} are exceeding labels limit"
description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.
This prevents from ingesting metrics with too many labels.
Please verify that `-maxLabelsPerTimeseries` is configured correctly or that clients which send these metrics aren't misbehaving."
- alert: VminsertVmstorageConnectionIsSaturated
expr: rate(vm_rpc_send_duration_seconds_total[5m]) > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"
description: "The connection between vminsert (instance {{ $labels.instance }}) and vmstorage (instance {{ $labels.addr }}) is saturated by more than 90% and vminsert won't be able to keep up.
This usually means that more vminsert or vmstorage nodes must be added to the cluster in order to increase the total number of vminsert -> vmstorage links."
vmCluster:
- alert: DiskRunsOutOfSpaceIn3Days
expr: |
vm_free_disk_space_bytes / ignoring(path)
(
rate(vm_rows_added_to_storage_total[1d])
* scalar(
sum(vm_data_size_bytes{type!~"indexdb.*"}) /
sum(vm_rows{type!~"indexdb.*"})
)
) < 3 * 24 * 3600 > 0
for: 30m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} will run out of disk space in 3 days"
description: "Taking into account current ingestion rate, free disk space will be enough only for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.
Consider to limit the ingestion rate, decrease retention or scale the disk space up if possible."
- alert: DiskRunsOutOfSpace
expr: |
sum(vm_data_size_bytes) by(job, instance) /
(
sum(vm_free_disk_space_bytes) by(job, instance) +
sum(vm_data_size_bytes) by(job, instance)
) > 0.8
for: 30m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} (job={{ $labels.job }}) will run out of disk space soon"
description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.
Having less than 20% of free disk space could cripple merges processes and overall performance.
Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."
- alert: RequestErrorsToAPI
expr: increase(vm_http_request_errors_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors. Please verify if clients are sending correct requests."
- alert: RPCErrors
expr: |
(
sum(increase(vm_rpc_connection_errors_total[5m])) by(job, instance)
+
sum(increase(vm_rpc_dial_errors_total[5m])) by(job, instance)
+
sum(increase(vm_rpc_handshake_errors_total[5m])) by(job, instance)
) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"
description: "RPC errors are interconnection errors between cluster components.
Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."
- alert: RowsRejectedOnIngestion
expr: rate(vm_rows_ignored_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the following reason: \"{{ $labels.reason }}\""
- alert: TooHighChurnRate
expr: |
(
sum(rate(vm_new_timeseries_created_total[5m]))
/
sum(rate(vm_rows_inserted_total[5m]))
) > 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Churn rate is more than 10% for the last 15m"
description: "VM constantly creates new time series. This effect is known as Churn Rate.
High Churn Rate tightly connected with database performance and may result in unexpected OOM's or slow queries."
- alert: TooHighChurnRate24h
expr: |
sum(increase(vm_new_timeseries_created_total[24h]))
>
(sum(vm_cache_entries{type="storage/hour_metric_ids"}) * 3)
for: 15m
labels:
severity: warning
annotations:
summary: "Too high number of new series created over last 24h"
description: "The number of created new time series over last 24h is 3x times higher than current number of active series.
This effect is known as Churn Rate. High Churn Rate tightly connected with database performance and may result in unexpected OOM's or slow queries."
- alert: TooHighSlowInsertsRate
expr: |
(
sum(rate(vm_slow_row_inserts_total[5m]))
/
sum(rate(vm_rows_inserted_total[5m]))
) > 0.05
for: 15m
labels:
severity: warning
annotations:
summary: "Percentage of slow inserts is more than 5% for the last 15m"
description: "High rate of slow inserts may be a sign of resource exhaustion for the current load.
It is likely more RAM is needed for optimal handling of the current number of active time series.
See also https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183"
- alert: ProcessNearFDLimits
expr: (process_max_fds - process_open_fds) < 100
for: 5m
labels:
severity: critical
annotations:
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
Consider to increase the limit as fast as possible."
- alert: LabelsLimitExceededOnIngestion
expr: increase(vm_metrics_with_dropped_labels_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Metrics ingested to vminsert on {{ $labels.instance }} are exceeding labels limit"
description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.
This prevents from ingesting metrics with too many labels.
Please verify that `-maxLabelsPerTimeseries` is configured correctly or that clients which send these metrics aren't misbehaving."
- alert: VminsertVmstorageConnectionIsSaturated
expr: rate(vm_rpc_send_duration_seconds_total[5m]) > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"
description: "The connection between vminsert (instance {{ $labels.instance }}) and vmstorage (instance {{ $labels.addr }}) is saturated by more than 90% and vminsert won't be able to keep up.
This usually means that more vminsert or vmstorage nodes must be added to the cluster in order to increase the total number of vminsert -> vmstorage links."