Skip to content

Commit 6a0435a

Browse files
committed
don't remove metrics
1 parent d91138c commit 6a0435a

File tree

2 files changed

+701
-0
lines changed

2 files changed

+701
-0
lines changed

website/content/en/docs/reference/metrics.md

Lines changed: 350 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,328 @@ description: >
88
---
99
<!-- this document is generated from hack/docs/metrics_gen/main.go -->
1010
Karpenter makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at `karpenter.kube-system.svc.cluster.local:8080/metrics` configurable via the `METRICS_PORT` environment variable documented [here](../settings)
11+
### `karpenter_ignored_pod_count`
12+
Number of pods ignored during scheduling by Karpenter
13+
- Stability Level: ALPHA
14+
15+
### `karpenter_build_info`
16+
A metric with a constant '1' value labeled by version from which karpenter was built.
17+
- Stability Level: STABLE
18+
19+
## Nodeclaims Metrics
20+
21+
### `karpenter_nodeclaims_termination_duration_seconds`
22+
Duration of NodeClaim termination in seconds.
23+
- Stability Level: BETA
24+
25+
### `karpenter_nodeclaims_terminated_total`
26+
Number of nodeclaims terminated in total by Karpenter. Labeled by the owning nodepool.
27+
- Stability Level: STABLE
28+
29+
### `karpenter_nodeclaims_instance_termination_duration_seconds`
30+
Duration of CloudProvider Instance termination in seconds.
31+
- Stability Level: BETA
32+
33+
### `karpenter_nodeclaims_disrupted_total`
34+
Number of nodeclaims disrupted in total by Karpenter. Labeled by reason the nodeclaim was disrupted and the owning nodepool.
35+
- Stability Level: ALPHA
36+
37+
### `karpenter_nodeclaims_created_total`
38+
Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool.
39+
- Stability Level: STABLE
40+
41+
### `operator_nodeclaim_status_condition_transitions_total`
42+
The count of transitions of a nodeclaim, type and status. Labeled by the type, reason, and status.
43+
- Stability Level: BETA
44+
45+
### `operator_nodeclaim_status_condition_transition_seconds`
46+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
47+
- Stability Level: BETA
48+
49+
### `operator_nodeclaim_status_condition_current_status_seconds`
50+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
51+
- Stability Level: BETA
52+
53+
### `operator_nodeclaim_status_condition_count`
54+
The number of a condition for a nodeclaim, type and status. Labeled by the name, namespace, type, status, and reason.
55+
- Stability Level: BETA
56+
57+
### `operator_nodeclaim_termination_current_time_seconds`
58+
The current amount of time in seconds that a nodeclaim has been in terminating state. Labeled by name, and namespace.
59+
- Stability Level: BETA
60+
61+
### `operator_nodeclaim_termination_duration_seconds`
62+
The amount of time taken by a nodeclaim to terminate completely.
63+
- Stability Level: BETA
64+
65+
## Nodes Metrics
66+
67+
### `karpenter_nodes_total_pod_requests`
68+
Node total pod requests are the resources requested by pods bound to nodes, including the DaemonSet pods.
69+
- Stability Level: BETA
70+
71+
### `karpenter_nodes_total_pod_limits`
72+
Node total pod limits are the resources specified by pod limits, including the DaemonSet pods.
73+
- Stability Level: BETA
74+
75+
### `karpenter_nodes_total_daemon_requests`
76+
Node total daemon requests are the resource requested by DaemonSet pods bound to nodes.
77+
- Stability Level: BETA
78+
79+
### `karpenter_nodes_total_daemon_limits`
80+
Node total daemon limits are the resources specified by DaemonSet pod limits.
81+
- Stability Level: BETA
82+
83+
### `karpenter_nodes_termination_duration_seconds`
84+
The time taken between a node's deletion request and the removal of its finalizer
85+
- Stability Level: BETA
86+
87+
### `karpenter_nodes_terminated_total`
88+
Number of nodes terminated in total by Karpenter. Labeled by owning nodepool.
89+
- Stability Level: STABLE
90+
91+
### `karpenter_nodes_system_overhead`
92+
Node system daemon overhead are the resources reserved for system overhead, the difference between the node's capacity and allocatable values are reported by the status.
93+
- Stability Level: BETA
94+
95+
### `karpenter_nodes_lifetime_duration_seconds`
96+
The lifetime duration of the nodes since creation.
97+
- Stability Level: ALPHA
98+
99+
### `karpenter_nodes_eviction_requests_total`
100+
The total number of eviction requests made by Karpenter
101+
- Stability Level: ALPHA
102+
103+
### `karpenter_nodes_drained_total`
104+
The total number of nodes drained by Karpenter
105+
- Stability Level: ALPHA
106+
107+
### `karpenter_nodes_current_lifetime_seconds`
108+
Node age in seconds
109+
- Stability Level: ALPHA
110+
111+
### `karpenter_nodes_created_total`
112+
Number of nodes created in total by Karpenter. Labeled by owning nodepool.
113+
- Stability Level: STABLE
114+
115+
### `karpenter_nodes_allocatable`
116+
Node allocatable are the resources allocatable by nodes.
117+
- Stability Level: BETA
118+
119+
### `operator_node_status_condition_transitions_total`
120+
The count of transitions of a node, type and status.
121+
- Stability Level: BETA
122+
123+
### `operator_node_status_condition_transition_seconds`
124+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
125+
- Stability Level: BETA
126+
127+
### `operator_node_status_condition_current_status_seconds`
128+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
129+
- Stability Level: BETA
130+
131+
### `operator_node_status_condition_count`
132+
The number of a condition for a node, type and status. Labeled by the name, namespace, type, status, and reason.
133+
- Stability Level: BETA
134+
135+
### `operator_node_termination_current_time_seconds`
136+
The current amount of time in seconds that a node has been in terminating state. Labeled by name, and namespace.
137+
- Stability Level: BETA
138+
139+
### `operator_node_termination_duration_seconds`
140+
The amount of time taken by a node to terminate completely.
141+
- Stability Level: BETA
142+
143+
### `operator_node_event_count`
144+
The number of a events for a node.
145+
- Stability Level: BETA
146+
147+
## Pods Metrics
148+
149+
### `karpenter_pods_state`
150+
Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phase.
151+
- Stability Level: BETA
152+
153+
### `karpenter_pods_startup_duration_seconds`
154+
The time from pod creation until the pod is running.
155+
- Stability Level: STABLE
156+
157+
## Termination Metrics
158+
159+
### `operator_termination_duration_seconds`
160+
The amount of time taken by an object to terminate completely.
161+
- Stability Level: DEPRECATED
162+
163+
### `operator_termination_current_time_seconds`
164+
The current amount of time in seconds that an object has been in terminating state.
165+
- Stability Level: DEPRECATED
166+
167+
## Voluntary Disruption Metrics
168+
169+
### `karpenter_voluntary_disruption_queue_failures_total`
170+
The number of times that an enqueued disruption decision failed. Labeled by disruption method.
171+
- Stability Level: BETA
172+
173+
### `karpenter_voluntary_disruption_eligible_nodes`
174+
Number of nodes eligible for disruption by Karpenter. Labeled by disruption reason.
175+
- Stability Level: BETA
176+
177+
### `karpenter_voluntary_disruption_decisions_total`
178+
Number of disruption decisions performed. Labeled by disruption decision, reason, and consolidation type.
179+
- Stability Level: STABLE
180+
181+
### `karpenter_voluntary_disruption_decision_evaluation_duration_seconds`
182+
Duration of the disruption decision evaluation process in seconds. Labeled by method and consolidation type.
183+
- Stability Level: BETA
184+
185+
### `karpenter_voluntary_disruption_consolidation_timeouts_total`
186+
Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type.
187+
- Stability Level: BETA
188+
189+
## Scheduler Metrics
190+
191+
### `karpenter_scheduler_scheduling_duration_seconds`
192+
Duration of scheduling simulations used for deprovisioning and provisioning in seconds.
193+
- Stability Level: STABLE
194+
195+
### `karpenter_scheduler_queue_depth`
196+
The number of pods currently waiting to be scheduled.
197+
- Stability Level: BETA
198+
199+
## Nodepools Metrics
200+
201+
### `karpenter_nodepools_usage`
202+
The amount of resources that have been provisioned for a nodepool. Labeled by nodepool name and resource type.
203+
- Stability Level: ALPHA
204+
205+
### `karpenter_nodepools_limit`
206+
Limits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type.
207+
- Stability Level: ALPHA
208+
209+
### `karpenter_nodepools_allowed_disruptions`
210+
The number of nodes for a given NodePool that can be concurrently disrupting at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point.
211+
- Stability Level: ALPHA
212+
213+
### `operator_nodepool_status_condition_transitions_total`
214+
The count of transitions of a nodepool, type and status. Labeled by the type, reason, and status.
215+
- Stability Level: BETA
216+
217+
### `operator_nodepool_status_condition_transition_seconds`
218+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
219+
- Stability Level: BETA
220+
221+
### `operator_nodepool_status_condition_current_status_seconds`
222+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
223+
- Stability Level: BETA
224+
225+
### `operator_nodepool_status_condition_count`
226+
The number of an condition for a nodepool, type and status. Labeled by the name, namespace, type, status, and reason.
227+
- Stability Level: BETA
228+
229+
### `operator_nodepool_termination_current_time_seconds`
230+
The current amount of time in seconds that a nodepool has been in terminating state. Labeled by name, and namespace.
231+
- Stability Level: BETA
232+
233+
### `operator_nodepool_termination_duration_seconds`
234+
Duration of NodePool termination in seconds.
235+
- Stability Level: BETA
236+
237+
## EC2NodeClass Metrics
238+
239+
### `operator_ec2nodeclass_status_condition_transitions_total`
240+
The count of transitions of a ec2nodeclass, type and status. Labeled by the type, reason, and status.
241+
- Stability Level: BETA
242+
243+
### `operator_ec2nodeclass_status_condition_transition_seconds`
244+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
245+
- Stability Level: BETA
246+
247+
### `operator_ec2nodeclass_status_condition_current_status_seconds`
248+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
249+
- Stability Level: BETA
250+
251+
### `operator_ec2nodeclass_status_condition_count`
252+
The number of an condition for an ec2nodeclass, type and status. Labeled by the name, namespace, type, status, and reason.
253+
- Stability Level: BETA
254+
255+
### `operator_ec2nodeclass_termination_current_time_seconds`
256+
The current amount of time in seconds that an ec2nodeclass has been in terminating state. Labeled by name, and namespace.
257+
- Stability Level: BETA
258+
259+
### `operator_ec2nodeclass_termination_duration_seconds`
260+
Duration of ec2nodeclass termination in seconds.
261+
- Stability Level: BETA
262+
263+
## Interruption Metrics
264+
265+
### `karpenter_interruption_received_messages_total`
266+
Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable.
267+
- Stability Level: STABLE
268+
269+
### `karpenter_interruption_message_queue_duration_seconds`
270+
Amount of time an interruption message is on the queue before it is processed by karpenter.
271+
- Stability Level: STABLE
272+
273+
### `karpenter_interruption_deleted_messages_total`
274+
Count of messages deleted from the SQS queue.
275+
- Stability Level: STABLE
276+
277+
## Cluster Metrics
278+
279+
### `karpenter_cluster_utilization_percent`
280+
Utilization of allocatable resources by pod requests
281+
- Stability Level: ALPHA
282+
283+
## Cluster State Metrics
284+
285+
### `karpenter_cluster_state_unsynced_time_seconds`
286+
The time for which cluster state is not synced
287+
- Stability Level: ALPHA
288+
289+
### `karpenter_cluster_state_synced`
290+
Returns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter's cluster state
291+
- Stability Level: STABLE
292+
293+
### `karpenter_cluster_state_node_count`
294+
Current count of nodes in cluster state
295+
- Stability Level: STABLE
296+
297+
## Cloudprovider Metrics
298+
299+
### `karpenter_cloudprovider_instance_type_offering_price_estimate`
300+
Instance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zone.
301+
- Stability Level: BETA
302+
303+
### `karpenter_cloudprovider_instance_type_offering_available`
304+
Instance type offering availability, based on instance type, capacity type, and zone
305+
- Stability Level: BETA
306+
307+
### `karpenter_cloudprovider_instance_type_memory_bytes`
308+
Memory, in bytes, for a given instance type.
309+
- Stability Level: BETA
310+
311+
### `karpenter_cloudprovider_instance_type_cpu_cores`
312+
VCPUs cores for a given instance type.
313+
- Stability Level: BETA
314+
315+
### `karpenter_cloudprovider_errors_total`
316+
Total number of errors returned from CloudProvider calls.
317+
- Stability Level: BETA
318+
319+
### `karpenter_cloudprovider_duration_seconds`
320+
Duration of cloud provider method calls. Labeled by the controller, method name and provider.
321+
- Stability Level: BETA
322+
323+
## Cloudprovider Batcher Metrics
324+
325+
### `karpenter_cloudprovider_batcher_batch_time_seconds`
326+
Duration of the batching window per batcher
327+
- Stability Level: BETA
328+
329+
### `karpenter_cloudprovider_batcher_batch_size`
330+
Size of the request batch per batcher
331+
- Stability Level: BETA
332+
11333
## Controller Runtime Metrics
12334

13335
### `controller_runtime_terminal_reconcile_errors_total`
@@ -72,6 +394,34 @@ Current depth of workqueue by workqueue and priority
72394
Total number of adds handled by workqueue
73395
- Stability Level: STABLE
74396

397+
## Status Condition Metrics
398+
399+
### `operator_status_condition_transitions_total`
400+
The count of transitions of a given object, type and status.
401+
- Stability Level: DEPRECATED
402+
403+
### `operator_status_condition_transition_seconds`
404+
The amount of time a condition was in a given state before transitioning. e.g. Alarm := P99(Updated=False) > 5 minutes
405+
- Stability Level: DEPRECATED
406+
407+
### `operator_status_condition_current_status_seconds`
408+
The current amount of time in seconds that a status condition has been in a specific state. Alarm := P99(Updated=Unknown) > 5 minutes
409+
- Stability Level: DEPRECATED
410+
411+
### `operator_status_condition_count`
412+
The number of an condition for a given object, type and status. e.g. Alarm := Available=False > 0
413+
- Stability Level: DEPRECATED
414+
415+
## Client Go Metrics
416+
417+
### `client_go_request_total`
418+
Number of HTTP requests, partitioned by status code and method.
419+
- Stability Level: STABLE
420+
421+
### `client_go_request_duration_seconds`
422+
Request latency in seconds. Broken down by verb, group, version, kind, and subresource.
423+
- Stability Level: STABLE
424+
75425
## AWS SDK Go Metrics
76426

77427
### `aws_sdk_go_request_total`

0 commit comments

Comments
 (0)