You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expose more client-side metrics offered by client-go in the controller process by default, similar to how Kubernetes builtin controllers/apiserver does
Time and time again, lack of these metrics exposed our internal controllers has prevented us from being able to monitor how long we're getting stuck in the client-side rate limiter, or what is the observed latency of the REST client requests in the controller etc (without writing our own instrumented REST transport wrapper).
Number of transport entries in the internal cache.
rest_client_transport_create_calls_total
Counter
result
Number of calls to get a new transport, partitioned by the result of the operation.
Among these, the only metric currently exposed with controller-runtime is rest_client_requests_total. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the host dimension (which is presumably just however many apiserver host:ports you have).
Proposal
controller-runtime starts exposing all of the listed metrics (by copying them from k8s.io/component-base) in controller-runtime by default.
Existing rest_client_requests_total metric should remain unmodified.
ExecPluginCalls hook (i.e. rest_client_exec_plugin_call_total metric) should be left out as it is very rarely if ever useful for a controller process.
Cardinality: Some histogram metrics have 10-12 buckets. In a large cluster setup with 10 apiservers x 4 verbs, it can easily reach 400+ time series per metric (still bounded though).
Future improvements: Client-go offers a url value in one of the hook functions. This url is actually a value that's free of resource {namespace,name} (i.e. it's bounded cardinality for us!) but is available only in one metric hook😢. component-base basically uses that url.URL value to find the host label.
However, if client-go some day starts providing url label for every metric, it would be even more useful, but we'd likely need to break the metrics.
Among these, the only metric currently exposed with controller-runtime is rest_client_requests_total. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the host dimension (which is presumably just however many apiserver host:ports you have).
This was not the last/only time we reverted this :). xref: #2298
The metrics that we reverted in #2298 seem to be identical to the one in your table. So I think we'll get the same result as last time when we add them again.
I think the problem is that you get really high cardinality if you communicate with a lot of clusters. For example in Cluster API if you communicate with 2k workload clusters you end up with 2000 times the metrics compared to if you only communicate with one apiserver.
I think the fundamental issue that we really have to resolve is that there is no way for consumers of controller-runtime to register metrics themselves or configure the metrics that should be registered in some way.
This was surfaced in multiple places but I think we didn't really make progress:
Summary
Expose more client-side metrics offered by client-go in the controller process by default, similar to how Kubernetes builtin controllers/apiserver does
Time and time again, lack of these metrics exposed our internal controllers has prevented us from being able to monitor how long we're getting stuck in the client-side rate limiter, or what is the observed latency of the REST client requests in the controller etc (without writing our own instrumented REST transport wrapper).
Details
client-go currently exposes the following hooks that a metrics collector can register to https://github.com/kubernetes/client-go/blob/v0.33.0/tools/metrics/metrics.go#L114-L127:
rest_client_request_duration_seconds
verb
,host
Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0]
rest_client_dns_resolution_duration_seconds
host
Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0]
rest_client_request_size_bytes
verb
,host
Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
rest_client_response_size_bytes
verb
,host
Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
rest_client_rate_limiter_duration_seconds
verb
,host
Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0]
rest_client_requests_total
code
,method
,host
rest_client_request_retries_total
code
,verb
,host
rest_client_transport_cache_entries
rest_client_transport_create_calls_total
result
Among these, the only metric currently exposed with controller-runtime is
rest_client_requests_total
. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is thehost
dimension (which is presumably just however many apiserverhost:port
s you have).Proposal
controller-runtime starts exposing all of the listed metrics (by copying them from k8s.io/component-base) in controller-runtime by default.
Existing
rest_client_requests_total
metric should remain unmodified.ExecPluginCalls
hook (i.e.rest_client_exec_plugin_call_total
metric) should be left out as it is very rarely if ever useful for a controller process.Considerations
Stability: ALL of the metrics listed above are listed in
ALPHA
stage in component-base and in k8s.io Metrics Documentation, presumably for components likekube-scheduler
,kube-controller-manager
etc. Do we also offer them as stable? Or do we break users later?Cardinality: Some histogram metrics have
10-12 buckets
. In a large cluster setup with10 apiservers
x4 verbs
, it can easily reach 400+ time series per metric (still bounded though).Future improvements: Client-go offers a
url
value in one of the hook functions. Thisurl
is actually a value that's free of resource {namespace,name} (i.e. it's bounded cardinality for us!) but is available only in one metric hook😢.component-base
basically uses thaturl.URL
value to find thehost
label.However, if
client-go
some day starts providingurl
label for every metric, it would be even more useful, but we'd likely need to break the metrics./kind design
/cc @alvaroaleman
The text was updated successfully, but these errors were encountered: