Skip to content

Proposal: Expose all client-go metrics by default #3202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ahmetb opened this issue Apr 29, 2025 · 2 comments · May be fixed by #3205
Open

Proposal: Expose all client-go metrics by default #3202

ahmetb opened this issue Apr 29, 2025 · 2 comments · May be fixed by #3205
Labels
kind/design Categorizes issue or PR as related to design.

Comments

@ahmetb
Copy link
Member

ahmetb commented Apr 29, 2025

Summary

Expose more client-side metrics offered by client-go in the controller process by default, similar to how Kubernetes builtin controllers/apiserver does

Time and time again, lack of these metrics exposed our internal controllers has prevented us from being able to monitor how long we're getting stuck in the client-side rate limiter, or what is the observed latency of the REST client requests in the controller etc (without writing our own instrumented REST transport wrapper).

Details

client-go currently exposes the following hooks that a metrics collector can register to https://github.com/kubernetes/client-go/blob/v0.33.0/tools/metrics/metrics.go#L114-L127:

Metric Name Type Dimensions Description
rest_client_request_duration_seconds Histogram verb, host Request latency in seconds.

Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0]
rest_client_dns_resolution_duration_seconds Histogram host DNS resolver latency in seconds.

Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0]
rest_client_request_size_bytes Histogram verb, host Request size in bytes.

Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
rest_client_response_size_bytes Histogram verb, host Response size in bytes.

Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
rest_client_rate_limiter_duration_seconds Histogram verb, host Client-side rate limiter latency in seconds.

Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0]
rest_client_requests_total Counter code, method, host Number of HTTP requests.
rest_client_request_retries_total Counter code, verb, host Number of request retries.
rest_client_transport_cache_entries Gauge (none) Number of transport entries in the internal cache.
rest_client_transport_create_calls_total Counter result Number of calls to get a new transport, partitioned by the result of the operation.

Among these, the only metric currently exposed with controller-runtime is rest_client_requests_total. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the host dimension (which is presumably just however many apiserver host:ports you have).

Proposal

  1. controller-runtime starts exposing all of the listed metrics (by copying them from k8s.io/component-base) in controller-runtime by default.

  2. Existing rest_client_requests_total metric should remain unmodified.

  3. ExecPluginCalls hook (i.e. rest_client_exec_plugin_call_total metric) should be left out as it is very rarely if ever useful for a controller process.

Considerations

  1. Stability: ALL of the metrics listed above are listed in ALPHA stage in component-base and in k8s.io Metrics Documentation, presumably for components like kube-scheduler, kube-controller-manager etc. Do we also offer them as stable? Or do we break users later?

  2. Cardinality: Some histogram metrics have 10-12 buckets. In a large cluster setup with 10 apiservers x 4 verbs, it can easily reach 400+ time series per metric (still bounded though).

  3. Future improvements: Client-go offers a url value in one of the hook functions. This url is actually a value that's free of resource {namespace,name} (i.e. it's bounded cardinality for us!) but is available only in one metric hook😢. component-base basically uses that url.URL value to find the host label.

    However, if client-go some day starts providing url label for every metric, it would be even more useful, but we'd likely need to break the metrics.

/kind design
/cc @alvaroaleman

@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Apr 29, 2025
@sbueringer
Copy link
Member

sbueringer commented May 12, 2025

Among these, the only metric currently exposed with controller-runtime is rest_client_requests_total. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the host dimension (which is presumably just however many apiserver host:ports you have).

This was not the last/only time we reverted this :). xref: #2298

The metrics that we reverted in #2298 seem to be identical to the one in your table. So I think we'll get the same result as last time when we add them again.

I think the problem is that you get really high cardinality if you communicate with a lot of clusters. For example in Cluster API if you communicate with 2k workload clusters you end up with 2000 times the metrics compared to if you only communicate with one apiserver.

@sbueringer
Copy link
Member

I think the fundamental issue that we really have to resolve is that there is no way for consumers of controller-runtime to register metrics themselves or configure the metrics that should be registered in some way.

This was surfaced in multiple places but I think we didn't really make progress:

Overall I collected issues with our metrics here: #3054

I think this needs a general overhaul, but also some changes in upstream to make that possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants