-
Notifications
You must be signed in to change notification settings - Fork 1.2k
controller histogram buckets configuration #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The 1st option makes sense to me. |
I think we can probably choose some better buckets, too. This smae discussion is being had right now in the main k/k repo ;-). |
It seems to me like exposing the exact mechanics of how we observe metrics is exposing an implementation detail as public API (at least, that's my gut feeling), particularly since the only way to actually change the bucket definition would be to expose the individual metrics at a API level. That would mean that we'd be tied to a particular set of metrics with a particular interface for measuring them (and it would mean that only a major revision could change/remove metrics, but it's unclear if that's a bad thing). If we ever wanted to switch to, say, contextual metrics measurement, we couldn't without a major version rev, even if we didn't actually change the metrics exposed. My instinct here is to fix our buckets, and wait for more evidence that a "one-size-fits-all" solution won't work here, especially since we have little evidence that our currently buckets are correct for many cases. The Kubernetes SLO for request latency against the API server is |
Ref: kubernetes/kubernetes#63750 and kubernetes/kubernetes#67476 are the upstream issue and PR. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale looks like kube has fixed this upstream, so we can follow suit /good-first-issue |
@DirectXMan12: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority awaiting-more-evidence |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Hi @DirectXMan12 |
…gram metrics Current metric uses Prometheus default bucket for reconcile time histogram. This bucket is not sufficient to reason about percentile of requests which take less than x seconds when x falls outside the bucket of 10 secs. It's also hard to infer when the reconcile loops are fairly fast, as mentioned in this issue: kubernetes-sigs#258. This PR attempts to define explicit buckets for the metrics, values are chosen based on the apiserver request latency defined here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L103. The default Prometheus histogram buckets has also been added(wherever missing) to ensure backward compatibility.
Metrics are created for the controller's workqueue, using Prometheus's default buckets. Unfortunately, the default buckets are poorly chosen for event processing.
This can easily result in metrics that are extremely coarse. In a controller I was working on today, every single reconcile was faster than the smallest bucket:
It is possible to change the default buckets by modifying DefBuckets in
init
, as myinit
will be called after theDefBuckets
variable has been initialized, but before the controllermetrics packageinit
. But this is a very heavy handed brush, changing the defaults of all histograms.I propose two paths forward:
My preference would be the first option, as the change to the system is would be more easily understood.
The text was updated successfully, but these errors were encountered: