Skip to content

Increase the number and granularity of buckets for some of kube-scheduler's alpha metrics. Make exporting flag-gated #131551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ania-borowiec
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This is part of fixing #127384
Before this PR the values exported in the scheduler_plugin_execution_duration_seconds_bucket metric do not provide substantial information to kube-scheduler users or to sig-scheduling team. At the same time they tend to be a headache for cluster-admins, taking up memory.
This PR introduces more granular buckets for:
scheduler_event_handling_duration_seconds scheduler_plugin_execution_duration_seconds scheduler_queueing_hint_execution_duration_seconds scheduler_scheduling_algorithm_duration_seconds
It also introduces a feature gate SchedulerHighPrecisionMetrics (disabled by default) that toggles exporting the metrics:
scheduler_event_handling_duration_seconds scheduler_plugin_execution_duration_seconds scheduler_queueing_hint_execution_duration_seconds
to help save memory in the clusters.
The values of bucket boundaries have been adjusted to fit both scheduler's performance tests and user's request.

Which issue(s) this PR fixes:

Fixes #131209

Special notes for your reviewer:

I will create issues with a very specific description for refactoring these tests to use mocks (and thus actually become unit tests). Hopefully I can phrase it clear enough to be a 'good-first-issue'.

Does this PR introduce a user-facing change?

Increased accuracy of scheduler metrics: 
- scheduler_event_handling_duration_seconds
- scheduler_plugin_execution_duration_seconds
- scheduler_queueing_hint_execution_duration_seconds
- scheduler_scheduling_algorithm_duration_seconds

Export of metrics:
- scheduler_event_handling_duration_seconds
- scheduler_plugin_execution_duration_seconds
- scheduler_queueing_hint_execution_duration_seconds
is feature gated by SchedulerHighPrecisionMetrics (disabled by default).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 30, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ania-borowiec. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 30, 2025
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 30, 2025
@ania-borowiec
Copy link
Contributor Author

/cc dgrisonnet richabanker sanposhiho

@sanposhiho
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2025
@sanposhiho
Copy link
Member

Why do they have to be feature-gated? Like, can we just mark new ones as alpha metrics, and then drop the old ones later? That's the common approach of replacing the metrics.

@richabanker
Copy link
Contributor

richabanker commented Apr 30, 2025

Why do they have to be feature-gated? Like, can we just mark new ones as alpha metrics, and then drop the old ones later? That's the common approach of replacing the metrics.

Think the reasoning was that we do not want to expose these high cardinality metrics (since the buckets have been made more granular) by default unless opted in, particularly since these metrics would contain granular data most likely to be used for development / perf-testing than for general monitoring for clusters/scheduler. Also, these will eventually be replaced with native-histogram metrics, ref

@sanposhiho
Copy link
Member

sanposhiho commented Apr 30, 2025

The idea here sounds like some clusters (+ scheduler perf test) want to have this feature, but others might not, so use the feature gate to control ON/OFF.

I am not sure if it's a suitable usecase to use the feature flag. The feature flag IMO generally means "the feature is going to be rolled out to everyone eventually".
It's not a config parameter, it is not to implement on-off switch for such case.
So, I would prefer to have this on the scheduler config (or scheduler's flag?).

@ania-borowiec
Copy link
Contributor Author

The idea here sounds like some clusters (+ scheduler perf test) want to have this feature, but others might not, so use the feature gate to control ON/OFF.

I am not sure if it's a suitable usecase to use the feature flag. The feature flag IMO generally means "the feature is going to be rolled out to everyone eventually". It's not a config parameter, it is not to implement on-off switch for such case. So, I would prefer to have this on the scheduler config (or scheduler's flag?).

SGTM, I'll rework the code and ping you, thanks

@sanposhiho
Copy link
Member

Given scheduler-perf, which doesn't start the scheduler without the command, would use it, I am thinking that probably the scheduler config would be better than a command flag. But I am open to the discussion, if anyone has a different opinion.

these will eventually be replaced with native-histogram metrics

Want to dig a bit more to design the config parameter: Do we think that those native-histogram metrics would completely replace the existing metrics for all users? Or, will some use cases still need the existing old ones, and we need to keep both after all?

Also, do we know timeline when those new features are going to arrive?

@ania-borowiec
Copy link
Contributor Author

AFAIU the main difference (from our perspective) between the new histograms vs the ones we currently used it that they can optimize memory usage. In a BIG simplification: you can configure them to use exponential buckets with a given growth factor, but if your values only fall into 4 buckets, then only those 4 timeseries will get exported.
So I think it sounds like the native histograms could be a solution for all use cases.

The timeline is not yet known, AFAIR there is hope for a KEP in 1.34 - @richabanker please chime in if I got that wrong, thanks!

@sanposhiho
Copy link
Member

sanposhiho commented May 5, 2025

there is hope for a KEP in 1.34

If it's planned for the next release and will completely replace the existing ones, why do we need this pr, which modifies the existing metrics and will be released in 1.34?

@ania-borowiec
Copy link
Contributor Author

That's mostly because I wasn't sure how to go about the feature gates. Ideally we'd like to get this ready to use by users as soon as possible

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ania-borowiec
Once this PR has been reviewed and has the lgtm label, please assign macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label May 5, 2025
@sanposhiho
Copy link
Member

This change would, either way, be published to users at 1.34, hence the same timing.
Now, I would prefer just wait for the new feature to be available, and adopt to it as soon as possible to make it happen in the scheduler in 1.34. Much better than adding a parameter to control the old metrics that are going to be removed very soon.

@sanposhiho
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2025
@ania-borowiec
Copy link
Contributor Author

Do you think there's any way we could deliver it sooner, so that users can use this right away e.g. in their tests?
This user #131209 explicitly requested an fixing the plugin execution metric, I wish we could help them sooner, not relying on the native histograms timeline?

@sanposhiho
Copy link
Member

We cannot deliver any new features without a new release, and the soonest new release is 1.34.

@ania-borowiec
Copy link
Contributor Author

So I guess they key question now is: @richabanker how confident are you about native histograms being available to use in 1.34? :)

@sanposhiho
Copy link
Member

sanposhiho commented May 5, 2025

I would probably say the same even if we end up delaying it to 1.35, though.
I was guessing it's a far-away future, but apparently that is already on the timeline. Then, I cannot agree to adding this feature along with a config param, only to ship a feature just to fix the problem one release faster, which will no longer be needed in the next release (1.35). This metric has existed for a long time, and we haven't received many complaints from many people; That also indicates the fix priority is not that high.
I'd rather just park this problem until the native histograms arrive, as long as it's on the work table of SIG-instrumentation.

For the scheduler-perf purpose as well, for now, we're mostly looking at the scheduling throughput, which isn't impacted by this bucket problem. In case we need to utilize those problematic metrics at some point before the native histograms come, then we can start considering something; probably just providing the internal Option so that the scheduler_perf can increase the granularity. No need to add a new API on the config or the flag, even in such cases.

@k8s-ci-robot
Copy link
Contributor

@ania-borowiec: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-integration 8c9c2fe link true /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@richabanker
Copy link
Contributor

I imaging the native-histogram support to be a long running effort where we gradually onboard different control plane components and enable the support one by one for each. Even while doing so, I think we want to continue supporting the old format along with the new one, just so the migration is not breaking to users, before defaulting to always exporting native-hist. I can only point you to this slack thread atm, but maybe @dgrisonnet can shed more light about the timeline once he's back.

Till then, I think having a scheduler config option to turn on the more granular buckets seems like a neat way to go about making these metrics usable for the end-user in the short term.

@sanposhiho
Copy link
Member

the native-histogram support to be a long running effort where we gradually onboard different control plane components and enable the support one by one for each.

Then, can we just be a first migrator, adopting to it asap? instead of putting effort into this feature.
That surely would save a waste double efforts from us.

we want to continue supporting the old format along with the new one, just so the migration is not breaking to users, before defaulting to always exporting native-hist.

I know we won't remove the old ones immediately.
But, once the native-hist feature is implemented, for the users needing granular buckets, we can just say using new ones with enabling the feature flag for the native-hist bucket.
No need to implement this parameter for them, in addition to the native-hist feature.

Rather, why could we add a new config parameter and suggest start using it for users, while knowing a better native-hist feature is coming, and those old metrics will be deprecated eventually.
So, not only us, this would cause double efforts and confusion for those users.


From those reasons, I cannot say Yes to adding a new config parameter just for a temporal short-term mitigation for the problem, especially given it has existed for a long time but not many people have been complaining (i.e., low priority issue)

@richabanker
Copy link
Contributor

Rather, why could we add a new config parameter and suggest start using it for users, while knowing a better native-hist feature is coming, and those old metrics will be deprecated eventually.
So, not only us, this would cause double efforts and confusion for those users.

For sure, we definitely do not want any wasted efforts here, which is why initially we(sig-instrumentation) had asked if its ok to wait on this till native-hist becomes available, ref - though it was mentioned at the time that the current state of the metric is broken and they are basically unusable for users due to the buckets not capturing meaningful data as they exist today. Which is not a good state either. Hence this PR was agreed upon as a short-term fix to unblock users of these metrics. But if thats not the case, and since there haven't been a lot of complaints, maybe its ok to wait for the native-hist introduction afterall?

Then, can we just be a first migrator, adopting to it asap?

That is actually great that we have already have a use case and an early-adopter ready to play with native-hist once they are ready. We will ensure that we make a note of this in the KEP for the same and involve you in the relevant discussions around the same. cc @dgrisonnet

@krisztianfekete
Copy link

Hey, so I've recently added native histogram support to kubernetes/controller-runtime here kubernetes-sigs/controller-runtime#3165 (and across our projects, too).

As described there, these are opt-in features for users (well, Prometheus instances), whereas today you explicitly need to enable --enable-feature=native-histograms. Without that, Prometheus will ignore the native ones. There's also a way to scrape both, so there are various ways around breaking anything (you can even have a compatible solutions with OR queries in Grafana dashboards).

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alaypatel07
Copy link
Contributor

alaypatel07 commented May 16, 2025

@sanposhiho @richabanker @ania-borowiec since the current metrics are not usable for plugins that take longer time than 0.022 second, can we treat this as a bug, fix it with a config and be backported to supported releases?

The newer histogram metrics when available can replace the existing metric. But I want to emphasize, even if the new histogram metrics will be available they will be in 1.34+ version at best. We still have granularity issue for users running supported 1.30-1.33 clusters which need attention.

@sanposhiho
Copy link
Member

sanposhiho commented May 16, 2025

I know (and, am sorry) it is very inconvenient for your case, but, as I mentioned,

  1. We basically cannot introduce a new config with a patch release; that would be regarded as a new feature.
  2. The patch release can only include fixes for critical bugs. But, we cannot justify this problem is critical enough. This metric has existed since 2019 and we hadn't noticed it until recently; that also indicates this is not critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plugin execution metric buckets are not useful for debugging high latency plugins
6 participants