Skip to content

KEP-4346: Add metrics for informer #129160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xigang
Copy link
Member

@xigang xigang commented Dec 11, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

  1. Adds reflector metrics
  2. Adds informer metrics
  3. Expose informer reflector/queue/eventHandler metrics

KEP-4346
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4346-informer-metrics

Which issue(s) this PR fixes:

#121474
#129795
#117123
#122067 (comment)
#130767
kubernetes/client-go#1027
kubernetes-sigs/controller-runtime#817
kubernetes-sigs/controller-runtime#3189
kubernetes-sigs/controller-runtime#3182

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added metrics for reflectors and informers, covering reflector operations, queue processing, and event handling.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4346-informer-metrics

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 11, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.32 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.32.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Dec 11 12:08:11 UTC 2024.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 11, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @xigang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 11, 2024
p.metrics.processDuration.Observe(time.Since(startTime).Seconds())
//TODO: This requires implementing Len() and Capacity() for ring growing
// p.metrics.numberOfPendingNotifications.Set(float64(p.pendingNotifications.Len()))
// p.metrics.sizeOfRingGrowing.Set(float64(p.pendingNotifications.Capacity()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to wait for the Len() and Capacity() methods in the ring growing package to be merged.
PR: kubernetes/utils#321

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this single-threaded? (is calling Len and Capacity independently and not under lock safe here, given the pendingNotifications is not thread-safe?)

Copy link
Member Author

@xigang xigang Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pendingNotifications is not thread-safe. The pop() and run() goroutines will concurrently read and write. Use atomic operations to ensure data races are eliminated.

It can be fixed as follows:

            metricsUpdateCounter++
            if metricsUpdateCounter >= metricsUpdateBatch || time.Since(lastMetricsUpdate) >= metricsUpdateInterval {
                p.metrics.processDuration.Observe(time.Since(startTime).Seconds())
                // Read count using atomic operation
             p.metrics.numberOfPendingNotifications.Set(float64(atomic.LoadInt64(&p.pendingNotificationsCount)))
                p.metrics.sizeOfRingGrowing.Set(float64(p.pendingNotifications.Cap()))

                metricsUpdateCounter = 0
                lastMetricsUpdate = time.Now()
            }
        }()

Copy link
Member Author

@xigang xigang Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@xigang xigang changed the title [WIP] clent-go: Add metrics for informer clent-go: Add metrics for informer Dec 12, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 12, 2024
@xigang xigang changed the title clent-go: Add metrics for informer client-go: Add metrics for informer Dec 12, 2024
@xigang
Copy link
Member Author

xigang commented Dec 12, 2024

/sig api-machinery
/sig scalability

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Dec 12, 2024
@xigang xigang changed the title client-go: Add metrics for informer KEP-4346: Add metrics for informer Dec 12, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 12, 2024
@dgrisonnet
Copy link
Member

for sig-instrumentation review

/assign

@Jefftree
Copy link
Member

/cc @richabanker
/triage accepted

@xigang
Copy link
Member Author

xigang commented Apr 28, 2025

@pohly Yes. Based on the current input from the informer, I don't have a good way to handle this special case. If there isn't a good solution, in the short term, can we accept this special case?

Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two additional comments:

  • The informer and reflector provider should probably be exposed in their respective options to allow consumers to override the metrics
  • I think it could be useful if controllers would propagate their name to informers/reflectors and fifos no? Maybe doing that in a new owner label of something of the sort to be able to identify the responsible controller more easily?

@@ -602,6 +603,10 @@ func newInformer(clientState Store, options InformerOptions) Controller {
KnownObjects: clientState,
EmitDeltaTypeReplaced: true,
Transformer: options.Transform,
Metrics: newInformerMetrics(InformerIdentifier{
Copy link
Member

@dgrisonnet dgrisonnet Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to decouple FIFO metrics from the informer ones. I don't think Kubernetes is using the FIFO outside of informers, but some users of client_go might

Copy link
Member Author

@xigang xigang May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIFO metrics have been decoupled into the FIFOMetricsProvider interface in fifo_metrics.go. Additionally, FIFOMetricsProvider have been exposed in DeltaFIFOOptions to allow custom providers to override the default metrics.

done.

return ringGrowingCapacity.WithLabelValues(name, resourceType, handlerName)
}

func (informerMetricsProvider) NewPrcoessDurationMetric(name string, resourceType string, handlerName string) cache.HistogramMetric {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (informerMetricsProvider) NewPrcoessDurationMetric(name string, resourceType string, handlerName string) cache.HistogramMetric {
func (informerMetricsProvider) NewProcessDurationMetric(name string, resourceType string, handlerName string) cache.HistogramMetric {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -58,6 +58,9 @@ type DeltaFIFOOptions struct {

// If set, log output will go to this logger instead of klog.Background().
Logger *klog.Logger

// If set, metrics will be collected for the informer.
Metrics *informerMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass the provider here so that consumers of the library can override the metrics if they need. I know that controller-runtime does that with other packages. For example https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/controller/priorityqueue/priorityqueue.go#L58-L60

Copy link
Member Author

@xigang xigang May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIFOMetricsProvider interface has been exposed in DeltaFIFOOptions, allowing users to provide custom providers. Additionally, the relevant provider has been exposed in both the informer and reflector options.

done.

Comment on lines 29 to 30
// makeValidPrometheusLabelValue converts a string into a valid Prometheus label value.
// A valid label value must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a best practice for label and metric names, not values. For values you can have any UTF-8 sequence

Copy link
Member Author

@xigang xigang Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The makeValidPrometheusLabelValue code has been removed.

done.

@xigang xigang force-pushed the informer_metrics branch 4 times, most recently from 1effb46 to 0592ce6 Compare April 29, 2025 04:26
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 29, 2025
@xigang xigang force-pushed the informer_metrics branch from 0592ce6 to a90b2ab Compare April 29, 2025 04:33
@xigang
Copy link
Member Author

xigang commented Apr 29, 2025

Thanks, @dgrisonnet . Comments addressed. PTAL.

@xigang
Copy link
Member Author

xigang commented May 4, 2025

The informer and reflector provider should probably be exposed in their respective options to allow consumers to override the metrics

Done.

@xigang
Copy link
Member Author

xigang commented May 4, 2025

I think it could be useful if controllers would propagate their name to informers/reflectors and fifos no? Maybe doing that in a new owner label of something of the sort to be able to identify the responsible controller more easily?

The Name field has been exposed in both SharedIndexInformerOptions and DeltaFIFOOptions, making it easier to identify the controller associated with the informer and DeltaFIFO metrics.

Done.

@xigang
Copy link
Member Author

xigang commented May 4, 2025

@dgrisonnet , I’ve addressed all the comments above. Could you please take another look when you have time? Thanks!

@xigang xigang force-pushed the informer_metrics branch from a90b2ab to c2b132f Compare May 6, 2025 06:10
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 6, 2025
@xigang xigang force-pushed the informer_metrics branch from c2b132f to 0840ea4 Compare May 6, 2025 06:20
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 6, 2025
@xigang xigang force-pushed the informer_metrics branch from 0840ea4 to 43cdcf6 Compare May 6, 2025 06:32
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 6, 2025

@xigang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-apidiff-client-go 43cdcf6 link false /test pull-kubernetes-apidiff-client-go

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@xigang
Copy link
Member Author

xigang commented May 6, 2025

/test pull-kubernetes-e2e-gce

@xigang
Copy link
Member Author

xigang commented May 13, 2025

@dgrisonnet, just following up on this small fix PR that you’ve partially reviewed.

Also looping in @richabanker @sbueringer and @alvaroaleman — if you have time, a quick look would be much appreciated. Thanks! 🙇

@RainbowMango
Copy link
Member

@xigang there is a failing check that needs to be resolved.

@xigang
Copy link
Member Author

xigang commented May 17, 2025

@RainbowMango Once this PR is merged, the client-go staging code will be synced to the kubernetes/client-go repository’s main branch, and the next run of apidiff will no longer report any Incompatible changes errors. Some of the interface changes are necessary.

see the KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4346-informer-metrics#reflector-metrics

@xigang
Copy link
Member Author

xigang commented May 20, 2025

@richabanker This PR has been blocked for a while — could you take a look? @dgrisonnet hasn’t responded recently. Thanks!

@richabanker
Copy link
Contributor

@richabanker This PR has been blocked for a while — could you take a look? @dgrisonnet hasn’t responded recently. Thanks!

Queuing up, will try my best to get to it this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.