Fix: Reduce histogram bucket cardinality from 500 to 12#799
Fix: Reduce histogram bucket cardinality from 500 to 12#799lalitadithya merged 3 commits intoNVIDIA:mainfrom
Conversation
Changed platform-connector histogram metrics from LinearBuckets(0, 10, 500) to ExponentialBuckets(10, 2, 12) to dramatically reduce metric cardinality while maintaining useful precision for latency measurements. Impact: - Reduces from 500 buckets to 12 buckets per histogram (96% reduction) - Bucket range: 10ms to 20.48s with exponential spacing - Reduces series per pod from 3,012 to ~132 (96% reduction) - Total cluster reduction: ~500k series eliminated - Better precision at low latencies where it matters most This fixes excessive cardinality that was causing Prometheus remote write issues (PrometheusRemoteWriteDesiredShards alerts) in production clusters. Refs: NVIDIA#793 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughReplaced high-cardinality linear histogram buckets with exponential buckets in two metrics files and added a single blank line for formatting; no exported APIs or signatures were changed. (48 words) Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
platform-connectors/pkg/ringbuffer/metrics.go (1)
41-47:⚠️ Potential issue | 🟡 MinorFix bucket values to match the intended millisecond-to-second range.
The bucket configuration has a unit mismatch. The commit message states "Bucket range: 10ms to 20.48s with exponential spacing," but
ExponentialBuckets(10, 2, 12)produces[10, 20, 40, ..., 20480]in seconds (based on the.Seconds()conversion beforeObserve()). This creates buckets from 10 to 20,480 seconds—unrealistic for queue latency.To achieve the intended 10ms–20.48s coverage, use
ExponentialBuckets(0.01, 2, 12)instead.
Summary
Reduces platform-connector histogram bucket count from 500 to 12, eliminating ~500k metric series cluster-wide and resolving Prometheus remote write bottlenecks.
Problem
Platform-connector histograms were using
LinearBuckets(0, 10, 500), creating 500 buckets per histogram. With 6 histogram metrics running on 174 nodes (DaemonSet), this generated:PrometheusRemoteWriteDesiredShardsalerts in productionSolution
Changed to
ExponentialBuckets(10, 2, 12):Impact
Files Changed
platform-connectors/pkg/ringbuffer/metrics.go: UpdatedNewLatencyMetricandNewWorkDurationMetricplatform-connectors/pkg/connectors/kubernetes/metrics.go: UpdatednodeConditionUpdateDurationandnodeEventUpdateCreateDurationTesting
make lint)References
🤖 Generated with Claude Code
Summary by CodeRabbit
Refactor
Style