Fix: Reduce histogram bucket cardinality from 500 to 12 by jtschelling · Pull Request #799 · NVIDIA/NVSentinel

jtschelling · 2026-02-04T18:10:46Z

Summary

Reduces platform-connector histogram bucket count from 500 to 12, eliminating ~500k metric series cluster-wide and resolving Prometheus remote write bottlenecks.

Problem

Platform-connector histograms were using LinearBuckets(0, 10, 500), creating 500 buckets per histogram. With 6 histogram metrics running on 174 nodes (DaemonSet), this generated:

523,896 series (6 metrics × 174 nodes × 502 series each)
10-15% of total cluster cardinality
Contributed to PrometheusRemoteWriteDesiredShards alerts in production

Solution

Changed to ExponentialBuckets(10, 2, 12):

12 buckets: [10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480] milliseconds
Range: 10ms to 20.48 seconds
Better precision at low latencies where it matters most
96% reduction in bucket count (500 → 12)

Impact

Series per histogram: 502 → 14 (96% reduction)
Series per pod: 3,012 → 132 (96% reduction)
Total cluster savings: ~500,928 series eliminated
Faster Prometheus queries (fewer series to scan)
Reduced remote write bandwidth
Lower memory usage in Prometheus

Files Changed

platform-connectors/pkg/ringbuffer/metrics.go: Updated NewLatencyMetric and NewWorkDurationMetric
platform-connectors/pkg/connectors/kubernetes/metrics.go: Updated nodeConditionUpdateDuration and nodeEventUpdateCreateDuration

Testing

✅ All linting checks pass (make lint)
✅ All unit tests pass (176 tests)
✅ Verified exponential buckets provide appropriate coverage for latency measurements

References

Closes Excessive histogram bucket configuration in platform-connectors causing high cardinality #793
Prometheus Histogram Best Practices
Internal investigation: prometheus-remote-write-investigation.md

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Updated histogram bucket configurations for platform connector and workqueue metrics to improve monitoring granularity and better represent latency distributions.
Style
- Minor formatting adjustments with no runtime behavior changes.

Changed platform-connector histogram metrics from LinearBuckets(0, 10, 500) to ExponentialBuckets(10, 2, 12) to dramatically reduce metric cardinality while maintaining useful precision for latency measurements. Impact: - Reduces from 500 buckets to 12 buckets per histogram (96% reduction) - Bucket range: 10ms to 20.48s with exponential spacing - Reduces series per pod from 3,012 to ~132 (96% reduction) - Total cluster reduction: ~500k series eliminated - Better precision at low latencies where it matters most This fixes excessive cardinality that was causing Prometheus remote write issues (PrometheusRemoteWriteDesiredShards alerts) in production clusters. Refs: NVIDIA#793 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai · 2026-02-04T18:11:05Z

📝 Walkthrough

Walkthrough

Replaced high-cardinality linear histogram buckets with exponential buckets in two metrics files and added a single blank line for formatting; no exported APIs or signatures were changed. (48 words)

Changes

Cohort / File(s)	Summary
Histogram Bucket Optimization `platform-connectors/pkg/connectors/kubernetes/metrics.go`, `platform-connectors/pkg/ringbuffer/metrics.go`	Replaced `prometheus.LinearBuckets(0, 10, 500)` with `prometheus.ExponentialBuckets(10, 2, 12)` for relevant histograms, reducing bucket count and expected Prometheus cardinality.
Formatting `platform-connectors/pkg/ringbuffer/ring_buffer.go`	Inserted a blank line after `workqueue.SetProvider(prometheusMetricsProvider{})` (formatting only).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 I hopped through code with gentle thump,

Swapped many buckets for a nimble jump,
Fewer bars but wisdom stays—
Prometheus breathes easier these days,
A rabbit's nod for tidy ways 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: reducing histogram bucket cardinality from 500 to 12, which is the primary objective of the PR.
Linked Issues check	✅ Passed	The PR fully implements all coding requirements from issue `#793`: replaced LinearBuckets(0, 10, 500) with ExponentialBuckets(10, 2, 12) in both required files, achieving the 96% cardinality reduction target.
Out of Scope Changes check	✅ Passed	All changes are directly within scope. Metric configuration updates target the two specified files. The blank line insertion in ring_buffer.go is a minor formatting change related to the metric provider setup, not extraneous.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

platform-connectors/pkg/ringbuffer/metrics.go (1)

41-47: ⚠️ Potential issue | 🟡 Minor

Fix bucket values to match the intended millisecond-to-second range.

The bucket configuration has a unit mismatch. The commit message states "Bucket range: 10ms to 20.48s with exponential spacing," but ExponentialBuckets(10, 2, 12) produces [10, 20, 40, ..., 20480] in seconds (based on the .Seconds() conversion before Observe()). This creates buckets from 10 to 20,480 seconds—unrealistic for queue latency.

To achieve the intended 10ms–20.48s coverage, use ExponentialBuckets(0.01, 2, 12) instead.

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

jtschelling added 2 commits February 4, 2026 10:23

Merge branch 'main' into fix/histogram-bucket-cardinality

1603f42

fix: workqueue uses seconds, need to account for float values

3434d63

lalitadithya approved these changes Feb 4, 2026

View reviewed changes

lalitadithya merged commit c7121b5 into NVIDIA:main Feb 4, 2026
62 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Reduce histogram bucket cardinality from 500 to 12#799

Fix: Reduce histogram bucket cardinality from 500 to 12#799
lalitadithya merged 3 commits intoNVIDIA:mainfrom
jtschelling:fix/histogram-bucket-cardinality

jtschelling commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jtschelling commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Impact

Files Changed

Testing

References

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jtschelling commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading