Skip to content

Conversation

eeldaly
Copy link
Contributor

@eeldaly eeldaly commented Aug 28, 2025

What this PR does:
Adds a metric that counts how many series have discarded samples in them. This includes a label with the reason the sample was discarded.

Which issue(s) this PR fixes:
Fixes #6995

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

"per_user_series_limit",
"per_labelset_series_limit",
"per_metric_series_limit",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid hardcoding the discarded reason. There can be more reasons in the future and it is very easy to miss it

@@ -87,6 +88,8 @@ type ValidateMetrics struct {

DiscardedSamplesPerLabelSet *prometheus.CounterVec
LabelSetTracker *labelset.LabelSetTracker
DiscardedSeriesGauge *prometheus.GaugeVec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid having metric type Gauge in the name

@@ -145,6 +148,14 @@ func NewValidateMetrics(r prometheus.Registerer) *ValidateMetrics {
NativeHistogramMinResetDuration: 1 * time.Hour,
}, []string{"user"})
registerCollector(r, labelSizeBytes)
discardedSeriesGauge := prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "cortex_discarded_series_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_total suffix is used for counter mainly. We can remove that

Copy link
Member

@SungJin1212 SungJin1212 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the intended metric a counter? It seems to me that this metric is a counter.

@@ -434,4 +447,7 @@ func DeletePerUserValidationMetrics(validateMetrics *ValidateMetrics, userID str
if err := util.DeleteMatchingLabels(validateMetrics.LabelSizeBytes, filter); err != nil {
level.Warn(log).Log("msg", "failed to remove cortex_label_size_bytes metric for user", "user", userID, "err", err)
}
if err := util.DeleteMatchingLabels(validateMetrics.DiscardedSeriesGauge, filter); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this to the existing test case of cleaning up user metrics?

return &DiscardedSeriesTracker{labelUserMap: labelUserMap, discardedSeriesGauge: discardedSeriesGauge}
}

func (t *DiscardedSeriesTracker) Track(label string, user string, series uint64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label is a bit confusing. Can we call it reason?

go func() {
for {
time.Sleep(vendMetricsInterval)
t.UpdateMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use time.Ticker?

Also let's rename StartDiscardedSeriesGoroutine. This is not a good method name. I don't see where it is being called as well

// Check if the error is a soft error we can proceed on. If so, we keep track
// of it, so that we can return it back to the distributor, which will return a
// 400 error to the client. The client (Prometheus) will not retry on 400, and
// we actually ingested all samples which haven't failed.
switch cause := errors.Cause(err); {
case errors.Is(cause, storage.ErrOutOfBounds):
sampleOutOfBoundsCount++
i.validateMetrics.DiscardedSeriesTracker.Track("sample_out_of_bounds", userID, seriesHash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can use the same const we use for the DiscardedSamples metric.

const (
sampleOutOfOrder = "sample-out-of-order"
newValueForTimestamp = "new-value-for-timestamp"
sampleOutOfBounds = "sample-out-of-bounds"
sampleTooOld = "sample-too-old"
nativeHistogramSample = "native-histogram-sample"
)

and
const (
perUserSeriesLimit = "per_user_series_limit"
perUserNativeHistogramSeriesLimit = "per_user_native_histogram_series_limit"
perMetricSeriesLimit = "per_metric_series_limit"
perLabelsetSeriesLimit = "per_labelset_series_limit"
)

func NewDiscardedSeriesTracker(discardedSeriesGauge *prometheus.GaugeVec) *DiscardedSeriesTracker {
labelUserMap := make(map[string]*UserCounter)
for _, label := range trackedLabels {
labelUserMap[label] = &UserCounter{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create when receiving a new reason? On track it seems we ignore reason not initially created

RWMutex: &sync.RWMutex{},
seriesCountMap: make(map[uint64]struct{}),
}
userCounter.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the lock to be before the check? In this case we would still override the content, no?


func (t *DiscardedSeriesTracker) UpdateMetrics() {
for label, userCounter := range t.labelUserMap {
userCounter.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We maybe want to consider changing it to be UserLabel instead of LabelUser.
When emitting this metric for a Label, we would block global requests from every user.
It seems this would have faster lock if we have UserLabel.

}

func (t *DiscardedSeriesTracker) Track(reason string, user string, series uint64) {
userCounter, ok := t.labelUserMap[reason]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a lock here?

t.Unlock()
}

seriesCounter, ok := userCounter.userSeriesMap[user]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a lock for the user series map

userCounter.Unlock()
}

if _, ok := seriesCounter.seriesCountMap[series]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here


func (t *DiscardedSeriesTracker) UpdateMetrics() {
usersToDelete := make([]string, 0)
for label, userCounter := range t.labelUserMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need read lock here

type UserCounter struct {
*sync.RWMutex
userSeriesMap map[string]*SeriesCounter
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to not expose those structs if we don't need access from other packages. Unless we have a good reason to mark them as public

@eeldaly
Copy link
Contributor Author

eeldaly commented Sep 11, 2025

Will recommit my changes since the pr now includes files I did not change when I went back to sign old commits

@eeldaly eeldaly reopened this Sep 11, 2025
Signed-off-by: Essam Eldaly <[email protected]>
Signed-off-by: Essam Eldaly <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metric for number of series with discarded samples
4 participants