-
Notifications
You must be signed in to change notification settings - Fork 831
Add metric for number of series with discarded samples #6996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
"per_user_series_limit", | ||
"per_labelset_series_limit", | ||
"per_metric_series_limit", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid hardcoding the discarded reason. There can be more reasons in the future and it is very easy to miss it
pkg/util/validation/validate.go
Outdated
@@ -87,6 +88,8 @@ type ValidateMetrics struct { | |||
|
|||
DiscardedSamplesPerLabelSet *prometheus.CounterVec | |||
LabelSetTracker *labelset.LabelSetTracker | |||
DiscardedSeriesGauge *prometheus.GaugeVec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid having metric type Gauge
in the name
pkg/util/validation/validate.go
Outdated
@@ -145,6 +148,14 @@ func NewValidateMetrics(r prometheus.Registerer) *ValidateMetrics { | |||
NativeHistogramMinResetDuration: 1 * time.Hour, | |||
}, []string{"user"}) | |||
registerCollector(r, labelSizeBytes) | |||
discardedSeriesGauge := prometheus.NewGaugeVec( | |||
prometheus.GaugeOpts{ | |||
Name: "cortex_discarded_series_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_total
suffix is used for counter mainly. We can remove that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the intended metric a counter? It seems to me that this metric is a counter.
pkg/util/validation/validate.go
Outdated
@@ -434,4 +447,7 @@ func DeletePerUserValidationMetrics(validateMetrics *ValidateMetrics, userID str | |||
if err := util.DeleteMatchingLabels(validateMetrics.LabelSizeBytes, filter); err != nil { | |||
level.Warn(log).Log("msg", "failed to remove cortex_label_size_bytes metric for user", "user", userID, "err", err) | |||
} | |||
if err := util.DeleteMatchingLabels(validateMetrics.DiscardedSeriesGauge, filter); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this to the existing test case of cleaning up user metrics?
pkg/util/discardedseries/tracker.go
Outdated
return &DiscardedSeriesTracker{labelUserMap: labelUserMap, discardedSeriesGauge: discardedSeriesGauge} | ||
} | ||
|
||
func (t *DiscardedSeriesTracker) Track(label string, user string, series uint64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label is a bit confusing. Can we call it reason
?
go func() { | ||
for { | ||
time.Sleep(vendMetricsInterval) | ||
t.UpdateMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use time.Ticker?
Also let's rename StartDiscardedSeriesGoroutine
. This is not a good method name. I don't see where it is being called as well
pkg/ingester/ingester.go
Outdated
// Check if the error is a soft error we can proceed on. If so, we keep track | ||
// of it, so that we can return it back to the distributor, which will return a | ||
// 400 error to the client. The client (Prometheus) will not retry on 400, and | ||
// we actually ingested all samples which haven't failed. | ||
switch cause := errors.Cause(err); { | ||
case errors.Is(cause, storage.ErrOutOfBounds): | ||
sampleOutOfBoundsCount++ | ||
i.validateMetrics.DiscardedSeriesTracker.Track("sample_out_of_bounds", userID, seriesHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can use the same const we use for the DiscardedSamples metric.
Lines 3 to 9 in 746b40e
const ( | |
sampleOutOfOrder = "sample-out-of-order" | |
newValueForTimestamp = "new-value-for-timestamp" | |
sampleOutOfBounds = "sample-out-of-bounds" | |
sampleTooOld = "sample-too-old" | |
nativeHistogramSample = "native-histogram-sample" | |
) |
and
cortex/pkg/ingester/user_state.go
Lines 18 to 23 in 746b40e
const ( | |
perUserSeriesLimit = "per_user_series_limit" | |
perUserNativeHistogramSeriesLimit = "per_user_native_histogram_series_limit" | |
perMetricSeriesLimit = "per_metric_series_limit" | |
perLabelsetSeriesLimit = "per_labelset_series_limit" | |
) |
pkg/util/discardedseries/tracker.go
Outdated
func NewDiscardedSeriesTracker(discardedSeriesGauge *prometheus.GaugeVec) *DiscardedSeriesTracker { | ||
labelUserMap := make(map[string]*UserCounter) | ||
for _, label := range trackedLabels { | ||
labelUserMap[label] = &UserCounter{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create when receiving a new reason? On track it seems we ignore reason not initially created
pkg/util/discardedseries/tracker.go
Outdated
RWMutex: &sync.RWMutex{}, | ||
seriesCountMap: make(map[uint64]struct{}), | ||
} | ||
userCounter.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the lock to be before the check? In this case we would still override the content, no?
pkg/util/discardedseries/tracker.go
Outdated
|
||
func (t *DiscardedSeriesTracker) UpdateMetrics() { | ||
for label, userCounter := range t.labelUserMap { | ||
userCounter.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We maybe want to consider changing it to be UserLabel instead of LabelUser.
When emitting this metric for a Label, we would block global requests from every user.
It seems this would have faster lock if we have UserLabel.
pkg/util/discardedseries/tracker.go
Outdated
} | ||
|
||
func (t *DiscardedSeriesTracker) Track(reason string, user string, series uint64) { | ||
userCounter, ok := t.labelUserMap[reason] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a lock here?
t.Unlock() | ||
} | ||
|
||
seriesCounter, ok := userCounter.userSeriesMap[user] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a lock for the user series map
pkg/util/discardedseries/tracker.go
Outdated
userCounter.Unlock() | ||
} | ||
|
||
if _, ok := seriesCounter.seriesCountMap[series]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
pkg/util/discardedseries/tracker.go
Outdated
|
||
func (t *DiscardedSeriesTracker) UpdateMetrics() { | ||
usersToDelete := make([]string, 0) | ||
for label, userCounter := range t.labelUserMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need read lock here
type UserCounter struct { | ||
*sync.RWMutex | ||
userSeriesMap map[string]*SeriesCounter | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to not expose those structs if we don't need access from other packages. Unless we have a good reason to mark them as public
Will recommit my changes since the pr now includes files I did not change when I went back to sign old commits |
Signed-off-by: Essam Eldaly <[email protected]>
Signed-off-by: Essam Eldaly <[email protected]>
Signed-off-by: Essam Eldaly <[email protected]>
What this PR does:
Adds a metric that counts how many series have discarded samples in them. This includes a label with the reason the sample was discarded.
Which issue(s) this PR fixes:
Fixes #6995
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]