-
Notifications
You must be signed in to change notification settings - Fork 817
Samples can get out of order between distributor and ingester #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This can indeed happen; is it a problem? This error shouldn't get through to the user. |
Hrm, I'd leave this open, cuz it just hit me that the scrape interval is 15s, so unless the goroutine is hanging around for 15s, this should never happen, hrm. |
@gouthamve the context is "suppose some client Prometheus has hundreds of samples queued up for remote write", e.g. the sender is catching up after a network outage. So they don't need to take 15s to write. We don't see this so much now, maybe because I put a cap on the number of shards in the sender. |
@weeco do you have the othe conditions described? |
I have what looks to be a similar situation. I have three prometheus clusters (all with just one replica) sending data to the same cortex (all share the same tenant/orgid) but just one of them is seeing "sample out of order" errors. This prometheus has in total 133157 series but its seeing errors from just a handful metrics: scrape_duration_seconds, scrape_samples_post_metric_relabeling, scrape_samples_scraped, scrape_series_added and "up". Here's an example error message: If I compare the timestamps in the error messages the difference is either 0.036 or 6.25 seconds. The sending prometheus version is 2.17.1 |
I am actually also seeing |
@garo This should be unrelated. The most common cause is when a tenant is remote writing clashing series. This could happen when you have relabelling rules in Prometheus which remove some labels from series, which would lead to clashing series (ie. |
For some instances we see "sample timestamp out of order for series" in our logs with a gap between previous and new timestamp of 15 or 30 seconds.
If this was going wrong in the sending Prometheus we would see the same error on all ingester replicas. We do not: they are reported sporadically on one ingester at a time. From this I deduce the out-of-order is happening inside Cortex.
Here is my best theory: suppose some client Prometheus has hundreds of samples queued up for remote write, then the following can happen:
The text was updated successfully, but these errors were encountered: