Samples can get out of order between distributor and ingester #670

bboreham · 2018-01-25T14:12:09Z

For some instances we see "sample timestamp out of order for series" in our logs with a gap between previous and new timestamp of 15 or 30 seconds.
If this was going wrong in the sending Prometheus we would see the same error on all ingester replicas. We do not: they are reported sporadically on one ingester at a time. From this I deduce the out-of-order is happening inside Cortex.

Here is my best theory: suppose some client Prometheus has hundreds of samples queued up for remote write, then the following can happen:

Prometheus sends 100 samples to distributor.
Distributor replicates the data three times and fires up three goroutines to deliver the data.
Once two of the calls have returned from ingesters, distributor returns success to prometheus.
Third call continues, on its goroutine.
Prometheus sends the next 100 samples; distributor (likely on another node) fires up another 3 goroutines.
One of those goroutines can overtake the third one from the previous call.

tomwilkie · 2018-09-24T11:43:41Z

This can indeed happen; is it a problem? This error shouldn't get through to the user.

gouthamve · 2019-08-28T13:43:35Z

Hrm, I'd leave this open, cuz it just hit me that the scrape interval is 15s, so unless the goroutine is hanging around for 15s, this should never happen, hrm.

bboreham · 2019-08-28T16:29:01Z

@gouthamve the context is "suppose some client Prometheus has hundreds of samples queued up for remote write", e.g. the sender is catching up after a network outage. So they don't need to take 15s to write.

We don't see this so much now, maybe because I put a cap on the number of shards in the sender.

weeco · 2020-03-23T09:47:39Z

I do see this as well. I just noticed that Distributors reported hundreds of "sample out of orders" after one or two ingesters were consuming signifcantly more CPU than usual. Sometimes I also observe this behaviour during rolling updates of ingesters.

bboreham · 2020-03-23T10:20:01Z

@weeco do you have the othe conditions described?
Does the error message give two timestamps close to one scrape interval apart?
Does your sending Prometheus have a lot of data queued up?

garo · 2020-04-17T11:49:53Z

I have what looks to be a similar situation. I have three prometheus clusters (all with just one replica) sending data to the same cortex (all share the same tenant/orgid) but just one of them is seeing "sample out of order" errors. This prometheus has in total 133157 series but its seeing errors from just a handful metrics: scrape_duration_seconds, scrape_samples_post_metric_relabeling, scrape_samples_scraped, scrape_series_added and "up".

Here's an example error message:
ts=2020-04-17T09:19:10.407113092Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=399.64µs err="rpc error: code = Code(400) desc = user=services: sample timestamp out of order; last timestamp: 1587115147.329, incoming timestamp: 1587115141.08 for series {__name__=\"scrape_duration_seconds\", __prom=\"eks-services-prod\", endpoint=\"https-metrics\", instance=\"10.107.166.33:10250\", job=\"kubelet\", namespace=\"kube-system\", node=\"ip-10-107-166-33.ec2.internal\", service=\"mon-prometheus-operator-kubelet\"}" msg="gRPC\n"

If I compare the timestamps in the error messages the difference is either 0.036 or 6.25 seconds. The sending prometheus version is 2.17.1

garo · 2020-04-17T12:03:25Z

I am actually also seeing sample with repeated timestamp but different value from another tenant. This error also comes from the same file https://github.com/cortexproject/cortex/blob/master/pkg/ingester/series.go#L71 than the sample timestamp out of order error.

pracucci · 2020-04-17T16:12:48Z

I am actually also seeing sample with repeated timestamp but different value from another tenant.

@garo This should be unrelated. The most common cause is when a tenant is remote writing clashing series. This could happen when you have relabelling rules in Prometheus which remove some labels from series, which would lead to clashing series (ie. series_1{a="1",b="2} and series_1{a="1",b="3"}, then you've a relabelling rule to remove the label b and you end up with the clashing series series_1{a="1"}). I've mentioning it cause this is an issue I've already seen few times with our customers.

bboreham added the type/bug label Jan 25, 2018

bboreham mentioned this issue Mar 2, 2018

Stop early-return after majority success in distributor writes #730

Closed

gouthamve added not-as-easy-as-it-looks keepalive Skipped by stale bot labels May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samples can get out of order between distributor and ingester #670

Samples can get out of order between distributor and ingester #670

bboreham commented Jan 25, 2018

tomwilkie commented Sep 24, 2018

gouthamve commented Aug 28, 2019

bboreham commented Aug 28, 2019

weeco commented Mar 23, 2020

bboreham commented Mar 23, 2020

garo commented Apr 17, 2020

garo commented Apr 17, 2020

pracucci commented Apr 17, 2020

Samples can get out of order between distributor and ingester #670

Samples can get out of order between distributor and ingester #670

Comments

bboreham commented Jan 25, 2018

tomwilkie commented Sep 24, 2018

gouthamve commented Aug 28, 2019

bboreham commented Aug 28, 2019

weeco commented Mar 23, 2020

bboreham commented Mar 23, 2020

garo commented Apr 17, 2020

garo commented Apr 17, 2020

pracucci commented Apr 17, 2020