Ruler performance frequently degrades #702

csmarchbanks · 2018-02-13T20:18:54Z

The ruler service in our cluster is frequently (every day) running into issues that end up meaning no rules are processed. The main issue seen is upper-percentile (90th percentile and above) ruler query time durations increase to 10 - 20 seconds, which causes the ruler to run into the group timeout (left at the default 10s in our cluster). Since we evaluate ~100 rules per tenant, these high percentile latencies cause every evaluation to fail.

Queries for this graph look like:

histogram_quantile(0.99, sum(rate(cortex_distributor_query_duration_seconds_bucket{name="ruler"}[1m])) by (le))

Lots of log messages like:

ts=2018-02-13T09:38:55.273063356Z caller=log.go:108 level=error org_id=0 msg="error in mergeQuerier.selectSamples" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
ts=2018-02-13T09:38:55.274565552Z caller=log.go:108 level=warn msg="context error" error="context deadline exceeded"

The text was updated successfully, but these errors were encountered:

jml · 2018-02-14T10:55:57Z

Yup. We've seen this too.

I think the answer is:

split scheduler & worker to separate processes
- worker API is a single gRPC / HTTP endpoint that takes rules and evaluates them before returning a response
have the worker persist firing state to an external (consistent?) store
switch to multiple workers

#310 has some notes.

I've been intending to do this for a while, but management responsibilities preclude any serious coding. I want to do #619 first anyway, but it's not a strict prerequisite.

bboreham · 2018-02-18T15:51:57Z

Given a group is typically evaluated every 15s, and you are hitting a 10s timeout on single group(s) of rules, I'd say the solution here is not to split up the processes but either make the queries run faster or split up the groups.

In the short term you could just raise the timeout. Even raise it above 15s and live with not meeting the expectation of running every 15s.

bboreham · 2018-02-18T16:00:09Z

@csmarchbanks how many workers do you have?

(-ruler.num-workers, defaults to 1)

More workers will mean a slow query does not hold up everything behind it.

csmarchbanks · 2018-02-19T00:02:05Z

We have 10 workers right now, some of which are always idle since we don't have that many tenants yet.

We are looking into splitting up the groups, which will help with running more rules. However, the spike in query request time above happened when no rules or anything else was changed. Restarting ruler brought the query request time back down, so something else is still happening.

bboreham · 2018-02-19T10:27:52Z

@csmarchbanks thanks for the update. I agree, definitely worth drilling into the root cause. I see elsewhere you have installed Jaeger which I have found very useful - if you want to post some traces maybe I can help talk through what is happening.

leth · 2018-02-19T10:38:19Z

FWIW the ruler idle count metric suffers from sampling problems: rules are evaluated every 15s from when the ruler starts - if you sample it at the start of that window they'll all be busy, if you sample it at the end they'll all be idle. I'll work on a PR to spread the work over the window!

bboreham · 2018-02-19T14:14:26Z

I note these failures are essentially invisible to the end-user; it might be good to create a synthetic metric similar to that described in #577 that could be viewed (and alerted on, if the thing doing the alerting wasn't broken)

csmarchbanks · 2018-02-20T15:57:37Z

All that was in the Jaeger traces from the ruler was /cortex.Ingester/Query would eventually time out at 10s for some ingesters. I will try to dig up some traces, but currently we are mitigating this issue by frequently restarting ruler, so it might be awhile until it happens again. It is purely hypothetical, but I wonder if some of the ingester clients in the ruler are getting into a bad state.

csmarchbanks · 2018-02-20T16:51:18Z

Also, these failures are definitely visible to the end-user. The ruler ends up in a state where every single rule evaluation is timing out, so no customer rules are visible.

csmarchbanks · 2018-02-20T22:31:40Z

@bboreham Got some Jaeger traces:

Queries, and sending samples are the only things taking longer than 1s according to Jaeger. Let me know if there is anything specific you would like me to post

bboreham · 2018-02-21T08:39:22Z

Makes me wonder if gRPC gets stuck somehow. Can you get traces from the ingester side of the call?

cboggs · 2018-02-21T18:11:18Z

We've not wired up the ingester for tracing yet, but can do that shortly.

In the meantime, here's some very similar behavior from the Distributor:

csmarchbanks · 2018-02-21T18:17:45Z

In case they are helpful, here are some goroutine traces of the problem distributor:
goroutine-distributor.pdf

goroutine-dump-distributor.txt

bboreham · 2018-02-21T19:23:21Z

Once I drilled in a bit I realised the traces were not connected up so I did #720 to fix. You should get a much richer view when you update.

Can you see an error message when you click on the long traces? It would be under “logs” in the Jaeger UI.

csmarchbanks · 2018-02-21T19:28:22Z

They are all context deadline exceeded errors like:

"rpc error: code = DeadlineExceeded desc = context deadline exceeded"

bboreham · 2018-02-21T21:36:16Z

Ok, still mysterious.

Is it possible the link from ingester to ruler is flat-out? Do you have bytes/sec stats from when it’s working and when it’s broken?

Similarly, does the cpu usage change? Or memory?

csmarchbanks · 2018-02-22T02:08:48Z

Ruler network, cpu, and memory all drop significantly, but not to zero:

It does seem possible that the link from the ruler to one or two of the ingesters is dead, but some queries do go through, so probably not all ingesters. Since queries are taking so much longer it makes sense that network, cpu, and memory all drop since more time will be spent idle, and less data is processed.

It is hard to screenshot, but looking at a few of the improved Jaeger traces, there are lots of timeouts for both ingester pushes, and queries in a single evaluation. I can send some raw JSON traces over if that would be helpful?

csmarchbanks · 2018-02-22T02:10:41Z

Here's a screenshot of an example evaluation:

In case it is helpful, we have 5 ingesters running with a replication factor of 3

bboreham · 2018-02-26T16:34:53Z

I suspect this can be caused by #723 - could you check your ruler logs for updating rules for user messages mid-execution, and see if the timing matches (some of) your slow-downs?

The way things go back to normal when ruler is restarted is a very good match for this theory; network and cpu usage dropping not so much.

csmarchbanks · 2018-02-26T23:24:56Z

@bboreham I have looked at a couple of the most recent failures and do not see any updating rules log messages corresponding to the slow downs.

cboggs · 2018-02-27T04:22:59Z

For the moment, our rules are entirely generated and managed by us. Updates to the same are very infrequent, unfortunately. Looking forward to those other changes all the same, though. :-)

leth · 2018-02-28T11:42:06Z

It'd be great to know if the recent updates have improved things for you folks!
Also could you check out rate(cortex_worker_idle_seconds_total[1m]) (from #727) to see how close you are to max worker capacity :)

csmarchbanks · 2018-02-28T23:42:08Z

Sounds good! We had some demos the last couple days, but I will upgrade everything tomorrow and let you know what I find.

csmarchbanks · 2018-03-01T23:32:37Z

@leth had the issue happen again today. Here are some of the metrics we collected, including cortex_worker_idle_seconds_total:

csmarchbanks · 2018-03-12T14:52:16Z

It appears this issue is caused by something going wrong with the ingester gRPC clients. It seems that restarting the clients when they go bad fixes this issue (see #741)

leth mentioned this issue Feb 19, 2018

Spread rule evaluation over the evaluation interval #716

Merged

cboggs mentioned this issue Mar 4, 2018

Tens of thousands of gRPC goroutines between ruler and ingester #672

Closed

csmarchbanks mentioned this issue Mar 12, 2018

Add a healthcheck endpoint on the ingesters that distributors can use #741

Merged

bboreham closed this as completed in #741 Mar 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler performance frequently degrades #702

Ruler performance frequently degrades #702

csmarchbanks commented Feb 13, 2018

jml commented Feb 14, 2018

bboreham commented Feb 18, 2018

bboreham commented Feb 18, 2018

csmarchbanks commented Feb 19, 2018

bboreham commented Feb 19, 2018

leth commented Feb 19, 2018

bboreham commented Feb 19, 2018

csmarchbanks commented Feb 20, 2018

csmarchbanks commented Feb 20, 2018

csmarchbanks commented Feb 20, 2018

bboreham commented Feb 21, 2018

cboggs commented Feb 21, 2018

csmarchbanks commented Feb 21, 2018

bboreham commented Feb 21, 2018

csmarchbanks commented Feb 21, 2018 •

edited

Loading

bboreham commented Feb 21, 2018

csmarchbanks commented Feb 22, 2018

csmarchbanks commented Feb 22, 2018 •

edited

Loading

bboreham commented Feb 26, 2018

csmarchbanks commented Feb 26, 2018

cboggs commented Feb 27, 2018

leth commented Feb 28, 2018

csmarchbanks commented Feb 28, 2018

csmarchbanks commented Mar 1, 2018

csmarchbanks commented Mar 12, 2018

Ruler performance frequently degrades #702

Ruler performance frequently degrades #702

Comments

csmarchbanks commented Feb 13, 2018

jml commented Feb 14, 2018

bboreham commented Feb 18, 2018

bboreham commented Feb 18, 2018

csmarchbanks commented Feb 19, 2018

bboreham commented Feb 19, 2018

leth commented Feb 19, 2018

bboreham commented Feb 19, 2018

csmarchbanks commented Feb 20, 2018

csmarchbanks commented Feb 20, 2018

csmarchbanks commented Feb 20, 2018

bboreham commented Feb 21, 2018

cboggs commented Feb 21, 2018

csmarchbanks commented Feb 21, 2018

bboreham commented Feb 21, 2018

csmarchbanks commented Feb 21, 2018 • edited Loading

bboreham commented Feb 21, 2018

csmarchbanks commented Feb 22, 2018

csmarchbanks commented Feb 22, 2018 • edited Loading

bboreham commented Feb 26, 2018

csmarchbanks commented Feb 26, 2018

cboggs commented Feb 27, 2018

leth commented Feb 28, 2018

csmarchbanks commented Feb 28, 2018

csmarchbanks commented Mar 1, 2018

csmarchbanks commented Mar 12, 2018

csmarchbanks commented Feb 21, 2018 •

edited

Loading

csmarchbanks commented Feb 22, 2018 •

edited

Loading