Skip to content

Ruler performance frequently degrades #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
csmarchbanks opened this issue Feb 13, 2018 · 25 comments · Fixed by #741
Closed

Ruler performance frequently degrades #702

csmarchbanks opened this issue Feb 13, 2018 · 25 comments · Fixed by #741

Comments

@csmarchbanks
Copy link
Contributor

The ruler service in our cluster is frequently (every day) running into issues that end up meaning no rules are processed. The main issue seen is upper-percentile (90th percentile and above) ruler query time durations increase to 10 - 20 seconds, which causes the ruler to run into the group timeout (left at the default 10s in our cluster). Since we evaluate ~100 rules per tenant, these high percentile latencies cause every evaluation to fail.

screen shot 2018-02-13 at 1 05 51 pm
Queries for this graph look like:

histogram_quantile(0.99, sum(rate(cortex_distributor_query_duration_seconds_bucket{name="ruler"}[1m])) by (le))

Lots of log messages like:

ts=2018-02-13T09:38:55.273063356Z caller=log.go:108 level=error org_id=0 msg="error in mergeQuerier.selectSamples" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
ts=2018-02-13T09:38:55.274565552Z caller=log.go:108 level=warn msg="context error" error="context deadline exceeded"
@jml
Copy link
Contributor

jml commented Feb 14, 2018

Yup. We've seen this too.

I think the answer is:

  • split scheduler & worker to separate processes
    • worker API is a single gRPC / HTTP endpoint that takes rules and evaluates them before returning a response
  • have the worker persist firing state to an external (consistent?) store
  • switch to multiple workers

#310 has some notes.

I've been intending to do this for a while, but management responsibilities preclude any serious coding. I want to do #619 first anyway, but it's not a strict prerequisite.

@bboreham
Copy link
Contributor

Given a group is typically evaluated every 15s, and you are hitting a 10s timeout on single group(s) of rules, I'd say the solution here is not to split up the processes but either make the queries run faster or split up the groups.

In the short term you could just raise the timeout. Even raise it above 15s and live with not meeting the expectation of running every 15s.

@bboreham
Copy link
Contributor

@csmarchbanks how many workers do you have?

(-ruler.num-workers, defaults to 1)

More workers will mean a slow query does not hold up everything behind it.

@csmarchbanks
Copy link
Contributor Author

We have 10 workers right now, some of which are always idle since we don't have that many tenants yet.

We are looking into splitting up the groups, which will help with running more rules. However, the spike in query request time above happened when no rules or anything else was changed. Restarting ruler brought the query request time back down, so something else is still happening.

@bboreham
Copy link
Contributor

@csmarchbanks thanks for the update. I agree, definitely worth drilling into the root cause. I see elsewhere you have installed Jaeger which I have found very useful - if you want to post some traces maybe I can help talk through what is happening.

@leth
Copy link
Contributor

leth commented Feb 19, 2018

FWIW the ruler idle count metric suffers from sampling problems: rules are evaluated every 15s from when the ruler starts - if you sample it at the start of that window they'll all be busy, if you sample it at the end they'll all be idle. I'll work on a PR to spread the work over the window!

@bboreham
Copy link
Contributor

I note these failures are essentially invisible to the end-user; it might be good to create a synthetic metric similar to that described in #577 that could be viewed (and alerted on, if the thing doing the alerting wasn't broken)

@csmarchbanks
Copy link
Contributor Author

All that was in the Jaeger traces from the ruler was /cortex.Ingester/Query would eventually time out at 10s for some ingesters. I will try to dig up some traces, but currently we are mitigating this issue by frequently restarting ruler, so it might be awhile until it happens again. It is purely hypothetical, but I wonder if some of the ingester clients in the ruler are getting into a bad state.

@csmarchbanks
Copy link
Contributor Author

Also, these failures are definitely visible to the end-user. The ruler ends up in a state where every single rule evaluation is timing out, so no customer rules are visible.

@csmarchbanks
Copy link
Contributor Author

@bboreham Got some Jaeger traces:
screen shot 2018-02-20 at 3 28 04 pm

screen shot 2018-02-20 at 3 29 42 pm

Queries, and sending samples are the only things taking longer than 1s according to Jaeger. Let me know if there is anything specific you would like me to post

@bboreham
Copy link
Contributor

Makes me wonder if gRPC gets stuck somehow. Can you get traces from the ingester side of the call?

@cboggs
Copy link
Contributor

cboggs commented Feb 21, 2018

We've not wired up the ingester for tracing yet, but can do that shortly.

In the meantime, here's some very similar behavior from the Distributor:
screen shot 2018-02-21 at 11 09 56 am

@csmarchbanks
Copy link
Contributor Author

In case they are helpful, here are some goroutine traces of the problem distributor:
goroutine-distributor.pdf

goroutine-dump-distributor.txt

@bboreham
Copy link
Contributor

Once I drilled in a bit I realised the traces were not connected up so I did #720 to fix. You should get a much richer view when you update.

Can you see an error message when you click on the long traces? It would be under “logs” in the Jaeger UI.

@csmarchbanks
Copy link
Contributor Author

csmarchbanks commented Feb 21, 2018

They are all context deadline exceeded errors like:

"rpc error: code = DeadlineExceeded desc = context deadline exceeded"

@bboreham
Copy link
Contributor

Ok, still mysterious.

Is it possible the link from ingester to ruler is flat-out? Do you have bytes/sec stats from when it’s working and when it’s broken?

Similarly, does the cpu usage change? Or memory?

@csmarchbanks
Copy link
Contributor Author

Ruler network, cpu, and memory all drop significantly, but not to zero:
screen shot 2018-02-21 at 7 04 00 pm

screen shot 2018-02-21 at 6 57 20 pm

screen shot 2018-02-21 at 6 58 06 pm

It does seem possible that the link from the ruler to one or two of the ingesters is dead, but some queries do go through, so probably not all ingesters. Since queries are taking so much longer it makes sense that network, cpu, and memory all drop since more time will be spent idle, and less data is processed.

It is hard to screenshot, but looking at a few of the improved Jaeger traces, there are lots of timeouts for both ingester pushes, and queries in a single evaluation. I can send some raw JSON traces over if that would be helpful?

@csmarchbanks
Copy link
Contributor Author

csmarchbanks commented Feb 22, 2018

Here's a screenshot of an example evaluation:
screen shot 2018-02-21 at 7 09 32 pm

screen shot 2018-02-21 at 7 12 42 pm

In case it is helpful, we have 5 ingesters running with a replication factor of 3

@bboreham
Copy link
Contributor

I suspect this can be caused by #723 - could you check your ruler logs for updating rules for user messages mid-execution, and see if the timing matches (some of) your slow-downs?

The way things go back to normal when ruler is restarted is a very good match for this theory; network and cpu usage dropping not so much.

@csmarchbanks
Copy link
Contributor Author

@bboreham I have looked at a couple of the most recent failures and do not see any updating rules log messages corresponding to the slow downs.

@cboggs
Copy link
Contributor

cboggs commented Feb 27, 2018

For the moment, our rules are entirely generated and managed by us. Updates to the same are very infrequent, unfortunately. Looking forward to those other changes all the same, though. :-)

@leth
Copy link
Contributor

leth commented Feb 28, 2018

It'd be great to know if the recent updates have improved things for you folks!
Also could you check out rate(cortex_worker_idle_seconds_total[1m]) (from #727) to see how close you are to max worker capacity :)

@csmarchbanks
Copy link
Contributor Author

Sounds good! We had some demos the last couple days, but I will upgrade everything tomorrow and let you know what I find.

@csmarchbanks
Copy link
Contributor Author

@leth had the issue happen again today. Here are some of the metrics we collected, including cortex_worker_idle_seconds_total:

image

@csmarchbanks
Copy link
Contributor Author

It appears this issue is caused by something going wrong with the ingester gRPC clients. It seems that restarting the clients when they go bad fixes this issue (see #741)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants