-
Notifications
You must be signed in to change notification settings - Fork 817
Ruler performance frequently degrades #702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yup. We've seen this too. I think the answer is:
#310 has some notes. I've been intending to do this for a while, but management responsibilities preclude any serious coding. I want to do #619 first anyway, but it's not a strict prerequisite. |
Given a group is typically evaluated every 15s, and you are hitting a 10s timeout on single group(s) of rules, I'd say the solution here is not to split up the processes but either make the queries run faster or split up the groups. In the short term you could just raise the timeout. Even raise it above 15s and live with not meeting the expectation of running every 15s. |
@csmarchbanks how many workers do you have? ( More workers will mean a slow query does not hold up everything behind it. |
We have 10 workers right now, some of which are always idle since we don't have that many tenants yet. We are looking into splitting up the groups, which will help with running more rules. However, the spike in query request time above happened when no rules or anything else was changed. Restarting ruler brought the query request time back down, so something else is still happening. |
@csmarchbanks thanks for the update. I agree, definitely worth drilling into the root cause. I see elsewhere you have installed Jaeger which I have found very useful - if you want to post some traces maybe I can help talk through what is happening. |
FWIW the ruler idle count metric suffers from sampling problems: rules are evaluated every 15s from when the ruler starts - if you sample it at the start of that window they'll all be busy, if you sample it at the end they'll all be idle. I'll work on a PR to spread the work over the window! |
I note these failures are essentially invisible to the end-user; it might be good to create a synthetic metric similar to that described in #577 that could be viewed (and alerted on, if the thing doing the alerting wasn't broken) |
All that was in the Jaeger traces from the ruler was |
Also, these failures are definitely visible to the end-user. The ruler ends up in a state where every single rule evaluation is timing out, so no customer rules are visible. |
@bboreham Got some Jaeger traces: Queries, and sending samples are the only things taking longer than 1s according to Jaeger. Let me know if there is anything specific you would like me to post |
Makes me wonder if gRPC gets stuck somehow. Can you get traces from the ingester side of the call? |
In case they are helpful, here are some goroutine traces of the problem distributor: |
Once I drilled in a bit I realised the traces were not connected up so I did #720 to fix. You should get a much richer view when you update. Can you see an error message when you click on the long traces? It would be under “logs” in the Jaeger UI. |
They are all context deadline exceeded errors like:
|
Ok, still mysterious. Is it possible the link from ingester to ruler is flat-out? Do you have bytes/sec stats from when it’s working and when it’s broken? Similarly, does the cpu usage change? Or memory? |
I suspect this can be caused by #723 - could you check your The way things go back to normal when |
@bboreham I have looked at a couple of the most recent failures and do not see any |
For the moment, our rules are entirely generated and managed by us. Updates to the same are very infrequent, unfortunately. Looking forward to those other changes all the same, though. :-) |
It'd be great to know if the recent updates have improved things for you folks! |
Sounds good! We had some demos the last couple days, but I will upgrade everything tomorrow and let you know what I find. |
@leth had the issue happen again today. Here are some of the metrics we collected, including cortex_worker_idle_seconds_total: |
It appears this issue is caused by something going wrong with the ingester gRPC clients. It seems that restarting the clients when they go bad fixes this issue (see #741) |
The ruler service in our cluster is frequently (every day) running into issues that end up meaning no rules are processed. The main issue seen is upper-percentile (90th percentile and above) ruler query time durations increase to 10 - 20 seconds, which causes the ruler to run into the group timeout (left at the default 10s in our cluster). Since we evaluate ~100 rules per tenant, these high percentile latencies cause every evaluation to fail.
Queries for this graph look like:
Lots of log messages like:
The text was updated successfully, but these errors were encountered: