-
Notifications
You must be signed in to change notification settings - Fork 816
Cortex reads are slow #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Now we have better metrics around queries, its looks like expectations around number of chunks per query was wrong. For something like I suspect this is because that queries hits a few hundred different timeseries (due to lots of label/values), so nothing we can improve there. We should probably look at the logic for merging together these chunks, and maybe make it lazy? We could also exclude chunks that lie outside the time range using info purely from the index. I think we'll need to do some kinda of tracing to get a better understanding of this. |
Also notice that memcache's hit rate starts of low and trends up - I wonder if the absence of any queries it making this artificially low? Perhaps we can tell memcache to evict old chunks first (in the absense of any other data - or maybe it does this already)? |
Ideas to try:
|
After first round of optimizations, reported query performance (from directly instrumenting the Things to consider
|
Also noticed a query with bad syntax (that shouldn't fetch anything) still take ~100ms. |
Given the dynamodb query took 212ms, and the authfe query took 219ms, 7ms overhead from authfe -> distributor isn't too shabby. Also, gRPC latency from distributor <-> ingester seems worse (at the beast part of 12ms) than authfe <-> distributor (at ~6ms). |
Closing this out, moving the other ideas to #209 |
On prod, we're seeing 99th %ile reads at 3.3s:
https://cloud.weave.works/admin/prometheus/graph?g0.range_input=1h&g0.expr=topk(10%2C+histogram_quantile(0.99%2C+sum(rate(scope_request_duration_seconds_bucket%7Bjob%3D%22default%2Fauthfe%22%2C+route+!~+%22admin.*%22%2C+ws%3D%22false%22%7D%5B1h%5D))+by+(route%2C+le)))&g0.tab=0
The text was updated successfully, but these errors were encountered: