Cortex reads are slow #132

tomwilkie · 2016-11-11T05:01:33Z

On prod, we're seeing 99th %ile reads at 3.3s:

https://cloud.weave.works/admin/prometheus/graph?g0.range_input=1h&g0.expr=topk(10%2C+histogram_quantile(0.99%2C+sum(rate(scope_request_duration_seconds_bucket%7Bjob%3D%22default%2Fauthfe%22%2C+route+!~+%22admin.*%22%2C+ws%3D%22false%22%7D%5B1h%5D))+by+(route%2C+le)))&g0.tab=0

tomwilkie · 2016-11-14T13:52:30Z

Now we have better metrics around queries, its looks like expectations around number of chunks per query was wrong. For something like sum by(instance, job) (rate(container_cpu_user_seconds_total{job="kubernetes-nodes"}[1m])) we're fetching 100s of chunks.

I suspect this is because that queries hits a few hundred different timeseries (due to lots of label/values), so nothing we can improve there. We should probably look at the logic for merging together these chunks, and maybe make it lazy? We could also exclude chunks that lie outside the time range using info purely from the index.

I think we'll need to do some kinda of tracing to get a better understanding of this.

tomwilkie · 2016-11-14T13:56:51Z

Also notice that memcache's hit rate starts of low and trends up - I wonder if the absence of any queries it making this artificially low? Perhaps we can tell memcache to evict old chunks first (in the absense of any other data - or maybe it does this already)?

tomwilkie · 2016-11-23T13:29:40Z

Ideas to try:

exclude chunks before fetching by using high/low watermark from chunk ID (Don't fetch chunks outside our timerange #149)
fetch results from ingester & chunk strore in parallel (Read ingesters and chunk store in parallel #150)
implement backoff for dyanmodb query to prevent aggressive throttling (Do queries on background go routines which backoff #153)

tomwilkie · 2016-11-23T15:16:08Z

After first round of optimizations, reported query performance (from directly instrumenting the Query function) show "good" (<100ms 99th %ile) latency, but real world latency, and that observed by authfe, is still high.

Things to consider

Cost of doing serialisation? Instance queries are much quicker than 'graph' / range queries
This query seems to contradict that: sum(cortex_request_duration_seconds_count) for @jml
Still seeing high dynamodb latency in ruler, although not seeing high query latency.

tomwilkie · 2016-11-23T19:58:31Z

Also noticed a query with bad syntax (that shouldn't fetch anything) still take ~100ms.

tomwilkie · 2016-11-24T17:05:40Z

From local testing, I don't see any sizable latency introduced from dynamodb:

tomwilkie · 2016-11-24T17:08:17Z

Given the dynamodb query took 212ms, and the authfe query took 219ms, 7ms overhead from authfe -> distributor isn't too shabby. Also, gRPC latency from distributor <-> ingester seems worse (at the beast part of 12ms) than authfe <-> distributor (at ~6ms).

tomwilkie · 2017-01-13T11:16:02Z

Closing this out, moving the other ideas to #209

tomwilkie self-assigned this Nov 14, 2016

This was referenced Nov 22, 2016

Add OpenTracing stuff #148

Closed

Don't fetch chunks outside our timerange #149

Merged

Read ingesters and chunk store in parallel #150

Merged

This was referenced Nov 24, 2016

Do queries on background go routines which backoff #153

Merged

Instrument Cortex with OpenTracing middleware #155

Merged

tomwilkie removed their assignment Jan 5, 2017

tomwilkie closed this as completed Jan 13, 2017

tomwilkie mentioned this issue Jan 13, 2017

Query optimisation ideas #209

Closed

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex reads are slow #132

Cortex reads are slow #132

tomwilkie commented Nov 11, 2016

tomwilkie commented Nov 14, 2016

tomwilkie commented Nov 14, 2016

tomwilkie commented Nov 23, 2016 •

edited

Loading

tomwilkie commented Nov 23, 2016

tomwilkie commented Nov 23, 2016

tomwilkie commented Nov 24, 2016

tomwilkie commented Nov 24, 2016

tomwilkie commented Jan 13, 2017

Cortex reads are slow #132

Cortex reads are slow #132

Comments

tomwilkie commented Nov 11, 2016

tomwilkie commented Nov 14, 2016

tomwilkie commented Nov 14, 2016

tomwilkie commented Nov 23, 2016 • edited Loading

tomwilkie commented Nov 23, 2016

tomwilkie commented Nov 23, 2016

tomwilkie commented Nov 24, 2016

tomwilkie commented Nov 24, 2016

tomwilkie commented Jan 13, 2017

tomwilkie commented Nov 23, 2016 •

edited

Loading