Skip to content

Cortex reads are slow #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomwilkie opened this issue Nov 11, 2016 · 8 comments
Closed

Cortex reads are slow #132

tomwilkie opened this issue Nov 11, 2016 · 8 comments

Comments

@tomwilkie
Copy link
Contributor Author

Now we have better metrics around queries, its looks like expectations around number of chunks per query was wrong. For something like sum by(instance, job) (rate(container_cpu_user_seconds_total{job="kubernetes-nodes"}[1m])) we're fetching 100s of chunks.

I suspect this is because that queries hits a few hundred different timeseries (due to lots of label/values), so nothing we can improve there. We should probably look at the logic for merging together these chunks, and maybe make it lazy? We could also exclude chunks that lie outside the time range using info purely from the index.

I think we'll need to do some kinda of tracing to get a better understanding of this.

@tomwilkie
Copy link
Contributor Author

Also notice that memcache's hit rate starts of low and trends up - I wonder if the absence of any queries it making this artificially low? Perhaps we can tell memcache to evict old chunks first (in the absense of any other data - or maybe it does this already)?

@tomwilkie
Copy link
Contributor Author

tomwilkie commented Nov 23, 2016

Ideas to try:

@tomwilkie
Copy link
Contributor Author

After first round of optimizations, reported query performance (from directly instrumenting the Query function) show "good" (<100ms 99th %ile) latency, but real world latency, and that observed by authfe, is still high.

Things to consider

  • Cost of doing serialisation? Instance queries are much quicker than 'graph' / range queries
  • This query seems to contradict that: sum(cortex_request_duration_seconds_count) for @jml
  • Still seeing high dynamodb latency in ruler, although not seeing high query latency.

@tomwilkie
Copy link
Contributor Author

Also noticed a query with bad syntax (that shouldn't fetch anything) still take ~100ms.

@tomwilkie
Copy link
Contributor Author

From local testing, I don't see any sizable latency introduced from dynamodb:

screen shot 2016-11-24 at 17 04 45

@tomwilkie
Copy link
Contributor Author

Given the dynamodb query took 212ms, and the authfe query took 219ms, 7ms overhead from authfe -> distributor isn't too shabby. Also, gRPC latency from distributor <-> ingester seems worse (at the beast part of 12ms) than authfe <-> distributor (at ~6ms).

@tomwilkie tomwilkie removed their assignment Jan 5, 2017
@tomwilkie
Copy link
Contributor Author

Closing this out, moving the other ideas to #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant