Skip to content

Add extensible and cacheable statistics to instant and range queries #4630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aughr opened this issue Jan 20, 2022 · 8 comments
Open

Add extensible and cacheable statistics to instant and range queries #4630

aughr opened this issue Jan 20, 2022 · 8 comments
Labels
keepalive Skipped by stale bot
Milestone

Comments

@aughr
Copy link
Contributor

aughr commented Jan 20, 2022

Is your feature request related to a problem? Please describe.
At AWS, we want to provide our customers and our operators visibility into the resource cost of queries. We believe there are two kinds of statistics:

  1. Semantic: statistics that semantically remain the same whether cached or not. For example, the number of series or samples that contributed to the result.
  2. Runtime: statistics about the specific query run. For example, the time the query actually took this run, or the number of samples that were (not) cached.

Today, statistics are limited to timing. They're also only available for instant queries because of Cortex's range query caching system.

We specifically want to expose sample counts, series counts, and time for both range and instant queries, which we're working on upstream in Prometheus. However, we think it's entirely plausible that the community will find more useful statistics to expose, so we want to make that easy.

Describe the solution you'd like
Assuming our Prometheus work goes forward, the Prometheus engine will support recording sample and series stats, for either the query as a whole or by step in a range query. The statistics struct will be extensible.

In this work, we propose integrating that work into Cortex, exposing those new statistics for instant and range queries. We'd like to extend the extent cache to record stats by step so that partially-cached queries can correctly report their semantic statistics. We'll also expose timing statistics for range queries. We will ensure that it's as easy as possible to add a new kind of statistic in Cortex, ideally by identifying another useful stat as a part of this work.

Describe alternatives you've considered
We've thought about implementing this purely in Prometheus, but the interaction between cached extents and semantic statistics means that we think we need to change Cortex too.

Additional context
prometheus/prometheus#10181 is the corresponding Prometheus issue.

A question for maintainers and the community: are there other statistics that would be useful for us to capture in Cortex specifically?

@aughr
Copy link
Contributor Author

aughr commented Jan 20, 2022

@bboreham As promised, here's our Cortex-side issue for the stats work we're looking to do. Let me know what you think!

@aughr
Copy link
Contributor Author

aughr commented Jan 21, 2022

I asked in CNCF's Slack about possible other stats to expose on a per-query basis. From @thejosephstevens:

number_of_queriers # how many queriers worked on the query
redundant_block_reads # not sure of exact way to measure, but basically a metric that can tell you if you have queriers doing redundant work, could tell you you should change the query split interval
ratio of cached to uncached results # would be great to get at all cache levels (index, blocks, blocks metadata, frontend/results)
processing_time_breakdown # also not a convenient metric name, but basically, how much time is spent fetching index, how much is spent fetching blocks, how much is spent on aggregations
percentage_queries_failed # of all the query tasks that were sent out, how many failed and had to be retried, might be useful to indicate queriers should be resized

I'll try to grab at least a couple of these as a way to show the stats are extensible.

@alanprot alanprot mentioned this issue Apr 6, 2022
3 tasks
@stale
Copy link

stale bot commented Apr 24, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 24, 2022
@friedrich-at-adobe
Copy link
Contributor

still interesting

@stale stale bot removed the stale label Jun 2, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 21, 2022
@alvinlin123 alvinlin123 added the keepalive Skipped by stale bot label Sep 21, 2022
@stale stale bot removed the stale label Sep 21, 2022
@jeromeinsf jeromeinsf added this to the Release 1.15 milestone Dec 7, 2022
@jeromeinsf
Copy link
Contributor

@alanprot what's left to do here?

@alanprot
Copy link
Member

Hi..

I think this stats is already extensible...

Do you have any statistic u still wanna add @friedrich-at-adobe ?

@friedrich-at-adobe
Copy link
Contributor

I think we can close this one @alanprot . I don't have any specific statistic in mind.
This was recently merged #5198 too

A more specific issue could be created later if there is a need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot
Projects
None yet
Development

No branches or pull requests

5 participants