Limit per-user metric cardinality #47

juliusv · 2016-10-11T16:56:12Z

Even just by accident, it's really easy for a user to overload or hotspot an ingester right now, by either sending too many series with the same metric name, or just too many time series in general (especially when accidentally misusing a label with unbounded value cardinality).

We should have some way of limiting the user's number of series.

fd875e2 Fix test wrt shellcheck 54ec2d9 Don't capitalise error messages 19d3b6e Merge pull request #49 from weaveworks/pin-shfmt fea98f6 Go get from the vendor dir 1d867b0 Try and vendor a specific version of shfmt 76619c2 Merge pull request #48 from weaveworks/revert-41-user-tokens 4f96c51 Revert "Add experimental support for user tokens" d00033f Merge pull request #41 from weaveworks/user-tokens 245ed26 Merge pull request #47 from weaveworks/46-shfmt c1d7815 Fix shfmt error cb39746 Don't overright lint_result with 0 when shellcheck succeeds 8ab80e8 Merge pull request #45 from weaveworks/lint 83d5bd1 getting integration/config and test shellcheck-compliant cff9ec3 Fix some shellcheck errors 7a843d6 run shellcheck as part of lint if it is installed 31552a0 removing spurious space from test 6ca7c5f Merge pull request #44 from weaveworks/shfmt 952356d Allow lint to lint itself b7ac59c Run shfmt on all shell files in this repo 5570b0e Add shfmt formatting of shell files in lint 0a67594 fix circle build by splatting gopath permissions b990f48 Merge pull request #42 from kinvolk/lorenzo/fix-git-diff 224a145 Check if SHA1 exists before calling `git diff` 1c3000d Add auto_apply config for wcloud 0ebf5c0 Fix wcloud -serivice 354e083 Fixing lint 586060b Add experimental support for user tokens git-subtree-dir: tools git-subtree-split: fd875e27c5379d443574bcf20f24a52a594871ca

juliusv · 2017-02-01T13:13:36Z

There's a question here of whether we want to limit the total number of series per user or per ingester. Maybe both.

From a system safety perspective, we'll want to limit the number of series per ingester. But that is hard to communicate to the user. They will want to know how many series they can create overall, because they don't understand the uneven distribution of their series over ingesters (similarly to how we stumbled over Dynamo table throughput issues with uneven table shards).

Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high.

I'm not sure how we would track the total number of series for a user as another limit. The distributor would either have to get stats from the ingesters on every append (probably infeasible) or the distributor would have to track metric cardinality itself. Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog.

tomwilkie · 2017-02-01T13:21:06Z

Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high.

Yes, I think this is the best place to start. We can always offer users this limit as a lower bound.

Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog.

Thats an interesting idea; it would need some kind of moving average as well, as cardinality over time can be virtually unlimited - we close inactive series after 1h.

rade · 2017-02-01T20:13:51Z

What are we proposing to do when the limit is exceeded? Throw away the data cortex has received? How will a user know that (and why) this is happening?

juliusv · 2017-02-01T20:24:16Z

@rade Throw away the samples and return an error to the user. For the rate-limiting case, we do the same and return a 429 Too Many Requests, which wouldn't be fully accurate here. I don't know what the best HTTP response code would be, maybe 418 I'm a teapot :)

At first, this would require a user to meta-monitor their Prometheus scraper for failed remote writes, but eventually we could notify users automatically when we see that they are being continuously denied.

tomwilkie · 2017-02-01T20:38:39Z

We need to make it clear thats its the product of the label cardinality for a given metric thats problematic for us. If we detect such a metric, we should black list it, but not drop the entire batch.

rade · 2017-02-01T20:40:54Z

Getting something to show up in the Weave Cloud cortex UI would be nice. And, going totally meta... feed a metric into the instance's cortex, which the user can set an alert on :)

rade · 2017-02-01T20:41:35Z

This is all post MVP, obviously. Let's be safe before being nice.

juliusv · 2017-02-01T20:49:08Z

We could still store other samples, but should still return an HTTP error, because a 200 would not seem ok in that case. Then the user doesn't know which samples were stored, but I don't think we can do better.

Yeah, in the future we can have nice UI features and meta-metrics for this.

rade · 2017-02-01T20:53:31Z

Then the user doesn't know which samples were stored, but I don't think we can do better.

We can say what did/didn't happen in the response body. That won't be easy for the user to track down (would the sending prom even log it?), but it's better than nothing.

juliusv · 2017-02-01T21:07:51Z

The Prometheus server doesn't look at the response body at all, but yeah, theoretically a user could tcpdump it. It would not be super useful, since the series that were rejected are not special in any way, they just happened to be the first ones that were "one too much".

On a technical level, reporting these details back would require changing the gRPC response from the ingesters to include this information and the distributor to then merge it and send the failed series back to the user. Since I don't believe it'll ever be even seen by anyone, I doubt the value of that.

rade · 2017-02-01T21:13:45Z

The Prometheus server doesn't look at the response body at all

That seems wrong. Surely it should log any errors it gets back.

the series that were rejected are not special in any way, they just happened to be the first ones that were "one too much".

What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.

On a technical level, (it's complicated)

Fair enough. Not part of the MVP them. As I said, let's be safe before being nice.

juliusv · 2017-02-01T21:20:15Z

The Prometheus server doesn't look at the response body at all

That seems wrong. Surely it should log any errors it gets back.

It logs the remote write send failure based on the HTTP status code, but does not inspect the response body, or expect anything to be in it.

What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.

True. Though if they hit this situation, they will be more interested in finding out which metric of theirs is currently causing the blowup.

Fair enough. Not part of the MVP them. As I said, let's be safe before being nice.

Yup.

rade · 2017-02-01T21:24:53Z

(prometheus) does not inspect the (error) response body, or expect anything to be in it.

That's what I meant by "wrong" :)

juliusv · 2017-02-01T21:27:12Z

(prometheus) does not inspect the (error) response body, or expect anything to be in it.

That's what I meant by "wrong" :)

Well, it's not part of our generic write protocol to return anything in the response body... but anyways :)

juliusv · 2017-02-07T16:09:15Z

#273 implemented a total series limit per user and ingester. Tom suggested also limiting the per-metric cardinality, which I'm looking at next.

juliusv · 2017-02-07T16:25:01Z

@tomwilkie for checking the current number of series for a metric, the index has a nested map map[model.LabelName]map[model.LabelValue][]model.Fingerprint (https://github.com/weaveworks/cortex/blob/master/ingester/index.go#L13), but I think it'd be unwise to iterate through it and add up the number of fingerprints at the leaves for every sample we ingest. So I propose adding another seriesPerMetric map[model.LabelName]int map to the index that just tracks how many series there currently are for which metric. Sounds good?

tomwilkie · 2017-02-07T16:43:58Z

+1

…

On Tue, Feb 7, 2017 at 4:25 PM, Julius Volz ***@***.***> wrote: @tom <https://github.com/tom> for checking the current number of series for a metric, the index has a nested map map[model.LabelName]map[model. LabelValue][]model.Fingerprint (https://github.com/ weaveworks/cortex/blob/master/ingester/index.go#L13), but I think it'd be unwise to iterate through it and add up the number of fingerprints at the leaves for every sample we ingest. So I propose adding another seriesPerMetric map[model.LabelName]int map to the index that just tracks how many series there currently are for which metric. Sounds good? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhXiVXylqy-fp_L-LW_AbLlzm2-8kks5raJrdgaJpZM4KT3Hx> .

Fixes #47

* Limit series per metric Fixes #47 * Fix error wrapping for series-per-metric exceeded * Move num-metrics-per-series handling to userState * Review feeback

…/3291733c24b77f666dec7a6b632eec285abef44c Prerelease/3291733c24b77f666dec7a6b632eec285abef44c

rade mentioned this issue Nov 22, 2016

Implement rate limiting #128

Closed

tomwilkie assigned juliusv Feb 1, 2017

juliusv added a commit that referenced this issue Feb 7, 2017

Limit series per metric

37cc15c

Fixes #47

juliusv mentioned this issue Feb 7, 2017

Limit series per metric #276

Merged

tomwilkie closed this as completed in #276 Feb 8, 2017

tomwilkie pushed a commit that referenced this issue Feb 8, 2017

Limit series per metric (#276)

6ca8c88

* Limit series per metric Fixes #47 * Fix error wrapping for series-per-metric exceeded * Move num-metrics-per-series handling to userState * Review feeback

roystchiang pushed a commit to roystchiang/cortex that referenced this issue Apr 6, 2022

Merge pull request cortexproject#47 from aws-observability/prerelease…

58571a5

…/3291733c24b77f666dec7a6b632eec285abef44c Prerelease/3291733c24b77f666dec7a6b632eec285abef44c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit per-user metric cardinality #47

Limit per-user metric cardinality #47

juliusv commented Oct 11, 2016

juliusv commented Feb 1, 2017

tomwilkie commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

tomwilkie commented Feb 1, 2017

rade commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017 •

edited

Loading

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

juliusv commented Feb 7, 2017

juliusv commented Feb 7, 2017 •

edited

Loading

tomwilkie commented Feb 7, 2017 via email

Limit per-user metric cardinality #47

Limit per-user metric cardinality #47

Comments

juliusv commented Oct 11, 2016

juliusv commented Feb 1, 2017

tomwilkie commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

tomwilkie commented Feb 1, 2017

rade commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017 • edited Loading

rade commented Feb 1, 2017

juliusv commented Feb 1, 2017

juliusv commented Feb 7, 2017

juliusv commented Feb 7, 2017 • edited Loading

tomwilkie commented Feb 7, 2017 via email

juliusv commented Feb 1, 2017 •

edited

Loading

juliusv commented Feb 7, 2017 •

edited

Loading