-
Notifications
You must be signed in to change notification settings - Fork 816
Limit per-user metric cardinality #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
fd875e2 Fix test wrt shellcheck 54ec2d9 Don't capitalise error messages 19d3b6e Merge pull request #49 from weaveworks/pin-shfmt fea98f6 Go get from the vendor dir 1d867b0 Try and vendor a specific version of shfmt 76619c2 Merge pull request #48 from weaveworks/revert-41-user-tokens 4f96c51 Revert "Add experimental support for user tokens" d00033f Merge pull request #41 from weaveworks/user-tokens 245ed26 Merge pull request #47 from weaveworks/46-shfmt c1d7815 Fix shfmt error cb39746 Don't overright lint_result with 0 when shellcheck succeeds 8ab80e8 Merge pull request #45 from weaveworks/lint 83d5bd1 getting integration/config and test shellcheck-compliant cff9ec3 Fix some shellcheck errors 7a843d6 run shellcheck as part of lint if it is installed 31552a0 removing spurious space from test 6ca7c5f Merge pull request #44 from weaveworks/shfmt 952356d Allow lint to lint itself b7ac59c Run shfmt on all shell files in this repo 5570b0e Add shfmt formatting of shell files in lint 0a67594 fix circle build by splatting gopath permissions b990f48 Merge pull request #42 from kinvolk/lorenzo/fix-git-diff 224a145 Check if SHA1 exists before calling `git diff` 1c3000d Add auto_apply config for wcloud 0ebf5c0 Fix wcloud -serivice 354e083 Fixing lint 586060b Add experimental support for user tokens git-subtree-dir: tools git-subtree-split: fd875e27c5379d443574bcf20f24a52a594871ca
There's a question here of whether we want to limit the total number of series per user or per ingester. Maybe both. From a system safety perspective, we'll want to limit the number of series per ingester. But that is hard to communicate to the user. They will want to know how many series they can create overall, because they don't understand the uneven distribution of their series over ingesters (similarly to how we stumbled over Dynamo table throughput issues with uneven table shards). Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high. I'm not sure how we would track the total number of series for a user as another limit. The distributor would either have to get stats from the ingesters on every append (probably infeasible) or the distributor would have to track metric cardinality itself. Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog. |
Yes, I think this is the best place to start. We can always offer users this limit as a lower bound.
Thats an interesting idea; it would need some kind of moving average as well, as cardinality over time can be virtually unlimited - we close inactive series after 1h. |
What are we proposing to do when the limit is exceeded? Throw away the data cortex has received? How will a user know that (and why) this is happening? |
@rade Throw away the samples and return an error to the user. For the rate-limiting case, we do the same and return a At first, this would require a user to meta-monitor their Prometheus scraper for failed remote writes, but eventually we could notify users automatically when we see that they are being continuously denied. |
We need to make it clear thats its the product of the label cardinality for a given metric thats problematic for us. If we detect such a metric, we should black list it, but not drop the entire batch. |
Getting something to show up in the Weave Cloud cortex UI would be nice. And, going totally meta... feed a metric into the instance's cortex, which the user can set an alert on :) |
This is all post MVP, obviously. Let's be safe before being nice. |
We could still store other samples, but should still return an HTTP error, because a Yeah, in the future we can have nice UI features and meta-metrics for this. |
We can say what did/didn't happen in the response body. That won't be easy for the user to track down (would the sending prom even log it?), but it's better than nothing. |
The Prometheus server doesn't look at the response body at all, but yeah, theoretically a user could On a technical level, reporting these details back would require changing the gRPC response from the ingesters to include this information and the distributor to then merge it and send the failed series back to the user. Since I don't believe it'll ever be even seen by anyone, I doubt the value of that. |
That seems wrong. Surely it should log any errors it gets back.
What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.
Fair enough. Not part of the MVP them. As I said, let's be safe before being nice. |
It logs the remote write send failure based on the HTTP status code, but does not inspect the response body, or expect anything to be in it.
True. Though if they hit this situation, they will be more interested in finding out which metric of theirs is currently causing the blowup.
Yup. |
That's what I meant by "wrong" :) |
Well, it's not part of our generic write protocol to return anything in the response body... but anyways :) |
#273 implemented a total series limit per user and ingester. Tom suggested also limiting the per-metric cardinality, which I'm looking at next. |
@tomwilkie for checking the current number of series for a metric, the index has a nested map |
+1
…On Tue, Feb 7, 2017 at 4:25 PM, Julius Volz ***@***.***> wrote:
@tom <https://github.com/tom> for checking the current number of series
for a metric, the index has a nested map map[model.LabelName]map[model.
LabelValue][]model.Fingerprint (https://github.com/
weaveworks/cortex/blob/master/ingester/index.go#L13), but I think it'd be
unwise to iterate through it and add up the number of fingerprints at the
leaves for every sample we ingest. So I propose adding another seriesPerMetric
map[model.LabelName]int map to the index that just tracks how many series
there currently are for which metric. Sounds good?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#47 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbGhXiVXylqy-fp_L-LW_AbLlzm2-8kks5raJrdgaJpZM4KT3Hx>
.
|
* Limit series per metric Fixes #47 * Fix error wrapping for series-per-metric exceeded * Move num-metrics-per-series handling to userState * Review feeback
…/3291733c24b77f666dec7a6b632eec285abef44c Prerelease/3291733c24b77f666dec7a6b632eec285abef44c
Even just by accident, it's really easy for a user to overload or hotspot an ingester right now, by either sending too many series with the same metric name, or just too many time series in general (especially when accidentally misusing a label with unbounded value cardinality).
We should have some way of limiting the user's number of series.
The text was updated successfully, but these errors were encountered: