Skip to content

Ruler should be protected against high-cardinality output #1396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bboreham opened this issue May 17, 2019 · 6 comments
Open

Ruler should be protected against high-cardinality output #1396

bboreham opened this issue May 17, 2019 · 6 comments

Comments

@bboreham
Copy link
Contributor

bboreham commented May 17, 2019

Suppose someone creates a rule, either recording rule or alert, that generates 100,000 output series.

Currently, all series will be sent in one request, which will hit the distributor rate-limit (defaults to 50,000 burst size) and be dropped.

This creates problems:

  • ruler consumes a lot of CPU and memory
  • it's hard for the operator hard to figure out what happened. The log line (caller=manager.go:539 msg="rule sample appending failed" err="rpc error: code = Code(429) desc = ingestion rate limit (50000) exceeded while adding 100000 samples") doesn't include the tenant ID
  • it's completely invisible to the user of the tenant

I'm thinking ruler should cap the size of its output, and generate some signal (a synthetic series, perhaps?) that can be used to know when the cap was hit.

If we want to handle outputs from rules in the hundreds of thousands, we should batch them up so they don't choke the distributor.

The channel to alertmanager is also limited: caller=notifier.go:371 msg="Alert batch larger than queue capacity, dropping alerts" num_dropped=30973

@jtlisi
Copy link
Contributor

jtlisi commented May 24, 2019

I agree! The implementation is going to be a bit tricky I believe. I think we are going to have to write our own rule manager instead of using the prom upstream.

Sent with GitHawk

@bboreham
Copy link
Contributor Author

Perhaps this could be implemented via engineQueryFunc.
Limits could be integrated with the existing limit/overrides system.

@bboreham
Copy link
Contributor Author

Possibly addressed by prometheus/prometheus#9260

@jeromeinsf
Copy link
Contributor

@krishnateja325 something you are looking at?

@krishnateja325
Copy link
Contributor

yes, pulled-in prometheus/prometheus#9260 and added support for limit field in this PR: #5528

@jeromeinsf
Copy link
Contributor

/assign @krishnateja325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants