Skip to content

Split alertmanager & ruler configs #619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jml opened this issue Dec 6, 2017 · 3 comments
Closed

Split alertmanager & ruler configs #619

jml opened this issue Dec 6, 2017 · 3 comments
Labels
component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. stale

Comments

@jml
Copy link
Contributor

jml commented Dec 6, 2017

Currently there's one table, configs, keyed by (id, orgID), where id is a version number. The value is a JSON blob which has both the alertmanager config and the rules. Further, although there are separate endpoints, the implementations behind these endpoints are identical.

Instead, the config systems for alertmanager and ruler should be fully disjoint.

This means we should have:

  • one table with rules
  • one table with alertmanager configs
  • endpoints for alertmanager (that have nothing to do with ruler) for
    • getting a config
    • setting a config
    • getting all configs
    • getting all configs since a version
  • likewise endpoints for ruler configs

We should also separate the client libraries, embedding them in their respective call-sites (move ruler config client to ruler, alertmanager config client to alertmanager).

This will need to be multiple PRs in order to handle backwards compatibility between UI <-> configs <-> DB.

At a guess:

  1. New endpoints
    • create new endpoints with desired behaviour
    • update clients in cortex to use & rely on new behaviour
    • still write to existing conflated DB structure
    • old endpoints still work as intended
  2. Update Weave Cloud UI to use new endpoints
  3. Remove old endpoints
  4. New tables
    • create new tables
    • write to new tables
    • read from new tables and fall back to old
  5. Migrate old data
    • wait until data only being written to new tables
    • migrate old data to new tables
  6. Remove old tables
    • remove old tables
    • remove fallback code

LMK if that needs refining (can we do it in fewer steps? would even these steps have backwards compat problems?).

I'd kind of like to kill off the configs service completely by moving the endpoints to alertmanager and ruler respectively, and having both of those services connect directly to the database. The most convenient place to make that decision is in PR 1 above, since we'll have to update UI endpoints. It will also require more PRs for juggling flags to our deployed versions.

Other stuff that we might want to consider:

  • use gRPC rather than REST
    • not a massive advantage, as we would still have to re-parse configs at client side, because the configs will still be strings
    • since we'll want new endpoints, might as well have new endpoints with better protocol
    • probably don't want to use gRPC if we're moving config endpoints to ruler & alertmanager, because we'll need to talk to them from JS anyway
  • could theoretically split DBs, or even move to different DB backend (e.g. DynamoDB)
    • not under this GH issue
@jml jml self-assigned this Dec 6, 2017
@jml jml mentioned this issue Dec 6, 2017
@bboreham
Copy link
Contributor

What is the expected gain from this change?

@jml
Copy link
Contributor Author

jml commented Dec 11, 2017

Not a great deal (which is why we've not done it yet).

Gains are:

  1. conceptual simplicity
  2. save a roundtrip when setting configurations (less room for error)
  3. getting rid of an unnecessary service (configs)

Expanding on 2. Currently, all the configurations for an organization share a row. If you want to set the alertmanager config without affecting the recording/alerting rules config, you have to:

  • fetch all the configs for an organization
  • make your changes to the alertmanager config
  • submit both the new alertmanager config and the old rules config

All our clients now do this, so you could argue that it's a solved problem. I feel it's a wart, that it's hard for newcomers to the code-base to understand, and that it's likely to break with some future change (e.g. to the UI).

With this change, the way to change an alertmanager config will be:

  • submit the new alertmanager config

Which is simpler.


Writing the above, I recalled #330, which demonstrates a clear user-facing benefit: less likelihood of losing your edits to a concurrent editor.

Currently, #620 does not add CAS apis. Perhaps it should.

@stale
Copy link

stale bot commented Feb 3, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. stale
Projects
None yet
Development

No branches or pull requests

3 participants