Skip to content

Update Ruler to use upstream Prom Rule Manager #1571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 5, 2019

Conversation

jtlisi
Copy link
Contributor

@jtlisi jtlisi commented Aug 9, 2019

This PR is a refactor of #1532 to utilize the Prometheus Rule Manager to schedule and evaluate rule groups.

Overview

  • Utilize the upstream prometheus manager to scheduler and evaluate rules
  • Sync and map rules from an external service (configdb) to a temporary file which is synced with the users prometheus rules manager
  • Utilize afero to map rule files to allow for mocked tests using in memory filesystems and potentially using an in-memory file system in production if upstream changes can be made
  • Utilize a protobuf interchange format for rules to allow for the effectively have a versionable format for storing and communicating rules

Fixes #477
Fixes #493

@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from cd4db11 to 6a74045 Compare August 19, 2019 13:37
@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch 6 times, most recently from 470a180 to ebf8caf Compare September 6, 2019 14:25
@jtlisi jtlisi closed this Sep 6, 2019
@jtlisi jtlisi reopened this Sep 6, 2019
@jtlisi jtlisi marked this pull request as ready for review September 6, 2019 20:45
@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch 4 times, most recently from bf31fbd to d65c96d Compare September 10, 2019 10:47
@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch 2 times, most recently from 6b87cd6 to d3fec66 Compare September 19, 2019 20:56
@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from 5fc8a74 to e7fd239 Compare September 27, 2019 15:59
@@ -163,3 +119,52 @@ func (c ConfigsResponse) GetLatestConfigID() configs.ID {
}
return latest
}

// ListAllRuleGroups polls the configdb server and returns the updated rule groups
func (c *ConfigDBClient) ListAllRuleGroups(ctx context.Context) (map[string]rules.RuleGroupList, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we drop the since functionality for a reason? Any concern for extremely large rule sets that this will return very large amounts of data. Otherwise I like removing it for simplicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method continue returning map[string]configs.VersionedRulesConfig and the translation to the new type occur in pkg/ruler/rules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say since was for large numbers of tenants, regardless of size of rule set. And I am a bit nervous about dropping it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand the concern. The main reason since was dropped is because the entire ruleset must be hashed when generating mapped files. Otherwise changes to the ring will not be reflected in the scheduled evaluations. However, polling such a large payload is not ideal. One option would be for the config client to cache the previous response and have an internal usage of the since variable. That way it can keep an up-to-date set of the active rules configs and only poll for changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussions I'm re-adding since support now.


// decomposeGroupSlug breaks the group slug from Parse
// into it's group name and file name
func decomposeGroupSlug(slug string) (string, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unreferenced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll remove this.


for user, cfg := range configs {
userRules := rules.RuleGroupList{}
if cfg.IsDeleted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this like a diff of rules you're parsing and rebuilding the final state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is essentially what is going on.

}

func (a *appendableAppender) Appender() (storage.Appender, error) {
return a, nil
}

func (a *appendableAppender) Add(l labels.Labels, t int64, v float64) (uint64, error) {
a.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we locking here? None of the Appender implementations in prometheus lock suggesting these functions are not reentrant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lock is required here because we pool the samples for a user to the same appendable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previously in Cortex we created a separate appendableAppender for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the rule groups for each user will all share an appendableAppender. There are some advantages to this approach. Primarily, it will make #1396 easier to solve since output limits for a user can be configured in the same place.

@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from ea8d2c4 to 7e394fa Compare October 7, 2019 19:54
@joe-elliott
Copy link
Contributor

@jtlisi I restored the configs Client functionality to what it was before and made a ConfigRuleStore in rules/store.go. This implements the RuleStore interface and bridges the gap to the config Client by maintaining state.

This will be easy to extend in the future by adding other implementations such as GCSRuleStore or whatever.

}

// ConfigRuleStore is a concrete implementation of RuleStore that sources rules from the config service
type ConfigRuleStore struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good to me. @bboreham are you ok with abstracting since into a concrete type that lives behind the interface? That way a full set of rules can be returned on each poll. However, only new rules will be polled from the config service? I don't think we can work around polling the entire ruleset currently since the horizontally sharded ruler will need to hash each rule group to ensure it is evaluating the appropriate set of rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that I worry about is like this: say we have 10,000 rule groups across all tenants, and one tenant changes one of them, does the program do 10,000 things or 1 thing?

Copy link
Contributor

@joe-elliott joe-elliott Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to correct any errors @jtlisi.

As it is currently designed, once a polling cycle each ruler will calculate a hash for every rule group to determine which groups it should process locally. This will happen 10,000 times per ruler in your scenario.

Next it will take this subset of the rule groups and compare them to locally stored files on disk only updating those files on disk that have changed. This will happen 10,000 / n times per ruler where n is the number of rulers.

Then, if any files have changed or been added for a given user, it will clear the old prometheus rules manager for that user and build a new one pointed at the new set of files.

Even if we do not perform this process once a polling cycle we would need to at least do this process when certain events happen (such as a ruler joining or leaving the ring). I believe @jtlisi was preferring the straightforward nature of this approach.

@joe-elliott joe-elliott force-pushed the 20190806_prommanager_ruler branch from 004373f to 1c47e52 Compare October 10, 2019 12:33
@joe-elliott
Copy link
Contributor

@jtlisi I rewrote the MapRules method as discussed. Please review when you get a chance.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reading all the changes, but I have some notes.
Particularly when I tried it, it seemed to barf on the "v1" rules:

level=error ts=2019-11-01T17:29:22.664302075Z caller=ruler.go:331 msg="unable to poll for rules" err="yaml: unmarshal errors:\n  line 53: cannot unmarshal !!str `ALERT D...` into rulefmt.RuleGroups"

CHANGELOG.md Outdated
* [CHANGE] Flags changed with transition to upstream Prometheus rules manager:
* `ruler.client-timeout` is now `ruler.configs.client-timeout` in order to match `ruler.configs.url`
* `ruler.group-timeout`has been removed
* `ruler.num-workers` has been removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's going to cause disruption - can we map it to "deprecated" (ignored) first?
Any advice to the end-user what to do instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it so the flags aren't removed and are instead deprecated with a message.

resp, err := client.Do(req)
if err != nil {
return nil, err
}
configsRequestDuration.WithLabelValues(operation, resp.Status).Observe(time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done using CollectedRequest() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,40 @@
syntax = "proto3";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have an introductory comment here saying what these protobuf definitions are for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of protos made more sense before I split this out from a larger PR a few months back. The proto format is used to store rule groups in a denormalized way in an object store backend. It also is used to communicate between rulers to fulfill the /api/v1/rules endpoint that reports the status of rules with their rule health. Since each ruler only knows the state of rules it is currently responsible for it needs to communicate with each ruler in the ring to get a complete view of rule health. To implement this feature a GRPC service is implemented by each ruler.

}

func (a *appendableAppender) Appender() (storage.Appender, error) {
return a, nil
}

func (a *appendableAppender) Add(l labels.Labels, t int64, v float64) (uint64, error) {
a.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previously in Cortex we created a separate appendableAppender for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender now?

@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from 37858a4 to 06e54b2 Compare November 1, 2019 18:59
CHANGELOG.md Outdated
@@ -12,6 +12,13 @@
* [ENHANCEMENT] Allocation improvements in adding samples to Chunk. #1706
* [ENHANCEMENT] Consul client now follows recommended practices for blocking queries wrt returned Index value. #1708
* [ENHANCEMENT] Consul client can optionally rate-limit itself during Watch (used e.g. by ring watchers) and WatchPrefix (used by HA feature) operations. Rate limiting is disabled by default. New flags added: `--consul.watch-rate-limit`, and `--consul.watch-burst-size`. #1708
* [CHANGE] Flags changed with transition to upstream Prometheus rules manager:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes should be on top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All set

@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from a842199 to 52aa0ea Compare November 4, 2019 18:28
@jtlisi
Copy link
Contributor Author

jtlisi commented Nov 4, 2019

@bboreham I fixed V1 rule loading and refactored based on your comments. This should be good for a second look.

cfg.StoreConfig.RegisterFlags(f)

// Deprecated Flags that will be maintained to avoid user disruption
flagext.DeprecatedFlag(f, "ruler.client-timeout", "This flag has been renamed to ruler.configs.client-timeout")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruler.configs.url appears to be missing from this list of deprecated flags. It was part of the deleted pkg/configs/client/config.go file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag also still exists. It got move a bit and is registered in pkg/ruler/storage.go https://github.com/cortexproject/cortex/pull/1571/files#diff-16c509ab46b783eb193e10999f09ed31R21

@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from 1540040 to eb76b8b Compare November 21, 2019 16:17
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough to me.

jtlisi and others added 11 commits December 5, 2019 11:27
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
@jtlisi jtlisi force-pushed the 20190806_prommanager_ruler branch from eb76b8b to 4418ae9 Compare December 5, 2019 19:28
@jtlisi jtlisi merged commit 6c63039 into cortexproject:master Dec 5, 2019
@jtlisi jtlisi deleted the 20190806_prommanager_ruler branch December 5, 2019 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants