Update Ruler to use upstream Prom Rule Manager #1571

jtlisi · 2019-08-09T18:47:39Z

This PR is a refactor of #1532 to utilize the Prometheus Rule Manager to schedule and evaluate rule groups.

Overview

Utilize the upstream prometheus manager to scheduler and evaluate rules
Sync and map rules from an external service (configdb) to a temporary file which is synced with the users prometheus rules manager
Utilize afero to map rule files to allow for mocked tests using in memory filesystems and potentially using an in-memory file system in production if upstream changes can be made
Utilize a protobuf interchange format for rules to allow for the effectively have a versionable format for storing and communicating rules

Fixes #477
Fixes #493

joe-elliott · 2019-10-01T19:17:19Z

pkg/configs/client/client.go

@@ -163,3 +119,52 @@ func (c ConfigsResponse) GetLatestConfigID() configs.ID {
 	}
 	return latest
 }
+
+// ListAllRuleGroups polls the configdb server and returns the updated rule groups
+func (c *ConfigDBClient) ListAllRuleGroups(ctx context.Context) (map[string]rules.RuleGroupList, error) {


Did we drop the since functionality for a reason? Any concern for extremely large rule sets that this will return very large amounts of data. Otherwise I like removing it for simplicity.

Should this method continue returning map[string]configs.VersionedRulesConfig and the translation to the new type occur in pkg/ruler/rules?

I would say since was for large numbers of tenants, regardless of size of rule set. And I am a bit nervous about dropping it.

I do understand the concern. The main reason since was dropped is because the entire ruleset must be hashed when generating mapped files. Otherwise changes to the ring will not be reflected in the scheduled evaluations. However, polling such a large payload is not ideal. One option would be for the config client to cache the previous response and have an internal usage of the since variable. That way it can keep an up-to-date set of the active rules configs and only poll for changes.

Per discussions I'm re-adding since support now.

joe-elliott · 2019-10-01T19:22:00Z

pkg/configs/client/client.go

+
+// decomposeGroupSlug breaks the group slug from Parse
+// into it's group name and file name
+func decomposeGroupSlug(slug string) (string, string) {


unreferenced?

Good catch, I'll remove this.

joe-elliott · 2019-10-01T19:40:52Z

pkg/configs/client/client.go

+
+	for user, cfg := range configs {
+		userRules := rules.RuleGroupList{}
+		if cfg.IsDeleted() {


Is this like a diff of rules you're parsing and rebuilding the final state?

Yes, that is essentially what is going on.

pkg/ruler/ruler.go

joe-elliott · 2019-10-02T16:53:16Z

pkg/ruler/compat.go

 }

 func (a *appendableAppender) Appender() (storage.Appender, error) {
 	return a, nil
 }

 func (a *appendableAppender) Add(l labels.Labels, t int64, v float64) (uint64, error) {
+	a.Lock()


Why are we locking here? None of the Appender implementations in prometheus lock suggesting these functions are not reentrant.

A lock is required here because we pool the samples for a user to the same appendable.

I think previously in Cortex we created a separate appendableAppender for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender now?

Yea the rule groups for each user will all share an appendableAppender. There are some advantages to this approach. Primarily, it will make #1396 easier to solve since output limits for a user can be configured in the same place.

joe-elliott · 2019-10-09T16:25:00Z

@jtlisi I restored the configs Client functionality to what it was before and made a ConfigRuleStore in rules/store.go. This implements the RuleStore interface and bridges the gap to the config Client by maintaining state.

This will be easy to extend in the future by adding other implementations such as GCSRuleStore or whatever.

jtlisi · 2019-10-09T17:43:29Z

pkg/ruler/rules/store.go

+}
+
+// ConfigRuleStore is a concrete implementation of RuleStore that sources rules from the config service
+type ConfigRuleStore struct {


This implementation looks good to me. @bboreham are you ok with abstracting since into a concrete type that lives behind the interface? That way a full set of rules can be returned on each poll. However, only new rules will be polled from the config service? I don't think we can work around polling the entire ruleset currently since the horizontally sharded ruler will need to hash each rule group to ensure it is evaluating the appropriate set of rules.

The thing that I worry about is like this: say we have 10,000 rule groups across all tenants, and one tenant changes one of them, does the program do 10,000 things or 1 thing?

Feel free to correct any errors @jtlisi.

As it is currently designed, once a polling cycle each ruler will calculate a hash for every rule group to determine which groups it should process locally. This will happen 10,000 times per ruler in your scenario.

Next it will take this subset of the rule groups and compare them to locally stored files on disk only updating those files on disk that have changed. This will happen 10,000 / n times per ruler where n is the number of rulers.

Then, if any files have changed or been added for a given user, it will clear the old prometheus rules manager for that user and build a new one pointed at the new set of files.

Even if we do not perform this process once a polling cycle we would need to at least do this process when certain events happen (such as a ruler joining or leaving the ring). I believe @jtlisi was preferring the straightforward nature of this approach.

joe-elliott · 2019-10-10T21:24:23Z

@jtlisi I rewrote the MapRules method as discussed. Please review when you get a chance.

bboreham

I haven't finished reading all the changes, but I have some notes.
Particularly when I tried it, it seemed to barf on the "v1" rules:

level=error ts=2019-11-01T17:29:22.664302075Z caller=ruler.go:331 msg="unable to poll for rules" err="yaml: unmarshal errors:\n  line 53: cannot unmarshal !!str `ALERT D...` into rulefmt.RuleGroups"

bboreham · 2019-11-01T13:52:33Z

CHANGELOG.md

+* [CHANGE] Flags changed with transition to upstream Prometheus rules manager:
+  * `ruler.client-timeout` is now `ruler.configs.client-timeout` in order to match `ruler.configs.url`
+  * `ruler.group-timeout`has been removed
+  * `ruler.num-workers` has been removed


That's going to cause disruption - can we map it to "deprecated" (ignored) first?
Any advice to the end-user what to do instead?

I updated it so the flags aren't removed and are instead deprecated with a message.

bboreham · 2019-11-01T13:55:26Z

pkg/configs/client/client.go

 	resp, err := client.Do(req)
 	if err != nil {
 		return nil, err
 	}
+	configsRequestDuration.WithLabelValues(operation, resp.Status).Observe(time.Since(start).Seconds())


Could this be done using CollectedRequest() ?

bboreham · 2019-11-01T16:15:50Z

pkg/ruler/rules/rules.proto

@@ -0,0 +1,40 @@
+syntax = "proto3";


could we have an introductory comment here saying what these protobuf definitions are for?

The use of protos made more sense before I split this out from a larger PR a few months back. The proto format is used to store rule groups in a denormalized way in an object store backend. It also is used to communicate between rulers to fulfill the /api/v1/rules endpoint that reports the status of rules with their rule health. Since each ruler only knows the state of rules it is currently responsible for it needs to communicate with each ruler in the ring to get a complete view of rule health. To implement this feature a GRPC service is implemented by each ruler.

pkg/configs/configs.go

bboreham · 2019-11-01T17:37:52Z

pkg/ruler/compat.go

 }

 func (a *appendableAppender) Appender() (storage.Appender, error) {
 	return a, nil
 }

 func (a *appendableAppender) Add(l labels.Labels, t int64, v float64) (uint64, error) {
+	a.Lock()


I think previously in Cortex we created a separate appendableAppender for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender now?

gouthamve · 2019-11-04T12:02:47Z

CHANGELOG.md

@@ -12,6 +12,13 @@
 * [ENHANCEMENT] Allocation improvements in adding samples to Chunk. #1706
 * [ENHANCEMENT] Consul client now follows recommended practices for blocking queries wrt returned Index value. #1708
 * [ENHANCEMENT] Consul client can optionally rate-limit itself during Watch (used e.g. by ring watchers) and WatchPrefix (used by HA feature) operations. Rate limiting is disabled by default. New flags added: `--consul.watch-rate-limit`, and `--consul.watch-burst-size`. #1708
+* [CHANGE] Flags changed with transition to upstream Prometheus rules manager:


Changes should be on top.

jtlisi · 2019-11-04T20:21:57Z

@bboreham I fixed V1 rule loading and refactored based on your comments. This should be good for a second look.

pkg/cortex/cortex.go

khaines · 2019-11-21T16:16:32Z

pkg/ruler/ruler.go

+	cfg.StoreConfig.RegisterFlags(f)
+
+	// Deprecated Flags that will be maintained to avoid user disruption
+	flagext.DeprecatedFlag(f, "ruler.client-timeout", "This flag has been renamed to ruler.configs.client-timeout")


ruler.configs.url appears to be missing from this list of deprecated flags. It was part of the deleted pkg/configs/client/config.go file.

This flag also still exists. It got move a bit and is registered in pkg/ruler/storage.go https://github.com/cortexproject/cortex/pull/1571/files#diff-16c509ab46b783eb193e10999f09ed31R21

bboreham

Looks good enough to me.

Signed-off-by: Jacob Lisi <[email protected]>

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

…ests Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Signed-off-by: Jacob Lisi <[email protected]>

jtlisi mentioned this pull request Aug 14, 2019

pass user id metrics to prom eval metrics in ruler #1548

Closed

jtlisi force-pushed the 20190806_prommanager_ruler branch from cd4db11 to 6a74045 Compare August 19, 2019 13:37

jtlisi force-pushed the 20190806_prommanager_ruler branch 6 times, most recently from 470a180 to ebf8caf Compare September 6, 2019 14:25

jtlisi closed this Sep 6, 2019

jtlisi reopened this Sep 6, 2019

jtlisi marked this pull request as ready for review September 6, 2019 20:45

jtlisi force-pushed the 20190806_prommanager_ruler branch 4 times, most recently from bf31fbd to d65c96d Compare September 10, 2019 10:47

jtlisi force-pushed the 20190806_prommanager_ruler branch 2 times, most recently from 6b87cd6 to d3fec66 Compare September 19, 2019 20:56

jtlisi force-pushed the 20190806_prommanager_ruler branch from 5fc8a74 to e7fd239 Compare September 27, 2019 15:59

joe-elliott reviewed Oct 1, 2019

View reviewed changes

pkg/ruler/ruler.go Outdated Show resolved Hide resolved

joe-elliott reviewed Oct 1, 2019

View reviewed changes

pkg/ruler/ruler.go Show resolved Hide resolved

joe-elliott reviewed Oct 2, 2019

View reviewed changes

pkg/ruler/ruler.go Outdated Show resolved Hide resolved

joe-elliott reviewed Oct 2, 2019

View reviewed changes

jtlisi force-pushed the 20190806_prommanager_ruler branch from ea8d2c4 to 7e394fa Compare October 7, 2019 19:54

jtlisi commented Oct 9, 2019

View reviewed changes

joe-elliott force-pushed the 20190806_prommanager_ruler branch from 004373f to 1c47e52 Compare October 10, 2019 12:33

bboreham reviewed Nov 1, 2019

View reviewed changes

jtlisi force-pushed the 20190806_prommanager_ruler branch from 37858a4 to 06e54b2 Compare November 1, 2019 18:59

gouthamve reviewed Nov 4, 2019

View reviewed changes

jtlisi force-pushed the 20190806_prommanager_ruler branch from a842199 to 52aa0ea Compare November 4, 2019 18:28

jtlisi mentioned this pull request Nov 11, 2019

ruler doesn't support global external labels #404

Closed

jtlisi force-pushed the 20190806_prommanager_ruler branch from 2c9dc72 to 166c094 Compare November 18, 2019 19:23

khaines reviewed Nov 21, 2019

View reviewed changes

pkg/cortex/cortex.go Show resolved Hide resolved

khaines reviewed Nov 21, 2019

View reviewed changes

jtlisi force-pushed the 20190806_prommanager_ruler branch from 1540040 to eb76b8b Compare November 21, 2019 16:17

jtlisi requested review from bboreham, khaines and gouthamve November 26, 2019 20:15

bboreham approved these changes Dec 5, 2019

View reviewed changes

jtlisi and others added 11 commits December 5, 2019 11:27

refactor ruler to use upstream prometheus rule manager

c672d0e

Signed-off-by: Jacob Lisi <[email protected]>

fix concerns addressed in PR review

08c2898

Signed-off-by: Jacob Lisi <[email protected]>

First pass refactor of GetRules

3f43894

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Added store tests

587f7e7

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Reversed order of methods to reduce unnecessary diff

1b69fef

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Changed deleted rule handling to actually drop rules

edb35f1

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Fixed a multiple file issue and added tests

8a15f3f

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Rewrote mapper function to clean up bugs with multiple files. Added t…

224b1b4

…ests Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Fixed non-deterministic test

f8f8930

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

Removed unnecessary err check

4f4ee76

Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>

support v1 rules, update changelog, address review comments

4418ae9

Signed-off-by: Jacob Lisi <[email protected]>

jtlisi force-pushed the 20190806_prommanager_ruler branch from eb76b8b to 4418ae9 Compare December 5, 2019 19:28

jtlisi merged commit 6c63039 into cortexproject:master Dec 5, 2019

jtlisi deleted the 20190806_prommanager_ruler branch December 5, 2019 19:43

jtlisi mentioned this pull request Feb 5, 2020

Rule group metrics are always zero in Cortex #1557

Closed

bboreham mentioned this pull request Feb 14, 2020

Register a histogram that was omitted #2138

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Ruler to use upstream Prom Rule Manager #1571

Update Ruler to use upstream Prom Rule Manager #1571

jtlisi commented Aug 9, 2019 •

edited by gouthamve

Loading

joe-elliott Oct 1, 2019

joe-elliott Oct 1, 2019

bboreham Oct 2, 2019

jtlisi Oct 7, 2019

joe-elliott Oct 8, 2019

joe-elliott Oct 1, 2019

jtlisi Oct 7, 2019

joe-elliott Oct 1, 2019

jtlisi Oct 7, 2019

joe-elliott Oct 2, 2019

jtlisi Oct 7, 2019

bboreham Nov 1, 2019

jtlisi Nov 1, 2019

joe-elliott commented Oct 9, 2019

jtlisi Oct 9, 2019

bboreham Oct 10, 2019

joe-elliott Oct 10, 2019 •

edited

Loading

joe-elliott commented Oct 10, 2019

bboreham left a comment

bboreham Nov 1, 2019

jtlisi Nov 1, 2019

bboreham Nov 1, 2019

jtlisi Nov 1, 2019

bboreham Nov 1, 2019

jtlisi Nov 1, 2019

bboreham Nov 1, 2019

gouthamve Nov 4, 2019

jtlisi Nov 4, 2019

jtlisi commented Nov 4, 2019

khaines Nov 21, 2019

jtlisi Nov 21, 2019

bboreham left a comment

Update Ruler to use upstream Prom Rule Manager #1571

Update Ruler to use upstream Prom Rule Manager #1571

Conversation

jtlisi commented Aug 9, 2019 • edited by gouthamve Loading

Overview

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joe-elliott commented Oct 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joe-elliott Oct 10, 2019 • edited Loading

Choose a reason for hiding this comment

joe-elliott commented Oct 10, 2019

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtlisi commented Nov 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboreham left a comment

Choose a reason for hiding this comment

jtlisi commented Aug 9, 2019 •

edited by gouthamve

Loading

joe-elliott Oct 10, 2019 •

edited

Loading