-
Notifications
You must be signed in to change notification settings - Fork 816
Update Ruler to use upstream Prom Rule Manager #1571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Ruler to use upstream Prom Rule Manager #1571
Conversation
cd4db11
to
6a74045
Compare
470a180
to
ebf8caf
Compare
bf31fbd
to
d65c96d
Compare
6b87cd6
to
d3fec66
Compare
5fc8a74
to
e7fd239
Compare
pkg/configs/client/client.go
Outdated
@@ -163,3 +119,52 @@ func (c ConfigsResponse) GetLatestConfigID() configs.ID { | |||
} | |||
return latest | |||
} | |||
|
|||
// ListAllRuleGroups polls the configdb server and returns the updated rule groups | |||
func (c *ConfigDBClient) ListAllRuleGroups(ctx context.Context) (map[string]rules.RuleGroupList, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we drop the since functionality for a reason? Any concern for extremely large rule sets that this will return very large amounts of data. Otherwise I like removing it for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method continue returning map[string]configs.VersionedRulesConfig
and the translation to the new type occur in pkg/ruler/rules
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say since
was for large numbers of tenants, regardless of size of rule set. And I am a bit nervous about dropping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand the concern. The main reason since was dropped is because the entire ruleset must be hashed when generating mapped files. Otherwise changes to the ring will not be reflected in the scheduled evaluations. However, polling such a large payload is not ideal. One option would be for the config client to cache the previous response and have an internal usage of the since variable. That way it can keep an up-to-date set of the active rules configs and only poll for changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussions I'm re-adding since
support now.
pkg/configs/client/client.go
Outdated
|
||
// decomposeGroupSlug breaks the group slug from Parse | ||
// into it's group name and file name | ||
func decomposeGroupSlug(slug string) (string, string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unreferenced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll remove this.
pkg/configs/client/client.go
Outdated
|
||
for user, cfg := range configs { | ||
userRules := rules.RuleGroupList{} | ||
if cfg.IsDeleted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this like a diff of rules you're parsing and rebuilding the final state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is essentially what is going on.
pkg/ruler/compat.go
Outdated
} | ||
|
||
func (a *appendableAppender) Appender() (storage.Appender, error) { | ||
return a, nil | ||
} | ||
|
||
func (a *appendableAppender) Add(l labels.Labels, t int64, v float64) (uint64, error) { | ||
a.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we locking here? None of the Appender implementations in prometheus lock suggesting these functions are not reentrant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lock is required here because we pool the samples for a user to the same appendable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think previously in Cortex we created a separate appendableAppender
for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the rule groups for each user will all share an appendableAppender. There are some advantages to this approach. Primarily, it will make #1396 easier to solve since output limits for a user can be configured in the same place.
ea8d2c4
to
7e394fa
Compare
@jtlisi I restored the configs Client functionality to what it was before and made a This will be easy to extend in the future by adding other implementations such as |
} | ||
|
||
// ConfigRuleStore is a concrete implementation of RuleStore that sources rules from the config service | ||
type ConfigRuleStore struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation looks good to me. @bboreham are you ok with abstracting since
into a concrete type that lives behind the interface? That way a full set of rules can be returned on each poll. However, only new rules will be polled from the config service? I don't think we can work around polling the entire ruleset currently since the horizontally sharded ruler will need to hash each rule group to ensure it is evaluating the appropriate set of rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that I worry about is like this: say we have 10,000 rule groups across all tenants, and one tenant changes one of them, does the program do 10,000 things or 1 thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to correct any errors @jtlisi.
As it is currently designed, once a polling cycle each ruler will calculate a hash for every rule group to determine which groups it should process locally. This will happen 10,000 times per ruler in your scenario.
Next it will take this subset of the rule groups and compare them to locally stored files on disk only updating those files on disk that have changed. This will happen 10,000 / n times per ruler where n is the number of rulers.
Then, if any files have changed or been added for a given user, it will clear the old prometheus rules manager for that user and build a new one pointed at the new set of files.
Even if we do not perform this process once a polling cycle we would need to at least do this process when certain events happen (such as a ruler joining or leaving the ring). I believe @jtlisi was preferring the straightforward nature of this approach.
004373f
to
1c47e52
Compare
@jtlisi I rewrote the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reading all the changes, but I have some notes.
Particularly when I tried it, it seemed to barf on the "v1" rules:
level=error ts=2019-11-01T17:29:22.664302075Z caller=ruler.go:331 msg="unable to poll for rules" err="yaml: unmarshal errors:\n line 53: cannot unmarshal !!str `ALERT D...` into rulefmt.RuleGroups"
CHANGELOG.md
Outdated
* [CHANGE] Flags changed with transition to upstream Prometheus rules manager: | ||
* `ruler.client-timeout` is now `ruler.configs.client-timeout` in order to match `ruler.configs.url` | ||
* `ruler.group-timeout`has been removed | ||
* `ruler.num-workers` has been removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's going to cause disruption - can we map it to "deprecated" (ignored) first?
Any advice to the end-user what to do instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it so the flags aren't removed and are instead deprecated with a message.
pkg/configs/client/client.go
Outdated
resp, err := client.Do(req) | ||
if err != nil { | ||
return nil, err | ||
} | ||
configsRequestDuration.WithLabelValues(operation, resp.Status).Observe(time.Since(start).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be done using CollectedRequest()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -0,0 +1,40 @@ | |||
syntax = "proto3"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have an introductory comment here saying what these protobuf definitions are for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of protos made more sense before I split this out from a larger PR a few months back. The proto format is used to store rule groups in a denormalized way in an object store backend. It also is used to communicate between rulers to fulfill the /api/v1/rules
endpoint that reports the status of rules with their rule health. Since each ruler only knows the state of rules it is currently responsible for it needs to communicate with each ruler in the ring to get a complete view of rule health. To implement this feature a GRPC service is implemented by each ruler.
pkg/ruler/compat.go
Outdated
} | ||
|
||
func (a *appendableAppender) Appender() (storage.Appender, error) { | ||
return a, nil | ||
} | ||
|
||
func (a *appendableAppender) Add(l labels.Labels, t int64, v float64) (uint64, error) { | ||
a.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think previously in Cortex we created a separate appendableAppender
for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender
now?
37858a4
to
06e54b2
Compare
CHANGELOG.md
Outdated
@@ -12,6 +12,13 @@ | |||
* [ENHANCEMENT] Allocation improvements in adding samples to Chunk. #1706 | |||
* [ENHANCEMENT] Consul client now follows recommended practices for blocking queries wrt returned Index value. #1708 | |||
* [ENHANCEMENT] Consul client can optionally rate-limit itself during Watch (used e.g. by ring watchers) and WatchPrefix (used by HA feature) operations. Rate limiting is disabled by default. New flags added: `--consul.watch-rate-limit`, and `--consul.watch-burst-size`. #1708 | |||
* [CHANGE] Flags changed with transition to upstream Prometheus rules manager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes should be on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All set
a842199
to
52aa0ea
Compare
@bboreham I fixed V1 rule loading and refactored based on your comments. This should be good for a second look. |
2c9dc72
to
166c094
Compare
cfg.StoreConfig.RegisterFlags(f) | ||
|
||
// Deprecated Flags that will be maintained to avoid user disruption | ||
flagext.DeprecatedFlag(f, "ruler.client-timeout", "This flag has been renamed to ruler.configs.client-timeout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruler.configs.url
appears to be missing from this list of deprecated flags. It was part of the deleted pkg/configs/client/config.go
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag also still exists. It got move a bit and is registered in pkg/ruler/storage.go
https://github.com/cortexproject/cortex/pull/1571/files#diff-16c509ab46b783eb193e10999f09ed31R21
1540040
to
eb76b8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good enough to me.
Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
…ests Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Joe Elliott <[email protected]> Signed-off-by: Jacob Lisi <[email protected]>
Signed-off-by: Jacob Lisi <[email protected]>
eb76b8b
to
4418ae9
Compare
This PR is a refactor of #1532 to utilize the Prometheus Rule Manager to schedule and evaluate rule groups.
Overview
Fixes #477
Fixes #493