Spread rule evaluation over the evaluation interval #716

leth · 2018-02-19T11:26:39Z

Rules are currently evaluated every $evaluationInterval (e.g. 15s) from when the ruler starts.
This results in a burst of rules evaluated every 15 seconds, after which the ruler workers sit idle.

If we instead spread the rules over the whole $evaluationInterval, we can spread the work, and the resulting read and write load.

I didn't put much thought into which hashing algorithm I chose, any should do; advice greatly appreciated!

See also #702

leth · 2018-02-19T11:28:01Z

pkg/ruler/scheduler.go

 func (s *scheduler) addNewConfigs(now time.Time, cfgs map[string]configs.VersionedRulesConfig) {
 	// TODO: instrument how many configs we have, both valid & invalid.
 	level.Debug(util.Logger).Log("msg", "adding configurations", "num_configs", len(cfgs))
+	nextEvalCycle := time.Unix(0, int64(
+		math.Ceil(
+			math.Mod(


This bit is plain wrong 🤦‍♂️

jml

Sorry, I don't understand this well enough. At least needs more comments before it gets merged.

jml · 2018-02-19T17:54:22Z

pkg/ruler/scheduler.go

+			math.Mod(
+				float64(hasher.Sum64()),
+				cycleNanos)))
+	}


Why is this defined inline? Why not a method?

Mostly due to where I put the hashing object, see the other comment.

I would still lean to having a separate method that takes the hasher & now, or even a separate function that takes those two and evaluationInterval. Would make it much easier to test (hint hint).

Not going to insist on it though.

jml · 2018-02-19T17:56:16Z

pkg/ruler/scheduler.go

@@ -182,6 +184,18 @@ func (s *scheduler) poll() (map[string]configs.VersionedRulesConfig, error) {
 func (s *scheduler) addNewConfigs(now time.Time, cfgs map[string]configs.VersionedRulesConfig) {
 	// TODO: instrument how many configs we have, both valid & invalid.
 	level.Debug(util.Logger).Log("msg", "adding configurations", "num_configs", len(cfgs))
+	cycleNanos := float64(s.evaluationInterval.Nanoseconds())
+	nextEvalCycle := time.Unix(0, int64(math.Ceil(float64(now.UnixNano())/cycleNanos)*cycleNanos))
+	hasher := fnv.New64a()


What's up with constructing this out here, rather than inside the function? I'm guessing it's an optimization, but if so, why are we constructing it per addNewConfigs invocation rather than once per process?

Yes, it's at that level as an optimisation. The hashing object is stateful; I wanted to keep it as local as possible to avoid the need to add mutexes.

jml · 2018-02-19T18:00:48Z

pkg/ruler/scheduler.go

@@ -182,6 +184,18 @@ func (s *scheduler) poll() (map[string]configs.VersionedRulesConfig, error) {
 func (s *scheduler) addNewConfigs(now time.Time, cfgs map[string]configs.VersionedRulesConfig) {
 	// TODO: instrument how many configs we have, both valid & invalid.
 	level.Debug(util.Logger).Log("msg", "adding configurations", "num_configs", len(cfgs))
+	cycleNanos := float64(s.evaluationInterval.Nanoseconds())
+	nextEvalCycle := time.Unix(0, int64(math.Ceil(float64(now.UnixNano())/cycleNanos)*cycleNanos))


I don't understand this.

I don't get the purpose, what it's trying to achieve

I can't tell how it's meaningfully different from time.Unix(0, now.UnixNano())

It's trying to compute the start of the next evaluation cycle, so that we can distribute the load independently of when the ruler started, and evaluations per instance are consistently spaced across ruler restarts.

If now is 20 and the evaluation cycle is 15; we want to compute 30 as the start of the next evaluation cycle; 20 / 15 * 15 == 20, but ceil(20 / 15) * 15 == 30

It occurs to me that if now is 1, then this will delay all rule evaluation until 15, rather than starting those evaluations which land sooner (e.g. at 2-14). I'll tweak it to avoid waiting once we settle the other questions.

Thanks for the explanation. Looking forward to your tweak.

leth · 2018-02-20T10:56:27Z

Here's roughly what I'm aiming at:

jml

Thanks, I understand this much better now.

leth · 2018-02-20T21:48:23Z

The tweak to prevent dead time obscured the desired behaviour even more, so I factored it out to a function and added tests, at the cost of some minor repeated calculations.

jml

Suggestions

make hashResult an i64 and do the string conversion in the helper function
comment that we are telling the hash function what to return, and that this relies on fakeHasher impl.
maybe use some variables & arithmetic explicitly in the evalTime calls in the test

bboreham · 2018-02-26T17:58:39Z

Given #719 would you want to hash by user and filename, or just stick with user?

leth · 2018-02-27T09:37:35Z

I think I'd just stick with user, because if the file execution order was more spread out it might lead to more confusion

…al-spread

leth requested a review from bboreham February 19, 2018 11:26

leth commented Feb 19, 2018

View reviewed changes

leth force-pushed the ruler-eval-spread branch from 0d79b1e to c9d9450 Compare February 19, 2018 11:32

leth requested a review from jml February 19, 2018 15:42

jml suggested changes Feb 19, 2018

View reviewed changes

jml approved these changes Feb 20, 2018

View reviewed changes

Marcus Cobden added 4 commits February 21, 2018 09:25

Spread rule evaluation over the evaluation interval

4d746d9

Swap CRC hash for FNV-1a

a75490a

Add some comments, rename some variables

e0b7c94

Avoid a 'dead' eval period at startup

31aa246

leth force-pushed the ruler-eval-spread branch from 81bef7c to b77f41c Compare February 21, 2018 09:25

jml approved these changes Feb 21, 2018

View reviewed changes

Factor out to a method and add tests

e76fc5e

leth force-pushed the ruler-eval-spread branch from b77f41c to e76fc5e Compare February 21, 2018 10:33

leth mentioned this pull request Feb 26, 2018

Ruler group execution can overlap with itself after rules update #723

Closed

Merge commit 'c740f5f00ef2eb89b88d6fe87c8ea29336a2c811' into ruler-ev…

6639dfd

…al-spread

bboreham approved these changes Feb 27, 2018

View reviewed changes

bboreham merged commit 1547ede into master Feb 27, 2018

leth deleted the ruler-eval-spread branch February 27, 2018 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spread rule evaluation over the evaluation interval #716

Spread rule evaluation over the evaluation interval #716

leth commented Feb 19, 2018

leth Feb 19, 2018

jml left a comment

jml Feb 19, 2018

leth Feb 20, 2018

jml Feb 20, 2018

jml Feb 19, 2018

leth Feb 20, 2018

jml Feb 19, 2018

leth Feb 20, 2018

leth Feb 20, 2018

jml Feb 20, 2018

leth commented Feb 20, 2018

jml left a comment

leth commented Feb 20, 2018

jml left a comment

bboreham commented Feb 26, 2018

leth commented Feb 27, 2018

Spread rule evaluation over the evaluation interval #716

Spread rule evaluation over the evaluation interval #716

Conversation

leth commented Feb 19, 2018

Choose a reason for hiding this comment

jml left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leth commented Feb 20, 2018

jml left a comment

Choose a reason for hiding this comment

leth commented Feb 20, 2018

jml left a comment

Choose a reason for hiding this comment

bboreham commented Feb 26, 2018

leth commented Feb 27, 2018