Skip to content

Spread rule evaluation over the evaluation interval #716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 27, 2018

Conversation

leth
Copy link
Contributor

@leth leth commented Feb 19, 2018

Rules are currently evaluated every $evaluationInterval (e.g. 15s) from when the ruler starts.
This results in a burst of rules evaluated every 15 seconds, after which the ruler workers sit idle.

If we instead spread the rules over the whole $evaluationInterval, we can spread the work, and the resulting read and write load.

I didn't put much thought into which hashing algorithm I chose, any should do; advice greatly appreciated!

See also #702

@leth leth requested a review from bboreham February 19, 2018 11:26
func (s *scheduler) addNewConfigs(now time.Time, cfgs map[string]configs.VersionedRulesConfig) {
// TODO: instrument how many configs we have, both valid & invalid.
level.Debug(util.Logger).Log("msg", "adding configurations", "num_configs", len(cfgs))
nextEvalCycle := time.Unix(0, int64(
math.Ceil(
math.Mod(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is plain wrong 🤦‍♂️

@leth leth force-pushed the ruler-eval-spread branch from 0d79b1e to c9d9450 Compare February 19, 2018 11:32
@leth leth requested a review from jml February 19, 2018 15:42
Copy link
Contributor

@jml jml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this well enough. At least needs more comments before it gets merged.

math.Mod(
float64(hasher.Sum64()),
cycleNanos)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this defined inline? Why not a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly due to where I put the hashing object, see the other comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still lean to having a separate method that takes the hasher & now, or even a separate function that takes those two and evaluationInterval. Would make it much easier to test (hint hint).

Not going to insist on it though.

@@ -182,6 +184,18 @@ func (s *scheduler) poll() (map[string]configs.VersionedRulesConfig, error) {
func (s *scheduler) addNewConfigs(now time.Time, cfgs map[string]configs.VersionedRulesConfig) {
// TODO: instrument how many configs we have, both valid & invalid.
level.Debug(util.Logger).Log("msg", "adding configurations", "num_configs", len(cfgs))
cycleNanos := float64(s.evaluationInterval.Nanoseconds())
nextEvalCycle := time.Unix(0, int64(math.Ceil(float64(now.UnixNano())/cycleNanos)*cycleNanos))
hasher := fnv.New64a()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with constructing this out here, rather than inside the function? I'm guessing it's an optimization, but if so, why are we constructing it per addNewConfigs invocation rather than once per process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's at that level as an optimisation. The hashing object is stateful; I wanted to keep it as local as possible to avoid the need to add mutexes.

@@ -182,6 +184,18 @@ func (s *scheduler) poll() (map[string]configs.VersionedRulesConfig, error) {
func (s *scheduler) addNewConfigs(now time.Time, cfgs map[string]configs.VersionedRulesConfig) {
// TODO: instrument how many configs we have, both valid & invalid.
level.Debug(util.Logger).Log("msg", "adding configurations", "num_configs", len(cfgs))
cycleNanos := float64(s.evaluationInterval.Nanoseconds())
nextEvalCycle := time.Unix(0, int64(math.Ceil(float64(now.UnixNano())/cycleNanos)*cycleNanos))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this.

  1. I don't get the purpose, what it's trying to achieve
  2. I can't tell how it's meaningfully different from time.Unix(0, now.UnixNano())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trying to compute the start of the next evaluation cycle, so that we can distribute the load independently of when the ruler started, and evaluations per instance are consistently spaced across ruler restarts.

If now is 20 and the evaluation cycle is 15; we want to compute 30 as the start of the next evaluation cycle; 20 / 15 * 15 == 20, but ceil(20 / 15) * 15 == 30

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that if now is 1, then this will delay all rule evaluation until 15, rather than starting those evaluations which land sooner (e.g. at 2-14). I'll tweak it to avoid waiting once we settle the other questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Looking forward to your tweak.

@leth
Copy link
Contributor Author

leth commented Feb 20, 2018

Here's roughly what I'm aiming at:
ruler eval tick timeline 1

Copy link
Contributor

@jml jml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand this much better now.

@leth
Copy link
Contributor Author

leth commented Feb 20, 2018

The tweak to prevent dead time obscured the desired behaviour even more, so I factored it out to a function and added tests, at the cost of some minor repeated calculations.

@leth leth force-pushed the ruler-eval-spread branch from 81bef7c to b77f41c Compare February 21, 2018 09:25
Copy link
Contributor

@jml jml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions

  • make hashResult an i64 and do the string conversion in the helper function
  • comment that we are telling the hash function what to return, and that this relies on fakeHasher impl.
  • maybe use some variables & arithmetic explicitly in the evalTime calls in the test

@bboreham
Copy link
Contributor

Given #719 would you want to hash by user and filename, or just stick with user?

@leth
Copy link
Contributor Author

leth commented Feb 27, 2018

I think I'd just stick with user, because if the file execution order was more spread out it might lead to more confusion

@bboreham bboreham merged commit 1547ede into master Feb 27, 2018
@leth leth deleted the ruler-eval-spread branch February 27, 2018 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants