Skip to content

Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 27, 2019

Conversation

pracucci
Copy link
Contributor

The current backoff algorithm doesn't honor MinBackoff and each sleep is basically a random between zero and a max range value starting from MinBackoff and doubling up to MaxBackoff.

Following up the discussion in #792, in this PR I'm proposing to switch to a different algorithm with both honors min and max, while keeping some jitter.

The proposed idea is to work with a ranged jitter. For each consecutive retry, the delay is picked as a random value between a range whose boundaries doubles each time.

Simulation

I've run a simulation of 100 clients all starting the backoff at the same time 0 (should be the worst case scenario) with 10 max retries. The following chart shows the distribution over time (Y-axis, in milliseconds) of the retries, comparing the current and new algorithm.

Current algorithm

  • Y-axis: time in milliseconds
  • X-axis: clients
  • Colors: each different color identifies a retry for all clients (10 retries, 10 colors)

Screen Shot 2019-08-21 at 18 56 52

New algorithm

  • Y-axis: time in milliseconds
  • X-axis: clients
  • Colors: each different color identifies a retry for all clients (10 retries, 10 colors)

Screen Shot 2019-08-21 at 18 57 11

@pracucci pracucci force-pushed the improve-backoff-algorithm branch from f8ffa6b to 652117f Compare August 21, 2019 17:02
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels over-complicated, but it looks like it should work.

@bboreham
Copy link
Contributor

I think this PR also solves the issue that #1334 is aimed at.

@pracucci
Copy link
Contributor Author

pracucci commented Aug 26, 2019

@bboreham et all, thanks for your review. Is there any plan to merge it? _I'm asking just because it will unlock another little PR I've on the Loki project (which vendors Cortex master). Thanks 🙏

@pstibrany
Copy link
Contributor

pstibrany commented Dec 29, 2019

I’m curious what have we achieved by this change? “Full Jitter” aims at evenly distributing attempts from different clients to entire time range, thus reducing total number of required retries (which helps target service) and minimizing needed time to complete all operations. (As discussed in this paper)

This change has made delays between retries more predictable, which goes against the idea of adding jitter in the first place. It feels to me that we have optimized for the wrong thing here. Correct solution would have been to remove MinBackoff option (or simply add it to random value, but not to scale it with attempts), and not worry about randomness observed in #792.

@pracucci
Copy link
Contributor Author

Correct solution would have been to remove MinBackoff option

Removing - or not honoring - the MinBackoff removes any guarantee on how long it will be retried at least. Retries are usually used to recover from temporary conditions/errors (ie. short networking issues). If the user sets X retries but has no guarantee on the minimum waiting time (especially when X is low), it could be difficult to predict how "short" the temporary condition/error should be in order to get the retry to have any effect.

@pstibrany
Copy link
Contributor

Removing - or not honoring - the MinBackoff removes any guarantee on how long it will be retried at least.

Fair enough. What would you say about using random like before, but simply adding min wait time, without scaling it up. So your first graph would essentially shift up, but otherwise stay the same.

When I see the second graph, it's shows exactly the issue that jitter was supposed to solve, namely that retry times are too deterministic.

@bboreham
Copy link
Contributor

bboreham commented Jan 7, 2020

retry times are too deterministic

Do you have details of a specific case where this has a negative effect?

I have spent many hours dealing with problems that were caused by whatever jitter we were using at the time, so I don't feel good about making further changes without a strong reason.

BTW the AWS paper you cited (which gave us our first implementation) is aimed at optimistic concurrency and Cortex mostly uses jitter for different reasons.

@pstibrany
Copy link
Contributor

retry times are too deterministic

Do you have details of a specific case where this has a negative effect?

No, not really.

I came across the idea in AWS re:Invent 2019 talk, where Marc Brooker (who wrote that paper too) talks about it in context of distributed services talking to each other, retries and backoff [staring around 31:30]. Other than "this makes sense, why don't we do that... oh wait, we did that before!", I don't have any specific case where it is currently a problem.

I have spent many hours dealing with problems that were caused by whatever jitter we were using at the time, so I don't feel good about making further changes without a strong reason.

OK, that's fair. I'll try to educate myself more about problems we were facing, before giving more suggestions.

BTW the AWS paper you cited (which gave us our first implementation) is aimed at optimistic concurrency and Cortex mostly uses jitter for different reasons.

@bboreham
Copy link
Contributor

bboreham commented Jan 7, 2020

Maybe it would be helpful for me to point out that Cortex is usually more interested in the backoff side than the jitter side. If we've tried a few times to transfer chunks then we want to slow down, because most likely the target won't show up for a while. Similarly for rate-limiting on DynamoDB.
This is completely different to the optimistic locking case where it's quite likely you can get in quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants