-
Notifications
You must be signed in to change notification settings - Fork 816
Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599
Conversation
…ter' Signed-off-by: Marco Pracucci <[email protected]>
f8ffa6b
to
652117f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels over-complicated, but it looks like it should work.
I think this PR also solves the issue that #1334 is aimed at. |
@bboreham et all, thanks for your review. Is there any plan to merge it? _I'm asking just because it will unlock another little PR I've on the Loki project (which vendors Cortex master). Thanks 🙏 |
I’m curious what have we achieved by this change? “Full Jitter” aims at evenly distributing attempts from different clients to entire time range, thus reducing total number of required retries (which helps target service) and minimizing needed time to complete all operations. (As discussed in this paper) This change has made delays between retries more predictable, which goes against the idea of adding jitter in the first place. It feels to me that we have optimized for the wrong thing here. Correct solution would have been to remove MinBackoff option (or simply add it to random value, but not to scale it with attempts), and not worry about randomness observed in #792. |
Removing - or not honoring - the |
Fair enough. What would you say about using random like before, but simply adding min wait time, without scaling it up. So your first graph would essentially shift up, but otherwise stay the same. When I see the second graph, it's shows exactly the issue that jitter was supposed to solve, namely that retry times are too deterministic. |
Do you have details of a specific case where this has a negative effect? I have spent many hours dealing with problems that were caused by whatever jitter we were using at the time, so I don't feel good about making further changes without a strong reason. BTW the AWS paper you cited (which gave us our first implementation) is aimed at optimistic concurrency and Cortex mostly uses jitter for different reasons. |
No, not really. I came across the idea in AWS re:Invent 2019 talk, where Marc Brooker (who wrote that paper too) talks about it in context of distributed services talking to each other, retries and backoff [staring around 31:30]. Other than "this makes sense, why don't we do that... oh wait, we did that before!", I don't have any specific case where it is currently a problem.
OK, that's fair. I'll try to educate myself more about problems we were facing, before giving more suggestions.
|
Maybe it would be helpful for me to point out that Cortex is usually more interested in the backoff side than the jitter side. If we've tried a few times to transfer chunks then we want to slow down, because most likely the target won't show up for a while. Similarly for rate-limiting on DynamoDB. |
The current backoff algorithm doesn't honor
MinBackoff
and each sleep is basically a random between zero and a max range value starting fromMinBackoff
and doubling up toMaxBackoff
.Following up the discussion in #792, in this PR I'm proposing to switch to a different algorithm with both honors min and max, while keeping some jitter.
The proposed idea is to work with a ranged jitter. For each consecutive retry, the delay is picked as a random value between a range whose boundaries doubles each time.
Simulation
I've run a simulation of 100 clients all starting the backoff at the same time
0
(should be the worst case scenario) with10
max retries. The following chart shows the distribution over time (Y-axis, in milliseconds) of the retries, comparing the current and new algorithm.Current algorithm
New algorithm