Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599

pracucci · 2019-08-21T17:00:41Z

The current backoff algorithm doesn't honor MinBackoff and each sleep is basically a random between zero and a max range value starting from MinBackoff and doubling up to MaxBackoff.

Following up the discussion in #792, in this PR I'm proposing to switch to a different algorithm with both honors min and max, while keeping some jitter.

The proposed idea is to work with a ranged jitter. For each consecutive retry, the delay is picked as a random value between a range whose boundaries doubles each time.

Simulation

I've run a simulation of 100 clients all starting the backoff at the same time 0 (should be the worst case scenario) with 10 max retries. The following chart shows the distribution over time (Y-axis, in milliseconds) of the retries, comparing the current and new algorithm.

Current algorithm

Y-axis: time in milliseconds
X-axis: clients
Colors: each different color identifies a retry for all clients (10 retries, 10 colors)

New algorithm

Y-axis: time in milliseconds
X-axis: clients
Colors: each different color identifies a retry for all clients (10 retries, 10 colors)

…ter' Signed-off-by: Marco Pracucci <[email protected]>

bboreham

Feels over-complicated, but it looks like it should work.

bboreham · 2019-08-22T09:36:10Z

I think this PR also solves the issue that #1334 is aimed at.

pracucci · 2019-08-26T10:30:30Z

@bboreham et all, thanks for your review. Is there any plan to merge it? _I'm asking just because it will unlock another little PR I've on the Loki project (which vendors Cortex master). Thanks 🙏

pstibrany · 2019-12-29T20:05:40Z

I’m curious what have we achieved by this change? “Full Jitter” aims at evenly distributing attempts from different clients to entire time range, thus reducing total number of required retries (which helps target service) and minimizing needed time to complete all operations. (As discussed in this paper)

This change has made delays between retries more predictable, which goes against the idea of adding jitter in the first place. It feels to me that we have optimized for the wrong thing here. Correct solution would have been to remove MinBackoff option (or simply add it to random value, but not to scale it with attempts), and not worry about randomness observed in #792.

pracucci · 2019-12-30T08:00:20Z

Correct solution would have been to remove MinBackoff option

Removing - or not honoring - the MinBackoff removes any guarantee on how long it will be retried at least. Retries are usually used to recover from temporary conditions/errors (ie. short networking issues). If the user sets X retries but has no guarantee on the minimum waiting time (especially when X is low), it could be difficult to predict how "short" the temporary condition/error should be in order to get the retry to have any effect.

pstibrany · 2020-01-07T09:57:17Z

Removing - or not honoring - the MinBackoff removes any guarantee on how long it will be retried at least.

Fair enough. What would you say about using random like before, but simply adding min wait time, without scaling it up. So your first graph would essentially shift up, but otherwise stay the same.

When I see the second graph, it's shows exactly the issue that jitter was supposed to solve, namely that retry times are too deterministic.

bboreham · 2020-01-07T10:06:33Z

retry times are too deterministic

Do you have details of a specific case where this has a negative effect?

I have spent many hours dealing with problems that were caused by whatever jitter we were using at the time, so I don't feel good about making further changes without a strong reason.

BTW the AWS paper you cited (which gave us our first implementation) is aimed at optimistic concurrency and Cortex mostly uses jitter for different reasons.

pstibrany · 2020-01-07T10:24:27Z

retry times are too deterministic

Do you have details of a specific case where this has a negative effect?

No, not really.

I came across the idea in AWS re:Invent 2019 talk, where Marc Brooker (who wrote that paper too) talks about it in context of distributed services talking to each other, retries and backoff [staring around 31:30]. Other than "this makes sense, why don't we do that... oh wait, we did that before!", I don't have any specific case where it is currently a problem.

I have spent many hours dealing with problems that were caused by whatever jitter we were using at the time, so I don't feel good about making further changes without a strong reason.

OK, that's fair. I'll try to educate myself more about problems we were facing, before giving more suggestions.

BTW the AWS paper you cited (which gave us our first implementation) is aimed at optimistic concurrency and Cortex mostly uses jitter for different reasons.

bboreham · 2020-01-07T10:32:40Z

Maybe it would be helpful for me to point out that Cortex is usually more interested in the backoff side than the jitter side. If we've tried a few times to transfer chunks then we want to slow down, because most likely the target won't show up for a while. Similarly for rate-limiting on DynamoDB.
This is completely different to the optimistic locking case where it's quite likely you can get in quickly.

Switched backoff implementation from a 'Full Jitter' to a 'Ranged Jit…

652117f

…ter' Signed-off-by: Marco Pracucci <[email protected]>

pracucci force-pushed the improve-backoff-algorithm branch from f8ffa6b to 652117f Compare August 21, 2019 17:02

bboreham approved these changes Aug 22, 2019

View reviewed changes

tomwilkie approved these changes Aug 22, 2019

View reviewed changes

gouthamve approved these changes Aug 23, 2019

View reviewed changes

gouthamve merged commit 8533a21 into cortexproject:master Aug 27, 2019

pracucci deleted the improve-backoff-algorithm branch August 27, 2019 16:22

gouthamve mentioned this pull request Aug 28, 2019

Backoff algorithm is still not very good #792

Closed

bboreham mentioned this pull request Sep 3, 2019

Fix duration is zero #1334

Closed

pracucci mentioned this pull request Sep 30, 2019

Increased promtail's backoff settings in prod and improved doc grafana/loki#1083

Merged

2 tasks

pracucci mentioned this pull request Nov 21, 2019

Ingesters do not wait long enough for transfer #1307

Closed

goelankitt mentioned this pull request Nov 21, 2019

Make ingester transfer retries min and max duration configurable #1844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599

Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599

pracucci commented Aug 21, 2019

bboreham left a comment

bboreham commented Aug 22, 2019

pracucci commented Aug 26, 2019 •

edited

Loading

pstibrany commented Dec 29, 2019 •

edited

Loading

pracucci commented Dec 30, 2019

pstibrany commented Jan 7, 2020

bboreham commented Jan 7, 2020

pstibrany commented Jan 7, 2020

bboreham commented Jan 7, 2020

Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599

Switch backoff implementation from a 'Full Jitter' to a 'Ranged Jitter' #1599

Conversation

pracucci commented Aug 21, 2019

Simulation

Current algorithm

New algorithm

bboreham left a comment

Choose a reason for hiding this comment

bboreham commented Aug 22, 2019

pracucci commented Aug 26, 2019 • edited Loading

pstibrany commented Dec 29, 2019 • edited Loading

pracucci commented Dec 30, 2019

pstibrany commented Jan 7, 2020

bboreham commented Jan 7, 2020

pstibrany commented Jan 7, 2020

bboreham commented Jan 7, 2020

pracucci commented Aug 26, 2019 •

edited

Loading

pstibrany commented Dec 29, 2019 •

edited

Loading