Stop early-return after majority success in distributor writes #732

csmarchbanks · 2018-03-02T21:44:21Z

Allows you to configure the distributor to wait for all ingester requests to complete before returning to the client. Since this will cause a slowdown, we made it as a configuration variable.

See issue: #730

limscoder · 2018-03-02T22:54:36Z

pkg/distributor/distributor.go

+}
+
+func shortCircuit(err error, sampleTracker *sampleTracker, pushTracker *pushTracker) {
+	if err != nil {


Can the error handling be moved up into sendSamples so that waitForAll and shortCircuit use the same error handler?

Probably, but there is some subtle logic differences between the two (returning after an error, vs always incrementing sampleTracker.finished). We decided that a little bit of duplication was ok, since it allowed us to leave the previous logic intact. We can refactor it out if you think that is worthwhile however.

I think I see what you mean, pushed another commit doing the refactor.

bboreham · 2018-03-07T11:58:40Z

Looks plausible but would need to be re-done after #681. And I'm not sure it should be optional; as noted at #730 this behaviour means that we have can't bear a single ingester failure with 3x replication. I think the code will be somewhat simpler without early return.

What sort of impact did you see when running it?

cboggs · 2018-03-07T16:43:06Z

I don't quite understand why it means that we can't bear a single ingester outage with a rep factor of 3, given that we have to experience > MaxFailures before a push errors out.

If I recall correctly (@csmarchbanks, correct me if I'm way off), we tested it with a set of 3 ingesters, replication factor = 3, scaled down to 2 pods and watched the pushes continue on as expected.

We didn't test a single-ingester timeout, though, as I'm not sure how to cause that particular failure case.

bboreham · 2018-03-07T16:53:12Z

@cboggs I put a clarification on #730

cboggs · 2018-03-07T16:55:30Z

Oh, I think I misunderstood - it seems that you mean currently (master) doesn't allow for a single failure w/ rep factor = 3. I took your comment as "this PR introduces this bad behavior". Oops. :-) The new code should alleviate this, yes?

bboreham · 2018-03-07T17:30:44Z

What I meant to highlight was that this PR makes the bad behaviour optional.
Why wouldn't we just remove the bad behaviour?

cboggs · 2018-03-07T17:33:06Z

I'm slowly getting caught up, my apologies. :-)

Agreed, we can just yank the option to behave badly.

rade · 2018-03-07T17:36:47Z

What I meant to highlight was that this PR makes the bad behaviour optional.

But it also introduces other bad behaviour, namely less tolerance to single failures, in particular single failures that cause request to be delayed / never replied to. These now delay the return, which presumably reduces throughput, increases buffer sizes and memory usage, etc.

csmarchbanks · 2018-03-07T17:54:22Z

What @rade said was why I made this optional. Here are some timing graphs for before/after the flag was turned on.

bboreham · 2018-03-07T18:53:52Z

Not a fan of letting the administrator chose between two bad options. Slow-down is visible and recovers if transient; dropping data is silent and unrecoverable.

rade · 2018-03-07T19:02:09Z

Surely the correct fix is to return after majority success, but don't cancel the downstream request. Or make ingesters ignore the cancel. Under what circumstances is it actually desirable to cancel an ingester write?

bboreham · 2018-03-07T19:25:50Z

That sounds plausible. It’s the single shared Context that is cancelling the outstanding call(s), so needs some juggling to make it independent.

We would want to cancel after a reasonable timeout to avoid #672

csmarchbanks · 2018-03-07T19:31:54Z

So sounds like: Make a new context for each request with a timeout of say 10s, and keep the returning logic intact?

csmarchbanks · 2018-03-07T20:10:43Z

If so, take a look at #736

csmarchbanks · 2018-03-08T19:34:19Z

Closing in favor of #736

csmarchbanks mentioned this pull request Mar 2, 2018

Stop early-return after majority success in distributor writes #730

Closed

csmarchbanks force-pushed the stop-return-early branch from b8e6b11 to 67ce380 Compare March 2, 2018 21:56

Stop early-return after majority success in distributor writes

2ac4b1f

csmarchbanks force-pushed the stop-return-early branch from 67ce380 to 2ac4b1f Compare March 2, 2018 21:57

limscoder reviewed Mar 2, 2018

View reviewed changes

Refactored to share ingester error handling

0eb2a0f

csmarchbanks mentioned this pull request Mar 7, 2018

After N-1 ingester crashes, query results are unstable #731

Open

csmarchbanks mentioned this pull request Mar 7, 2018

Perform ingester pushes in a background context rather than the request context #736

Merged

csmarchbanks closed this Mar 8, 2018

cboggs deleted the stop-return-early branch March 12, 2018 13:53

Stop early-return after majority success in distributor writes #732

Stop early-return after majority success in distributor writes #732

Uh oh!

Conversation

csmarchbanks commented Mar 2, 2018

Uh oh!

limscoder Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Mar 2, 2018

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Mar 2, 2018

Choose a reason for hiding this comment

Uh oh!

bboreham commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboggs commented Mar 7, 2018

Uh oh!

bboreham commented Mar 7, 2018

Uh oh!

cboggs commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboggs commented Mar 7, 2018

Uh oh!

rade commented Mar 7, 2018

Uh oh!

csmarchbanks commented Mar 7, 2018

Uh oh!

bboreham commented Mar 7, 2018

Uh oh!

rade commented Mar 7, 2018

Uh oh!

bboreham commented Mar 7, 2018

Uh oh!

csmarchbanks commented Mar 7, 2018

Uh oh!

csmarchbanks commented Mar 7, 2018

Uh oh!

csmarchbanks commented Mar 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

limscoder Mar 2, 2018 •

edited

Loading

bboreham commented Mar 7, 2018 •

edited

Loading

cboggs commented Mar 7, 2018 •

edited

Loading

bboreham commented Mar 7, 2018 •

edited

Loading