Fix intermittently failing TestIngesterTransfer #661

jml · 2018-01-18T09:22:55Z

This fixes the intermittently failing TestIngesterTransfer.

The issue was that the TransferChunks method (called as an RPC by the departing ingester) would signal that it was complete (SendAndClose) before it claimed the ring and updated its state to active.

The test, however, runs a query immediately after the departing ingester has shut down.

Thus, there's a very small window after Shutdown has terminated but before ClaimTokensFor and ChangeState have run. When we run the query in this window, we get no results, because there are no ingesters that have the chunks we need.

I've fixed this by moving the call to SendAndClose to the end of TransferChunks. I think this is the right approach, but am not 100% sure.

An alternative would be to prevent the race in the test without changing the production code. To do this, we'd add the following snippet just after the call to Shutdown():

// ing2 might not have claimed the ring before it's told ing1 it's OK to
// shut down.
poll(t, 10*time.Millisecond, ring.ACTIVE, func() interface{} {
    return ing2.state
})

It was writing the comment that convinced me this was a less good approach.

The PR includes a commit that fixes some general, minor logging bugs I noticed while trying to debug the test failure.

I should point out that my snarky comment on #369 was wrong-headed: the test failure is not due to relying on the system clock, but rather to a more vanilla race condition.

Fixes #369

Things that are normal occurences are just debugs.

This fixes a race condition in `TestIngesterTransfer`. I don't think it makes the actually-running-in-production situation any worse.

bboreham · 2018-01-18T10:49:14Z

I note this change leaves the code holding userStatesMtx for longer. But see #662 where I question that Mutex's whole existence.

bboreham

I think this change makes the sequence more correct in production - the new ingester will be active and own the tokens before the old starts to shut down.

The length of time it takes the old ingester to actually shut down is unpredictable, so events could have happened in either order before.

bboreham · 2018-01-18T11:23:08Z

If you happen to be changing this further, I'd say a log message at the end of TransferChunks saying what happened would be useful.

jml · 2018-01-19T11:11:15Z

Might conflict w/ #654

jml added 2 commits January 17, 2018 13:52

Adjust log levels

061478c

Things that are normal occurences are just debugs.

Don't close stream until we're sure we've got the chunks

c60caf0

This fixes a race condition in `TestIngesterTransfer`. I don't think it makes the actually-running-in-production situation any worse.

jml requested a review from bboreham January 18, 2018 09:23

gofmt

79577d4

bboreham approved these changes Jan 18, 2018

View reviewed changes

Log the final result of TransferChunks

76fb076

jml merged commit 2810aa7 into master Jan 19, 2018

jml deleted the ingester-flake branch January 19, 2018 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix intermittently failing TestIngesterTransfer #661

Fix intermittently failing TestIngesterTransfer #661

Uh oh!

jml commented Jan 18, 2018

Uh oh!

bboreham commented Jan 18, 2018

Uh oh!

bboreham left a comment

Uh oh!

bboreham commented Jan 18, 2018

Uh oh!

jml commented Jan 19, 2018

Uh oh!

Uh oh!

Fix intermittently failing TestIngesterTransfer #661

Fix intermittently failing TestIngesterTransfer #661

Uh oh!

Conversation

jml commented Jan 18, 2018

Uh oh!

bboreham commented Jan 18, 2018

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham commented Jan 18, 2018

Uh oh!

jml commented Jan 19, 2018

Uh oh!

Uh oh!