Skip to content

Fix intermittently failing TestIngesterTransfer #661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 19, 2018
Merged

Fix intermittently failing TestIngesterTransfer #661

merged 4 commits into from
Jan 19, 2018

Conversation

jml
Copy link
Contributor

@jml jml commented Jan 18, 2018

This fixes the intermittently failing TestIngesterTransfer.

The issue was that the TransferChunks method (called as an RPC by the departing ingester) would signal that it was complete (SendAndClose) before it claimed the ring and updated its state to active.

The test, however, runs a query immediately after the departing ingester has shut down.

Thus, there's a very small window after Shutdown has terminated but before ClaimTokensFor and ChangeState have run. When we run the query in this window, we get no results, because there are no ingesters that have the chunks we need.

I've fixed this by moving the call to SendAndClose to the end of TransferChunks. I think this is the right approach, but am not 100% sure.

An alternative would be to prevent the race in the test without changing the production code. To do this, we'd add the following snippet just after the call to Shutdown():

// ing2 might not have claimed the ring before it's told ing1 it's OK to
// shut down.
poll(t, 10*time.Millisecond, ring.ACTIVE, func() interface{} {
    return ing2.state
})

It was writing the comment that convinced me this was a less good approach.

The PR includes a commit that fixes some general, minor logging bugs I noticed while trying to debug the test failure.

I should point out that my snarky comment on #369 was wrong-headed: the test failure is not due to relying on the system clock, but rather to a more vanilla race condition.

Fixes #369

jml added 2 commits January 17, 2018 13:52
Things that are normal occurences are just debugs.
This fixes a race condition in `TestIngesterTransfer`.

I don't think it makes the actually-running-in-production situation any worse.
@jml jml requested a review from bboreham January 18, 2018 09:23
@bboreham
Copy link
Contributor

I note this change leaves the code holding userStatesMtx for longer. But see #662 where I question that Mutex's whole existence.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes the sequence more correct in production - the new ingester will be active and own the tokens before the old starts to shut down.

The length of time it takes the old ingester to actually shut down is unpredictable, so events could have happened in either order before.

@bboreham
Copy link
Contributor

If you happen to be changing this further, I'd say a log message at the end of TransferChunks saying what happened would be useful.

@jml
Copy link
Contributor Author

jml commented Jan 19, 2018

Might conflict w/ #654

@jml jml merged commit 2810aa7 into master Jan 19, 2018
@jml jml deleted the ingester-flake branch January 19, 2018 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestIngesterTransfer fails intermittenly
2 participants