Fix intermittently failing TestIngesterTransfer #661
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes the intermittently failing
TestIngesterTransfer
.The issue was that the
TransferChunks
method (called as an RPC by the departing ingester) would signal that it was complete (SendAndClose
) before it claimed the ring and updated its state to active.The test, however, runs a query immediately after the departing ingester has shut down.
Thus, there's a very small window after
Shutdown
has terminated but beforeClaimTokensFor
andChangeState
have run. When we run the query in this window, we get no results, because there are no ingesters that have the chunks we need.I've fixed this by moving the call to
SendAndClose
to the end ofTransferChunks
. I think this is the right approach, but am not 100% sure.An alternative would be to prevent the race in the test without changing the production code. To do this, we'd add the following snippet just after the call to
Shutdown()
:It was writing the comment that convinced me this was a less good approach.
The PR includes a commit that fixes some general, minor logging bugs I noticed while trying to debug the test failure.
I should point out that my snarky comment on #369 was wrong-headed: the test failure is not due to relying on the system clock, but rather to a more vanilla race condition.
Fixes #369