[WIP]: Fix segment rebalance race #18769
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #18764.
Description
Race
Segment load/drop callbacks are racing with prepareCurrentServers.
Consider the following scenario: Coordinator is moving segment
Sfrom serverAto serverB.Shas 1x replication.Basically, the load/drop callbacks can happen between the start/end of this function in a way where you get a
Essentially:
I think what's happening is that the server loop in
prepareCurrentServers()reads the servers in a state where LOAD has persisted in the server view (2x loaded) but the DROP has not materialized yet in the view. This causesloaded=2. Then, I thought the in-flight DROP (since it hasn't materialized in the view) would get picked up in thequeuedSegmentsload queue peons (and show up asdropping=1), but I think since the DROP callback returns – and clears the entry from the oldqueuedSegmentsload queue peons – beforeprepareCluster()has a chance to copy over the queued action to newqueuedSegments, we lose that important bit of information. Hence, you are left in a weird state with a "valid" queue but an invalid load state. In other words, I think we need to somehow synchronize callbacks with thisprepareCurrentServers()andprepareCluster().Fix
I don't really like the idea of passing a lock through to child peons, but a solution here is to synchronize the callback executions with the
PrepareBalancerAndLoadQueues::run(). This way, callbacks onHttpLoadQueuePeoncan run concurrently w.r.t each other, but during thePrepareBalancerAndLoadQueues::run()step, callbacks must wait (and vice-versa). While this could mean "stale" accounting snapshots compared to the actual state of the cluster, this ensures valid accounting snapshots which means that we don't do anything crazy like drop a segment from the timeline during a rebalance (assuming the rest of the rebalancing logic is correct).Release note
Key changed/added classes in this PR
MyFooOurBarTheirBazThis PR has: