-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Is your feature request related to a problem? Please describe
Segment replication, as the core function of community storage and computation separation and read-write separation, has become increasingly important. When enabling segment replication, the replica shard does not perform segment construction, which can save a lot of CPU overhead. Therefore, it is widely used in production environments.
In production practice, we found that during the primary shard relocation, the index TPS (docs/s) may drop to 0
for a long time. Scaling nodes in the production environment cluster is a regular operation, and the migration of the primary shard often occurs, which prompts us to explore the reasons and find solutions.
As we all know, the relocation process of the primary shard will go through the peer recovery. In the document-replication scenario, the new primary will use InternalEngine
and build segments during the migration process. The segment-replication scenario is somewhat different. In order to avoid file conflict, #[5344] introduces a round of force segment replication.
For the convenience of reading, the Description in #[5344] is quoted here:
Update the peer recovery logic to work with segment replication. The existing primary relocation is broken because of segment files conflicts. This happens because of target (new primary) is started with InternalEngine writes its own version of segment files which later on conflicts with file copies on replicas (copied from older primary); when this target is promoted as primary.
This change patches the recovery flow by first recovering the primary target with NRTReplicationEngine engine (to prevent target from generating segment files). Post phase 2 before finalizeRecovery and handoff, the target (new primary) is forced to refresh segment files from older primary (to catch up with other replicas) followed by switching engine to InternalEngine.
This method has been validated for a long time and can indeed avoid file conflicts. However, this proposal also has some costs, such as blocking writes during the forced segment replication phase (In order to catch up with the latest checkpoint before hand-off phase, write blocking is necessary).
We used three 16C64GB
nodes, nyc_taxis
workload, 1 primary shard and 1 replica shard for testing. During the write process, relocation the primary shard to reproduce the issue of write TPS drop.
Describe the solution you'd like
Background
Now, Lucene supports modifying segmentInfos.counter in IndexWriter (#[14417]). We believe that it is possible to avoid file conflicts and ensure that writing is not affected.
Proposed Solution
Just like document-replication, use InternalEngine
directly when the new primary shard starts. At the same time, we will advance the segmentInfos.counter
of the new primary shard based on the totalTranslogOps
estimated by the old primary shard.
new_primary_advanced_segment_counter = base_segment_counter + totalTranslogOps * retry_count
new_primary_advanced_segment_counter
is the segment counter used by the new primary shard to create InternalEngine.base_segment_counter
is the segment counter copied from the old primary shard in peer recovery phase1.totalTranslogOps
is the number of translogs estimated by the old primary shard.retry_count
is the number of retries, the starting value is1
.
Because the step size is at least totalTranslogOps
, even if each write operation generates a new segment, it is difficult to have conflicts. However, because new write requests will continue to be received in the peer recovery process, we need to verify whether the local segment counter exceeds the new_primary_advanced_segment_counter
in the old primary shard, and cancel the peer recovery process when it exceeds.
Failover
If a failure occurs, it means that our increased step size cannot cover the incremental writes in the peer recovery process. We need to increase the retry_count
and expand the step size in the next migration.
Because the ReplicationTracker#primaryMode
is set to false
before the new primary shard is promoted, we will not let it publish checkpoints to other replicas. Therefore, it is safe to cancel the recovery process when the old primary shard detects a conflict.
Results
We use the same method to test and verify the effect.
Related component
Indexing:Replication
Describe alternatives you've considered
No response
Additional context
No response