Skip to content

[RFC][segment replication] Introduce a heuristic method to avoid long-term write blocking during primary shard relocation #18355

@guojialiang92

Description

@guojialiang92

Is your feature request related to a problem? Please describe

Segment replication, as the core function of community storage and computation separation and read-write separation, has become increasingly important. When enabling segment replication, the replica shard does not perform segment construction, which can save a lot of CPU overhead. Therefore, it is widely used in production environments.

In production practice, we found that during the primary shard relocation, the index TPS (docs/s) may drop to 0 for a long time. Scaling nodes in the production environment cluster is a regular operation, and the migration of the primary shard often occurs, which prompts us to explore the reasons and find solutions.

As we all know, the relocation process of the primary shard will go through the peer recovery. In the document-replication scenario, the new primary will use InternalEngine and build segments during the migration process. The segment-replication scenario is somewhat different. In order to avoid file conflict, #[5344] introduces a round of force segment replication.

For the convenience of reading, the Description in #[5344] is quoted here:

Update the peer recovery logic to work with segment replication. The existing primary relocation is broken because of segment files conflicts. This happens because of target (new primary) is started with InternalEngine writes its own version of segment files which later on conflicts with file copies on replicas (copied from older primary); when this target is promoted as primary.
This change patches the recovery flow by first recovering the primary target with NRTReplicationEngine engine (to prevent target from generating segment files). Post phase 2 before finalizeRecovery and handoff, the target (new primary) is forced to refresh segment files from older primary (to catch up with other replicas) followed by switching engine to InternalEngine.

This method has been validated for a long time and can indeed avoid file conflicts. However, this proposal also has some costs, such as blocking writes during the forced segment replication phase (In order to catch up with the latest checkpoint before hand-off phase, write blocking is necessary).

We used three 16C64GB nodes, nyc_taxis workload, 1 primary shard and 1 replica shard for testing. During the write process, relocation the primary shard to reproduce the issue of write TPS drop.

Image

Describe the solution you'd like

Background

Now, Lucene supports modifying segmentInfos.counter in IndexWriter (#[14417]). We believe that it is possible to avoid file conflicts and ensure that writing is not affected.

Proposed Solution

Just like document-replication, use InternalEngine directly when the new primary shard starts. At the same time, we will advance the segmentInfos.counter of the new primary shard based on the totalTranslogOps estimated by the old primary shard.

new_primary_advanced_segment_counter = base_segment_counter + totalTranslogOps * retry_count
  • new_primary_advanced_segment_counter is the segment counter used by the new primary shard to create InternalEngine.
  • base_segment_counter is the segment counter copied from the old primary shard in peer recovery phase1.
  • totalTranslogOps is the number of translogs estimated by the old primary shard.
  • retry_count is the number of retries, the starting value is 1.

Because the step size is at least totalTranslogOps, even if each write operation generates a new segment, it is difficult to have conflicts. However, because new write requests will continue to be received in the peer recovery process, we need to verify whether the local segment counter exceeds the new_primary_advanced_segment_counter in the old primary shard, and cancel the peer recovery process when it exceeds.

Failover

If a failure occurs, it means that our increased step size cannot cover the incremental writes in the peer recovery process. We need to increase the retry_count and expand the step size in the next migration.

Because the ReplicationTracker#primaryMode is set to false before the new primary shard is promoted, we will not let it publish checkpoints to other replicas. Therefore, it is safe to cancel the recovery process when the old primary shard detects a conflict.

Results

We use the same method to test and verify the effect.

Image

Related component

Indexing:Replication

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepenhancementEnhancement or improvement to existing feature or requestlucene

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions