[RFC][segment replication] Introduce a heuristic method to avoid long-term write blocking during primary shard relocation

### Is your feature request related to a problem? Please describe

[Segment replication](https://github.com/opensearch-project/OpenSearch/issues/1694), as the core function of community storage and computation separation and read-write separation, has become increasingly important. When enabling segment replication, the replica shard does not perform segment construction, which can save a lot of CPU overhead. Therefore, it is widely used in production environments.

In production practice, we found that during the primary shard relocation, the index TPS (docs/s) may drop to `0` for a long time. Scaling nodes in the production environment cluster is a regular operation, and the migration of the primary shard often occurs, which prompts us to explore the reasons and find solutions.

As we all know, the relocation process of the primary shard will go through the **peer recovery**. In the document-replication scenario, the new primary will use `InternalEngine` and build segments during the migration process. The segment-replication scenario is somewhat different. In order to **avoid file conflict**,  #[[5344](https://github.com/opensearch-project/OpenSearch/pull/5344)] introduces a round of force segment replication. 

For the convenience of reading, the Description in #[[5344](https://github.com/opensearch-project/OpenSearch/pull/5344)] is quoted here：

> Update the peer recovery logic to work with segment replication. The existing primary relocation is broken because of segment files conflicts. This happens because of target (new primary) is started with InternalEngine writes its own version of segment files which later on conflicts with file copies on replicas (copied from older primary); when this target is promoted as primary.
> This change patches the recovery flow by first recovering the primary target with NRTReplicationEngine engine (to prevent target from generating segment files). Post phase 2 before finalizeRecovery and handoff, the target (new primary) is forced to refresh segment files from older primary (to catch up with other replicas) followed by switching engine to InternalEngine.

This method has been validated for a long time and can indeed avoid file conflicts. However, this proposal also has some costs, such as blocking writes during the forced segment replication phase (In order to catch up with the latest checkpoint before hand-off phase, write blocking is necessary).

We used three `16C64GB` nodes, `nyc_taxis` workload, 1 primary shard and 1 replica shard for testing. During the write process, relocation the primary shard to reproduce the issue of write TPS drop.

![Image](https://github.com/user-attachments/assets/920b1e04-e321-4a29-9071-3d93a3e997a1)





### Describe the solution you'd like

### Background
Now, Lucene supports **modifying segmentInfos.counter in IndexWriter** (#[[14417](https://github.com/apache/lucene/pull/14417)]). We believe that it is possible to **avoid file conflicts** and **ensure that writing is not affected**.

### Proposed Solution
Just like document-replication, use `InternalEngine` directly when the new primary shard starts. At the same time, we will advance the `segmentInfos.counter` of the new primary shard based on the `totalTranslogOps` estimated by the old primary shard.

```
new_primary_advanced_segment_counter = base_segment_counter + totalTranslogOps * retry_count
```

- `new_primary_advanced_segment_counter` is the segment counter used by the new primary shard to create InternalEngine.
- `base_segment_counter` is the segment counter copied from the old primary shard in peer recovery phase1.
- `totalTranslogOps` is the number of translogs estimated by the old primary shard.
- `retry_count` is the number of retries, the starting value is `1`.


Because the step size is at least `totalTranslogOps`, even if each write operation generates a new segment, it is difficult to have conflicts. However, because new write requests will continue to be received in the peer recovery process, we need to verify whether the local segment counter exceeds the `new_primary_advanced_segment_counter` in the old primary shard, and cancel the peer recovery process when it exceeds. 

### Failover
If a failure occurs, it means that our increased step size cannot cover the incremental writes in the peer recovery process. We need to increase the `retry_count` and expand the step size in the next migration.

Because the `ReplicationTracker#primaryMode` is set to `false` before the new primary shard is promoted, we will not let it publish checkpoints to other replicas. Therefore, it is safe to cancel the recovery process when the old primary shard detects a conflict.

### Results
We use the same method to test and verify the effect.

![Image](https://github.com/user-attachments/assets/17e44d05-563f-4b23-9205-dbc6fb11c376)


### Related component

Indexing:Replication

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC][segment replication] Introduce a heuristic method to avoid long-term write blocking during primary shard relocation #18355

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Background

Proposed Solution

Failover

Results

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC][segment replication] Introduce a heuristic method to avoid long-term write blocking during primary shard relocation #18355

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Background

Proposed Solution

Failover

Results

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions