Harden etcd manifests post-provisioning: set --initial-cluster-state and --initial-cluster fields

### What would you like to be added

After initial cluster provisioning completes, kubespray should update the etcd static pod manifests on each control plane node so that:

1. `--initial-cluster` lists all control plane peers (not just the local node)
2. `--initial-cluster-state` is set to `existing` (not `new`)

These flags are only consulted by etcd when starting with an empty data directory — once initialized, etcd reads cluster membership from its WAL. But when the data directory is lost (disk failure, filesystem corruption, node rebuild), these flags become authoritative again. The current manifests tell each node to bootstrap as an independent single-member cluster, which causes silent split-brain.
The change should be a post-provisioning step (or a day-two reconciliation in the etcd manifest templates) that runs after all etcd members have successfully joined the cluster. The day-one behavior during initial bootstrap should remain unchanged.

### Why is this needed

When a kubeadm/kubespray-provisioned etcd node loses its data directory and restarts, the current manifest configuration (`--initial-cluster-state=new` with only the local node in `--initial-cluster`) causes it to silently bootstrap a brand-new single-node cluster with a fresh cluster ID. This creates two distinct failure modes:

**Partial data loss (subset of nodes)**: Surviving nodes with intact data cannot automatically reform quorum because their manifests only reference themselves. They get stuck in leaderless PreVote loops despite having valid data, and recovery requires manual `--force-new-cluster` intervention followed by `etcdctl member add` for each peer. With a corrected `--initial-cluster` listing all peers, surviving nodes could potentially re-establish quorum automatically once their peers recover.

**Total data loss (all nodes)**: Each node independently bootstraps a new single-node cluster with only the bare kubeadm skeleton state (~250–300 keys), and the API server silently serves empty/wrong data. With `--initial-cluster-state=existing`, the nodes would instead crash-loop — failing loudly rather than silently serving incorrect state. This leaves operators free to choose a deliberate recovery path (restore from backup or re-bootstrap).


We've hit this failure mode repeatedly across multiple clusters, all triggered by shared storage fabric disruptions that wipe or corrupt etcd data directories on one or more control plane nodes simultaneously. Each incident required manual `--force-new-cluster` recovery and sequential member re-addition — a process that is error-prone and has itself led to inconsistent manifests (e.g. `--initial-cluster` missing a peer after an incomplete re-add).

**Trade-off**: A node with `--initial-cluster-state=existing` and an empty data directory will crash-loop until it is explicitly re-added via `etcdctl member add`. This is strictly better than the current behavior (silently creating an independent cluster), but recovery still requires manual intervention. The key improvement is that the failure becomes visible and deterministic rather than silent and data-corrupting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden etcd manifests post-provisioning: set --initial-cluster-state and --initial-cluster fields #13261

What would you like to be added

Why is this needed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Harden etcd manifests post-provisioning: set --initial-cluster-state and --initial-cluster fields #13261

Description

What would you like to be added

Why is this needed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions