What would you like to be added
After initial cluster provisioning completes, kubespray should update the etcd static pod manifests on each control plane node so that:
--initial-cluster lists all control plane peers (not just the local node)
--initial-cluster-state is set to existing (not new)
These flags are only consulted by etcd when starting with an empty data directory — once initialized, etcd reads cluster membership from its WAL. But when the data directory is lost (disk failure, filesystem corruption, node rebuild), these flags become authoritative again. The current manifests tell each node to bootstrap as an independent single-member cluster, which causes silent split-brain.
The change should be a post-provisioning step (or a day-two reconciliation in the etcd manifest templates) that runs after all etcd members have successfully joined the cluster. The day-one behavior during initial bootstrap should remain unchanged.
Why is this needed
When a kubeadm/kubespray-provisioned etcd node loses its data directory and restarts, the current manifest configuration (--initial-cluster-state=new with only the local node in --initial-cluster) causes it to silently bootstrap a brand-new single-node cluster with a fresh cluster ID. This creates two distinct failure modes:
Partial data loss (subset of nodes): Surviving nodes with intact data cannot automatically reform quorum because their manifests only reference themselves. They get stuck in leaderless PreVote loops despite having valid data, and recovery requires manual --force-new-cluster intervention followed by etcdctl member add for each peer. With a corrected --initial-cluster listing all peers, surviving nodes could potentially re-establish quorum automatically once their peers recover.
Total data loss (all nodes): Each node independently bootstraps a new single-node cluster with only the bare kubeadm skeleton state (~250–300 keys), and the API server silently serves empty/wrong data. With --initial-cluster-state=existing, the nodes would instead crash-loop — failing loudly rather than silently serving incorrect state. This leaves operators free to choose a deliberate recovery path (restore from backup or re-bootstrap).
We've hit this failure mode repeatedly across multiple clusters, all triggered by shared storage fabric disruptions that wipe or corrupt etcd data directories on one or more control plane nodes simultaneously. Each incident required manual --force-new-cluster recovery and sequential member re-addition — a process that is error-prone and has itself led to inconsistent manifests (e.g. --initial-cluster missing a peer after an incomplete re-add).
Trade-off: A node with --initial-cluster-state=existing and an empty data directory will crash-loop until it is explicitly re-added via etcdctl member add. This is strictly better than the current behavior (silently creating an independent cluster), but recovery still requires manual intervention. The key improvement is that the failure becomes visible and deterministic rather than silent and data-corrupting.
What would you like to be added
After initial cluster provisioning completes, kubespray should update the etcd static pod manifests on each control plane node so that:
--initial-clusterlists all control plane peers (not just the local node)--initial-cluster-stateis set toexisting(notnew)These flags are only consulted by etcd when starting with an empty data directory — once initialized, etcd reads cluster membership from its WAL. But when the data directory is lost (disk failure, filesystem corruption, node rebuild), these flags become authoritative again. The current manifests tell each node to bootstrap as an independent single-member cluster, which causes silent split-brain.
The change should be a post-provisioning step (or a day-two reconciliation in the etcd manifest templates) that runs after all etcd members have successfully joined the cluster. The day-one behavior during initial bootstrap should remain unchanged.
Why is this needed
When a kubeadm/kubespray-provisioned etcd node loses its data directory and restarts, the current manifest configuration (
--initial-cluster-state=newwith only the local node in--initial-cluster) causes it to silently bootstrap a brand-new single-node cluster with a fresh cluster ID. This creates two distinct failure modes:Partial data loss (subset of nodes): Surviving nodes with intact data cannot automatically reform quorum because their manifests only reference themselves. They get stuck in leaderless PreVote loops despite having valid data, and recovery requires manual
--force-new-clusterintervention followed byetcdctl member addfor each peer. With a corrected--initial-clusterlisting all peers, surviving nodes could potentially re-establish quorum automatically once their peers recover.Total data loss (all nodes): Each node independently bootstraps a new single-node cluster with only the bare kubeadm skeleton state (~250–300 keys), and the API server silently serves empty/wrong data. With
--initial-cluster-state=existing, the nodes would instead crash-loop — failing loudly rather than silently serving incorrect state. This leaves operators free to choose a deliberate recovery path (restore from backup or re-bootstrap).We've hit this failure mode repeatedly across multiple clusters, all triggered by shared storage fabric disruptions that wipe or corrupt etcd data directories on one or more control plane nodes simultaneously. Each incident required manual
--force-new-clusterrecovery and sequential member re-addition — a process that is error-prone and has itself led to inconsistent manifests (e.g.--initial-clustermissing a peer after an incomplete re-add).Trade-off: A node with
--initial-cluster-state=existingand an empty data directory will crash-loop until it is explicitly re-added viaetcdctl member add. This is strictly better than the current behavior (silently creating an independent cluster), but recovery still requires manual intervention. The key improvement is that the failure becomes visible and deterministic rather than silent and data-corrupting.